Skip to content

[CUDA] Change '__activemask' to use '__nvvm_activemask()' #79892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 29, 2024

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Jan 29, 2024

Summary:
We recently added builitin support for this function.

Summary:
We recently added builitin support for this function.
@jhuber6 jhuber6 requested review from Artem-B and jlebar January 29, 2024 20:29
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:X86 clang:headers Headers provided by Clang, e.g. for intrinsics labels Jan 29, 2024
@llvmbot
Copy link
Member

llvmbot commented Jan 29, 2024

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-clang

Author: Joseph Huber (jhuber6)

Changes

Summary:
We recently added builitin support for this function.


Full diff: https://github.com/llvm/llvm-project/pull/79892.diff

1 Files Affected:

  • (modified) clang/lib/Headers/__clang_cuda_intrinsics.h (+1-3)
diff --git a/clang/lib/Headers/__clang_cuda_intrinsics.h b/clang/lib/Headers/__clang_cuda_intrinsics.h
index 3c3948863c1d453..a04e8b6de44d053 100644
--- a/clang/lib/Headers/__clang_cuda_intrinsics.h
+++ b/clang/lib/Headers/__clang_cuda_intrinsics.h
@@ -215,9 +215,7 @@ inline __device__ unsigned int __activemask() {
 #if CUDA_VERSION < 9020
   return __nvvm_vote_ballot(1);
 #else
-  unsigned int mask;
-  asm volatile("activemask.b32 %0;" : "=r"(mask));
-  return mask;
+  return __nvvm_activemask();
 #endif
 }
 

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 29, 2024

I've actually encountered some really strange behavior when trying to update libc to use the new intrinsic. The following returns a common 64-bit value to be compatible with AMDGPU's 64 lane wide mode. When I run this against the test suite, it fails on tests that specifically check against divergence.

This works

[[clang::convergent, gnu::noinline]]  uint64_t get_lane_mask() {
  uint32_t mask;              
  mask = __nvvm_activemask();
  return mask;               
} 

But this does not

[[clang::convergent, gnu::noinline]] uint64_t get_lane_mask() {
  return __nvvm_activemask();     
} 

If I check the PTX, the main difference seems to be the cvt instruction, here's the output respectively.

.weak .func  (.param .b64 func_retval0) _ZN22__llvm_libc_19_0_0_git3gpu13get_lane_maskEv()
{
  .reg .b32   %r<2>;
  .reg .b64   %rd<2>;

// %bb.0:                               // %entry
  activemask.b32  %r1;
  cvt.u64.u32   %rd1, %r1;
  st.param.b64  [func_retval0+0], %rd1;
  ret;
}
.weak .func  (.param .b64 func_retval0) _ZN22__llvm_libc_19_0_0_git3gpu13get_lane_maskEv()
{
  .reg .b32   %r<2>;
  .reg .b64   %rd<2>;

// %bb.0:                               // %entry
  activemask.b32  %r1;
  cvt.s64.s32   %rd1, %r1;
  st.param.b64  [func_retval0+0], %rd1;
  ret;
}

So, the difference is that the version that works uses cvt.u64.u32 while the version that's broken uses cvt.s64.s32. This means that likely this is returning a "signed" value, and the conversion is treating it like a negative number when all threads are active. @Artem-B is there a correct way to assert that this is unsigned so it does the correct thing?

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 29, 2024

Scratch that, I missed Ui in the builtin definition. I'll do a quick fix.

@jhuber6 jhuber6 merged commit 51379a9 into llvm:main Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 clang:headers Headers provided by Clang, e.g. for intrinsics clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants