-
Notifications
You must be signed in to change notification settings - Fork 788
[SYCL][CUDA][libclc] Add atomic loads and stores with various memory orders and scopes #5191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This test failure does not look connected to the changes in this PR. Also, I can not reproduce it locally. |
Disable Basic/get_backend test due to sporadic failures in prs: intel/llvm#5191 intel/llvm#5121
Disable Basic/get_backend test due to sporadic failures in prs: intel/llvm#5191 intel/llvm#5121
/verify with intel/llvm-test-suite#648 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libclc
part looks good to me.
It would be good to review LLVM intrinsics with @Artem-B.
You can either create another review request on reviews.llvm.org or extend https://reviews.llvm.org/D112718 and ping reviewers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me, modulo few nits.
Do you plan to upstream these changes?
BUILTIN(__nvvm_volatile_ld##ADDR_SPACE##_d, "ddD*", "n") | ||
|
||
#define LD_BUILTIN_TYPES(ORDER, SCOPE, ADDR_SPACE) \ | ||
TARGET_BUILTIN(__nvvm##ORDER##SCOPE##_ld##ADDR_SPACE##_i, "iiD*", "n", SM_70) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make things a bit more orderly if builtins would be using __nvvm_ld
/__nvvm_st
prefix and put modifiers as suffixes in the same order they are used in the PTX instructions the builtins translate into.
E.g. __nvvm_ld_shared_acquire_gpu_i()
-> ld.shared.acquire.gpu.u32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nice to have, but I would argue that being consistent with other builtins is more important. I was looking especially at atomic read-modify-write operations I worked on before, where we have for example __nvvm_atom_acq_rel_sys_min_shared_i
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. We can live with the order of modifiers matching that of __nvvm_atom
.
That said, __nvvm_atom
does have the instruction mnemonic atom
in front and so should these ld
builtins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that these have atom
at the start because they also have atom
in the PTX, which is not true for ld
and st
instructions.
clang/lib/CodeGen/CGBuiltin.cpp
Outdated
auto MakeScopedLd = [&](unsigned IntrinsicID) { | ||
Value *Ptr = EmitScalarExpr(E->getArg(0)); | ||
auto tmp = Builder.CreateCall( | ||
CGM.getIntrinsic(IntrinsicID, {Ptr->getType()->getPointerElementType(), | ||
Ptr->getType()}), | ||
{Ptr}); | ||
return tmp; | ||
}; | ||
auto MakeScopedSt = [&](unsigned IntrinsicID) { | ||
Value *Ptr = EmitScalarExpr(E->getArg(0)); | ||
return Builder.CreateCall( | ||
CGM.getIntrinsic( | ||
IntrinsicID, | ||
{Ptr->getType(), Ptr->getType()->getPointerElementType()}), | ||
{Ptr, EmitScalarExpr(E->getArg(1))}); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pointer is in generic AS. We may want to addrspacecast
it into specific AS when we generate the calls to the AS-specific builtins. That would allow compiler to eliminate unnecessary ASC to generic AS it uses when an address of a variable is taken.
// CHECK: call i32 @llvm.nvvm.ld.gen.i.volatile.i32.p0i32 | ||
__nvvm_volatile_ld_gen_i(ip); | ||
// CHECK: call i32 @llvm.nvvm.ld.global.i.volatile.i32.p1i32 | ||
__nvvm_volatile_ld_global_i((__attribute__((address_space(1))) int *)ip); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. The builtins are declared with the pointer in the generic AS, yet here we're passing a pointer to the global AS and it does get us the right intrinsic overload. This is convenient.
The downside of this is that unless user explicitly casts the pointer to a specific AS, we'll likely have unnecessary ASC to the generic AS, as I've mentioned above. It's not a bug as accesses via generic AS are OK, but rather an optimization opportunity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow your thinking. This is lowered into a PTX instruction that uses global (not generic) addressing. Are you saying that compiler will cast this pointer to generic AS and than back to global AS? Also, what does ASC stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow your thinking.
My fault. I should not do code reviews at 1am. Please ignore the second paragraph.
This is lowered into a PTX instruction that uses global (not generic) addressing.
This is the "interesting" part. Because the argument has been explicitly marked as being in global AS, things do work exactly as intended. In a way you've created an equivalent of a function overload for the builtin for pointers in different AS and that may allow clang to accept more than it should.
Are you saying that compiler will cast this pointer to generic AS and than back to global AS?
Normally, yes. See https://godbolt.org/z/aG1bd45ez.
For the calls of load
the pointer gets cast to global AS and then back to generic, before passing it to a function. You can see it in the IR, as well. Here you're probably saved by the fact that you handle call generation yourself, but you do sort of break the rules in the process. I.e. builtins with generic AS pointer arguments should not accept pointers in the specific AS. The test above demonstrates that these builtins do.
I'm curious what's going to happen if you actually call __nvvm_volatile_ld_global_i(ip);
.
Technically, it does match the builtin signature, so clang will accept this. The way builtin code gets called, it will likely end up calling @llvm.nvvm.ld.global.i.volatile.i32.p0i32
that would probably translate into ld.volatile
instruction with a generic pointer as an argument. I do not think it's a desired outcome.
I think builtins should have signatures with the pointers in the matching AS. This way clang will catch attempts to pass a wrong-AS pointer to them.
Also, what does ASC stand for?
ASC
= addrspacecast
LLVM IR instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the calls of load the pointer gets cast to global AS and then back to generic, before passing it to a function.
I don't see any such casts to global and back to generic in generated PTX. That is either in the godbolt example you linked or locally when I tried something similar with the new builtins.
I'm curious what's going to happen if you actually call __nvvm_volatile_ld_global_i(ip);.
Technically, it does match the builtin signature, so clang will accept this. The way builtin code gets called, it will likely end up calling @llvm.nvvm.ld.global.i.volatile.i32.p0i32 that would probably translate into ld.volatile instruction with a generic pointer as an argument. I do not think it's a desired outcome.
That is exactly what happens.
I think builtins should have signatures with the pointers in the matching AS. This way clang will catch attempts to pass a wrong-AS pointer to them.
I can not easily specify which AS the argument must be in when declaring builtins. While setting the address space numbers in builtin declarations works for CUDA, it does not work for OpenCL, which is our use case for this. I could try to throw an error in code generation if a builtin is called with AS that is not the same one it accepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that we merge this as it is to unblock other atomic-related PRs.
When we figure out how to enable typecasting between target and OpenCL language address spaces we come back to this and change the builtins in the way that was suggested in this thread. Regardless of what we decide here this will need to be done for atomic read-modify-write builtins that have already been merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with that plan if @Artem-B has no objections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Committing the changes as is to unblock other work sounds reasonable to me, provided that AS-specifc builtins API is fixed before the changes get merged upstream or released to a wider audience.
I think it would be great to contribute these changes to llvm-project. |
I do not think I've seen it. In the future, please do ping the review and explicitly mention reviewers if there's no activity. There are a lot of review email flying around and sometimes things do fall through the cracks. I'm off for the holidays, but I'll take a look when I'm back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just commented a nit.
I am not familiar with this functionality and so I leave the review to @bader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock since code looks ok to me. But please wait for approval from @bader or someone else who is familiar with this functionality
@t4c1, I'm lost a bit in PR dependencies, so I need your help here to avoid merging something that will break CI checks.
|
Merging this will not break anything. intel/llvm-test-suite#648 just tests new features added by this PR and does indeed need #4853 merged first. Once this and #5192 are merged, #4853 will be unblocked, so that one can be merged. To summarize: atomic PRs should be merged in the order:
EDIT: added #5192 |
Disable Basic/get_backend test due to sporadic failures in prs: intel#5191 intel#5121
Adds atomic loads and stores with various memory orders and scopes. These are implemented by adding intrinsics and builtins for PTX loads and stores.
Tests for this are here: intel/llvm-test-suite#648
This blocks #4853 (but it is not the only blocker).