[SYCL][CUDA][libclc] Add atomic loads and stores with various memory orders and scopes #5191

t4c1 · 2021-12-21T08:53:49Z

Adds atomic loads and stores with various memory orders and scopes. These are implemented by adding intrinsics and builtins for PTX loads and stores.

Tests for this are here: intel/llvm-test-suite#648

This blocks #4853 (but it is not the only blocker).

…y orders and scopes

t4c1 · 2021-12-21T12:21:28Z

This test failure does not look connected to the changes in this PR. Also, I can not reproduce it locally.

Disable Basic/get_backend test due to sporadic failures in prs: intel/llvm#5191 intel/llvm#5121

bader · 2021-12-22T10:14:39Z

/verify with intel/llvm-test-suite#648

bader

libclc part looks good to me.
It would be good to review LLVM intrinsics with @Artem-B.
You can either create another review request on reviews.llvm.org or extend https://reviews.llvm.org/D112718 and ping reviewers.

libclc/ptx-nvidiacl/libspirv/atomic/atomic_load.cl

Artem-B

Looks OK to me, modulo few nits.

Do you plan to upstream these changes?

clang/include/clang/Basic/BuiltinsNVPTX.def

Artem-B · 2021-12-22T16:43:46Z

clang/include/clang/Basic/BuiltinsNVPTX.def

+BUILTIN(__nvvm_volatile_ld##ADDR_SPACE##_d, "ddD*", "n")
+
+#define LD_BUILTIN_TYPES(ORDER, SCOPE, ADDR_SPACE) \
+TARGET_BUILTIN(__nvvm##ORDER##SCOPE##_ld##ADDR_SPACE##_i, "iiD*", "n", SM_70) \


I think it would make things a bit more orderly if builtins would be using __nvvm_ld/__nvvm_st prefix and put modifiers as suffixes in the same order they are used in the PTX instructions the builtins translate into.

E.g. __nvvm_ld_shared_acquire_gpu_i() -> ld.shared.acquire.gpu.u32

That would be nice to have, but I would argue that being consistent with other builtins is more important. I was looking especially at atomic read-modify-write operations I worked on before, where we have for example __nvvm_atom_acq_rel_sys_min_shared_i.

OK. We can live with the order of modifiers matching that of __nvvm_atom.

That said, __nvvm_atom does have the instruction mnemonic atom in front and so should these ld builtins.

My understanding is that these have atom at the start because they also have atom in the PTX, which is not true for ld and st instructions.

Artem-B · 2021-12-22T16:53:46Z

clang/lib/CodeGen/CGBuiltin.cpp

+  auto MakeScopedLd = [&](unsigned IntrinsicID) {
+    Value *Ptr = EmitScalarExpr(E->getArg(0));
+    auto tmp = Builder.CreateCall(
+        CGM.getIntrinsic(IntrinsicID, {Ptr->getType()->getPointerElementType(),
+                                       Ptr->getType()}),
+        {Ptr});
+    return tmp;
+  };
+  auto MakeScopedSt = [&](unsigned IntrinsicID) {
+    Value *Ptr = EmitScalarExpr(E->getArg(0));
+    return Builder.CreateCall(
+        CGM.getIntrinsic(
+            IntrinsicID,
+            {Ptr->getType(), Ptr->getType()->getPointerElementType()}),
+        {Ptr, EmitScalarExpr(E->getArg(1))});
+  };


The pointer is in generic AS. We may want to addrspacecast it into specific AS when we generate the calls to the AS-specific builtins. That would allow compiler to eliminate unnecessary ASC to generic AS it uses when an address of a variable is taken.

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

Artem-B · 2021-12-22T17:20:06Z

clang/test/CodeGen/builtins-nvptx.c

+  // CHECK: call i32 @llvm.nvvm.ld.gen.i.volatile.i32.p0i32
+  __nvvm_volatile_ld_gen_i(ip);
+  // CHECK: call i32 @llvm.nvvm.ld.global.i.volatile.i32.p1i32
+  __nvvm_volatile_ld_global_i((__attribute__((address_space(1))) int *)ip);


Interesting. The builtins are declared with the pointer in the generic AS, yet here we're passing a pointer to the global AS and it does get us the right intrinsic overload. This is convenient.

The downside of this is that unless user explicitly casts the pointer to a specific AS, we'll likely have unnecessary ASC to the generic AS, as I've mentioned above. It's not a bug as accesses via generic AS are OK, but rather an optimization opportunity.

I am not sure I follow your thinking. This is lowered into a PTX instruction that uses global (not generic) addressing. Are you saying that compiler will cast this pointer to generic AS and than back to global AS? Also, what does ASC stand for?

I am not sure I follow your thinking.

My fault. I should not do code reviews at 1am. Please ignore the second paragraph.

This is lowered into a PTX instruction that uses global (not generic) addressing.

This is the "interesting" part. Because the argument has been explicitly marked as being in global AS, things do work exactly as intended. In a way you've created an equivalent of a function overload for the builtin for pointers in different AS and that may allow clang to accept more than it should.

Are you saying that compiler will cast this pointer to generic AS and than back to global AS?

Normally, yes. See https://godbolt.org/z/aG1bd45ez.

For the calls of load the pointer gets cast to global AS and then back to generic, before passing it to a function. You can see it in the IR, as well. Here you're probably saved by the fact that you handle call generation yourself, but you do sort of break the rules in the process. I.e. builtins with generic AS pointer arguments should not accept pointers in the specific AS. The test above demonstrates that these builtins do.

I'm curious what's going to happen if you actually call __nvvm_volatile_ld_global_i(ip);.
Technically, it does match the builtin signature, so clang will accept this. The way builtin code gets called, it will likely end up calling @llvm.nvvm.ld.global.i.volatile.i32.p0i32 that would probably translate into ld.volatile instruction with a generic pointer as an argument. I do not think it's a desired outcome.

I think builtins should have signatures with the pointers in the matching AS. This way clang will catch attempts to pass a wrong-AS pointer to them.

Also, what does ASC stand for?

ASC = addrspacecast LLVM IR instruction.

For the calls of load the pointer gets cast to global AS and then back to generic, before passing it to a function.

I don't see any such casts to global and back to generic in generated PTX. That is either in the godbolt example you linked or locally when I tried something similar with the new builtins.

I'm curious what's going to happen if you actually call __nvvm_volatile_ld_global_i(ip);.
Technically, it does match the builtin signature, so clang will accept this. The way builtin code gets called, it will likely end up calling @llvm.nvvm.ld.global.i.volatile.i32.p0i32 that would probably translate into ld.volatile instruction with a generic pointer as an argument. I do not think it's a desired outcome.

That is exactly what happens.

I think builtins should have signatures with the pointers in the matching AS. This way clang will catch attempts to pass a wrong-AS pointer to them.

I can not easily specify which AS the argument must be in when declaring builtins. While setting the address space numbers in builtin declarations works for CUDA, it does not work for OpenCL, which is our use case for this. I could try to throw an error in code generation if a builtin is called with AS that is not the same one it accepts.

I would suggest that we merge this as it is to unblock other atomic-related PRs.

When we figure out how to enable typecasting between target and OpenCL language address spaces we come back to this and change the builtins in the way that was suggested in this thread. Regardless of what we decide here this will need to be done for atomic read-modify-write builtins that have already been merged.

I'm okay with that plan if @Artem-B has no objections.

Committing the changes as is to unblock other work sounds reasonable to me, provided that AS-specifc builtins API is fixed before the changes get merged upstream or released to a wider audience.

clang/test/CodeGen/builtins-nvptx.c

clang/lib/CodeGen/CGBuiltin.cpp

bader · 2021-12-23T13:46:05Z

Do you plan to upstream these changes?

I think it would be great to contribute these changes to llvm-project.
LLVM/Clang changes can be committed right away, but libclc changes rely on other work, which should be upstreamed first.
This is the second patch adding NVPTX intrinsics. First patch https://reviews.llvm.org/D112718 is already up for review and we would appreciate your help with it (it's been waiting for almost two months).

Artem-B · 2021-12-23T19:04:54Z

First patch https://reviews.llvm.org/D112718 is already up for review and we would appreciate your help with it (it's been waiting for almost two months).

I do not think I've seen it. In the future, please do ping the review and explicitly mention reviewers if there's no activity. There are a lot of review email flying around and sometimes things do fall through the cracks. I'm off for the holidays, but I'll take a look when I'm back.

llvm/test/CodeGen/NVPTX/atomics-with-semantics.ll

elizabethandrews

Just commented a nit.

I am not familiar with this functionality and so I leave the review to @bader.

clang/lib/CodeGen/CGBuiltin.cpp

elizabethandrews

Approving to unblock since code looks ok to me. But please wait for approval from @bader or someone else who is familiar with this functionality

bader · 2022-01-17T12:04:49Z

@t4c1, I'm lost a bit in PR dependencies, so I need your help here to avoid merging something that will break CI checks.

Is this PR breaks llvm-test-suite tests, which are supposed to be fixed by [SYCL][CUDA] Added tests for atomic loads and stores for various orders and scopes llvm-test-suite#648?
[SYCL][CUDA] Added tests for atomic loads and stores for various orders and scopes llvm-test-suite#648 says it's blocked by [SYCL][PI][CUDA] Update queries for atomic order and scope for CUDA #4853, which is a draft and can't be merged. If we need to merge [SYCL][CUDA] Added tests for atomic loads and stores for various orders and scopes llvm-test-suite#648 together with this PR, we need to merge [SYCL][PI][CUDA] Update queries for atomic order and scope for CUDA #4853 first. Is that right?

t4c1 · 2022-01-17T12:10:39Z

Merging this will not break anything.

intel/llvm-test-suite#648 just tests new features added by this PR and does indeed need #4853 merged first.

Once this and #5192 are merged, #4853 will be unblocked, so that one can be merged.

To summarize: atomic PRs should be merged in the order:

this one and [SYCL][CUDA][libclc] Add remaining atomics #5192 first in any order
than [SYCL][PI][CUDA] Update queries for atomic order and scope for CUDA #4853
Lastly [SYCL][CUDA] Added tests for atomic loads and stores for various orders and scopes llvm-test-suite#648 and [SYCL] Added tests for atomics with various memory orders and scopes llvm-test-suite#534 in any order

EDIT: added #5192

Disable Basic/get_backend test due to sporadic failures in prs: intel#5191 intel#5121

t4c1 added 2 commits December 21, 2021 08:28

[SYCL][CUDA][libclc] Added atomic loads and stores with various memor…

db2baab

…y orders and scopes

Merge branch 'sycl' into ptx_atomic_ldst

98aaa34

t4c1 requested review from bader, elizabethandrews, premanandrao and smanna12 as code owners December 21, 2021 08:53

This was referenced Dec 21, 2021

[SYCL][CUDA] Added tests for atomic loads and stores for various orders and scopes intel/llvm-test-suite#648

Merged

[SYCL][CUDA][libclc] Add remaining atomics #5192

Merged

denis-kabanov added a commit to denis-kabanov/llvm-test-suite that referenced this pull request Dec 21, 2021

[SYCL] Disable Basic/get_backend test

935be27

Disable Basic/get_backend test due to sporadic failures in prs: intel/llvm#5191 intel/llvm#5121

denis-kabanov mentioned this pull request Dec 21, 2021

[SYCL] Disable Basic/get_backend test intel/llvm-test-suite#649

Merged

bader pushed a commit to intel/llvm-test-suite that referenced this pull request Dec 22, 2021

[SYCL] Disable Basic/get_backend test (#649)

cabb936

Disable Basic/get_backend test due to sporadic failures in prs: intel/llvm#5191 intel/llvm#5121

bader previously approved these changes Dec 22, 2021

View reviewed changes

libclc/ptx-nvidiacl/libspirv/atomic/atomic_load.cl Outdated Show resolved Hide resolved

Artem-B reviewed Dec 22, 2021

View reviewed changes

format and push/pop macros

be3b553

t4c1 dismissed bader’s stale review via be3b553 December 23, 2021 08:58

FMarno reviewed Dec 23, 2021

View reviewed changes

clang/lib/CodeGen/CGBuiltin.cpp Outdated Show resolved Hide resolved

Artem-B reviewed Dec 23, 2021

View reviewed changes

llvm/test/CodeGen/NVPTX/atomics-with-semantics.ll Show resolved Hide resolved

removed redundant include and added tests for volatile intrinsics

e400d17

t4c1 requested review from a team as code owners January 13, 2022 09:20

elizabethandrews reviewed Jan 13, 2022

View reviewed changes

clang/lib/CodeGen/CGBuiltin.cpp Outdated Show resolved Hide resolved

bader previously approved these changes Jan 14, 2022

View reviewed changes

removed tmp variable

5c91b3d

t4c1 dismissed bader’s stale review via 5c91b3d January 14, 2022 08:25

elizabethandrews approved these changes Jan 14, 2022

View reviewed changes

smanna12 approved these changes Jan 14, 2022

View reviewed changes

bader approved these changes Jan 17, 2022

View reviewed changes

bader merged commit e15ac50 into intel:sycl Jan 17, 2022

t4c1 deleted the ptx_atomic_ldst branch March 15, 2022 08:51

ldrumm mentioned this pull request Jan 9, 2023

Implement AtomicFAddEXT for the CUDA BE #2853

Closed

aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023

[SYCL] Disable Basic/get_backend test (intel/llvm-test-suite#649)

86db9dc

Disable Basic/get_backend test due to sporadic failures in prs: intel#5191 intel#5121

[SYCL][CUDA][libclc] Add atomic loads and stores with various memory orders and scopes #5191

[SYCL][CUDA][libclc] Add atomic loads and stores with various memory orders and scopes #5191

Uh oh!

Conversation

t4c1 commented Dec 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

t4c1 commented Dec 21, 2021

Uh oh!

bader commented Dec 22, 2021

Uh oh!

bader left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t4c1 Dec 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Artem-B Dec 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bader commented Dec 23, 2021

Uh oh!

Artem-B commented Dec 23, 2021

Uh oh!

Uh oh!

elizabethandrews left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elizabethandrews left a comment

Choose a reason for hiding this comment

Uh oh!

bader commented Jan 17, 2022

Uh oh!

t4c1 commented Jan 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

t4c1 commented Dec 21, 2021 •

edited

Loading

t4c1 Dec 23, 2021 •

edited

Loading

Artem-B Dec 23, 2021 •

edited

Loading

t4c1 commented Jan 17, 2022 •

edited

Loading