Skip to content

[SYCL][CUDA][libclc] Add atomic loads and stores with various memory orders and scopes #5191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 17, 2022

Conversation

t4c1
Copy link
Contributor

@t4c1 t4c1 commented Dec 21, 2021

Adds atomic loads and stores with various memory orders and scopes. These are implemented by adding intrinsics and builtins for PTX loads and stores.

Tests for this are here: intel/llvm-test-suite#648

This blocks #4853 (but it is not the only blocker).

@t4c1
Copy link
Contributor Author

t4c1 commented Dec 21, 2021

This test failure does not look connected to the changes in this PR. Also, I can not reproduce it locally.

denis-kabanov added a commit to denis-kabanov/llvm-test-suite that referenced this pull request Dec 21, 2021
Disable Basic/get_backend test due to sporadic failures in prs:
intel/llvm#5191
intel/llvm#5121
bader pushed a commit to intel/llvm-test-suite that referenced this pull request Dec 22, 2021
Disable Basic/get_backend test due to sporadic failures in prs:
intel/llvm#5191
intel/llvm#5121
@bader
Copy link
Contributor

bader commented Dec 22, 2021

/verify with intel/llvm-test-suite#648

bader
bader previously approved these changes Dec 22, 2021
Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libclc part looks good to me.
It would be good to review LLVM intrinsics with @Artem-B.
You can either create another review request on reviews.llvm.org or extend https://reviews.llvm.org/D112718 and ping reviewers.

Copy link
Contributor

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me, modulo few nits.

Do you plan to upstream these changes?

BUILTIN(__nvvm_volatile_ld##ADDR_SPACE##_d, "ddD*", "n")

#define LD_BUILTIN_TYPES(ORDER, SCOPE, ADDR_SPACE) \
TARGET_BUILTIN(__nvvm##ORDER##SCOPE##_ld##ADDR_SPACE##_i, "iiD*", "n", SM_70) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make things a bit more orderly if builtins would be using __nvvm_ld/__nvvm_st prefix and put modifiers as suffixes in the same order they are used in the PTX instructions the builtins translate into.

E.g. __nvvm_ld_shared_acquire_gpu_i() -> ld.shared.acquire.gpu.u32

Copy link
Contributor Author

@t4c1 t4c1 Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice to have, but I would argue that being consistent with other builtins is more important. I was looking especially at atomic read-modify-write operations I worked on before, where we have for example __nvvm_atom_acq_rel_sys_min_shared_i.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. We can live with the order of modifiers matching that of __nvvm_atom.

That said, __nvvm_atom does have the instruction mnemonic atom in front and so should these ld builtins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that these have atom at the start because they also have atom in the PTX, which is not true for ld and st instructions.

Comment on lines 17434 to 17449
auto MakeScopedLd = [&](unsigned IntrinsicID) {
Value *Ptr = EmitScalarExpr(E->getArg(0));
auto tmp = Builder.CreateCall(
CGM.getIntrinsic(IntrinsicID, {Ptr->getType()->getPointerElementType(),
Ptr->getType()}),
{Ptr});
return tmp;
};
auto MakeScopedSt = [&](unsigned IntrinsicID) {
Value *Ptr = EmitScalarExpr(E->getArg(0));
return Builder.CreateCall(
CGM.getIntrinsic(
IntrinsicID,
{Ptr->getType(), Ptr->getType()->getPointerElementType()}),
{Ptr, EmitScalarExpr(E->getArg(1))});
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pointer is in generic AS. We may want to addrspacecast it into specific AS when we generate the calls to the AS-specific builtins. That would allow compiler to eliminate unnecessary ASC to generic AS it uses when an address of a variable is taken.

// CHECK: call i32 @llvm.nvvm.ld.gen.i.volatile.i32.p0i32
__nvvm_volatile_ld_gen_i(ip);
// CHECK: call i32 @llvm.nvvm.ld.global.i.volatile.i32.p1i32
__nvvm_volatile_ld_global_i((__attribute__((address_space(1))) int *)ip);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. The builtins are declared with the pointer in the generic AS, yet here we're passing a pointer to the global AS and it does get us the right intrinsic overload. This is convenient.

The downside of this is that unless user explicitly casts the pointer to a specific AS, we'll likely have unnecessary ASC to the generic AS, as I've mentioned above. It's not a bug as accesses via generic AS are OK, but rather an optimization opportunity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow your thinking. This is lowered into a PTX instruction that uses global (not generic) addressing. Are you saying that compiler will cast this pointer to generic AS and than back to global AS? Also, what does ASC stand for?

Copy link
Contributor

@Artem-B Artem-B Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow your thinking.

My fault. I should not do code reviews at 1am. Please ignore the second paragraph.

This is lowered into a PTX instruction that uses global (not generic) addressing.

This is the "interesting" part. Because the argument has been explicitly marked as being in global AS, things do work exactly as intended. In a way you've created an equivalent of a function overload for the builtin for pointers in different AS and that may allow clang to accept more than it should.

Are you saying that compiler will cast this pointer to generic AS and than back to global AS?

Normally, yes. See https://godbolt.org/z/aG1bd45ez.

For the calls of load the pointer gets cast to global AS and then back to generic, before passing it to a function. You can see it in the IR, as well. Here you're probably saved by the fact that you handle call generation yourself, but you do sort of break the rules in the process. I.e. builtins with generic AS pointer arguments should not accept pointers in the specific AS. The test above demonstrates that these builtins do.

I'm curious what's going to happen if you actually call __nvvm_volatile_ld_global_i(ip);.
Technically, it does match the builtin signature, so clang will accept this. The way builtin code gets called, it will likely end up calling @llvm.nvvm.ld.global.i.volatile.i32.p0i32 that would probably translate into ld.volatile instruction with a generic pointer as an argument. I do not think it's a desired outcome.

I think builtins should have signatures with the pointers in the matching AS. This way clang will catch attempts to pass a wrong-AS pointer to them.

Also, what does ASC stand for?

ASC = addrspacecast LLVM IR instruction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the calls of load the pointer gets cast to global AS and then back to generic, before passing it to a function.

I don't see any such casts to global and back to generic in generated PTX. That is either in the godbolt example you linked or locally when I tried something similar with the new builtins.

I'm curious what's going to happen if you actually call __nvvm_volatile_ld_global_i(ip);.
Technically, it does match the builtin signature, so clang will accept this. The way builtin code gets called, it will likely end up calling @llvm.nvvm.ld.global.i.volatile.i32.p0i32 that would probably translate into ld.volatile instruction with a generic pointer as an argument. I do not think it's a desired outcome.

That is exactly what happens.

I think builtins should have signatures with the pointers in the matching AS. This way clang will catch attempts to pass a wrong-AS pointer to them.

I can not easily specify which AS the argument must be in when declaring builtins. While setting the address space numbers in builtin declarations works for CUDA, it does not work for OpenCL, which is our use case for this. I could try to throw an error in code generation if a builtin is called with AS that is not the same one it accepts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that we merge this as it is to unblock other atomic-related PRs.

When we figure out how to enable typecasting between target and OpenCL language address spaces we come back to this and change the builtins in the way that was suggested in this thread. Regardless of what we decide here this will need to be done for atomic read-modify-write builtins that have already been merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with that plan if @Artem-B has no objections.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committing the changes as is to unblock other work sounds reasonable to me, provided that AS-specifc builtins API is fixed before the changes get merged upstream or released to a wider audience.

@bader
Copy link
Contributor

bader commented Dec 23, 2021

Do you plan to upstream these changes?

I think it would be great to contribute these changes to llvm-project.
LLVM/Clang changes can be committed right away, but libclc changes rely on other work, which should be upstreamed first.
This is the second patch adding NVPTX intrinsics. First patch https://reviews.llvm.org/D112718 is already up for review and we would appreciate your help with it (it's been waiting for almost two months).

@Artem-B
Copy link
Contributor

Artem-B commented Dec 23, 2021

First patch https://reviews.llvm.org/D112718 is already up for review and we would appreciate your help with it (it's been waiting for almost two months).

I do not think I've seen it. In the future, please do ping the review and explicitly mention reviewers if there's no activity. There are a lot of review email flying around and sometimes things do fall through the cracks. I'm off for the holidays, but I'll take a look when I'm back.

@t4c1 t4c1 requested review from a team as code owners January 13, 2022 09:20
Copy link
Contributor

@elizabethandrews elizabethandrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just commented a nit.

I am not familiar with this functionality and so I leave the review to @bader.

bader
bader previously approved these changes Jan 14, 2022
Copy link
Contributor

@elizabethandrews elizabethandrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock since code looks ok to me. But please wait for approval from @bader or someone else who is familiar with this functionality

@bader
Copy link
Contributor

bader commented Jan 17, 2022

@t4c1, I'm lost a bit in PR dependencies, so I need your help here to avoid merging something that will break CI checks.

@t4c1
Copy link
Contributor Author

t4c1 commented Jan 17, 2022

Merging this will not break anything.

intel/llvm-test-suite#648 just tests new features added by this PR and does indeed need #4853 merged first.

Once this and #5192 are merged, #4853 will be unblocked, so that one can be merged.

To summarize: atomic PRs should be merged in the order:

EDIT: added #5192

@bader bader merged commit e15ac50 into intel:sycl Jan 17, 2022
@t4c1 t4c1 deleted the ptx_atomic_ldst branch March 15, 2022 08:51
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023
Disable Basic/get_backend test due to sporadic failures in prs:
intel#5191
intel#5121
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants