-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL][CUDA] support launch bounds #9772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update the PR description to describe this change and add a test?
Interesting! Does this need to be an attribute or could we make it a compile-time kernel property instead? |
When a sycl::local_accessor is needed for a SYCL kernel, the Clang compiler will create some metadata for the local accessor. The metadata, somehow, overrides the metadata for the bound value (e.g. max threads of a block). In the end, the bound value is not written to the NVIDIA PTX assembly file even though it is added by the function "addNVVMMetadata" in clang/lib/CodeGen/TargetInfo.cpp. Do you have more clues about the issue ?
|
Do you happen to have a repro for it, I'd like to have a look at it? |
Any SYCL program using a SYCL local accessor is a reproducer. Thanks.
|
@jchlanda Do you know the cause after taking a look ? |
This has slipped of my radar completely, sorry. Let me have a look. |
I can reproduce the error and I think I know what's going on. The metadata is correctly generated and the addition of !0 = distinct !{ptr @_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_E12blockReverse, !"kernel", i32 1}
!1 = distinct !{null, !"maxntidx", i32 256} For it to be valid, !1 = distinct !{ptr @_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_E12blockReverse, !"maxntidx", i32 256} Let me find who's misbehaving here. |
@steffenlarsen @npmiller
Many users have requested the feature. I tried to get started. Help is needed to finish the definition and implementation of the kernel attribute for "minBlocksPerMultiprocessor" to support launch bounds in CUDA. Thanks.