-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] Extend global offset intrinsic removal #11909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Extend intel#11674 by modifying the globaloffset optimization pass to always replace uses of Loads from the llvm.nvvm.implicit.offset and llvm.amdgcn.implicit.offset intrinsics with constant zeros in the original non-offset kernel. Hence, perform the optimization even when -enable-global-offset=true (default). Duplicate recursively functions containing calls to the implicit offset intrinsic and let the implicit offset kernel entry point only call the original functions (i.e. do not call the functions with added offset arguments). Remove zero allocations for the original kernel entry points.
Use std::get to retrieve an element from a tuple that used to be pair previously. Do not mention deleted allocas of zeros anymore.
Fix an incomplete removal of calls to the implicit offset intrinsics. The global-offset-multiple-calls-from-one-function.ll test highlighted this flaw inside the IR. Recurse only non cloned functions for adding offset parameters and obtain call intructions that are to be replaced through a global VMap.
Check for functions with and without offset parameters inside the NVPTX and AMDGPU global-offset test cases. Take into account that functions that contain calls to the global-offset intrinsic or transitively call one of these functions have two different versions now - one with and one without offset argument. Check that the correct ones are called. Add also a new test case for mixed functions, i.e. functions containing direct calls to the intrinsic and functions that eventually transitively call the intrinsic. Do this to check if the recursion inside the global offset pass handles these correctly.
Copy the global offset correctly to addrspace(3) inside the kernel. Do not perform this operation if the cloned function is not a kernel entry point. Adapt the AMDGPU test cases to take the copying of the global offset into account
Before the global-offset optimization, non-offset and offset kernels called the same function with an added offset argument. The non-offset versions allocated an array with 3 zeros in addrspace(5) and passed it as an argument to the function calls containing calls to the intrinsic. Since the allocation and filling is not possible in addrspace(4), the offset version adds an extra alloca in addrspace(5) and does an addrspace cast to addrspace(4). This is not required anymore since the non-offset version does not perform an alloca of zeros anymore. However, the optimization is rendered almost useless as the global offset feature is deprecated. Adapt the comment to reflect this.
continue; | ||
|
||
// Kernel entry points need additional processing and change Metdadata. | ||
if (EntryPointMetadata.count(Caller) != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zero is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it's a good idea to change it. There are two reasons:
- We're dealing here with natural numbers (counts), not bools
- I didn't introduce the change originally, so it would add more diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EntryPointMetadata
is a map and therefore .count()
is often used as contains, i.e. implicit conversion to bool should be ok in that context.
But I agree with (2)
Extend #11674 by modifying the globaloffset optimization pass to always
replace uses of Loads from the
llvm.nvvm.implicit.offset
andllvm.amdgcn.implicit.offset intrinsics
with constant zeros in theoriginal non-offset kernel.
Hence, perform the optimization even when
-enable-global-offset=true
(default).Duplicate recursively functions containing calls to the implicit offset
intrinsic and let the implicit offset kernel entry point only call
the original functions (i.e. do not call the functions with added offset
arguments).
Remove zero allocations for the original kernel entry points.