-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL][CUDA][HIP] Fix enable-global-offset flag #11674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from a few comments, this LGTM
Unfortunately and most likely, dpcpp-tools team doesn't have a member that is familiar with this pass. |
You're replacing calls to the intrinsics and uses of their result with known constants. Thus, the commit message needs clarifying and expanding, otherwise the code and tests look really good |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the PR description and commit message are fixed this LGTM
60a65d8
to
41c0fb8
Compare
41c0fb8
to
67a89c9
Compare
The uses of each global-offset intrinsic are fully removed from the kernel when -enable-global-offset=false. Afterwards, the offset intrinsic is removed as well.
Instead of simply replacing all uses of the global offset intrinsics, remove all its CallInsts and GEPs and Loads that use it. Usages of each Load are replaced by constant zeros. This ensures that no usage fragments are left in the kernel after the global offset pass is completed.
The test cases for the AMDGPU and NVPTX check if all GEPS, Loads and the implicit offset intrinsic have been removed
llvm/test/CodeGen/NVPTX/global-offset-removal.ll and llvm/test/CodeGen/AMDGPU/global-offset-removal.ll are not supposed to pass when the optimization would simply delete the Load usages
67a89c9
to
0fef339
Compare
ping @intel/llvm-gatekeepers to get this merged |
Approval from @maksimsab is required. |
@steffenlarsen @maksimsab already gave their approval if I have understood it correctly:
(Resolved discussion above) |
@@ -59,11 +59,21 @@ ModulePass *llvm::createGlobalOffsetPassLegacy() { | |||
return new GlobalOffsetLegacy(); | |||
} | |||
|
|||
// Recursive helper function to collect Loads from GEPs in a BFS fashion. | |||
static void getLoads(Instruction *P, SmallVectorImpl<Instruction *> &Traversed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it was a bit confusing with the name 'Traversed'. Can we not just call this also as 'PtrUses'?
Also, the function name seems a bit incomplete. May be getLoadsAndGEPs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the name traversed here honestly, since it is capturing the approach that this function follows a bit better.
It does a BFS for finding Loads
and GEP
s while keeping track of the already traversed functions.
The name focuses on the goal of this function: To collect the Loads in a small vector.
The traversal of GEPs could be more seen as a byproduct.
// Recursive helper function to collect Loads from GEPs in a BFS fashion. | ||
static void getLoads(Instruction *P, SmallVectorImpl<Instruction *> &Traversed, | ||
SmallVectorImpl<LoadInst *> &Loads) { | ||
Traversed.push_back(P); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be an assert at the very top of this function body to check for *p being either a load or a GEP might be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the check is already taking place in the dyn_cast<LoadInst>(P)
and the assertion if the else branch is taken: assert(isa<GetElementPtrInst>(*P));
.
If the instruction is not a Load or GEP, the code will also throw without another assert in the first line, so I think it would add a sort of redundancy here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just some minor nits. I can approve on behalf of DPC++ Tools. Thanks
Thank you for pointing that out, @MartinWehking ! When change requests are made, formal approval is normally needed for gatekeepers to merge the PR. Some gatekeepers can still merge despite it, but generally we try to avoid it as much as possible. Please address @asudarsa 's comments and I will happily invoke those powers, given the information. 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #10624 (we think) |
Extend intel#11674 by modifying the globaloffset optimization pass to always replace uses of Loads from the llvm.nvvm.implicit.offset and llvm.amdgcn.implicit.offset intrinsics with constant zeros in the original non-offset kernel. Hence, perform the optimization even when -enable-global-offset=true (default). Duplicate recursively functions containing calls to the implicit offset intrinsic and let the implicit offset kernel entry point only call the original functions (i.e. do not call the functions with added offset arguments). Remove zero allocations for the original kernel entry points.
Extend #11674 by modifying the globaloffset optimization pass to always replace uses of Loads from the `llvm.nvvm.implicit.offset` and `llvm.amdgcn.implicit.offset intrinsics` with constant zeros in the original non-offset kernel. Hence, perform the optimization even when `-enable-global-offset=true` (default). Duplicate recursively functions containing calls to the implicit offset intrinsic and let the implicit offset kernel entry point only call the original functions (i.e. do not call the functions with added offset arguments). Remove zero allocations for the original kernel entry points.
Modify the globaloffset pass to remove calls to the
llvm.nvvm.implicit.offset
andllvm.amdgcn.implicit.offset
from the IR during the SYCL globaloffset pass when-enable-global-offset=false
.Remove their respective uses, i.e. GEPs and Loads and replace further uses of the latter with 0 constants.
Ensure that these intrinsics do not occur anymore during target lowering.
Before, in some cases a compilation error was thrown because the intrinsic could not be selected for the AMDGPU and NVPTX targets.
Based on the inspection of the IR, any calls of the intrinsic were probably expected to be fully removed after the globaloffset pass.
Replace Loads from the intrinsic with known constants and enable further optimization of the IR to remove dead code.
In our observed cases, several kernels with implicit global offset failed to remove useless stores to the stack.
Fixes #10624
(we think)