[SYCL][CUDA][HIP] Fix enable-global-offset flag #11674

MartinWehking · 2023-10-26T14:23:53Z

Modify the globaloffset pass to remove calls to the llvm.nvvm.implicit.offset and llvm.amdgcn.implicit.offset from the IR during the SYCL globaloffset pass when -enable-global-offset=false.
Remove their respective uses, i.e. GEPs and Loads and replace further uses of the latter with 0 constants.
Ensure that these intrinsics do not occur anymore during target lowering.
Before, in some cases a compilation error was thrown because the intrinsic could not be selected for the AMDGPU and NVPTX targets.
Based on the inspection of the IR, any calls of the intrinsic were probably expected to be fully removed after the globaloffset pass.
Replace Loads from the intrinsic with known constants and enable further optimization of the IR to remove dead code.
In our observed cases, several kernels with implicit global offset failed to remove useless stores to the stack.

Fixes #10624

(we think)

pasaulais

Apart from a few comments, this LGTM

llvm/lib/SYCLLowerIR/GlobalOffset.cpp

maksimsab · 2023-10-26T15:09:30Z

Unfortunately and most likely, dpcpp-tools team doesn't have a member that is familiar with this pass.
@steffenlarsen @jchlanda I see that you were contributors here. Your opinions would be really helpful here.

llvm/lib/SYCLLowerIR/GlobalOffset.cpp

ldrumm · 2023-10-27T10:07:36Z

You're replacing calls to the intrinsics and uses of their result with known constants.

Thus, the commit message needs clarifying and expanding, otherwise the code and tests look really good

ldrumm

Once the PR description and commit message are fixed this LGTM

The uses of each global-offset intrinsic are fully removed from the kernel when -enable-global-offset=false. Afterwards, the offset intrinsic is removed as well.

Instead of simply replacing all uses of the global offset intrinsics, remove all its CallInsts and GEPs and Loads that use it. Usages of each Load are replaced by constant zeros. This ensures that no usage fragments are left in the kernel after the global offset pass is completed.

The test cases for the AMDGPU and NVPTX check if all GEPS, Loads and the implicit offset intrinsic have been removed

llvm/test/CodeGen/NVPTX/global-offset-removal.ll and llvm/test/CodeGen/AMDGPU/global-offset-removal.ll are not supposed to pass when the optimization would simply delete the Load usages

llvm/lib/SYCLLowerIR/GlobalOffset.cpp

MartinWehking · 2023-11-08T10:42:23Z

ping @intel/llvm-gatekeepers to get this merged

steffenlarsen · 2023-11-08T12:27:00Z

Approval from @maksimsab is required.

MartinWehking · 2023-11-08T15:11:01Z

Approval from @maksimsab is required.

@steffenlarsen @maksimsab already gave their approval if I have understood it correctly:

I will be OOO until 13-th of November. Please, if you decide to fix that then don't wait for my LGTM. Just go ahead.

(Resolved discussion above)

asudarsa · 2023-11-08T15:24:51Z

llvm/lib/SYCLLowerIR/GlobalOffset.cpp

@@ -59,11 +59,21 @@ ModulePass *llvm::createGlobalOffsetPassLegacy() {
  return new GlobalOffsetLegacy();
 }

+// Recursive helper function to collect Loads from GEPs in a BFS fashion.
+static void getLoads(Instruction *P, SmallVectorImpl<Instruction *> &Traversed,


nit: it was a bit confusing with the name 'Traversed'. Can we not just call this also as 'PtrUses'?
Also, the function name seems a bit incomplete. May be getLoadsAndGEPs?

I prefer the name traversed here honestly, since it is capturing the approach that this function follows a bit better.
It does a BFS for finding Loads and GEPs while keeping track of the already traversed functions.
The name focuses on the goal of this function: To collect the Loads in a small vector.
The traversal of GEPs could be more seen as a byproduct.

asudarsa · 2023-11-08T15:26:11Z

llvm/lib/SYCLLowerIR/GlobalOffset.cpp

+// Recursive helper function to collect Loads from GEPs in a BFS fashion.
+static void getLoads(Instruction *P, SmallVectorImpl<Instruction *> &Traversed,
+                     SmallVectorImpl<LoadInst *> &Loads) {
+  Traversed.push_back(P);


May be an assert at the very top of this function body to check for *p being either a load or a GEP might be better?

I mean the check is already taking place in the dyn_cast<LoadInst>(P) and the assertion if the else branch is taken: assert(isa<GetElementPtrInst>(*P));.
If the instruction is not a Load or GEP, the code will also throw without another assert in the first line, so I think it would add a sort of redundancy here.

asudarsa

LGTM. Just some minor nits. I can approve on behalf of DPC++ Tools. Thanks

steffenlarsen · 2023-11-08T15:55:12Z

Approval from @maksimsab is required.

@steffenlarsen @maksimsab already gave their approval if I have understood it correctly:

I will be OOO until 13-th of November. Please, if you decide to fix that then don't wait for my LGTM. Just go ahead.

(Resolved discussion above)

Thank you for pointing that out, @MartinWehking ! When change requests are made, formal approval is normally needed for gatekeepers to merge the PR. Some gatekeepers can still merge despite it, but generally we try to avoid it as much as possible. Please address @asudarsa 's comments and I will happily invoke those powers, given the information. 😄

maksimsab

LGTM

JackAKirk · 2023-11-13T18:10:37Z

Fixes #10624

(we think)

Extend intel#11674 by modifying the globaloffset optimization pass to always replace uses of Loads from the llvm.nvvm.implicit.offset and llvm.amdgcn.implicit.offset intrinsics with constant zeros in the original non-offset kernel. Hence, perform the optimization even when -enable-global-offset=true (default). Duplicate recursively functions containing calls to the implicit offset intrinsic and let the implicit offset kernel entry point only call the original functions (i.e. do not call the functions with added offset arguments). Remove zero allocations for the original kernel entry points.

Extend #11674 by modifying the globaloffset optimization pass to always replace uses of Loads from the `llvm.nvvm.implicit.offset` and `llvm.amdgcn.implicit.offset intrinsics` with constant zeros in the original non-offset kernel. Hence, perform the optimization even when `-enable-global-offset=true` (default). Duplicate recursively functions containing calls to the implicit offset intrinsic and let the implicit offset kernel entry point only call the original functions (i.e. do not call the functions with added offset arguments). Remove zero allocations for the original kernel entry points.

MartinWehking requested a review from a team as a code owner October 26, 2023 14:23

MartinWehking temporarily deployed to WindowsCILock October 26, 2023 14:33 — with GitHub Actions Inactive

pasaulais reviewed Oct 26, 2023

View reviewed changes

llvm/lib/SYCLLowerIR/GlobalOffset.cpp Outdated Show resolved Hide resolved

llvm/lib/SYCLLowerIR/GlobalOffset.cpp Show resolved Hide resolved

MartinWehking temporarily deployed to WindowsCILock October 26, 2023 15:26 — with GitHub Actions Inactive

jchlanda approved these changes Oct 27, 2023

View reviewed changes

llvm/lib/SYCLLowerIR/GlobalOffset.cpp Outdated Show resolved Hide resolved

llvm/lib/SYCLLowerIR/GlobalOffset.cpp Outdated Show resolved Hide resolved

MartinWehking temporarily deployed to WindowsCILock October 27, 2023 09:42 — with GitHub Actions Inactive

ldrumm self-requested a review October 27, 2023 10:08

ldrumm requested changes Oct 27, 2023

View reviewed changes

MartinWehking temporarily deployed to WindowsCILock October 27, 2023 10:29 — with GitHub Actions Inactive

MartinWehking changed the title ~~[SYCL][CUDA][HIP] Remove global offset intrinsics when explicitly deactivated~~ [SYCL][CUDA][HIP] Remove calls to global offset intrinsics and uses of their result with known constants when explicitly deactivated Oct 27, 2023

MartinWehking force-pushed the global-offset-fix branch from 60a65d8 to 41c0fb8 Compare October 27, 2023 12:31

MartinWehking temporarily deployed to WindowsCILock October 27, 2023 12:32 — with GitHub Actions Inactive

MartinWehking force-pushed the global-offset-fix branch from 41c0fb8 to 67a89c9 Compare October 27, 2023 12:46

MartinWehking temporarily deployed to WindowsCILock October 27, 2023 12:47 — with GitHub Actions Inactive

Martin added 8 commits October 27, 2023 13:49

Replace uses of the global-offset intrinsic

a208542

The uses of each global-offset intrinsic are fully removed from the kernel when -enable-global-offset=false. Afterwards, the offset intrinsic is removed as well.

Add test cases for implicit offset optimization

c9d9d1f

The test cases for the AMDGPU and NVPTX check if all GEPS, Loads and the implicit offset intrinsic have been removed

Apply suggestions

e7ade02

Add zext check to AMDGPU global offset test

638c4c4

llvm/test/CodeGen/NVPTX/global-offset-removal.ll and llvm/test/CodeGen/AMDGPU/global-offset-removal.ll are not supposed to pass when the optimization would simply delete the Load usages

Add comment to clarify usage of reverse function

96bdba9

Apply suggestions

be1ccba

Fix formatting

0fef339

MartinWehking force-pushed the global-offset-fix branch from 67a89c9 to 0fef339 Compare October 27, 2023 12:55

MartinWehking temporarily deployed to WindowsCILock October 27, 2023 13:00 — with GitHub Actions Inactive

MartinWehking temporarily deployed to WindowsCILock October 27, 2023 13:40 — with GitHub Actions Inactive

MartinWehking changed the title ~~[SYCL][CUDA][HIP] Remove calls to global offset intrinsics and uses of their result with known constants when explicitly deactivated~~ [SYCL][CUDA][HIP] Fix enable-global-offset flag Oct 27, 2023

pasaulais approved these changes Oct 27, 2023

View reviewed changes

ldrumm approved these changes Oct 27, 2023

View reviewed changes

maksimsab requested changes Oct 31, 2023

View reviewed changes

llvm/lib/SYCLLowerIR/GlobalOffset.cpp Show resolved Hide resolved

maksimsab self-requested a review October 31, 2023 17:19

asudarsa reviewed Nov 8, 2023

View reviewed changes

asudarsa approved these changes Nov 8, 2023

View reviewed changes

maksimsab approved these changes Nov 13, 2023

View reviewed changes

aelovikov-intel merged commit 00cf4c2 into intel:sycl Nov 13, 2023

GeorgeWeb mentioned this pull request Nov 13, 2023

[SYCL][HIP] Poor Memory Bandwidth due to Unnecessary Memory Write Traffic, 50% slower than OpenSYCL #10624

Closed

MartinWehking mentioned this pull request Dec 4, 2023

[SYCL] Extend global offset intrinsic removal #11909

Merged

[SYCL][CUDA][HIP] Fix enable-global-offset flag #11674

[SYCL][CUDA][HIP] Fix enable-global-offset flag #11674

Uh oh!

Conversation

MartinWehking commented Oct 26, 2023 • edited by JackAKirk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pasaulais left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

maksimsab commented Oct 26, 2023

Uh oh!

Uh oh!

Uh oh!

ldrumm commented Oct 27, 2023

Uh oh!

ldrumm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MartinWehking commented Nov 8, 2023

Uh oh!

steffenlarsen commented Nov 8, 2023

Uh oh!

MartinWehking commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asudarsa Nov 8, 2023

Choose a reason for hiding this comment

Uh oh!

MartinWehking Nov 8, 2023

Choose a reason for hiding this comment

Uh oh!

asudarsa Nov 8, 2023

Choose a reason for hiding this comment

Uh oh!

MartinWehking Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asudarsa left a comment

Choose a reason for hiding this comment

Uh oh!

steffenlarsen commented Nov 8, 2023

Uh oh!

maksimsab left a comment

Choose a reason for hiding this comment

Uh oh!

JackAKirk commented Nov 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

MartinWehking commented Oct 26, 2023 •

edited by JackAKirk

Loading

MartinWehking commented Nov 8, 2023 •

edited

Loading

MartinWehking Nov 8, 2023 •

edited

Loading

JackAKirk commented Nov 13, 2023 •

edited

Loading