Skip to content

[SYCL][CUDA][HIP] Fix enable-global-offset flag #11674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Nov 13, 2023

Conversation

MartinWehking
Copy link
Contributor

@MartinWehking MartinWehking commented Oct 26, 2023

  • Modify the globaloffset pass to remove calls to the llvm.nvvm.implicit.offset and llvm.amdgcn.implicit.offset from the IR during the SYCL globaloffset pass when -enable-global-offset=false.
    Remove their respective uses, i.e. GEPs and Loads and replace further uses of the latter with 0 constants.
    Ensure that these intrinsics do not occur anymore during target lowering.
    Before, in some cases a compilation error was thrown because the intrinsic could not be selected for the AMDGPU and NVPTX targets.
    Based on the inspection of the IR, any calls of the intrinsic were probably expected to be fully removed after the globaloffset pass.

  • Replace Loads from the intrinsic with known constants and enable further optimization of the IR to remove dead code.
    In our observed cases, several kernels with implicit global offset failed to remove useless stores to the stack.

Fixes #10624

(we think)

@MartinWehking MartinWehking requested a review from a team as a code owner October 26, 2023 14:23
@MartinWehking MartinWehking temporarily deployed to WindowsCILock October 26, 2023 14:33 — with GitHub Actions Inactive
Copy link

@pasaulais pasaulais left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from a few comments, this LGTM

@maksimsab
Copy link
Contributor

Unfortunately and most likely, dpcpp-tools team doesn't have a member that is familiar with this pass.
@steffenlarsen @jchlanda I see that you were contributors here. Your opinions would be really helpful here.

@MartinWehking MartinWehking temporarily deployed to WindowsCILock October 26, 2023 15:26 — with GitHub Actions Inactive
@MartinWehking MartinWehking temporarily deployed to WindowsCILock October 27, 2023 09:42 — with GitHub Actions Inactive
@ldrumm
Copy link
Contributor

ldrumm commented Oct 27, 2023

You're replacing calls to the intrinsics and uses of their result with known constants.

Thus, the commit message needs clarifying and expanding, otherwise the code and tests look really good

@ldrumm ldrumm self-requested a review October 27, 2023 10:08
Copy link
Contributor

@ldrumm ldrumm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the PR description and commit message are fixed this LGTM

@MartinWehking MartinWehking temporarily deployed to WindowsCILock October 27, 2023 10:29 — with GitHub Actions Inactive
@MartinWehking MartinWehking changed the title [SYCL][CUDA][HIP] Remove global offset intrinsics when explicitly deactivated [SYCL][CUDA][HIP] Remove calls to global offset intrinsics and uses of their result with known constants when explicitly deactivated Oct 27, 2023
@MartinWehking MartinWehking temporarily deployed to WindowsCILock October 27, 2023 12:32 — with GitHub Actions Inactive
@MartinWehking MartinWehking temporarily deployed to WindowsCILock October 27, 2023 12:47 — with GitHub Actions Inactive
Martin added 8 commits October 27, 2023 13:49
The uses of each global-offset intrinsic are fully removed from the
kernel when -enable-global-offset=false. Afterwards, the offset
intrinsic is removed as well.
Instead of simply replacing all uses of the global offset intrinsics,
remove all its CallInsts and GEPs and Loads that use it.
Usages of each Load are replaced by constant zeros.
This ensures that no usage fragments are left in the kernel
after the global offset pass is completed.
The test cases for the AMDGPU and NVPTX check if all GEPS,
Loads and the implicit offset intrinsic have been removed
llvm/test/CodeGen/NVPTX/global-offset-removal.ll and
llvm/test/CodeGen/AMDGPU/global-offset-removal.ll are not supposed
to pass when the optimization would simply delete the Load usages
@MartinWehking MartinWehking temporarily deployed to WindowsCILock October 27, 2023 13:00 — with GitHub Actions Inactive
@MartinWehking MartinWehking temporarily deployed to WindowsCILock October 27, 2023 13:40 — with GitHub Actions Inactive
@MartinWehking MartinWehking changed the title [SYCL][CUDA][HIP] Remove calls to global offset intrinsics and uses of their result with known constants when explicitly deactivated [SYCL][CUDA][HIP] Fix enable-global-offset flag Oct 27, 2023
@maksimsab maksimsab self-requested a review October 31, 2023 17:19
@MartinWehking
Copy link
Contributor Author

ping @intel/llvm-gatekeepers to get this merged

@steffenlarsen
Copy link
Contributor

Approval from @maksimsab is required.

@MartinWehking
Copy link
Contributor Author

MartinWehking commented Nov 8, 2023

Approval from @maksimsab is required.

@steffenlarsen @maksimsab already gave their approval if I have understood it correctly:

I will be OOO until 13-th of November. Please, if you decide to fix that then don't wait for my LGTM. Just go ahead.

(Resolved discussion above)

@@ -59,11 +59,21 @@ ModulePass *llvm::createGlobalOffsetPassLegacy() {
return new GlobalOffsetLegacy();
}

// Recursive helper function to collect Loads from GEPs in a BFS fashion.
static void getLoads(Instruction *P, SmallVectorImpl<Instruction *> &Traversed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it was a bit confusing with the name 'Traversed'. Can we not just call this also as 'PtrUses'?
Also, the function name seems a bit incomplete. May be getLoadsAndGEPs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the name traversed here honestly, since it is capturing the approach that this function follows a bit better.
It does a BFS for finding Loads and GEPs while keeping track of the already traversed functions.
The name focuses on the goal of this function: To collect the Loads in a small vector.
The traversal of GEPs could be more seen as a byproduct.

// Recursive helper function to collect Loads from GEPs in a BFS fashion.
static void getLoads(Instruction *P, SmallVectorImpl<Instruction *> &Traversed,
SmallVectorImpl<LoadInst *> &Loads) {
Traversed.push_back(P);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be an assert at the very top of this function body to check for *p being either a load or a GEP might be better?

Copy link
Contributor Author

@MartinWehking MartinWehking Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the check is already taking place in the dyn_cast<LoadInst>(P) and the assertion if the else branch is taken: assert(isa<GetElementPtrInst>(*P));.
If the instruction is not a Load or GEP, the code will also throw without another assert in the first line, so I think it would add a sort of redundancy here.

Copy link
Contributor

@asudarsa asudarsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just some minor nits. I can approve on behalf of DPC++ Tools. Thanks

@steffenlarsen
Copy link
Contributor

Approval from @maksimsab is required.

@steffenlarsen @maksimsab already gave their approval if I have understood it correctly:

I will be OOO until 13-th of November. Please, if you decide to fix that then don't wait for my LGTM. Just go ahead.

(Resolved discussion above)

Thank you for pointing that out, @MartinWehking ! When change requests are made, formal approval is normally needed for gatekeepers to merge the PR. Some gatekeepers can still merge despite it, but generally we try to avoid it as much as possible. Please address @asudarsa 's comments and I will happily invoke those powers, given the information. 😄

Copy link
Contributor

@maksimsab maksimsab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JackAKirk
Copy link
Contributor

JackAKirk commented Nov 13, 2023

Fixes #10624

(we think)

@aelovikov-intel aelovikov-intel merged commit 00cf4c2 into intel:sycl Nov 13, 2023
MartinWehking pushed a commit to MartinWehking/llvm that referenced this pull request Nov 14, 2023
Extend intel#11674 by modifying the globaloffset optimization pass to always
replace uses of Loads from the llvm.nvvm.implicit.offset and
llvm.amdgcn.implicit.offset intrinsics with constant zeros in the
original non-offset kernel.

Hence, perform the optimization even when
-enable-global-offset=true (default).

Duplicate recursively functions containing calls to the implicit offset
intrinsic and let the implicit offset kernel entry point only call
the original functions (i.e. do not call the functions with added offset
arguments).

Remove zero allocations for the original kernel entry points.
againull pushed a commit that referenced this pull request Dec 8, 2023
Extend #11674 by modifying the globaloffset optimization pass to always
replace uses of Loads from the `llvm.nvvm.implicit.offset` and
`llvm.amdgcn.implicit.offset intrinsics` with constant zeros in the
original non-offset kernel.
Hence, perform the optimization even when
`-enable-global-offset=true` (default).
Duplicate recursively functions containing calls to the implicit offset
intrinsic and let the implicit offset kernel entry point only call
the original functions (i.e. do not call the functions with added offset
arguments).
Remove zero allocations for the original kernel entry points.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SYCL][HIP] Poor Memory Bandwidth due to Unnecessary Memory Write Traffic, 50% slower than OpenSYCL
9 participants