Skip to content

[AMDGPU] Create new option for force flush load counter #124974

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

rampitec
Copy link
Collaborator

In ceratin situations it is beneficial to wait for all outstanding
loads regardless of specific load's data we need. This may allow
to reduce a number of cache requests.

Fixes: SWDEV-511507

In ceratin situations it is beneficial to wait for all outstanding
loads regardless of specific load's data we need. This may allow
to reduce a number of cache requests.

Fixes: SWDEV-511507
@rampitec rampitec requested review from arsenm and kerbowa January 29, 2025 19:40
@rampitec rampitec marked this pull request as ready for review January 29, 2025 19:41
@llvmbot
Copy link
Member

llvmbot commented Jan 29, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Stanislav Mekhanoshin (rampitec)

Changes

In ceratin situations it is beneficial to wait for all outstanding
loads regardless of specific load's data we need. This may allow
to reduce a number of cache requests.

Fixes: SWDEV-511507


Full diff: https://github.com/llvm/llvm-project/pull/124974.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+8)
  • (added) llvm/test/CodeGen/AMDGPU/load-store-cnt.ll (+48)
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index de2095fa60ffd4..3d6419778f4b1c 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -53,6 +53,11 @@ static cl::opt<bool>
                                "s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)"),
                       cl::init(false), cl::Hidden);
 
+static cl::opt<bool> ForceEmitZeroLoadFlag(
+    "amdgpu-waitcnt-load-forcezero",
+    cl::desc("Force all waitcnt load counters to wait until 0"),
+    cl::init(false), cl::Hidden);
+
 namespace {
 // Class of object that encapsulates latest instruction counter score
 // associated with the operand.  Used for determining whether
@@ -1850,6 +1855,9 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
       Wait.BvhCnt = 0;
   }
 
+  if (ForceEmitZeroLoadFlag && Wait.LoadCnt != ~0u)
+    Wait.LoadCnt = 0;
+
   return generateWaitcnt(Wait, MI.getIterator(), *MI.getParent(), ScoreBrackets,
                          OldWaitcntInstr);
 }
diff --git a/llvm/test/CodeGen/AMDGPU/load-store-cnt.ll b/llvm/test/CodeGen/AMDGPU/load-store-cnt.ll
new file mode 100644
index 00000000000000..a7fccde4166713
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/load-store-cnt.ll
@@ -0,0 +1,48 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -march=amdgcn -mcpu=gfx1100 < %s | FileCheck --check-prefixes=DEFAULT %s
+; RUN: llc -march=amdgcn -mcpu=gfx1100 -amdgpu-waitcnt-load-forcezero < %s | FileCheck --check-prefixes=LDZERO %s
+
+define amdgpu_kernel void @copy(ptr addrspace(1) noalias nocapture readonly %src1, ptr addrspace(1) noalias nocapture readonly %src2, ptr addrspace(1) noalias nocapture writeonly %dst1, ptr addrspace(1) noalias nocapture writeonly %dst2) {
+; DEFAULT-LABEL: copy:
+; DEFAULT:       ; %bb.0:
+; DEFAULT-NEXT:    s_load_b256 s[0:7], s[4:5], 0x24
+; DEFAULT-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; DEFAULT-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; DEFAULT-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
+; DEFAULT-NEXT:    s_waitcnt lgkmcnt(0)
+; DEFAULT-NEXT:    s_clause 0x1
+; DEFAULT-NEXT:    global_load_b32 v1, v0, s[0:1]
+; DEFAULT-NEXT:    global_load_b32 v2, v0, s[2:3]
+; DEFAULT-NEXT:    s_waitcnt vmcnt(1)
+; DEFAULT-NEXT:    global_store_b32 v0, v1, s[4:5]
+; DEFAULT-NEXT:    s_waitcnt vmcnt(0)
+; DEFAULT-NEXT:    global_store_b32 v0, v2, s[6:7]
+; DEFAULT-NEXT:    s_endpgm
+;
+; LDZERO-LABEL: copy:
+; LDZERO:       ; %bb.0:
+; LDZERO-NEXT:    s_load_b256 s[0:7], s[4:5], 0x24
+; LDZERO-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; LDZERO-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; LDZERO-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
+; LDZERO-NEXT:    s_waitcnt lgkmcnt(0)
+; LDZERO-NEXT:    s_clause 0x1
+; LDZERO-NEXT:    global_load_b32 v1, v0, s[0:1]
+; LDZERO-NEXT:    global_load_b32 v2, v0, s[2:3]
+; LDZERO-NEXT:    s_waitcnt vmcnt(0)
+; LDZERO-NEXT:    s_clause 0x1
+; LDZERO-NEXT:    global_store_b32 v0, v1, s[4:5]
+; LDZERO-NEXT:    global_store_b32 v0, v2, s[6:7]
+; LDZERO-NEXT:    s_endpgm
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  %idx = zext i32 %id to i64
+  %gep.ld1 = getelementptr inbounds nuw float, ptr addrspace(1) %src1, i64 %idx
+  %v1 = load float, ptr addrspace(1) %gep.ld1, align 4
+  %gep.ld2 = getelementptr inbounds nuw float, ptr addrspace(1) %src2, i64 %idx
+  %v2 = load float, ptr addrspace(1) %gep.ld2, align 4
+  %gep.st1 = getelementptr inbounds nuw float, ptr addrspace(1) %dst1, i64 %idx
+  store float %v1, ptr addrspace(1) %gep.st1, align 4
+  %gep.st2 = getelementptr inbounds nuw float, ptr addrspace(1) %dst2, i64 %idx
+  store float %v2, ptr addrspace(1) %gep.st2, align 4
+  ret void
+}

Copy link
Collaborator Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link
Member

@kerbowa kerbowa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Only comment is that it's strange that amdgpu-waitcnt-forcezero always emits waitcnt after every instruction and this new option waits for some waitcnt to be required before flushing to zero. Maybe we should revamp this option later so that you can select the counter to flush on the cl and also change this behavior. These flags are useful for debugging.

@rampitec rampitec merged commit 8a20c64 into main Jan 30, 2025
12 checks passed
@rampitec rampitec deleted the users/rampitec/01-27-_amdgpu_create_new_option_for_force_flush_load_counter branch January 30, 2025 19:14
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jan 30, 2025

LLVM Buildbot has detected a new failure on builder openmp-offload-libc-amdgpu-runtime running on omp-vega20-1 while building llvm at step 7 "Add check check-offload".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/73/builds/12753

Here is the relevant piece of the build log for the reference
Step 7 (Add check check-offload) failure: test (failure)
******************** TEST 'libomptarget :: amdgcn-amd-amdhsa :: mapping/data_member_ref.cpp' FAILED ********************
Exit Code: 2

Command Output (stdout):
--
# RUN: at line 1
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang++ -fopenmp    -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/mapping/data_member_ref.cpp -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/mapping/Output/data_member_ref.cpp.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a && /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/mapping/Output/data_member_ref.cpp.tmp | /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/mapping/data_member_ref.cpp
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang++ -fopenmp -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/mapping/data_member_ref.cpp -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/mapping/Output/data_member_ref.cpp.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/mapping/Output/data_member_ref.cpp.tmp
# note: command had no output on stdout or stderr
# error: command failed with exit status: -11
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/mapping/data_member_ref.cpp
# .---command stderr------------
# | FileCheck error: '<stdin>' is empty.
# | FileCheck command line:  /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/mapping/data_member_ref.cpp
# `-----------------------------
# error: command failed with exit status: 2

--

********************


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants