[AMDGPU] Create new option for force flush load counter #124974

rampitec · 2025-01-29T19:40:24Z

In ceratin situations it is beneficial to wait for all outstanding
loads regardless of specific load's data we need. This may allow
to reduce a number of cache requests.

Fixes: SWDEV-511507

In ceratin situations it is beneficial to wait for all outstanding loads regardless of specific load's data we need. This may allow to reduce a number of cache requests. Fixes: SWDEV-511507

llvmbot · 2025-01-29T19:41:40Z

@llvm/pr-subscribers-backend-amdgpu

Author: Stanislav Mekhanoshin (rampitec)

Changes

In ceratin situations it is beneficial to wait for all outstanding
loads regardless of specific load's data we need. This may allow
to reduce a number of cache requests.

Fixes: SWDEV-511507

Full diff: https://github.com/llvm/llvm-project/pull/124974.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+8)
(added) llvm/test/CodeGen/AMDGPU/load-store-cnt.ll (+48)

diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index de2095fa60ffd4..3d6419778f4b1c 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -53,6 +53,11 @@ static cl::opt<bool>
                                "s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)"),
                       cl::init(false), cl::Hidden);
 
+static cl::opt<bool> ForceEmitZeroLoadFlag(
+    "amdgpu-waitcnt-load-forcezero",
+    cl::desc("Force all waitcnt load counters to wait until 0"),
+    cl::init(false), cl::Hidden);
+
 namespace {
 // Class of object that encapsulates latest instruction counter score
 // associated with the operand.  Used for determining whether
@@ -1850,6 +1855,9 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
       Wait.BvhCnt = 0;
   }
 
+  if (ForceEmitZeroLoadFlag && Wait.LoadCnt != ~0u)
+    Wait.LoadCnt = 0;
+
   return generateWaitcnt(Wait, MI.getIterator(), *MI.getParent(), ScoreBrackets,
                          OldWaitcntInstr);
 }
diff --git a/llvm/test/CodeGen/AMDGPU/load-store-cnt.ll b/llvm/test/CodeGen/AMDGPU/load-store-cnt.ll
new file mode 100644
index 00000000000000..a7fccde4166713
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/load-store-cnt.ll
@@ -0,0 +1,48 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -march=amdgcn -mcpu=gfx1100 < %s | FileCheck --check-prefixes=DEFAULT %s
+; RUN: llc -march=amdgcn -mcpu=gfx1100 -amdgpu-waitcnt-load-forcezero < %s | FileCheck --check-prefixes=LDZERO %s
+
+define amdgpu_kernel void @copy(ptr addrspace(1) noalias nocapture readonly %src1, ptr addrspace(1) noalias nocapture readonly %src2, ptr addrspace(1) noalias nocapture writeonly %dst1, ptr addrspace(1) noalias nocapture writeonly %dst2) {
+; DEFAULT-LABEL: copy:
+; DEFAULT:       ; %bb.0:
+; DEFAULT-NEXT:    s_load_b256 s[0:7], s[4:5], 0x24
+; DEFAULT-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; DEFAULT-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; DEFAULT-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
+; DEFAULT-NEXT:    s_waitcnt lgkmcnt(0)
+; DEFAULT-NEXT:    s_clause 0x1
+; DEFAULT-NEXT:    global_load_b32 v1, v0, s[0:1]
+; DEFAULT-NEXT:    global_load_b32 v2, v0, s[2:3]
+; DEFAULT-NEXT:    s_waitcnt vmcnt(1)
+; DEFAULT-NEXT:    global_store_b32 v0, v1, s[4:5]
+; DEFAULT-NEXT:    s_waitcnt vmcnt(0)
+; DEFAULT-NEXT:    global_store_b32 v0, v2, s[6:7]
+; DEFAULT-NEXT:    s_endpgm
+;
+; LDZERO-LABEL: copy:
+; LDZERO:       ; %bb.0:
+; LDZERO-NEXT:    s_load_b256 s[0:7], s[4:5], 0x24
+; LDZERO-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; LDZERO-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; LDZERO-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
+; LDZERO-NEXT:    s_waitcnt lgkmcnt(0)
+; LDZERO-NEXT:    s_clause 0x1
+; LDZERO-NEXT:    global_load_b32 v1, v0, s[0:1]
+; LDZERO-NEXT:    global_load_b32 v2, v0, s[2:3]
+; LDZERO-NEXT:    s_waitcnt vmcnt(0)
+; LDZERO-NEXT:    s_clause 0x1
+; LDZERO-NEXT:    global_store_b32 v0, v1, s[4:5]
+; LDZERO-NEXT:    global_store_b32 v0, v2, s[6:7]
+; LDZERO-NEXT:    s_endpgm
+  %id = tail call i32 @llvm.amdgcn.workitem.id.x()
+  %idx = zext i32 %id to i64
+  %gep.ld1 = getelementptr inbounds nuw float, ptr addrspace(1) %src1, i64 %idx
+  %v1 = load float, ptr addrspace(1) %gep.ld1, align 4
+  %gep.ld2 = getelementptr inbounds nuw float, ptr addrspace(1) %src2, i64 %idx
+  %v2 = load float, ptr addrspace(1) %gep.ld2, align 4
+  %gep.st1 = getelementptr inbounds nuw float, ptr addrspace(1) %dst1, i64 %idx
+  store float %v1, ptr addrspace(1) %gep.st1, align 4
+  %gep.st2 = getelementptr inbounds nuw float, ptr addrspace(1) %dst2, i64 %idx
+  store float %v2, ptr addrspace(1) %gep.st2, align 4
+  ret void
+}

rampitec · 2025-01-29T19:41:52Z

[AMDGPU] Create new option for force flush load counter #124974 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

kerbowa

LGTM.

Only comment is that it's strange that amdgpu-waitcnt-forcezero always emits waitcnt after every instruction and this new option waits for some waitcnt to be required before flushing to zero. Maybe we should revamp this option later so that you can select the counter to flush on the cl and also change this behavior. These flags are useful for debugging.

llvm-ci · 2025-01-30T20:04:48Z

LLVM Buildbot has detected a new failure on builder openmp-offload-libc-amdgpu-runtime running on omp-vega20-1 while building llvm at step 7 "Add check check-offload".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/73/builds/12753

Here is the relevant piece of the build log for the reference

Step 7 (Add check check-offload) failure: test (failure)
******************** TEST 'libomptarget :: amdgcn-amd-amdhsa :: mapping/data_member_ref.cpp' FAILED ********************
Exit Code: 2

Command Output (stdout):
--
# RUN: at line 1
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang++ -fopenmp    -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/mapping/data_member_ref.cpp -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/mapping/Output/data_member_ref.cpp.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a && /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/mapping/Output/data_member_ref.cpp.tmp | /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/mapping/data_member_ref.cpp
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang++ -fopenmp -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/mapping/data_member_ref.cpp -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/mapping/Output/data_member_ref.cpp.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/mapping/Output/data_member_ref.cpp.tmp
# note: command had no output on stdout or stderr
# error: command failed with exit status: -11
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/mapping/data_member_ref.cpp
# .---command stderr------------
# | FileCheck error: '<stdin>' is empty.
# | FileCheck command line:  /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/mapping/data_member_ref.cpp
# `-----------------------------
# error: command failed with exit status: 2

--

********************

[AMDGPU] Create new option for force flush load counter

d97055b

In ceratin situations it is beneficial to wait for all outstanding loads regardless of specific load's data we need. This may allow to reduce a number of cache requests. Fixes: SWDEV-511507

rampitec requested review from arsenm and kerbowa January 29, 2025 19:40

rampitec marked this pull request as ready for review January 29, 2025 19:41

llvmbot added the backend:AMDGPU label Jan 29, 2025

kerbowa approved these changes Jan 30, 2025

View reviewed changes

rampitec merged commit 8a20c64 into main Jan 30, 2025
12 checks passed

rampitec deleted the users/rampitec/01-27-_amdgpu_create_new_option_for_force_flush_load_counter branch January 30, 2025 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Create new option for force flush load counter #124974

[AMDGPU] Create new option for force flush load counter #124974

Uh oh!

rampitec commented Jan 29, 2025

Uh oh!

llvmbot commented Jan 29, 2025

Uh oh!

rampitec commented Jan 29, 2025

Uh oh!

kerbowa left a comment

Uh oh!

Uh oh!

llvm-ci commented Jan 30, 2025

Uh oh!

Uh oh!

[AMDGPU] Create new option for force flush load counter #124974

[AMDGPU] Create new option for force flush load counter #124974

Uh oh!

Conversation

rampitec commented Jan 29, 2025

Uh oh!

llvmbot commented Jan 29, 2025

Uh oh!

rampitec commented Jan 29, 2025

Uh oh!

kerbowa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvm-ci commented Jan 30, 2025

Uh oh!

Uh oh!