[AMDGPU] Fix selection of s_load_b96 on GFX11 #108029

jayfoad · 2024-09-10T14:00:13Z

Fix a bug which resulted in selection of s_load_b96 on GFX11, which only
exists in GFX12.

The root cause was a mismatch between legalization and selection. The
condition used to check that the load was uniform in legalization
(SITargetLowering::LowerLOAD) was "!Op->isDivergent()". The condition
used to detect a non-uniform load during selection
(AMDGPUDAGToDAGISel::isUniformLoad()) was
"N->isDivergent() && !AMDGPUInstrInfo::isUniformMMO(MMO)". This makes a
difference when IR uniformity analysis has more information than SDAG's
built in analysis. In the test case this is because IR UA reports that
everything is uniform if isSingleLaneExecution() returns true, e.g. if
the specified max flat workgroup size is 1, but SDAG does not have this
optimization.

The immediate fix is to use the same condition to detect uniform loads
in legalization and selection. In future SDAG should learn about
isSingleLaneExecution(), and then it could probably stop relying on IR
metadata to detect uniform loads.

llvmbot · 2024-09-10T14:00:54Z

@llvm/pr-subscribers-backend-amdgpu

Author: Jay Foad (jayfoad)

Changes

Fix a bug which resulted in selection of s_load_b96 on GFX11, which only
exists in GFX12.

The root cause was a mismatch between legalization and selection. The
condition used to check that the load was uniform in legalization
(SITargetLowering::LowerLOAD) was "!Op->isDivergent()". The condition
used to detect a non-uniform load during selection
(AMDGPUDAGToDAGISel::isUniformLoad()) was
"N->isDivergent() && !AMDGPUInstrInfo::isUniformMMO(MMO)". This makes a
difference when IR uniformity analysis has more information than SDAG's
built in analysis. In the test case this is because IR UA reports that
everything is uniform if isSingleLaneExecution() returns true, e.g. if
the specified max flat workgroup size is 1, but SDAG does not have this
optimization.

The immediate fix is to use the same condition to detect uniform loads
in legalization and selection. In future SDAG should learn about
isSingleLaneExecution(), and then it could probably stop relying on IR
metadata to detect uniform loads.

Full diff: https://github.com/llvm/llvm-project/pull/108029.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+20-19)
(added) llvm/test/CodeGen/AMDGPU/load-constant-always-uniform.ll (+42)

diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 736f714ac1a77c..8b2e47a0b1b921 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -10256,6 +10256,7 @@ SDValue SITargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
   LoadSDNode *Load = cast<LoadSDNode>(Op);
   ISD::LoadExtType ExtType = Load->getExtensionType();
   EVT MemVT = Load->getMemoryVT();
+  MachineMemOperand *MMO = Load->getMemOperand();
 
   if (ExtType == ISD::NON_EXTLOAD && MemVT.getSizeInBits() < 32) {
     if (MemVT == MVT::i16 && isTypeLegal(MVT::i16))
@@ -10266,7 +10267,6 @@ SDValue SITargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
 
     SDValue Chain = Load->getChain();
     SDValue BasePtr = Load->getBasePtr();
-    MachineMemOperand *MMO = Load->getMemOperand();
 
     EVT RealMemVT = (MemVT == MVT::i1) ? MVT::i8 : MVT::i16;
 
@@ -10322,9 +10322,10 @@ SDValue SITargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
 
   unsigned NumElements = MemVT.getVectorNumElements();
 
-  if (AS == AMDGPUAS::CONSTANT_ADDRESS ||
-      AS == AMDGPUAS::CONSTANT_ADDRESS_32BIT) {
-    if (!Op->isDivergent() && Alignment >= Align(4) && NumElements < 32) {
+  if ((!Op->isDivergent() || AMDGPUInstrInfo::isUniformMMO(MMO)) &&
+      Alignment >= Align(4) && NumElements < 32) {
+    if (AS == AMDGPUAS::CONSTANT_ADDRESS ||
+        AS == AMDGPUAS::CONSTANT_ADDRESS_32BIT) {
       if (MemVT.isPow2VectorType() ||
           (Subtarget->hasScalarDwordx3Loads() && NumElements == 3))
         return SDValue();
@@ -10334,24 +10335,24 @@ SDValue SITargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
     // have the same legalization requirements as global and private
     // loads.
     //
-  }
 
-  if (AS == AMDGPUAS::CONSTANT_ADDRESS ||
-      AS == AMDGPUAS::CONSTANT_ADDRESS_32BIT ||
-      AS == AMDGPUAS::GLOBAL_ADDRESS) {
-    if (Subtarget->getScalarizeGlobalBehavior() && !Op->isDivergent() &&
-        Load->isSimple() && isMemOpHasNoClobberedMemOperand(Load) &&
-        Alignment >= Align(4) && NumElements < 32) {
-      if (MemVT.isPow2VectorType() ||
-          (Subtarget->hasScalarDwordx3Loads() && NumElements == 3))
-        return SDValue();
-      return WidenOrSplitVectorLoad(Op, DAG);
+    if (AS == AMDGPUAS::CONSTANT_ADDRESS ||
+        AS == AMDGPUAS::CONSTANT_ADDRESS_32BIT ||
+        AS == AMDGPUAS::GLOBAL_ADDRESS) {
+      if (Subtarget->getScalarizeGlobalBehavior() && Load->isSimple() &&
+          isMemOpHasNoClobberedMemOperand(Load)) {
+        if (MemVT.isPow2VectorType() ||
+            (Subtarget->hasScalarDwordx3Loads() && NumElements == 3))
+          return SDValue();
+        return WidenOrSplitVectorLoad(Op, DAG);
+      }
+      // Non-uniform loads will be selected to MUBUF instructions, so they
+      // have the same legalization requirements as global and private
+      // loads.
+      //
     }
-    // Non-uniform loads will be selected to MUBUF instructions, so they
-    // have the same legalization requirements as global and private
-    // loads.
-    //
   }
+
   if (AS == AMDGPUAS::CONSTANT_ADDRESS ||
       AS == AMDGPUAS::CONSTANT_ADDRESS_32BIT ||
       AS == AMDGPUAS::GLOBAL_ADDRESS ||
diff --git a/llvm/test/CodeGen/AMDGPU/load-constant-always-uniform.ll b/llvm/test/CodeGen/AMDGPU/load-constant-always-uniform.ll
new file mode 100644
index 00000000000000..719a9e0ccb0144
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/load-constant-always-uniform.ll
@@ -0,0 +1,42 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 < %s | FileCheck %s -prefix=GFX11
+
+define amdgpu_cs void @test_uniform_load_b96(i32 %arg) "amdgpu-flat-work-group-size"="1,1" {
+; CHECK-LABEL: test_uniform_load_b96:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; CHECK-NEXT:    v_lshlrev_b64 v[0:1], 2, v[0:1]
+; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; CHECK-NEXT:    v_readfirstlane_b32 s1, v1
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_clause 0x1
+; CHECK-NEXT:    s_load_b64 s[2:3], s[0:1], 0x0
+; CHECK-NEXT:    s_load_b32 s0, s[0:1], 0x8
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_mov_b32_e32 v2, s3
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_or3_b32 v2, s2, v2, s0
+; CHECK-NEXT:    global_store_b32 v[0:1], v2, off
+; CHECK-NEXT:    s_nop 0
+; CHECK-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; CHECK-NEXT:    s_endpgm
+bb:
+  %i = zext i32 %arg to i64
+  %i1 = getelementptr <{ [4294967295 x i32] }>, ptr addrspace(1) null, i64 0, i32 0, i64 %i
+  %i2 = load i32, ptr addrspace(1) %i1, align 4
+  %i3 = add nuw i32 %arg, 1
+  %i4 = zext i32 %i3 to i64
+  %i5 = getelementptr <{ [4294967295 x i32] }>, ptr addrspace(1) null, i64 0, i32 0, i64 %i4
+  %i6 = load i32, ptr addrspace(1) %i5, align 4
+  %i7 = add nuw i32 %arg, 2
+  %i8 = zext i32 %i7 to i64
+  %i9 = getelementptr <{ [4294967295 x i32] }>, ptr addrspace(1) null, i64 0, i32 0, i64 %i8
+  %i10 = load i32, ptr addrspace(1) %i9, align 4
+  %i11 = or i32 %i2, %i6
+  %i12 = or i32 %i10, %i11
+  store i32 %i12, ptr addrspace(1) null, align 4
+  ret void
+}

jayfoad · 2024-09-10T14:54:36Z

In future SDAG should learn about isSingleLaneExecution(),

I've raised #108038 for this.

piotrAMD

LGTM.
I assume global-isel doesn't have this problem?

piotrAMD · 2024-09-10T16:09:37Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

}
+

Nit: Spurious extra line.

I did that on purpose to separate these two cases (which are guarded by (!Op->isDivergent() || AMDGPUInstrInfo::isUniformMMO(MMO)) && ...) from the next case.

Ok, makes sense.

jayfoad · 2024-09-10T16:19:33Z

LGTM. I assume global-isel doesn't have this problem?

I think GISel is OK because it is already calling AMDGPUInstrInfo::isUniformMMO from AMDGPURegisterBankInfo::isScalarLoadLegal.

perlfu · 2024-09-11T00:38:29Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

-          (Subtarget->hasScalarDwordx3Loads() && NumElements == 3))
-        return SDValue();
-      return WidenOrSplitVectorLoad(Op, DAG);
+    if (AS == AMDGPUAS::CONSTANT_ADDRESS ||


AMDGPUAS::CONSTANT_ADDRESS and AMDGPUAS::CONSTANT_ADDRESS_32BIT do nothing here as they are completely handled by the previous if()?
If so, then you can hoist inner if(Subtarget... and merge AS == ... condition into it?

perlfu · 2024-09-11T00:39:45Z

llvm/test/CodeGen/AMDGPU/load-constant-always-uniform.ll

@@ -0,0 +1,42 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 < %s | FileCheck %s -prefix=GFX11


Perhaps this should also have gfx1200 run line to show that selection on GFX12 uses s_load_b96?

arsenm · 2024-09-11T10:07:52Z

llvm/test/CodeGen/AMDGPU/load-constant-always-uniform.ll

+  %i11 = or i32 %i2, %i6
+  %i12 = or i32 %i10, %i11
+  store i32 %i12, ptr addrspace(1) null, align 4
+  ret void


Can you simplify this test? I would expect this IR to already be vectorized

arsenm · 2024-09-11T11:35:34Z

llvm/test/CodeGen/AMDGPU/load-constant-always-uniform.ll

+; GFX12-NEXT:    s_endpgm
+bb:
+  %i = zext i32 %arg to i64
+  %i1 = getelementptr i32, ptr addrspace(1) null, i64 %i


non-null base pointer would be better

jayfoad added 2 commits September 10, 2024 14:41

refactor

ac3badd

fix bug

9045636

llvmbot added the backend:AMDGPU label Sep 10, 2024

jayfoad requested review from perlfu, dstutt, ssahasra and piotrAMD September 10, 2024 14:06

piotrAMD reviewed Sep 10, 2024

View reviewed changes

piotrAMD approved these changes Sep 10, 2024

View reviewed changes

perlfu reviewed Sep 11, 2024

View reviewed changes

ssahasra requested review from ssahasra and removed request for ssahasra September 11, 2024 06:22

jayfoad added 2 commits September 11, 2024 09:15

add GFX12 testing

9544c74

more nfc commoning and refactoring

e535023

arsenm reviewed Sep 11, 2024

View reviewed changes

simplify IR in test

aa0728b

arsenm approved these changes Sep 11, 2024

View reviewed changes

use non-null pointer in test

80993c6

jayfoad merged commit c657a6f into llvm:main Sep 12, 2024
5 of 8 checks passed

jayfoad deleted the load-constant-always-uniform branch September 12, 2024 12:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Fix selection of s_load_b96 on GFX11 #108029

[AMDGPU] Fix selection of s_load_b96 on GFX11 #108029

Uh oh!

jayfoad commented Sep 10, 2024

Uh oh!

llvmbot commented Sep 10, 2024

Uh oh!

jayfoad commented Sep 10, 2024

Uh oh!

piotrAMD left a comment

Uh oh!

piotrAMD Sep 10, 2024

Uh oh!

jayfoad Sep 10, 2024

Uh oh!

piotrAMD Sep 10, 2024

Uh oh!

jayfoad commented Sep 10, 2024

Uh oh!

perlfu Sep 11, 2024

Uh oh!

perlfu Sep 11, 2024

Uh oh!

arsenm Sep 11, 2024

Uh oh!

arsenm Sep 11, 2024

Uh oh!

Uh oh!

Uh oh!

		@@ -0,0 +1,42 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
		; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 < %s \| FileCheck %s -prefix=GFX11

[AMDGPU] Fix selection of s_load_b96 on GFX11 #108029

[AMDGPU] Fix selection of s_load_b96 on GFX11 #108029

Uh oh!

Conversation

jayfoad commented Sep 10, 2024

Uh oh!

llvmbot commented Sep 10, 2024

Uh oh!

jayfoad commented Sep 10, 2024

Uh oh!

piotrAMD left a comment

Choose a reason for hiding this comment

Uh oh!

piotrAMD Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

jayfoad Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

piotrAMD Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

jayfoad commented Sep 10, 2024

Uh oh!

perlfu Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

perlfu Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

arsenm Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

arsenm Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!