Skip to content

[AMDGPU] Inplace FI elimination during PEI for scalar copy instruction #99556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jul 31, 2024

Conversation

PankajDwivedi-25
Copy link
Contributor

@PankajDwivedi-25 PankajDwivedi-25 commented Jul 18, 2024

eliminateFrameIndex wasn't handling the FI copy to Scalar registers and the default implementation breaks the code while trying to handle it. This patch handles the broken lowering and also takes care of some edge cases that might arise. This case is tricky for non-zero offset, scc & vcc is live and we don't find sgpr pair available.
Co-authored by @arsenm

@llvmbot
Copy link
Member

llvmbot commented Jul 18, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Pankaj Dwivedi (PankajDwivedi-25)

Changes

During prologepilog pass while trying to eliminate frame index for scalar copy instruction we ended up in a case where two spills are trying to use same emergency stack slot which leads to almost a dead end with one emergency stack slot.
This is solved by avoiding spilling at the first place, using in-place frame index elimination for scalar copy instruction.
in-place frame index elimination for scalar copy instruction solves the issue but still this patch might not be covering all the corner cases.
Co-authored by @arsenm


Patch is 39.87 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/99556.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp (+52-18)
  • (modified) llvm/test/CodeGen/AMDGPU/frame-index.mir (+324)
  • (added) llvm/test/CodeGen/AMDGPU/reduce-frame-index.mir (+176)
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
index 8a315aa822786..05153383a3a21 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
@@ -2449,7 +2449,8 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
                                             ? &AMDGPU::SReg_32RegClass
                                             : &AMDGPU::VGPR_32RegClass;
         bool IsCopy = MI->getOpcode() == AMDGPU::V_MOV_B32_e32 ||
-                      MI->getOpcode() == AMDGPU::V_MOV_B32_e64;
+                      MI->getOpcode() == AMDGPU::V_MOV_B32_e64 ||
+                      MI->getOpcode() == AMDGPU::S_MOV_B32;
         Register ResultReg =
             IsCopy ? MI->getOperand(0).getReg()
                    : RS->scavengeRegisterBackwards(*RC, MI, false, 0);
@@ -2458,7 +2459,13 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
         if (Offset == 0) {
           unsigned OpCode = IsSALU && !LiveSCC ? AMDGPU::S_LSHR_B32
                                                : AMDGPU::V_LSHRREV_B32_e64;
-          auto Shift = BuildMI(*MBB, MI, DL, TII->get(OpCode), ResultReg);
+          Register TmpResultReg = ResultReg;
+          if (IsSALU && LiveSCC) {
+            TmpResultReg = RS->scavengeRegisterBackwards(
+                AMDGPU::VGPR_32RegClass, MI, false, 0);
+          }
+
+          auto Shift = BuildMI(*MBB, MI, DL, TII->get(OpCode), TmpResultReg);
           if (OpCode == AMDGPU::V_LSHRREV_B32_e64)
             // For V_LSHRREV, the operands are reversed (the shift count goes
             // first).
@@ -2468,11 +2475,13 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
           if (IsSALU && !LiveSCC)
             Shift.getInstr()->getOperand(3).setIsDead(); // Mark SCC as dead.
           if (IsSALU && LiveSCC) {
-            Register NewDest = RS->scavengeRegisterBackwards(
-                AMDGPU::SReg_32RegClass, Shift, false, 0);
+            Register NewDest =
+                IsCopy ? ResultReg
+                       : RS->scavengeRegisterBackwards(AMDGPU::SReg_32RegClass,
+                                                       Shift, false, 0);
             BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32),
                     NewDest)
-                .addReg(ResultReg);
+                .addReg(TmpResultReg);
             ResultReg = NewDest;
           }
         } else {
@@ -2523,22 +2532,47 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
 
             // We may have 1 free scratch SGPR even though a carry out is
             // unavailable. Only one additional mov is needed.
-            Register TmpScaledReg = RS->scavengeRegisterBackwards(
-                AMDGPU::SReg_32_XM0RegClass, MI, false, 0, false);
-            Register ScaledReg = TmpScaledReg.isValid() ? TmpScaledReg : FrameReg;
-
-            BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_LSHR_B32), ScaledReg)
-              .addReg(FrameReg)
-              .addImm(ST.getWavefrontSizeLog2());
-            BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), ScaledReg)
-                .addReg(ScaledReg, RegState::Kill)
-                .addImm(Offset);
+            Register TmpScaledReg =
+                IsCopy && IsSALU
+                    ? ResultReg
+                    : RS->scavengeRegisterBackwards(AMDGPU::SReg_32_XM0RegClass,
+                                                    MI, false, 0, false);
+            Register ScaledReg =
+                TmpScaledReg.isValid() ? TmpScaledReg : FrameReg;
+            Register TmpResultReg = ScaledReg;
+
+            if (!LiveSCC) {
+              BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_LSHR_B32), TmpResultReg)
+                  .addReg(FrameReg)
+                  .addImm(ST.getWavefrontSizeLog2());
+              BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), TmpResultReg)
+                  .addReg(TmpResultReg, RegState::Kill)
+                  .addImm(Offset);
+            } else {
+              TmpResultReg = RS->scavengeRegisterBackwards(
+                  AMDGPU::VGPR_32RegClass, MI, false, 0);
+
+              BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_LSHR_B32_e64),
+                      TmpResultReg)
+                  .addImm(ST.getWavefrontSizeLog2())
+                  .addReg(FrameReg);
+              auto Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_ADD_U32_e32),
+                                 TmpResultReg);
+              Add.addImm(Offset).addReg(TmpResultReg, RegState::Kill);
+              Register NewDest =
+                  IsCopy ? ResultReg
+                         : RS->scavengeRegisterBackwards(
+                               AMDGPU::SReg_32RegClass, Add, false, 0);
+              BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32),
+                      NewDest)
+                  .addReg(TmpResultReg);
+              ResultReg = NewDest;
+            }
             if (!IsSALU)
               BuildMI(*MBB, MI, DL, TII->get(AMDGPU::COPY), ResultReg)
-                  .addReg(ScaledReg, RegState::Kill);
+                  .addReg(TmpResultReg, RegState::Kill);
             else
-              ResultReg = ScaledReg;
-
+              ResultReg = TmpResultReg;
             // If there were truly no free SGPRs, we need to undo everything.
             if (!TmpScaledReg.isValid()) {
               BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), ScaledReg)
diff --git a/llvm/test/CodeGen/AMDGPU/frame-index.mir b/llvm/test/CodeGen/AMDGPU/frame-index.mir
index d8736c5276a26..f4476a497f754 100644
--- a/llvm/test/CodeGen/AMDGPU/frame-index.mir
+++ b/llvm/test/CodeGen/AMDGPU/frame-index.mir
@@ -211,3 +211,327 @@ body:             |
     renamable $vgpr0 = V_ADD_CO_U32_e32 %stack.0, killed $vgpr0, implicit-def dead $vcc, implicit $exec
 ...
 
+---
+name: materialize_fi_s_mov_b32_offset_0_dead_scc
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+  stackPtrOffsetReg: '$sgpr32'
+body:             |
+  bb.0:
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_0_dead_scc
+    ; GCN: $sgpr4 = S_LSHR_B32 $sgpr32, 6, implicit-def dead $scc
+    ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4
+  renamable $sgpr4 = S_MOV_B32 %stack.0
+  S_ENDPGM 0, implicit $sgpr4
+
+...
+
+---
+name: materialize_fi_s_mov_b32_offset_0_live_scc
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+  stackPtrOffsetReg: '$sgpr32'
+body:             |
+  bb.0:
+    liveins: $sgpr4, $sgpr5
+
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_0_live_scc
+    ; GCN: liveins: $sgpr4, $sgpr5
+    ; GCN-NEXT: {{  $}}
+    ; GCN-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+    ; GCN-NEXT: $vgpr0 = V_LSHRREV_B32_e64 6, $sgpr32, implicit $exec
+    ; GCN-NEXT: $sgpr4 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+    ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4, implicit $scc
+  S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+  renamable $sgpr4 = S_MOV_B32 %stack.0
+  S_ENDPGM 0, implicit $sgpr4, implicit $scc
+
+...
+
+# FI#0 is filler to get a non-0 offset for FI#1
+---
+name: materialize_fi_s_mov_b32_offset_64_dead_scc
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 64, alignment: 16, stack-id: default }
+  - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+  stackPtrOffsetReg: '$sgpr32'
+body:             |
+  bb.0:
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_64_dead_scc
+    ; GCN: $sgpr4 = S_LSHR_B32 $sgpr32, 6, implicit-def $scc
+    ; GCN-NEXT: $sgpr4 = S_ADD_I32 killed $sgpr4, 64, implicit-def $scc
+    ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4
+  renamable $sgpr4 = S_MOV_B32 %stack.1
+  S_ENDPGM 0, implicit $sgpr4
+
+...
+
+---
+name: materialize_fi_s_mov_b32_offset_68_dead_scc
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 68, alignment: 16, stack-id: default }
+  - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+  stackPtrOffsetReg: '$sgpr32'
+body:             |
+  bb.0:
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_68_dead_scc
+    ; GCN: $sgpr4 = S_LSHR_B32 $sgpr32, 6, implicit-def $scc
+    ; GCN-NEXT: $sgpr4 = S_ADD_I32 killed $sgpr4, 68, implicit-def $scc
+    ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4
+  renamable $sgpr4 = S_MOV_B32 %stack.1
+  S_ENDPGM 0, implicit $sgpr4
+
+...
+
+---
+name: materialize_fi_s_mov_b32_offset_64_live_scc
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 64, alignment: 16, stack-id: default }
+  - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+  stackPtrOffsetReg: '$sgpr32'
+body:             |
+  bb.0:
+    liveins: $sgpr4, $sgpr5
+
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_64_live_scc
+    ; GCN: liveins: $sgpr4, $sgpr5
+    ; GCN-NEXT: {{  $}}
+    ; GCN-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+    ; GCN-NEXT: $vgpr0 = V_LSHR_B32_e64 6, $sgpr32, implicit $exec
+    ; GCN-NEXT: $vgpr0 = V_ADD_U32_e32 64, killed $vgpr0, implicit $exec
+    ; GCN-NEXT: $sgpr4 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+    ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4, implicit $scc
+  S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+  renamable $sgpr4 = S_MOV_B32 %stack.1
+  S_ENDPGM 0, implicit $sgpr4, implicit $scc
+
+...
+
+---
+name: materialize_fi_s_mov_b32_offset_68_live_scc
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 68, alignment: 16, stack-id: default }
+  - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+  stackPtrOffsetReg: '$sgpr32'
+body:             |
+  bb.0:
+    liveins: $sgpr4, $sgpr5
+
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_68_live_scc
+    ; GCN: liveins: $sgpr4, $sgpr5
+    ; GCN-NEXT: {{  $}}
+    ; GCN-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+    ; GCN-NEXT: $vgpr0 = V_LSHR_B32_e64 6, $sgpr32, implicit $exec
+    ; GCN-NEXT: $vgpr0 = V_ADD_U32_e32 68, killed $vgpr0, implicit $exec
+    ; GCN-NEXT: $sgpr4 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+    ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4, implicit $scc
+  S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+  renamable $sgpr4 = S_MOV_B32 %stack.1
+  S_ENDPGM 0, implicit $sgpr4, implicit $scc
+
+...
+
+# FIXME: This is finding a VGPR
+---
+name: materialize_fi_s_mov_b32_offset_0_live_scc__no_free_vgprs
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 4, alignment: 16, stack-id: default }
+machineFunctionInfo:
+  scratchRSrcReg:  $sgpr0_sgpr1_sgpr2_sgpr3
+  stackPtrOffsetReg: '$sgpr32'
+  occupancy: 10
+
+body:             |
+  bb.0:
+    liveins: $sgpr4, $sgpr5, $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_0_live_scc__no_free_vgprs
+    ; GCN: liveins: $sgpr4, $sgpr5, $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39, $vgpr40, $vgpr41, $vgpr42, $vgpr43, $vgpr44, $vgpr45, $vgpr46, $vgpr47, $vgpr56, $vgpr57, $vgpr58, $vgpr59, $vgpr60, $vgpr61, $vgpr62, $vgpr63
+    ; GCN-NEXT: {{  $}}
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 60, 0, 0, implicit $exec :: (store (s32) into %stack.1, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr41, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 56, 0, 0, implicit $exec :: (store (s32) into %stack.2, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr42, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 52, 0, 0, implicit $exec :: (store (s32) into %stack.3, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr43, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 48, 0, 0, implicit $exec :: (store (s32) into %stack.4, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr44, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 44, 0, 0, implicit $exec :: (store (s32) into %stack.5, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr45, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 40, 0, 0, implicit $exec :: (store (s32) into %stack.6, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr46, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 36, 0, 0, implicit $exec :: (store (s32) into %stack.7, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr47, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 32, 0, 0, implicit $exec :: (store (s32) into %stack.8, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr56, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 28, 0, 0, implicit $exec :: (store (s32) into %stack.9, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr57, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 24, 0, 0, implicit $exec :: (store (s32) into %stack.10, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr58, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 20, 0, 0, implicit $exec :: (store (s32) into %stack.11, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr59, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 16, 0, 0, implicit $exec :: (store (s32) into %stack.12, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr60, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 12, 0, 0, implicit $exec :: (store (s32) into %stack.13, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr61, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 8, 0, 0, implicit $exec :: (store (s32) into %stack.14, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr62, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (store (s32) into %stack.15, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr63, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec :: (store (s32) into %stack.16, addrspace 5)
+    ; GCN-NEXT: S_NOP 0, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
+    ; GCN-NEXT: S_NOP 0, implicit-def $vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
+    ; GCN-NEXT: S_NOP 0, implicit-def $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23
+    ; GCN-NEXT: S_NOP 0, implicit-def $vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+    ; GCN-NEXT: S_NOP 0, implicit-def $vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
+    ; GCN-NEXT: S_NOP 0, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51_vgpr52_vgpr53_vgpr54_vgpr55
+    ; GCN-NEXT: S_NOP 0, implicit-def $vgpr56_vgpr57_vgpr58_vgpr59_vgpr60_vgpr61_vgpr62_vgpr63
+    ; GCN-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 68, 0, 0, implicit $exec :: (store (s32) into %stack.17, addrspace 5)
+    ; GCN-NEXT: $vgpr0 = V_LSHR_B32_e64 6, $sgpr32, implicit $exec
+    ; GCN-NEXT: $vgpr0 = V_ADD_U32_e32 64, killed $vgpr0, implicit $exec
+    ; GCN-NEXT: $sgpr4 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+    ; GCN-NEXT: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 68, 0, 0, implicit $exec :: (load (s32) from %stack.17, addrspace 5)
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr48_vgpr49_vgpr50_vgpr51_vgpr52_vgpr53_vgpr54_vgpr55
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr56_vgpr57_vgpr58_vgpr59_vgpr60_vgpr61_vgpr62_vgpr63
+    ; GCN-NEXT: $vgpr63 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec :: (load (s32) from %stack.16, addrspace 5)
+    ; GCN-NEXT: $vgpr62 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (load (s32) from %stack.15, addrspace 5)
+    ; GCN-NEXT: $vgpr61 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 8, 0, 0, implicit $exec :: (load (s32) from %stack.14, addrspace 5)
+    ; GCN-NEXT: $vgpr60 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 12, 0, 0, implicit $exec :: (load (s32) from %stack.13, addrspace 5)
+    ; GCN-NEXT: $vgpr59 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 16, 0, 0, implicit $exec :: (load (s32) from %stack.12, addrspace 5)
+    ; GCN-NEXT: $vgpr58 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 20, 0, 0, implicit $exec :: (load (s32) from %stack.11, addrspace 5)
+    ; GCN-NEXT: $vgpr57 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 24, 0, 0, implicit $exec :: (load (s32) from %stack.10, addrspace 5)
+    ; GCN-NEXT: $vgpr56 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 28, 0, 0, implicit $exec :: (load (s32) from %stack.9, addrspace 5)
+    ; GCN-NEXT: $vgpr47 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 32, 0, 0, implicit $exec :: (load (s32) from %stack.8, addrspace 5)
+    ; GCN-NEXT: $vgpr46 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 36, 0, 0, implicit $exec :: (load (s32) from %stack.7, addrspace 5)
+    ; GCN-NEXT: $vgpr45 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 40, 0, 0, implicit $exec :: (load (s32) from %stack.6, addrspace 5)
+    ; GCN-NEXT: $vgpr44 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 44, 0, 0, implicit $exec :: (load (s32) from %stack.5, addrspace 5)
+    ; GCN-NEXT: $vgpr43 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 48, 0, 0, implicit $exec :: (load (s32) from %stack.4, addrspace 5)
+    ; GCN-NEXT: $vgpr42 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 52, 0, 0, implicit $exec :: (load (s32) from %stack.3, addrspace 5)
+    ; GCN-NEXT: $vgpr41 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 56, 0, 0, implicit $exec :: (load (s32) from %stack.2, addrspace 5)
+    ; GCN-NEXT: $vgpr40 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 60, 0, 0, implicit $exec :: (load (s32) from %stack.1, addrspace 5)
+    ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4, implicit $scc
+    S_NOP 0, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
+    S_NOP 0, implicit-def $vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
+    S_NOP 0, implicit-def $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23
+    S_NOP 0, implicit-def $vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
+    S_NOP 0, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+    S_NOP 0, implicit-def $vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
+    S_NOP 0, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51_vgpr52_vgpr53_vgpr54_vgpr55
+    S_NOP 0, implicit-def $vgpr56_vgpr57_vgpr58_vgpr59_vgpr60_vgpr61_vgpr62_vgpr63
+    S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+
+    renamable $sgpr4 = S_MOV_B32 %stack.0
+
+    S_NOP 0, implicit $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
+    S_NOP 0, implicit $vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
+    S_NOP 0, implicit $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23
+    S_NOP 0, implicit $vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
+    S_NOP 0, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+    S_NOP 0, implicit $vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
+    S_NOP 0, implicit $vgpr48_vgpr49_vgpr50_vgpr51_vgpr52_vgpr53_vgpr54_vgpr55
+    S_NOP 0, implicit $vgpr56_vgpr57_vgpr58_vgpr59_vgpr60_vgpr61_vgpr62_vgpr63
+    S_ENDPGM 0, implicit $sgpr4, implicit $scc
+
+...
+
+# FIXME: This is clobbering scc
+---
+name: materialize_fi_s_mov_b32_offset_96_live_scc__no_free_vgprs
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 64, alignment: 16, stack-id: default }
+  - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+  scratchRSrcReg:  $sgpr0_sgpr1_sgpr2_sgpr3
+  stackPtrOffsetReg: '$sgpr32'
+body:             |
+  bb.0:
+    liveins: $sgpr4, $sgpr5
+
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset...
[truncated]

Copy link
Collaborator

@cdevadas cdevadas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite the PR description. Include the point that eliminateFrameIndex wasn't handling the FI copy to Scalar registers and the default implementation breaks the code while trying to handle it. There is no need to mention the two spills and the emergency spill slots.

ret void
}

attributes #0 = { "amdgpu-stack-objects" "target-cpu"="gfx90a" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't need the IR section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

renamable $vgpr0_vgpr1 = COPY killed renamable $sgpr22_sgpr23, implicit $exec
renamable $vgpr0_vgpr1 = COPY killed renamable $sgpr20_sgpr21, implicit $exec
renamable $vgpr0_vgpr1 = COPY killed renamable $sgpr18_sgpr19, implicit $exec
dead $sgpr30_sgpr31 = SI_CALL killed renamable $sgpr16_sgpr17, @killer, csr_amdgpu_gfx90ainsts, implicit killed $sgpr4_sgpr5, implicit killed $sgpr6_sgpr7, implicit killed $sgpr8_sgpr9, implicit killed $sgpr10_sgpr11, implicit killed $sgpr12, implicit killed $sgpr13, implicit killed $sgpr14, implicit killed $sgpr15, implicit $sgpr0_sgpr1_sgpr2_sgpr3, implicit-def $vgpr0, implicit-def $vgpr1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the function reference with 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

IsCopy && IsSALU
? ResultReg
: RS->scavengeRegisterBackwards(AMDGPU::SReg_32_XM0RegClass,
MI, false, 0, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MI, false, 0, false);
MI, false, 0, /*AllowSpill=*/false);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

.addImm(Offset);
} else {
TmpResultReg = RS->scavengeRegisterBackwards(
AMDGPU::VGPR_32RegClass, MI, false, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
AMDGPU::VGPR_32RegClass, MI, false, 0);
AMDGPU::VGPR_32RegClass, MI, false, 0, /*AllowSpill=*/true);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

Comment on lines 2558 to 2559
auto Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_ADD_U32_e32),
TmpResultReg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to use getAddNoCarry here and then handle the failure case too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the current patch.

@@ -211,3 +211,327 @@ body: |
renamable $vgpr0 = V_ADD_CO_U32_e32 %stack.0, killed $vgpr0, implicit-def dead $vcc, implicit $exec
...

---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a pre-commit, can you generate GFX8 and GFX9 checks in this test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed.

Register NewDest =
IsCopy ? ResultReg
: RS->scavengeRegisterBackwards(
AMDGPU::SReg_32RegClass, Add, false, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
AMDGPU::SReg_32RegClass, Add, false, 0);
AMDGPU::SReg_32RegClass, Add, false, 0, /*AllowSpill*/=true);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the case where this scavenge spill happens tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I guess the last two test cases present in frame-index.mir are spill case, which are working fine.

Comment on lines +2545 to +2550
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_LSHR_B32), TmpResultReg)
.addReg(FrameReg)
.addImm(ST.getWavefrontSizeLog2());
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), TmpResultReg)
.addReg(TmpResultReg, RegState::Kill)
.addImm(Offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exact sequence is repeated above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see these two instruction above, for offset 0 we have shift but not add.

Comment on lines +2565 to +2567
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32),
NewDest)
.addReg(TmpResultReg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole sequence is repeated above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is when offset is 0.

Copy link

github-actions bot commented Jul 19, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@@ -0,0 +1,72 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck %s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test file name is still too generic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is renamed now.

Comment on lines 2560 to 2561
if ((Add = TII->getAddNoCarry(*MBB, MI, DL, TmpResultReg, *RS)) ==
nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be easiest to read if we handled the positive cases where this returned an add before the fallback path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case now will not arise, handling null case inside getAddNoCarry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, getAddNoCarry has a failure path which you are checking for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now getAddNoCarry will return either of these three cases. V_ADD_U32_e32, V_MAD_U32_U24_e64 or V_ADD_CO_U32_e64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are couple of more test cases which are failing if i try to handle this, inside getAddNoCarry.
and the code above which is expecting nullptr will not be rechable.
else {
MachineInstrBuilder MIB;
if (!IsSALU) {
if ((MIB = TII->getAddNoCarry(*MBB, MI, DL, ResultReg, *RS)) !=
nullptr) {

// scavenged? This a way around to avoid carry, need follow-up.
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), ResultReg)
.addImm(Offset);
Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MAD_I32_I24_e64),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know we can use the unsigned version here. The offset will always be a small positive constant and the SP/FP will always be positive

Copy link
Contributor Author

@PankajDwivedi-25 PankajDwivedi-25 Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, replacing v_mad_i32_i24 with v_mad_u32_u24

// scavenged? This a way around to avoid carry, need follow-up.
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), ResultReg)
.addImm(Offset);
Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MAD_I32_I24_e64),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could handle using mad like this directly in getAddNoCarry.

I was also originally thinking of folding in the FP scaling into the mad. I.e. v_mad_u32_u24 fp, wavesize, offset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then i may i have to modify the interface for getAddNoCarry, as we are using an scavenged register for shifting Offset into.
or that part should i take care as present?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's split to avoid the scavenging inside. I guess just reorder the handling for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's going to shift around if we can fold in the shift by wavesize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more cases need to be handled if i move handling inside, getAddNoCarry.
It hits assert(MIB->getOpcode() == AMDGPU::V_ADD_CO_U32_e64 && "Need to reuse carry out register");
where we are trying to handle !SALU cases.

for now I have left it custom handled.

# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=tonga -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck -check-prefixes=GFX8,GCN %s
# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck -check-prefixes=GFX900,GCN %s
# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck -check-prefixes=GFX90A,GCN %s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test name shouldn't mention failure, this is just the lowering handling. also s_mov not copy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

@@ -0,0 +1,574 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=tonga -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck -check-prefixes=GFX8,GCN %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about eliminate-frame-index-s-mov-b32.mir

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!

Comment on lines 2573 to 2574
// since emergency stack slot is already used for spilling VGPR
// scavenged? This a way around to avoid carry, need follow-up.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs capitalization and rephrasing. I'm not sure I understand the comment. Is everything working now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, all cases are working now.

Comment on lines 2575 to 2576
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), ResultReg)
.addImm(Offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead you can materialize in the VGPR TmpResultReg. That saves the constant bus use for the frame pointer register to directly materialize in one mad

Copy link
Contributor Author

@PankajDwivedi-25 PankajDwivedi-25 Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will clobber shift value in TmpResultReg right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not if you replace the whole sequence. It's the free register you just scavenged, so it would be
tmp_vgpr = offset
tmp_vgpr = mad24 FrameReg, wavesize, tmp_vgpr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you should add an assert that the offset is positive. I think we need to switch this to the signed mad when we switch the scratch instruction ABI to use negative offsets

Copy link
Contributor Author

@PankajDwivedi-25 PankajDwivedi-25 Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not if you replace the whole sequence. It's the free register you just scavenged, so it would be tmp_vgpr = offset tmp_vgpr = mad24 FrameReg, wavesize, tmp_vgpr

That's really great idea, but now it looks like we are trying to replace right shift with left shift?

in VGPR we are performing right shift with WavefrontSizelog2 which is eventually divide by WavefrontSize.
but now FrameReg * WavefrontSize which is eventually left shift by WavefrontSizelog2?
is mad correct operator then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you should add an assert that the offset is positive. I think we need to switch this to the signed mad when we switch the scratch instruction ABI to use negative offsets

sure, will add it.

Copy link
Contributor

@arsenm arsenm Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We start with a frame pointer with a wave space value, and an offset in lane-space. We are materializing a lane space value. You can either do a right shift of the frame pointer to get to lane space, or a left shift of the offset to get to wavespace. We can right shift after the computation to get back to the desired per-lane value.

I also spent some time with alive2, and unfortunately we can only use the mad24 on gfx9 and gfx10. https://alive2.llvm.org/ce/z/_gVKVG. The stack size increased in gfx11 so that the maximum sized object computation can no longer be done with the 24-bit multiply.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we might be able to - I wrote the assumes, assuming the FP and the offset are the maximum. But these are actually linked. The condition is more like fp + offset < maximum value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Still seems like it would require a 26-bit multiply in wave32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes, is it what you have suggested.

// We have to produce a carry out, and there isn't a free SGPR
// pair for it. This a way around to avoid carry.
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), ResultReg)
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you do this reassociation, you need to multiply the offset by the wavesize. You're doing the mul + add in terms of waves

@@ -368,9 +368,9 @@ body: |
; GFX8-NEXT: S_NOP 0, implicit-def $sgpr56_sgpr57_sgpr58_sgpr59_sgpr60_sgpr61_sgpr62_sgpr63
; GFX8-NEXT: V_CMP_EQ_U32_e32 0, killed $vgpr0, implicit-def $vcc, implicit $exec
; GFX8-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
; GFX8-NEXT: $vgpr0 = V_MOV_B32_e32 64, implicit $exec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test also should get gfx10, gfx11, and gfx12 runlines (maybe leave for a separate commit though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32),
TmpResultReg)
.addImm(Offset);
assert(Offset > 0 && "Offset is positive");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(Offset > 0 && "Offset is positive");
assert(Offset > 0 && isUInt<24>(2 * ST.getMaxWaveScratchSize()) &&
"offset is unsafe for v_mad_u32_u24");

This assert should also be strengthened to cover the range, I need to work out the exact condition but something like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can squeeze a bit more range out of this with more instructions, but we shouldn't need to on gfx11/12 since they can just directly encode the larger offset in the add

Copy link
Contributor

@arsenm arsenm Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus when we use scratch instructions, the SP value is in lane values anyway so we can do everything in one add

Comment on lines 2579 to 2581
.addReg(FrameReg)
.addImm(ST.getWavefrontSize())
.addReg(TmpResultReg, RegState::Kill)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the operands swapped. This should be doing the multiply of the offset and adding the frame reg

Suggested change
.addReg(FrameReg)
.addImm(ST.getWavefrontSize())
.addReg(TmpResultReg, RegState::Kill)
.addReg(TmpResultReg, RegState::Kill)
.addImm(ST.getWavefrontSize())
.addReg(FrameReg)

} else
Add.addImm(Offset).addReg(TmpResultReg, RegState::Kill);
} else {
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs comments about why this is doing this reassociation vs. above

@arsenm
Copy link
Contributor

arsenm commented Jul 29, 2024

It would also be good if we had an end to end IR test that hit these new paths

// value. We can either do a right shift of the frame pointer to
// get to lane space, or a left shift of the offset to get to
// wavespace. We can right shift after the computation to get
// back to the desired per-lane value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention we are using the mad_u32_u24 primarily as an add with no carry out clobber

@@ -2580,6 +2580,8 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
// get to lane space, or a left shift of the offset to get to
// wavespace. We can right shift after the computation to get
// back to the desired per-lane value.
// we are using the mad_u32_u24 primarily as an add with no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments should be capitalized

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would still be good if we had an IR test that hit these cases

@arsenm
Copy link
Contributor

arsenm commented Jul 30, 2024

It would still be good if we had an IR test that hit these cases

I think I've managed to craft one that will need some adjustment to hit the various cases

Comment on lines 2557 to 2559
BuildMI(*MBB, *Add, DL, TII->get(AMDGPU::V_LSHR_B32_e64),
TmpResultReg)
.addImm(ST.getWavefrontSizeLog2())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use V_LSHRREV_B32_e64

Comment on lines 2591 to 2594
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_LSHR_B32_e64),
TmpResultReg)
.addImm(ST.getWavefrontSizeLog2())
.addReg(FrameReg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lshrrev

} else
Add.addImm(Offset).addReg(TmpResultReg, RegState::Kill);
} else {
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the offset is an inline immediate, you can just fold it directly into the mad24 operand as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should address this in a follow up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also should special case folding frame indexes into S_ADD_I32/S_OR_B32

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open another PR with additional IR tests

@arsenm arsenm merged commit 57d10b4 into llvm:main Jul 31, 2024
7 checks passed
@PankajDwivedi-25
Copy link
Contributor Author

I'll open another PR with additional IR tests

sure matt, will add the changes, you suggested.

@arsenm
Copy link
Contributor

arsenm commented Jul 31, 2024

I'll open another PR with additional IR tests

This is #101306

PankajDwivedi-25 added a commit that referenced this pull request Aug 2, 2024
In case of v_mad we have materialized the offset in vgpr and mad is
performed in wave space, later vgpr have to be shifted back in lane
space. [#99556](#99556)
introduces a bug.

Co-authored-by: Pankajdwivedi-25 <[email protected]>
jrbyrnes pushed a commit to jrbyrnes/llvm-project that referenced this pull request Aug 16, 2024
llvm#99556)

eliminateFrameIndex wasn't handling the FI copy to Scalar registers and
the default implementation breaks the code while trying to handle it.
This patch handles the broken lowering and also takes care of some edge
cases that might arise. This case is tricky for non-zero offset, scc &
vcc is live and we don't find sgpr pair available.
Co-authored by @arsenm

---------
Co-authored-by: Matt Arsenault <[email protected]>
Co-authored-by: PankajDwivedi-25 <[email protected]>

Change-Id: I760538f594d267bdc414833c4489486e426fb55b
jrbyrnes pushed a commit to jrbyrnes/llvm-project that referenced this pull request Aug 16, 2024
In case of v_mad we have materialized the offset in vgpr and mad is
performed in wave space, later vgpr have to be shifted back in lane
space. [llvm#99556](llvm#99556)
introduces a bug.

Co-authored-by: Pankajdwivedi-25 <[email protected]>
Change-Id: Ic02338cf087cb3c95964b7273609a73a3e7b2d03
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Sep 11, 2024
llvm#99556)

eliminateFrameIndex wasn't handling the FI copy to Scalar registers and
the default implementation breaks the code while trying to handle it.
This patch handles the broken lowering and also takes care of some edge
cases that might arise. This case is tricky for non-zero offset, scc &
vcc is live and we don't find sgpr pair available.
Co-authored by @arsenm

---------
Co-authored-by: Matt Arsenault <[email protected]>
Co-authored-by: PankajDwivedi-25 <[email protected]>

Change-Id: I760538f594d267bdc414833c4489486e426fb55b
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Sep 11, 2024
In case of v_mad we have materialized the offset in vgpr and mad is
performed in wave space, later vgpr have to be shifted back in lane
space. [llvm#99556](llvm#99556)
introduces a bug.

Co-authored-by: Pankajdwivedi-25 <[email protected]>
Change-Id: Ic02338cf087cb3c95964b7273609a73a3e7b2d03
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Nov 22, 2024
llvm#99556)

eliminateFrameIndex wasn't handling the FI copy to Scalar registers and
the default implementation breaks the code while trying to handle it.
This patch handles the broken lowering and also takes care of some edge
cases that might arise. This case is tricky for non-zero offset, scc &
vcc is live and we don't find sgpr pair available.
Co-authored by @arsenm

---------
Co-authored-by: Matt Arsenault <[email protected]>
Co-authored-by: PankajDwivedi-25 <[email protected]>

Change-Id: I8db13328357723da96c1bfaa2fd2368430b17100
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Nov 22, 2024
In case of v_mad we have materialized the offset in vgpr and mad is
performed in wave space, later vgpr have to be shifted back in lane
space. [llvm#99556](llvm#99556)
introduces a bug.

Co-authored-by: Pankajdwivedi-25 <[email protected]>
Change-Id: I23db2a321c9bf3e28b5e173ddc95fa98c4d16fd6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants