[AMDGPU] Inplace FI elimination during PEI for scalar copy instruction #99556

PankajDwivedi-25 · 2024-07-18T20:00:17Z

eliminateFrameIndex wasn't handling the FI copy to Scalar registers and the default implementation breaks the code while trying to handle it. This patch handles the broken lowering and also takes care of some edge cases that might arise. This case is tricky for non-zero offset, scc & vcc is live and we don't find sgpr pair available.
Co-authored by @arsenm

llvmbot · 2024-07-18T20:00:38Z

@llvm/pr-subscribers-backend-amdgpu

Author: Pankaj Dwivedi (PankajDwivedi-25)

Changes

During prologepilog pass while trying to eliminate frame index for scalar copy instruction we ended up in a case where two spills are trying to use same emergency stack slot which leads to almost a dead end with one emergency stack slot.
This is solved by avoiding spilling at the first place, using in-place frame index elimination for scalar copy instruction.
in-place frame index elimination for scalar copy instruction solves the issue but still this patch might not be covering all the corner cases.
Co-authored by @arsenm

Patch is 39.87 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/99556.diff

3 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp (+52-18)
(modified) llvm/test/CodeGen/AMDGPU/frame-index.mir (+324)
(added) llvm/test/CodeGen/AMDGPU/reduce-frame-index.mir (+176)

diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
index 8a315aa822786..05153383a3a21 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
@@ -2449,7 +2449,8 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
                                             ? &AMDGPU::SReg_32RegClass
                                             : &AMDGPU::VGPR_32RegClass;
         bool IsCopy = MI->getOpcode() == AMDGPU::V_MOV_B32_e32 ||
-                      MI->getOpcode() == AMDGPU::V_MOV_B32_e64;
+                      MI->getOpcode() == AMDGPU::V_MOV_B32_e64 ||
+                      MI->getOpcode() == AMDGPU::S_MOV_B32;
         Register ResultReg =
             IsCopy ? MI->getOperand(0).getReg()
                    : RS->scavengeRegisterBackwards(*RC, MI, false, 0);
@@ -2458,7 +2459,13 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
         if (Offset == 0) {
           unsigned OpCode = IsSALU && !LiveSCC ? AMDGPU::S_LSHR_B32
                                                : AMDGPU::V_LSHRREV_B32_e64;
-          auto Shift = BuildMI(*MBB, MI, DL, TII->get(OpCode), ResultReg);
+          Register TmpResultReg = ResultReg;
+          if (IsSALU && LiveSCC) {
+            TmpResultReg = RS->scavengeRegisterBackwards(
+                AMDGPU::VGPR_32RegClass, MI, false, 0);
+          }
+
+          auto Shift = BuildMI(*MBB, MI, DL, TII->get(OpCode), TmpResultReg);
           if (OpCode == AMDGPU::V_LSHRREV_B32_e64)
             // For V_LSHRREV, the operands are reversed (the shift count goes
             // first).
@@ -2468,11 +2475,13 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
           if (IsSALU && !LiveSCC)
             Shift.getInstr()->getOperand(3).setIsDead(); // Mark SCC as dead.
           if (IsSALU && LiveSCC) {
-            Register NewDest = RS->scavengeRegisterBackwards(
-                AMDGPU::SReg_32RegClass, Shift, false, 0);
+            Register NewDest =
+                IsCopy ? ResultReg
+                       : RS->scavengeRegisterBackwards(AMDGPU::SReg_32RegClass,
+                                                       Shift, false, 0);
             BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32),
                     NewDest)
-                .addReg(ResultReg);
+                .addReg(TmpResultReg);
             ResultReg = NewDest;
           }
         } else {
@@ -2523,22 +2532,47 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
 
             // We may have 1 free scratch SGPR even though a carry out is
             // unavailable. Only one additional mov is needed.
-            Register TmpScaledReg = RS->scavengeRegisterBackwards(
-                AMDGPU::SReg_32_XM0RegClass, MI, false, 0, false);
-            Register ScaledReg = TmpScaledReg.isValid() ? TmpScaledReg : FrameReg;
-
-            BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_LSHR_B32), ScaledReg)
-              .addReg(FrameReg)
-              .addImm(ST.getWavefrontSizeLog2());
-            BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), ScaledReg)
-                .addReg(ScaledReg, RegState::Kill)
-                .addImm(Offset);
+            Register TmpScaledReg =
+                IsCopy && IsSALU
+                    ? ResultReg
+                    : RS->scavengeRegisterBackwards(AMDGPU::SReg_32_XM0RegClass,
+                                                    MI, false, 0, false);
+            Register ScaledReg =
+                TmpScaledReg.isValid() ? TmpScaledReg : FrameReg;
+            Register TmpResultReg = ScaledReg;
+
+            if (!LiveSCC) {
+              BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_LSHR_B32), TmpResultReg)
+                  .addReg(FrameReg)
+                  .addImm(ST.getWavefrontSizeLog2());
+              BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), TmpResultReg)
+                  .addReg(TmpResultReg, RegState::Kill)
+                  .addImm(Offset);
+            } else {
+              TmpResultReg = RS->scavengeRegisterBackwards(
+                  AMDGPU::VGPR_32RegClass, MI, false, 0);
+
+              BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_LSHR_B32_e64),
+                      TmpResultReg)
+                  .addImm(ST.getWavefrontSizeLog2())
+                  .addReg(FrameReg);
+              auto Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_ADD_U32_e32),
+                                 TmpResultReg);
+              Add.addImm(Offset).addReg(TmpResultReg, RegState::Kill);
+              Register NewDest =
+                  IsCopy ? ResultReg
+                         : RS->scavengeRegisterBackwards(
+                               AMDGPU::SReg_32RegClass, Add, false, 0);
+              BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32),
+                      NewDest)
+                  .addReg(TmpResultReg);
+              ResultReg = NewDest;
+            }
             if (!IsSALU)
               BuildMI(*MBB, MI, DL, TII->get(AMDGPU::COPY), ResultReg)
-                  .addReg(ScaledReg, RegState::Kill);
+                  .addReg(TmpResultReg, RegState::Kill);
             else
-              ResultReg = ScaledReg;
-
+              ResultReg = TmpResultReg;
             // If there were truly no free SGPRs, we need to undo everything.
             if (!TmpScaledReg.isValid()) {
               BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), ScaledReg)
diff --git a/llvm/test/CodeGen/AMDGPU/frame-index.mir b/llvm/test/CodeGen/AMDGPU/frame-index.mir
index d8736c5276a26..f4476a497f754 100644
--- a/llvm/test/CodeGen/AMDGPU/frame-index.mir
+++ b/llvm/test/CodeGen/AMDGPU/frame-index.mir
@@ -211,3 +211,327 @@ body:             |
     renamable $vgpr0 = V_ADD_CO_U32_e32 %stack.0, killed $vgpr0, implicit-def dead $vcc, implicit $exec
 ...
 
+---
+name: materialize_fi_s_mov_b32_offset_0_dead_scc
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+  stackPtrOffsetReg: '$sgpr32'
+body:             |
+  bb.0:
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_0_dead_scc
+    ; GCN: $sgpr4 = S_LSHR_B32 $sgpr32, 6, implicit-def dead $scc
+    ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4
+  renamable $sgpr4 = S_MOV_B32 %stack.0
+  S_ENDPGM 0, implicit $sgpr4
+
+...
+
+---
+name: materialize_fi_s_mov_b32_offset_0_live_scc
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+  stackPtrOffsetReg: '$sgpr32'
+body:             |
+  bb.0:
+    liveins: $sgpr4, $sgpr5
+
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_0_live_scc
+    ; GCN: liveins: $sgpr4, $sgpr5
+    ; GCN-NEXT: {{  $}}
+    ; GCN-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+    ; GCN-NEXT: $vgpr0 = V_LSHRREV_B32_e64 6, $sgpr32, implicit $exec
+    ; GCN-NEXT: $sgpr4 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+    ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4, implicit $scc
+  S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+  renamable $sgpr4 = S_MOV_B32 %stack.0
+  S_ENDPGM 0, implicit $sgpr4, implicit $scc
+
+...
+
+# FI#0 is filler to get a non-0 offset for FI#1
+---
+name: materialize_fi_s_mov_b32_offset_64_dead_scc
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 64, alignment: 16, stack-id: default }
+  - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+  stackPtrOffsetReg: '$sgpr32'
+body:             |
+  bb.0:
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_64_dead_scc
+    ; GCN: $sgpr4 = S_LSHR_B32 $sgpr32, 6, implicit-def $scc
+    ; GCN-NEXT: $sgpr4 = S_ADD_I32 killed $sgpr4, 64, implicit-def $scc
+    ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4
+  renamable $sgpr4 = S_MOV_B32 %stack.1
+  S_ENDPGM 0, implicit $sgpr4
+
+...
+
+---
+name: materialize_fi_s_mov_b32_offset_68_dead_scc
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 68, alignment: 16, stack-id: default }
+  - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+  stackPtrOffsetReg: '$sgpr32'
+body:             |
+  bb.0:
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_68_dead_scc
+    ; GCN: $sgpr4 = S_LSHR_B32 $sgpr32, 6, implicit-def $scc
+    ; GCN-NEXT: $sgpr4 = S_ADD_I32 killed $sgpr4, 68, implicit-def $scc
+    ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4
+  renamable $sgpr4 = S_MOV_B32 %stack.1
+  S_ENDPGM 0, implicit $sgpr4
+
+...
+
+---
+name: materialize_fi_s_mov_b32_offset_64_live_scc
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 64, alignment: 16, stack-id: default }
+  - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+  stackPtrOffsetReg: '$sgpr32'
+body:             |
+  bb.0:
+    liveins: $sgpr4, $sgpr5
+
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_64_live_scc
+    ; GCN: liveins: $sgpr4, $sgpr5
+    ; GCN-NEXT: {{  $}}
+    ; GCN-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+    ; GCN-NEXT: $vgpr0 = V_LSHR_B32_e64 6, $sgpr32, implicit $exec
+    ; GCN-NEXT: $vgpr0 = V_ADD_U32_e32 64, killed $vgpr0, implicit $exec
+    ; GCN-NEXT: $sgpr4 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+    ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4, implicit $scc
+  S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+  renamable $sgpr4 = S_MOV_B32 %stack.1
+  S_ENDPGM 0, implicit $sgpr4, implicit $scc
+
+...
+
+---
+name: materialize_fi_s_mov_b32_offset_68_live_scc
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 68, alignment: 16, stack-id: default }
+  - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+  stackPtrOffsetReg: '$sgpr32'
+body:             |
+  bb.0:
+    liveins: $sgpr4, $sgpr5
+
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_68_live_scc
+    ; GCN: liveins: $sgpr4, $sgpr5
+    ; GCN-NEXT: {{  $}}
+    ; GCN-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+    ; GCN-NEXT: $vgpr0 = V_LSHR_B32_e64 6, $sgpr32, implicit $exec
+    ; GCN-NEXT: $vgpr0 = V_ADD_U32_e32 68, killed $vgpr0, implicit $exec
+    ; GCN-NEXT: $sgpr4 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+    ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4, implicit $scc
+  S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+  renamable $sgpr4 = S_MOV_B32 %stack.1
+  S_ENDPGM 0, implicit $sgpr4, implicit $scc
+
+...
+
+# FIXME: This is finding a VGPR
+---
+name: materialize_fi_s_mov_b32_offset_0_live_scc__no_free_vgprs
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 4, alignment: 16, stack-id: default }
+machineFunctionInfo:
+  scratchRSrcReg:  $sgpr0_sgpr1_sgpr2_sgpr3
+  stackPtrOffsetReg: '$sgpr32'
+  occupancy: 10
+
+body:             |
+  bb.0:
+    liveins: $sgpr4, $sgpr5, $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_0_live_scc__no_free_vgprs
+    ; GCN: liveins: $sgpr4, $sgpr5, $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39, $vgpr40, $vgpr41, $vgpr42, $vgpr43, $vgpr44, $vgpr45, $vgpr46, $vgpr47, $vgpr56, $vgpr57, $vgpr58, $vgpr59, $vgpr60, $vgpr61, $vgpr62, $vgpr63
+    ; GCN-NEXT: {{  $}}
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 60, 0, 0, implicit $exec :: (store (s32) into %stack.1, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr41, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 56, 0, 0, implicit $exec :: (store (s32) into %stack.2, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr42, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 52, 0, 0, implicit $exec :: (store (s32) into %stack.3, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr43, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 48, 0, 0, implicit $exec :: (store (s32) into %stack.4, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr44, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 44, 0, 0, implicit $exec :: (store (s32) into %stack.5, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr45, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 40, 0, 0, implicit $exec :: (store (s32) into %stack.6, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr46, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 36, 0, 0, implicit $exec :: (store (s32) into %stack.7, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr47, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 32, 0, 0, implicit $exec :: (store (s32) into %stack.8, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr56, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 28, 0, 0, implicit $exec :: (store (s32) into %stack.9, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr57, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 24, 0, 0, implicit $exec :: (store (s32) into %stack.10, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr58, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 20, 0, 0, implicit $exec :: (store (s32) into %stack.11, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr59, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 16, 0, 0, implicit $exec :: (store (s32) into %stack.12, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr60, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 12, 0, 0, implicit $exec :: (store (s32) into %stack.13, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr61, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 8, 0, 0, implicit $exec :: (store (s32) into %stack.14, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr62, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (store (s32) into %stack.15, addrspace 5)
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr63, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec :: (store (s32) into %stack.16, addrspace 5)
+    ; GCN-NEXT: S_NOP 0, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
+    ; GCN-NEXT: S_NOP 0, implicit-def $vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
+    ; GCN-NEXT: S_NOP 0, implicit-def $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23
+    ; GCN-NEXT: S_NOP 0, implicit-def $vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+    ; GCN-NEXT: S_NOP 0, implicit-def $vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
+    ; GCN-NEXT: S_NOP 0, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51_vgpr52_vgpr53_vgpr54_vgpr55
+    ; GCN-NEXT: S_NOP 0, implicit-def $vgpr56_vgpr57_vgpr58_vgpr59_vgpr60_vgpr61_vgpr62_vgpr63
+    ; GCN-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+    ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 68, 0, 0, implicit $exec :: (store (s32) into %stack.17, addrspace 5)
+    ; GCN-NEXT: $vgpr0 = V_LSHR_B32_e64 6, $sgpr32, implicit $exec
+    ; GCN-NEXT: $vgpr0 = V_ADD_U32_e32 64, killed $vgpr0, implicit $exec
+    ; GCN-NEXT: $sgpr4 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+    ; GCN-NEXT: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 68, 0, 0, implicit $exec :: (load (s32) from %stack.17, addrspace 5)
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr48_vgpr49_vgpr50_vgpr51_vgpr52_vgpr53_vgpr54_vgpr55
+    ; GCN-NEXT: S_NOP 0, implicit $vgpr56_vgpr57_vgpr58_vgpr59_vgpr60_vgpr61_vgpr62_vgpr63
+    ; GCN-NEXT: $vgpr63 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec :: (load (s32) from %stack.16, addrspace 5)
+    ; GCN-NEXT: $vgpr62 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (load (s32) from %stack.15, addrspace 5)
+    ; GCN-NEXT: $vgpr61 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 8, 0, 0, implicit $exec :: (load (s32) from %stack.14, addrspace 5)
+    ; GCN-NEXT: $vgpr60 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 12, 0, 0, implicit $exec :: (load (s32) from %stack.13, addrspace 5)
+    ; GCN-NEXT: $vgpr59 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 16, 0, 0, implicit $exec :: (load (s32) from %stack.12, addrspace 5)
+    ; GCN-NEXT: $vgpr58 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 20, 0, 0, implicit $exec :: (load (s32) from %stack.11, addrspace 5)
+    ; GCN-NEXT: $vgpr57 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 24, 0, 0, implicit $exec :: (load (s32) from %stack.10, addrspace 5)
+    ; GCN-NEXT: $vgpr56 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 28, 0, 0, implicit $exec :: (load (s32) from %stack.9, addrspace 5)
+    ; GCN-NEXT: $vgpr47 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 32, 0, 0, implicit $exec :: (load (s32) from %stack.8, addrspace 5)
+    ; GCN-NEXT: $vgpr46 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 36, 0, 0, implicit $exec :: (load (s32) from %stack.7, addrspace 5)
+    ; GCN-NEXT: $vgpr45 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 40, 0, 0, implicit $exec :: (load (s32) from %stack.6, addrspace 5)
+    ; GCN-NEXT: $vgpr44 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 44, 0, 0, implicit $exec :: (load (s32) from %stack.5, addrspace 5)
+    ; GCN-NEXT: $vgpr43 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 48, 0, 0, implicit $exec :: (load (s32) from %stack.4, addrspace 5)
+    ; GCN-NEXT: $vgpr42 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 52, 0, 0, implicit $exec :: (load (s32) from %stack.3, addrspace 5)
+    ; GCN-NEXT: $vgpr41 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 56, 0, 0, implicit $exec :: (load (s32) from %stack.2, addrspace 5)
+    ; GCN-NEXT: $vgpr40 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 60, 0, 0, implicit $exec :: (load (s32) from %stack.1, addrspace 5)
+    ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4, implicit $scc
+    S_NOP 0, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
+    S_NOP 0, implicit-def $vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
+    S_NOP 0, implicit-def $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23
+    S_NOP 0, implicit-def $vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
+    S_NOP 0, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+    S_NOP 0, implicit-def $vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
+    S_NOP 0, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51_vgpr52_vgpr53_vgpr54_vgpr55
+    S_NOP 0, implicit-def $vgpr56_vgpr57_vgpr58_vgpr59_vgpr60_vgpr61_vgpr62_vgpr63
+    S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+
+    renamable $sgpr4 = S_MOV_B32 %stack.0
+
+    S_NOP 0, implicit $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
+    S_NOP 0, implicit $vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
+    S_NOP 0, implicit $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23
+    S_NOP 0, implicit $vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
+    S_NOP 0, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+    S_NOP 0, implicit $vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
+    S_NOP 0, implicit $vgpr48_vgpr49_vgpr50_vgpr51_vgpr52_vgpr53_vgpr54_vgpr55
+    S_NOP 0, implicit $vgpr56_vgpr57_vgpr58_vgpr59_vgpr60_vgpr61_vgpr62_vgpr63
+    S_ENDPGM 0, implicit $sgpr4, implicit $scc
+
+...
+
+# FIXME: This is clobbering scc
+---
+name: materialize_fi_s_mov_b32_offset_96_live_scc__no_free_vgprs
+tracksRegLiveness: true
+stack:
+  - { id: 0, type: default, size: 64, alignment: 16, stack-id: default }
+  - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+  scratchRSrcReg:  $sgpr0_sgpr1_sgpr2_sgpr3
+  stackPtrOffsetReg: '$sgpr32'
+body:             |
+  bb.0:
+    liveins: $sgpr4, $sgpr5
+
+    ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset...
[truncated]

cdevadas

Rewrite the PR description. Include the point that eliminateFrameIndex wasn't handling the FI copy to Scalar registers and the default implementation breaks the code while trying to handle it. There is no need to mention the two spills and the emergency spill slots.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

llvm/test/CodeGen/AMDGPU/reduce-frame-index.mir

arsenm · 2024-07-19T11:17:51Z

llvm/test/CodeGen/AMDGPU/reduce-frame-index.mir

+    ret void
+  }
+
+  attributes #0 = { "amdgpu-stack-objects" "target-cpu"="gfx90a" }


Shouldn't need the IR section

arsenm · 2024-07-19T11:18:08Z

llvm/test/CodeGen/AMDGPU/reduce-frame-index.mir

+    renamable $vgpr0_vgpr1 = COPY killed renamable $sgpr22_sgpr23, implicit $exec
+    renamable $vgpr0_vgpr1 = COPY killed renamable $sgpr20_sgpr21, implicit $exec
+    renamable $vgpr0_vgpr1 = COPY killed renamable $sgpr18_sgpr19, implicit $exec
+    dead $sgpr30_sgpr31 = SI_CALL killed renamable $sgpr16_sgpr17, @killer, csr_amdgpu_gfx90ainsts, implicit killed $sgpr4_sgpr5, implicit killed $sgpr6_sgpr7, implicit killed $sgpr8_sgpr9, implicit killed $sgpr10_sgpr11, implicit killed $sgpr12, implicit killed $sgpr13, implicit killed $sgpr14, implicit killed $sgpr15, implicit $sgpr0_sgpr1_sgpr2_sgpr3, implicit-def $vgpr0, implicit-def $vgpr1


Replace the function reference with 0

arsenm · 2024-07-19T11:25:09Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+                IsCopy && IsSALU
+                    ? ResultReg
+                    : RS->scavengeRegisterBackwards(AMDGPU::SReg_32_XM0RegClass,
+                                                    MI, false, 0, false);


Suggested change

MI, false, 0, false);

MI, false, 0, /*AllowSpill=*/false);

arsenm · 2024-07-19T11:25:27Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+                  .addImm(Offset);
+            } else {
+              TmpResultReg = RS->scavengeRegisterBackwards(
+                  AMDGPU::VGPR_32RegClass, MI, false, 0);


Suggested change

AMDGPU::VGPR_32RegClass, MI, false, 0);

AMDGPU::VGPR_32RegClass, MI, false, 0, /*AllowSpill=*/true);

arsenm · 2024-07-19T11:28:59Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+              auto Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_ADD_U32_e32),
+                                 TmpResultReg);


You need to use getAddNoCarry here and then handle the failure case too

addressed in the current patch.

arsenm · 2024-07-19T11:29:45Z

llvm/test/CodeGen/AMDGPU/frame-index.mir

@@ -211,3 +211,327 @@ body:             |
    renamable $vgpr0 = V_ADD_CO_U32_e32 %stack.0, killed $vgpr0, implicit-def dead $vcc, implicit $exec
 ...

+---


In a pre-commit, can you generate GFX8 and GFX9 checks in this test

arsenm · 2024-07-19T11:30:03Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+              Register NewDest =
+                  IsCopy ? ResultReg
+                         : RS->scavengeRegisterBackwards(
+                               AMDGPU::SReg_32RegClass, Add, false, 0);


Suggested change

AMDGPU::SReg_32RegClass, Add, false, 0);

AMDGPU::SReg_32RegClass, Add, false, 0, /*AllowSpill*/=true);

Is the case where this scavenge spill happens tested?

yes, I guess the last two test cases present in frame-index.mir are spill case, which are working fine.

arsenm · 2024-07-19T11:31:05Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+              BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_LSHR_B32), TmpResultReg)
+                  .addReg(FrameReg)
+                  .addImm(ST.getWavefrontSizeLog2());
+              BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), TmpResultReg)
+                  .addReg(TmpResultReg, RegState::Kill)
+                  .addImm(Offset);


This exact sequence is repeated above?

I don't see these two instruction above, for offset 0 we have shift but not add.

arsenm · 2024-07-19T11:31:26Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+              BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32),
+                      NewDest)
+                  .addReg(TmpResultReg);


This whole sequence is repeated above?

yes, that is when offset is 0.

github-actions · 2024-07-19T13:34:50Z

✅ With the latest revision this PR passed the C/C++ code formatter.

arsenm · 2024-07-23T17:56:37Z

llvm/test/CodeGen/AMDGPU/reduce-frame-index.mir

@@ -0,0 +1,72 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck %s
+


Test file name is still too generic

that is renamed now.

arsenm · 2024-07-24T15:56:03Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+              if ((Add = TII->getAddNoCarry(*MBB, MI, DL, TmpResultReg, *RS)) ==
+                  nullptr) {


It would be easiest to read if we handled the positive cases where this returned an add before the fallback path.

this case now will not arise, handling null case inside getAddNoCarry.

I don't understand, getAddNoCarry has a failure path which you are checking for

Now getAddNoCarry will return either of these three cases. V_ADD_U32_e32, V_MAD_U32_U24_e64 or V_ADD_CO_U32_e64

there are couple of more test cases which are failing if i try to handle this, inside getAddNoCarry.
and the code above which is expecting nullptr will not be rechable.
else {
MachineInstrBuilder MIB;
if (!IsSALU) {
if ((MIB = TII->getAddNoCarry(*MBB, MI, DL, ResultReg, *RS)) !=
nullptr) {

arsenm · 2024-07-24T15:56:32Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+                // scavenged? This a way around to avoid carry, need follow-up.
+                BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), ResultReg)
+                    .addImm(Offset);
+                Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MAD_I32_I24_e64),


We know we can use the unsigned version here. The offset will always be a small positive constant and the SP/FP will always be positive

got it, replacing v_mad_i32_i24 with v_mad_u32_u24

arsenm · 2024-07-24T15:58:10Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+                // scavenged? This a way around to avoid carry, need follow-up.
+                BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), ResultReg)
+                    .addImm(Offset);
+                Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MAD_I32_I24_e64),


You could handle using mad like this directly in getAddNoCarry.

I was also originally thinking of folding in the FP scaling into the mad. I.e. v_mad_u32_u24 fp, wavesize, offset

Then i may i have to modify the interface for getAddNoCarry, as we are using an scavenged register for shifting Offset into.
or that part should i take care as present?

I guess it's split to avoid the scavenging inside. I guess just reorder the handling for now

But it's going to shift around if we can fold in the shift by wavesize

There are more cases need to be handled if i move handling inside, getAddNoCarry.
It hits assert(MIB->getOpcode() == AMDGPU::V_ADD_CO_U32_e64 && "Need to reuse carry out register");
where we are trying to handle !SALU cases.

for now I have left it custom handled.

arsenm · 2024-07-24T18:37:04Z

llvm/test/CodeGen/AMDGPU/s_copy_frame_index_elimination_failure_pei.mir

+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=tonga -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck -check-prefixes=GFX8,GCN %s
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck -check-prefixes=GFX900,GCN %s
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck -check-prefixes=GFX90A,GCN %s
+


Test name shouldn't mention failure, this is just the lowering handling. also s_mov not copy

… loop

arsenm · 2024-07-24T20:10:52Z

llvm/test/CodeGen/AMDGPU/s_mov_lowering_handling.mir

@@ -0,0 +1,574 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=tonga -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck -check-prefixes=GFX8,GCN %s


How about eliminate-frame-index-s-mov-b32.mir

arsenm · 2024-07-24T20:12:06Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+                // since emergency stack slot is already used for spilling VGPR
+                // scavenged? This a way around to avoid carry, need follow-up.


Needs capitalization and rephrasing. I'm not sure I understand the comment. Is everything working now?

yes, all cases are working now.

arsenm · 2024-07-24T20:13:21Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+                BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), ResultReg)
+                    .addImm(Offset);


Instead you can materialize in the VGPR TmpResultReg. That saves the constant bus use for the frame pointer register to directly materialize in one mad

It will clobber shift value in TmpResultReg right?

Not if you replace the whole sequence. It's the free register you just scavenged, so it would be
tmp_vgpr = offset
tmp_vgpr = mad24 FrameReg, wavesize, tmp_vgpr

Also you should add an assert that the offset is positive. I think we need to switch this to the signed mad when we switch the scratch instruction ABI to use negative offsets

Not if you replace the whole sequence. It's the free register you just scavenged, so it would be tmp_vgpr = offset tmp_vgpr = mad24 FrameReg, wavesize, tmp_vgpr

That's really great idea, but now it looks like we are trying to replace right shift with left shift?

in VGPR we are performing right shift with WavefrontSizelog2 which is eventually divide by WavefrontSize.
but now FrameReg * WavefrontSize which is eventually left shift by WavefrontSizelog2?
is mad correct operator then?

Also you should add an assert that the offset is positive. I think we need to switch this to the signed mad when we switch the scratch instruction ABI to use negative offsets

sure, will add it.

We start with a frame pointer with a wave space value, and an offset in lane-space. We are materializing a lane space value. You can either do a right shift of the frame pointer to get to lane space, or a left shift of the offset to get to wavespace. We can right shift after the computation to get back to the desired per-lane value.

I also spent some time with alive2, and unfortunately we can only use the mad24 on gfx9 and gfx10. https://alive2.llvm.org/ce/z/_gVKVG. The stack size increased in gfx11 so that the maximum sized object computation can no longer be done with the 24-bit multiply.

Actually we might be able to - I wrote the assumes, assuming the FP and the offset are the maximum. But these are actually linked. The condition is more like fp + offset < maximum value

Hmm. Still seems like it would require a 26-bit multiply in wave32

I have made changes, is it what you have suggested.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

arsenm · 2024-07-29T12:30:44Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

-                // We have to produce a carry out, and there isn't a free SGPR
-                // pair for it. This a way around to avoid carry.
-                BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), ResultReg)
+                BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32),


When you do this reassociation, you need to multiply the offset by the wavesize. You're doing the mul + add in terms of waves

arsenm · 2024-07-29T12:31:22Z

llvm/test/CodeGen/AMDGPU/eliminate-frame-index-s-mov-b32.mir

@@ -368,9 +368,9 @@ body:   |
    ; GFX8-NEXT: S_NOP 0, implicit-def $sgpr56_sgpr57_sgpr58_sgpr59_sgpr60_sgpr61_sgpr62_sgpr63
    ; GFX8-NEXT: V_CMP_EQ_U32_e32 0, killed $vgpr0, implicit-def $vcc, implicit $exec
    ; GFX8-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+    ; GFX8-NEXT: $vgpr0 = V_MOV_B32_e32 64, implicit $exec


This test also should get gfx10, gfx11, and gfx12 runlines (maybe leave for a separate commit though)

arsenm · 2024-07-29T12:50:30Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+                BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32),
+                        TmpResultReg)
+                    .addImm(Offset);
+                assert(Offset > 0 && "Offset is positive");


Suggested change

assert(Offset > 0 && "Offset is positive");

assert(Offset > 0 && isUInt<24>(2 * ST.getMaxWaveScratchSize()) &&

"offset is unsafe for v_mad_u32_u24");

This assert should also be strengthened to cover the range, I need to work out the exact condition but something like this.

I think we can squeeze a bit more range out of this with more instructions, but we shouldn't need to on gfx11/12 since they can just directly encode the larger offset in the add

Plus when we use scratch instructions, the SP value is in lane values anyway so we can do everything in one add

arsenm · 2024-07-29T18:22:39Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+                          .addReg(FrameReg)
+                          .addImm(ST.getWavefrontSize())
+                          .addReg(TmpResultReg, RegState::Kill)


This has the operands swapped. This should be doing the multiply of the offset and adding the frame reg

Suggested change

.addReg(FrameReg)

.addImm(ST.getWavefrontSize())

.addReg(TmpResultReg, RegState::Kill)

.addReg(TmpResultReg, RegState::Kill)

.addImm(ST.getWavefrontSize())

.addReg(FrameReg)

arsenm · 2024-07-29T18:22:54Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+                } else
+                  Add.addImm(Offset).addReg(TmpResultReg, RegState::Kill);
+              } else {
+                BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32),


This needs comments about why this is doing this reassociation vs. above

arsenm · 2024-07-29T18:23:19Z

It would also be good if we had an end to end IR test that hit these new paths

arsenm · 2024-07-30T10:33:22Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+                // value. We can either do a right shift of the frame pointer to
+                // get to lane space, or a left shift of the offset to get to
+                // wavespace. We can right shift after the computation to get
+                // back to the desired per-lane value.


Mention we are using the mad_u32_u24 primarily as an add with no carry out clobber

arsenm · 2024-07-30T11:11:37Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

@@ -2580,6 +2580,8 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
                // get to lane space, or a left shift of the offset to get to
                // wavespace. We can right shift after the computation to get
                // back to the desired per-lane value.
+                // we are using the mad_u32_u24 primarily as an add with no


Comments should be capitalized

arsenm

It would still be good if we had an IR test that hit these cases

arsenm · 2024-07-30T19:02:27Z

It would still be good if we had an IR test that hit these cases

I think I've managed to craft one that will need some adjustment to hit the various cases

arsenm · 2024-07-30T19:15:06Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+                BuildMI(*MBB, *Add, DL, TII->get(AMDGPU::V_LSHR_B32_e64),
+                        TmpResultReg)
+                    .addImm(ST.getWavefrontSizeLog2())


This should use V_LSHRREV_B32_e64

arsenm · 2024-07-30T19:19:31Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+                BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_LSHR_B32_e64),
+                        TmpResultReg)
+                    .addImm(ST.getWavefrontSizeLog2())
+                    .addReg(FrameReg);


arsenm · 2024-07-30T20:46:21Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+                } else
+                  Add.addImm(Offset).addReg(TmpResultReg, RegState::Kill);
+              } else {
+                BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32),


If the offset is an inline immediate, you can just fold it directly into the mad24 operand as well

Should address this in a follow up

We also should special case folding frame indexes into S_ADD_I32/S_OR_B32

arsenm

I'll open another PR with additional IR tests

PankajDwivedi-25 · 2024-07-31T08:12:03Z

I'll open another PR with additional IR tests

sure matt, will add the changes, you suggested.

arsenm · 2024-07-31T08:57:52Z

I'll open another PR with additional IR tests

This is #101306

In case of v_mad we have materialized the offset in vgpr and mad is performed in wave space, later vgpr have to be shifted back in lane space. [#99556](#99556) introduces a bug. Co-authored-by: Pankajdwivedi-25 <[email protected]>

@arsenm

llvm#99556) eliminateFrameIndex wasn't handling the FI copy to Scalar registers and the default implementation breaks the code while trying to handle it. This patch handles the broken lowering and also takes care of some edge cases that might arise. This case is tricky for non-zero offset, scc & vcc is live and we don't find sgpr pair available. Co-authored by @arsenm --------- Co-authored-by: Matt Arsenault <[email protected]> Co-authored-by: PankajDwivedi-25 <[email protected]> Change-Id: I760538f594d267bdc414833c4489486e426fb55b

In case of v_mad we have materialized the offset in vgpr and mad is performed in wave space, later vgpr have to be shifted back in lane space. [llvm#99556](llvm#99556) introduces a bug. Co-authored-by: Pankajdwivedi-25 <[email protected]> Change-Id: Ic02338cf087cb3c95964b7273609a73a3e7b2d03

@arsenm

llvm#99556) eliminateFrameIndex wasn't handling the FI copy to Scalar registers and the default implementation breaks the code while trying to handle it. This patch handles the broken lowering and also takes care of some edge cases that might arise. This case is tricky for non-zero offset, scc & vcc is live and we don't find sgpr pair available. Co-authored by @arsenm --------- Co-authored-by: Matt Arsenault <[email protected]> Co-authored-by: PankajDwivedi-25 <[email protected]> Change-Id: I760538f594d267bdc414833c4489486e426fb55b

In case of v_mad we have materialized the offset in vgpr and mad is performed in wave space, later vgpr have to be shifted back in lane space. [llvm#99556](llvm#99556) introduces a bug. Co-authored-by: Pankajdwivedi-25 <[email protected]> Change-Id: Ic02338cf087cb3c95964b7273609a73a3e7b2d03

@arsenm

llvm#99556) eliminateFrameIndex wasn't handling the FI copy to Scalar registers and the default implementation breaks the code while trying to handle it. This patch handles the broken lowering and also takes care of some edge cases that might arise. This case is tricky for non-zero offset, scc & vcc is live and we don't find sgpr pair available. Co-authored by @arsenm --------- Co-authored-by: Matt Arsenault <[email protected]> Co-authored-by: PankajDwivedi-25 <[email protected]> Change-Id: I8db13328357723da96c1bfaa2fd2368430b17100

In case of v_mad we have materialized the offset in vgpr and mad is performed in wave space, later vgpr have to be shifted back in lane space. [llvm#99556](llvm#99556) introduces a bug. Co-authored-by: Pankajdwivedi-25 <[email protected]> Change-Id: I23db2a321c9bf3e28b5e173ddc95fa98c4d16fd6

[AMDGPU] Inplace FI elimination during PEI for scalar copy instruction

4c43e74

PankajDwivedi-25 added the backend:AMDGPU label Jul 18, 2024

PankajDwivedi-25 requested review from arsenm and cdevadas July 18, 2024 20:00

cdevadas reviewed Jul 19, 2024

View reviewed changes

refactored SIRegisterInfo, reduce-frame-index

7b80c8b

arsenm reviewed Jul 19, 2024

View reviewed changes

PankajDwivedi-25 added 2 commits July 19, 2024 19:00

refactored,checks with gfx90a

126bd30

refactored,checks with gfx803

eb1e303

[WIP] updated check for gfx8,gfx9

aa34010

arsenm reviewed Jul 23, 2024

View reviewed changes

PankajDwivedi-25 added 2 commits July 24, 2024 15:44

[AMDGPU] this patch fixes all cases

ab19c1e

[AMDGPU] renamed test file

ce41a13

arsenm reviewed Jul 24, 2024

View reviewed changes

[WIP] replaced v_mad_i32_i24 with unsigned version, refactored nested…

4254f3b

… loop

arsenm reviewed Jul 24, 2024

View reviewed changes

PankajDwivedi-25 added 5 commits July 25, 2024 02:14

[WIP] renamed test file, removed unnecessesry null check

7bcea66

[WIP] renamed test file

23c7344

[WIP] rephrased comment

b5a09ea

[WIP] assert non positive offset

2f16d7b

[WIP] reused vgpr for copying offset

adf2f18

arsenm reviewed Jul 29, 2024

View reviewed changes

[WIP] mul offset with wavesize

4c97dde

arsenm reviewed Jul 29, 2024

View reviewed changes

PankajDwivedi-25 added 2 commits July 29, 2024 18:26

[WIP] added target gfx10,gfx11,gfx12

bb9f139

[WIP] updated assert, refactored test

548f498

arsenm reviewed Jul 29, 2024

View reviewed changes

[WIP] V_MAD_U32_U24_e64 utilises wave wavespace

a4d8201

arsenm reviewed Jul 30, 2024

View reviewed changes

[WIP] added comment

f7f8b16

arsenm reviewed Jul 30, 2024

View reviewed changes

[WIP] refactored

9ae1cd2

arsenm reviewed Jul 30, 2024

View reviewed changes

[WIP] updated opcode V_LSHR_B32_e64 with V_LSHRREV_B32_e64

0b696d8

arsenm approved these changes Jul 31, 2024

View reviewed changes

arsenm merged commit 57d10b4 into llvm:main Jul 31, 2024
7 checks passed

cdevadas mentioned this pull request Aug 2, 2024

[AMDGPU] Fix using wrong register in frame index shift #101649

Merged

	AMDGPU::VGPR_32RegClass, MI, false, 0);
	AMDGPU::VGPR_32RegClass, MI, false, 0, /AllowSpill=/true);

		auto Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_ADD_U32_e32),
		TmpResultReg);

	AMDGPU::SReg_32RegClass, Add, false, 0);
	AMDGPU::SReg_32RegClass, Add, false, 0, /AllowSpill/=true);

		@@ -0,0 +1,72 @@
		# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
		# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -verify-machineinstrs -run-pass=prologepilog %s -o - \| FileCheck %s

		if ((Add = TII->getAddNoCarry(MBB, MI, DL, TmpResultReg, RS)) ==
		nullptr) {

		@@ -0,0 +1,574 @@
		# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
		# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=tonga -verify-machineinstrs -run-pass=prologepilog %s -o - \| FileCheck -check-prefixes=GFX8,GCN %s

		// since emergency stack slot is already used for spilling VGPR
		// scavenged? This a way around to avoid carry, need follow-up.

		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), ResultReg)
		.addImm(Offset);

	assert(Offset > 0 && "Offset is positive");
	assert(Offset > 0 && isUInt<24>(2 * ST.getMaxWaveScratchSize()) &&
	"offset is unsafe for v_mad_u32_u24");

[AMDGPU] Inplace FI elimination during PEI for scalar copy instruction #99556

[AMDGPU] Inplace FI elimination during PEI for scalar copy instruction #99556

Uh oh!

Conversation

PankajDwivedi-25 commented Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 18, 2024

Uh oh!

cdevadas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PankajDwivedi-25 Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PankajDwivedi-25 commented Jul 18, 2024 •

edited

Loading

github-actions bot commented Jul 19, 2024 •

edited

Loading

PankajDwivedi-25 Jul 24, 2024 •

edited

Loading

PankajDwivedi-25 Jul 24, 2024 •

edited

Loading