-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AMDGPU] Inplace FI elimination during PEI for scalar copy instruction #99556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Pankaj Dwivedi (PankajDwivedi-25) ChangesDuring prologepilog pass while trying to eliminate frame index for scalar copy instruction we ended up in a case where two spills are trying to use same emergency stack slot which leads to almost a dead end with one emergency stack slot. Patch is 39.87 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/99556.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
index 8a315aa822786..05153383a3a21 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
@@ -2449,7 +2449,8 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
? &AMDGPU::SReg_32RegClass
: &AMDGPU::VGPR_32RegClass;
bool IsCopy = MI->getOpcode() == AMDGPU::V_MOV_B32_e32 ||
- MI->getOpcode() == AMDGPU::V_MOV_B32_e64;
+ MI->getOpcode() == AMDGPU::V_MOV_B32_e64 ||
+ MI->getOpcode() == AMDGPU::S_MOV_B32;
Register ResultReg =
IsCopy ? MI->getOperand(0).getReg()
: RS->scavengeRegisterBackwards(*RC, MI, false, 0);
@@ -2458,7 +2459,13 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
if (Offset == 0) {
unsigned OpCode = IsSALU && !LiveSCC ? AMDGPU::S_LSHR_B32
: AMDGPU::V_LSHRREV_B32_e64;
- auto Shift = BuildMI(*MBB, MI, DL, TII->get(OpCode), ResultReg);
+ Register TmpResultReg = ResultReg;
+ if (IsSALU && LiveSCC) {
+ TmpResultReg = RS->scavengeRegisterBackwards(
+ AMDGPU::VGPR_32RegClass, MI, false, 0);
+ }
+
+ auto Shift = BuildMI(*MBB, MI, DL, TII->get(OpCode), TmpResultReg);
if (OpCode == AMDGPU::V_LSHRREV_B32_e64)
// For V_LSHRREV, the operands are reversed (the shift count goes
// first).
@@ -2468,11 +2475,13 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
if (IsSALU && !LiveSCC)
Shift.getInstr()->getOperand(3).setIsDead(); // Mark SCC as dead.
if (IsSALU && LiveSCC) {
- Register NewDest = RS->scavengeRegisterBackwards(
- AMDGPU::SReg_32RegClass, Shift, false, 0);
+ Register NewDest =
+ IsCopy ? ResultReg
+ : RS->scavengeRegisterBackwards(AMDGPU::SReg_32RegClass,
+ Shift, false, 0);
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32),
NewDest)
- .addReg(ResultReg);
+ .addReg(TmpResultReg);
ResultReg = NewDest;
}
} else {
@@ -2523,22 +2532,47 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
// We may have 1 free scratch SGPR even though a carry out is
// unavailable. Only one additional mov is needed.
- Register TmpScaledReg = RS->scavengeRegisterBackwards(
- AMDGPU::SReg_32_XM0RegClass, MI, false, 0, false);
- Register ScaledReg = TmpScaledReg.isValid() ? TmpScaledReg : FrameReg;
-
- BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_LSHR_B32), ScaledReg)
- .addReg(FrameReg)
- .addImm(ST.getWavefrontSizeLog2());
- BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), ScaledReg)
- .addReg(ScaledReg, RegState::Kill)
- .addImm(Offset);
+ Register TmpScaledReg =
+ IsCopy && IsSALU
+ ? ResultReg
+ : RS->scavengeRegisterBackwards(AMDGPU::SReg_32_XM0RegClass,
+ MI, false, 0, false);
+ Register ScaledReg =
+ TmpScaledReg.isValid() ? TmpScaledReg : FrameReg;
+ Register TmpResultReg = ScaledReg;
+
+ if (!LiveSCC) {
+ BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_LSHR_B32), TmpResultReg)
+ .addReg(FrameReg)
+ .addImm(ST.getWavefrontSizeLog2());
+ BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), TmpResultReg)
+ .addReg(TmpResultReg, RegState::Kill)
+ .addImm(Offset);
+ } else {
+ TmpResultReg = RS->scavengeRegisterBackwards(
+ AMDGPU::VGPR_32RegClass, MI, false, 0);
+
+ BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_LSHR_B32_e64),
+ TmpResultReg)
+ .addImm(ST.getWavefrontSizeLog2())
+ .addReg(FrameReg);
+ auto Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_ADD_U32_e32),
+ TmpResultReg);
+ Add.addImm(Offset).addReg(TmpResultReg, RegState::Kill);
+ Register NewDest =
+ IsCopy ? ResultReg
+ : RS->scavengeRegisterBackwards(
+ AMDGPU::SReg_32RegClass, Add, false, 0);
+ BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32),
+ NewDest)
+ .addReg(TmpResultReg);
+ ResultReg = NewDest;
+ }
if (!IsSALU)
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::COPY), ResultReg)
- .addReg(ScaledReg, RegState::Kill);
+ .addReg(TmpResultReg, RegState::Kill);
else
- ResultReg = ScaledReg;
-
+ ResultReg = TmpResultReg;
// If there were truly no free SGPRs, we need to undo everything.
if (!TmpScaledReg.isValid()) {
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), ScaledReg)
diff --git a/llvm/test/CodeGen/AMDGPU/frame-index.mir b/llvm/test/CodeGen/AMDGPU/frame-index.mir
index d8736c5276a26..f4476a497f754 100644
--- a/llvm/test/CodeGen/AMDGPU/frame-index.mir
+++ b/llvm/test/CodeGen/AMDGPU/frame-index.mir
@@ -211,3 +211,327 @@ body: |
renamable $vgpr0 = V_ADD_CO_U32_e32 %stack.0, killed $vgpr0, implicit-def dead $vcc, implicit $exec
...
+---
+name: materialize_fi_s_mov_b32_offset_0_dead_scc
+tracksRegLiveness: true
+stack:
+ - { id: 0, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+ stackPtrOffsetReg: '$sgpr32'
+body: |
+ bb.0:
+ ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_0_dead_scc
+ ; GCN: $sgpr4 = S_LSHR_B32 $sgpr32, 6, implicit-def dead $scc
+ ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4
+ renamable $sgpr4 = S_MOV_B32 %stack.0
+ S_ENDPGM 0, implicit $sgpr4
+
+...
+
+---
+name: materialize_fi_s_mov_b32_offset_0_live_scc
+tracksRegLiveness: true
+stack:
+ - { id: 0, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+ stackPtrOffsetReg: '$sgpr32'
+body: |
+ bb.0:
+ liveins: $sgpr4, $sgpr5
+
+ ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_0_live_scc
+ ; GCN: liveins: $sgpr4, $sgpr5
+ ; GCN-NEXT: {{ $}}
+ ; GCN-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+ ; GCN-NEXT: $vgpr0 = V_LSHRREV_B32_e64 6, $sgpr32, implicit $exec
+ ; GCN-NEXT: $sgpr4 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+ ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4, implicit $scc
+ S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+ renamable $sgpr4 = S_MOV_B32 %stack.0
+ S_ENDPGM 0, implicit $sgpr4, implicit $scc
+
+...
+
+# FI#0 is filler to get a non-0 offset for FI#1
+---
+name: materialize_fi_s_mov_b32_offset_64_dead_scc
+tracksRegLiveness: true
+stack:
+ - { id: 0, type: default, size: 64, alignment: 16, stack-id: default }
+ - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+ stackPtrOffsetReg: '$sgpr32'
+body: |
+ bb.0:
+ ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_64_dead_scc
+ ; GCN: $sgpr4 = S_LSHR_B32 $sgpr32, 6, implicit-def $scc
+ ; GCN-NEXT: $sgpr4 = S_ADD_I32 killed $sgpr4, 64, implicit-def $scc
+ ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4
+ renamable $sgpr4 = S_MOV_B32 %stack.1
+ S_ENDPGM 0, implicit $sgpr4
+
+...
+
+---
+name: materialize_fi_s_mov_b32_offset_68_dead_scc
+tracksRegLiveness: true
+stack:
+ - { id: 0, type: default, size: 68, alignment: 16, stack-id: default }
+ - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+ stackPtrOffsetReg: '$sgpr32'
+body: |
+ bb.0:
+ ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_68_dead_scc
+ ; GCN: $sgpr4 = S_LSHR_B32 $sgpr32, 6, implicit-def $scc
+ ; GCN-NEXT: $sgpr4 = S_ADD_I32 killed $sgpr4, 68, implicit-def $scc
+ ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4
+ renamable $sgpr4 = S_MOV_B32 %stack.1
+ S_ENDPGM 0, implicit $sgpr4
+
+...
+
+---
+name: materialize_fi_s_mov_b32_offset_64_live_scc
+tracksRegLiveness: true
+stack:
+ - { id: 0, type: default, size: 64, alignment: 16, stack-id: default }
+ - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+ stackPtrOffsetReg: '$sgpr32'
+body: |
+ bb.0:
+ liveins: $sgpr4, $sgpr5
+
+ ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_64_live_scc
+ ; GCN: liveins: $sgpr4, $sgpr5
+ ; GCN-NEXT: {{ $}}
+ ; GCN-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+ ; GCN-NEXT: $vgpr0 = V_LSHR_B32_e64 6, $sgpr32, implicit $exec
+ ; GCN-NEXT: $vgpr0 = V_ADD_U32_e32 64, killed $vgpr0, implicit $exec
+ ; GCN-NEXT: $sgpr4 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+ ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4, implicit $scc
+ S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+ renamable $sgpr4 = S_MOV_B32 %stack.1
+ S_ENDPGM 0, implicit $sgpr4, implicit $scc
+
+...
+
+---
+name: materialize_fi_s_mov_b32_offset_68_live_scc
+tracksRegLiveness: true
+stack:
+ - { id: 0, type: default, size: 68, alignment: 16, stack-id: default }
+ - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+ stackPtrOffsetReg: '$sgpr32'
+body: |
+ bb.0:
+ liveins: $sgpr4, $sgpr5
+
+ ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_68_live_scc
+ ; GCN: liveins: $sgpr4, $sgpr5
+ ; GCN-NEXT: {{ $}}
+ ; GCN-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+ ; GCN-NEXT: $vgpr0 = V_LSHR_B32_e64 6, $sgpr32, implicit $exec
+ ; GCN-NEXT: $vgpr0 = V_ADD_U32_e32 68, killed $vgpr0, implicit $exec
+ ; GCN-NEXT: $sgpr4 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+ ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4, implicit $scc
+ S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+ renamable $sgpr4 = S_MOV_B32 %stack.1
+ S_ENDPGM 0, implicit $sgpr4, implicit $scc
+
+...
+
+# FIXME: This is finding a VGPR
+---
+name: materialize_fi_s_mov_b32_offset_0_live_scc__no_free_vgprs
+tracksRegLiveness: true
+stack:
+ - { id: 0, type: default, size: 4, alignment: 16, stack-id: default }
+machineFunctionInfo:
+ scratchRSrcReg: $sgpr0_sgpr1_sgpr2_sgpr3
+ stackPtrOffsetReg: '$sgpr32'
+ occupancy: 10
+
+body: |
+ bb.0:
+ liveins: $sgpr4, $sgpr5, $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+
+ ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset_0_live_scc__no_free_vgprs
+ ; GCN: liveins: $sgpr4, $sgpr5, $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39, $vgpr40, $vgpr41, $vgpr42, $vgpr43, $vgpr44, $vgpr45, $vgpr46, $vgpr47, $vgpr56, $vgpr57, $vgpr58, $vgpr59, $vgpr60, $vgpr61, $vgpr62, $vgpr63
+ ; GCN-NEXT: {{ $}}
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 60, 0, 0, implicit $exec :: (store (s32) into %stack.1, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr41, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 56, 0, 0, implicit $exec :: (store (s32) into %stack.2, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr42, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 52, 0, 0, implicit $exec :: (store (s32) into %stack.3, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr43, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 48, 0, 0, implicit $exec :: (store (s32) into %stack.4, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr44, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 44, 0, 0, implicit $exec :: (store (s32) into %stack.5, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr45, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 40, 0, 0, implicit $exec :: (store (s32) into %stack.6, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr46, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 36, 0, 0, implicit $exec :: (store (s32) into %stack.7, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr47, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 32, 0, 0, implicit $exec :: (store (s32) into %stack.8, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr56, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 28, 0, 0, implicit $exec :: (store (s32) into %stack.9, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr57, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 24, 0, 0, implicit $exec :: (store (s32) into %stack.10, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr58, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 20, 0, 0, implicit $exec :: (store (s32) into %stack.11, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr59, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 16, 0, 0, implicit $exec :: (store (s32) into %stack.12, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr60, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 12, 0, 0, implicit $exec :: (store (s32) into %stack.13, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr61, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 8, 0, 0, implicit $exec :: (store (s32) into %stack.14, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr62, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (store (s32) into %stack.15, addrspace 5)
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr63, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec :: (store (s32) into %stack.16, addrspace 5)
+ ; GCN-NEXT: S_NOP 0, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
+ ; GCN-NEXT: S_NOP 0, implicit-def $vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
+ ; GCN-NEXT: S_NOP 0, implicit-def $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23
+ ; GCN-NEXT: S_NOP 0, implicit-def $vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
+ ; GCN-NEXT: S_NOP 0, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+ ; GCN-NEXT: S_NOP 0, implicit-def $vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
+ ; GCN-NEXT: S_NOP 0, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51_vgpr52_vgpr53_vgpr54_vgpr55
+ ; GCN-NEXT: S_NOP 0, implicit-def $vgpr56_vgpr57_vgpr58_vgpr59_vgpr60_vgpr61_vgpr62_vgpr63
+ ; GCN-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+ ; GCN-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 68, 0, 0, implicit $exec :: (store (s32) into %stack.17, addrspace 5)
+ ; GCN-NEXT: $vgpr0 = V_LSHR_B32_e64 6, $sgpr32, implicit $exec
+ ; GCN-NEXT: $vgpr0 = V_ADD_U32_e32 64, killed $vgpr0, implicit $exec
+ ; GCN-NEXT: $sgpr4 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+ ; GCN-NEXT: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 68, 0, 0, implicit $exec :: (load (s32) from %stack.17, addrspace 5)
+ ; GCN-NEXT: S_NOP 0, implicit $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
+ ; GCN-NEXT: S_NOP 0, implicit $vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
+ ; GCN-NEXT: S_NOP 0, implicit $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23
+ ; GCN-NEXT: S_NOP 0, implicit $vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
+ ; GCN-NEXT: S_NOP 0, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+ ; GCN-NEXT: S_NOP 0, implicit $vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
+ ; GCN-NEXT: S_NOP 0, implicit $vgpr48_vgpr49_vgpr50_vgpr51_vgpr52_vgpr53_vgpr54_vgpr55
+ ; GCN-NEXT: S_NOP 0, implicit $vgpr56_vgpr57_vgpr58_vgpr59_vgpr60_vgpr61_vgpr62_vgpr63
+ ; GCN-NEXT: $vgpr63 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec :: (load (s32) from %stack.16, addrspace 5)
+ ; GCN-NEXT: $vgpr62 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (load (s32) from %stack.15, addrspace 5)
+ ; GCN-NEXT: $vgpr61 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 8, 0, 0, implicit $exec :: (load (s32) from %stack.14, addrspace 5)
+ ; GCN-NEXT: $vgpr60 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 12, 0, 0, implicit $exec :: (load (s32) from %stack.13, addrspace 5)
+ ; GCN-NEXT: $vgpr59 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 16, 0, 0, implicit $exec :: (load (s32) from %stack.12, addrspace 5)
+ ; GCN-NEXT: $vgpr58 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 20, 0, 0, implicit $exec :: (load (s32) from %stack.11, addrspace 5)
+ ; GCN-NEXT: $vgpr57 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 24, 0, 0, implicit $exec :: (load (s32) from %stack.10, addrspace 5)
+ ; GCN-NEXT: $vgpr56 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 28, 0, 0, implicit $exec :: (load (s32) from %stack.9, addrspace 5)
+ ; GCN-NEXT: $vgpr47 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 32, 0, 0, implicit $exec :: (load (s32) from %stack.8, addrspace 5)
+ ; GCN-NEXT: $vgpr46 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 36, 0, 0, implicit $exec :: (load (s32) from %stack.7, addrspace 5)
+ ; GCN-NEXT: $vgpr45 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 40, 0, 0, implicit $exec :: (load (s32) from %stack.6, addrspace 5)
+ ; GCN-NEXT: $vgpr44 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 44, 0, 0, implicit $exec :: (load (s32) from %stack.5, addrspace 5)
+ ; GCN-NEXT: $vgpr43 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 48, 0, 0, implicit $exec :: (load (s32) from %stack.4, addrspace 5)
+ ; GCN-NEXT: $vgpr42 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 52, 0, 0, implicit $exec :: (load (s32) from %stack.3, addrspace 5)
+ ; GCN-NEXT: $vgpr41 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 56, 0, 0, implicit $exec :: (load (s32) from %stack.2, addrspace 5)
+ ; GCN-NEXT: $vgpr40 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 60, 0, 0, implicit $exec :: (load (s32) from %stack.1, addrspace 5)
+ ; GCN-NEXT: S_ENDPGM 0, implicit $sgpr4, implicit $scc
+ S_NOP 0, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
+ S_NOP 0, implicit-def $vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
+ S_NOP 0, implicit-def $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23
+ S_NOP 0, implicit-def $vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
+ S_NOP 0, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+ S_NOP 0, implicit-def $vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
+ S_NOP 0, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51_vgpr52_vgpr53_vgpr54_vgpr55
+ S_NOP 0, implicit-def $vgpr56_vgpr57_vgpr58_vgpr59_vgpr60_vgpr61_vgpr62_vgpr63
+ S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc
+
+ renamable $sgpr4 = S_MOV_B32 %stack.0
+
+ S_NOP 0, implicit $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7
+ S_NOP 0, implicit $vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15
+ S_NOP 0, implicit $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23
+ S_NOP 0, implicit $vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31
+ S_NOP 0, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39
+ S_NOP 0, implicit $vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
+ S_NOP 0, implicit $vgpr48_vgpr49_vgpr50_vgpr51_vgpr52_vgpr53_vgpr54_vgpr55
+ S_NOP 0, implicit $vgpr56_vgpr57_vgpr58_vgpr59_vgpr60_vgpr61_vgpr62_vgpr63
+ S_ENDPGM 0, implicit $sgpr4, implicit $scc
+
+...
+
+# FIXME: This is clobbering scc
+---
+name: materialize_fi_s_mov_b32_offset_96_live_scc__no_free_vgprs
+tracksRegLiveness: true
+stack:
+ - { id: 0, type: default, size: 64, alignment: 16, stack-id: default }
+ - { id: 1, type: default, size: 4, alignment: 4, stack-id: default }
+machineFunctionInfo:
+ scratchRSrcReg: $sgpr0_sgpr1_sgpr2_sgpr3
+ stackPtrOffsetReg: '$sgpr32'
+body: |
+ bb.0:
+ liveins: $sgpr4, $sgpr5
+
+ ; GCN-LABEL: name: materialize_fi_s_mov_b32_offset...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrite the PR description. Include the point that eliminateFrameIndex wasn't handling the FI copy to Scalar registers and the default implementation breaks the code while trying to handle it. There is no need to mention the two spills and the emergency spill slots.
ret void | ||
} | ||
|
||
attributes #0 = { "amdgpu-stack-objects" "target-cpu"="gfx90a" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't need the IR section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
renamable $vgpr0_vgpr1 = COPY killed renamable $sgpr22_sgpr23, implicit $exec | ||
renamable $vgpr0_vgpr1 = COPY killed renamable $sgpr20_sgpr21, implicit $exec | ||
renamable $vgpr0_vgpr1 = COPY killed renamable $sgpr18_sgpr19, implicit $exec | ||
dead $sgpr30_sgpr31 = SI_CALL killed renamable $sgpr16_sgpr17, @killer, csr_amdgpu_gfx90ainsts, implicit killed $sgpr4_sgpr5, implicit killed $sgpr6_sgpr7, implicit killed $sgpr8_sgpr9, implicit killed $sgpr10_sgpr11, implicit killed $sgpr12, implicit killed $sgpr13, implicit killed $sgpr14, implicit killed $sgpr15, implicit $sgpr0_sgpr1_sgpr2_sgpr3, implicit-def $vgpr0, implicit-def $vgpr1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace the function reference with 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
IsCopy && IsSALU | ||
? ResultReg | ||
: RS->scavengeRegisterBackwards(AMDGPU::SReg_32_XM0RegClass, | ||
MI, false, 0, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MI, false, 0, false); | |
MI, false, 0, /*AllowSpill=*/false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
.addImm(Offset); | ||
} else { | ||
TmpResultReg = RS->scavengeRegisterBackwards( | ||
AMDGPU::VGPR_32RegClass, MI, false, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMDGPU::VGPR_32RegClass, MI, false, 0); | |
AMDGPU::VGPR_32RegClass, MI, false, 0, /*AllowSpill=*/true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
auto Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_ADD_U32_e32), | ||
TmpResultReg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to use getAddNoCarry here and then handle the failure case too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the current patch.
@@ -211,3 +211,327 @@ body: | | |||
renamable $vgpr0 = V_ADD_CO_U32_e32 %stack.0, killed $vgpr0, implicit-def dead $vcc, implicit $exec | |||
... | |||
|
|||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a pre-commit, can you generate GFX8 and GFX9 checks in this test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed.
Register NewDest = | ||
IsCopy ? ResultReg | ||
: RS->scavengeRegisterBackwards( | ||
AMDGPU::SReg_32RegClass, Add, false, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMDGPU::SReg_32RegClass, Add, false, 0); | |
AMDGPU::SReg_32RegClass, Add, false, 0, /*AllowSpill*/=true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the case where this scavenge spill happens tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I guess the last two test cases present in frame-index.mir are spill case, which are working fine.
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_LSHR_B32), TmpResultReg) | ||
.addReg(FrameReg) | ||
.addImm(ST.getWavefrontSizeLog2()); | ||
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), TmpResultReg) | ||
.addReg(TmpResultReg, RegState::Kill) | ||
.addImm(Offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exact sequence is repeated above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see these two instruction above, for offset 0 we have shift but not add.
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32), | ||
NewDest) | ||
.addReg(TmpResultReg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole sequence is repeated above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is when offset is 0.
✅ With the latest revision this PR passed the C/C++ code formatter. |
@@ -0,0 +1,72 @@ | |||
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5 | |||
# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck %s | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test file name is still too generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is renamed now.
if ((Add = TII->getAddNoCarry(*MBB, MI, DL, TmpResultReg, *RS)) == | ||
nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be easiest to read if we handled the positive cases where this returned an add before the fallback path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this case now will not arise, handling null case inside getAddNoCarry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, getAddNoCarry has a failure path which you are checking for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now getAddNoCarry will return either of these three cases. V_ADD_U32_e32, V_MAD_U32_U24_e64 or V_ADD_CO_U32_e64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are couple of more test cases which are failing if i try to handle this, inside getAddNoCarry.
and the code above which is expecting nullptr will not be rechable.
else {
MachineInstrBuilder MIB;
if (!IsSALU) {
if ((MIB = TII->getAddNoCarry(*MBB, MI, DL, ResultReg, *RS)) !=
nullptr) {
// scavenged? This a way around to avoid carry, need follow-up. | ||
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), ResultReg) | ||
.addImm(Offset); | ||
Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MAD_I32_I24_e64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know we can use the unsigned version here. The offset will always be a small positive constant and the SP/FP will always be positive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, replacing v_mad_i32_i24 with v_mad_u32_u24
// scavenged? This a way around to avoid carry, need follow-up. | ||
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), ResultReg) | ||
.addImm(Offset); | ||
Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MAD_I32_I24_e64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could handle using mad like this directly in getAddNoCarry.
I was also originally thinking of folding in the FP scaling into the mad. I.e. v_mad_u32_u24 fp, wavesize, offset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then i may i have to modify the interface for getAddNoCarry, as we are using an scavenged register for shifting Offset into.
or that part should i take care as present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's split to avoid the scavenging inside. I guess just reorder the handling for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's going to shift around if we can fold in the shift by wavesize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are more cases need to be handled if i move handling inside, getAddNoCarry.
It hits assert(MIB->getOpcode() == AMDGPU::V_ADD_CO_U32_e64 && "Need to reuse carry out register");
where we are trying to handle !SALU cases.
for now I have left it custom handled.
# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=tonga -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck -check-prefixes=GFX8,GCN %s | ||
# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck -check-prefixes=GFX900,GCN %s | ||
# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck -check-prefixes=GFX90A,GCN %s | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test name shouldn't mention failure, this is just the lowering handling. also s_mov not copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
@@ -0,0 +1,574 @@ | |||
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py | |||
# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=tonga -verify-machineinstrs -run-pass=prologepilog %s -o - | FileCheck -check-prefixes=GFX8,GCN %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about eliminate-frame-index-s-mov-b32.mir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great!
// since emergency stack slot is already used for spilling VGPR | ||
// scavenged? This a way around to avoid carry, need follow-up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs capitalization and rephrasing. I'm not sure I understand the comment. Is everything working now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, all cases are working now.
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), ResultReg) | ||
.addImm(Offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead you can materialize in the VGPR TmpResultReg. That saves the constant bus use for the frame pointer register to directly materialize in one mad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will clobber shift value in TmpResultReg right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not if you replace the whole sequence. It's the free register you just scavenged, so it would be
tmp_vgpr = offset
tmp_vgpr = mad24 FrameReg, wavesize, tmp_vgpr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you should add an assert that the offset is positive. I think we need to switch this to the signed mad when we switch the scratch instruction ABI to use negative offsets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not if you replace the whole sequence. It's the free register you just scavenged, so it would be tmp_vgpr = offset tmp_vgpr = mad24 FrameReg, wavesize, tmp_vgpr
That's really great idea, but now it looks like we are trying to replace right shift with left shift?
in VGPR we are performing right shift with WavefrontSizelog2 which is eventually divide by WavefrontSize.
but now FrameReg * WavefrontSize which is eventually left shift by WavefrontSizelog2?
is mad correct operator then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you should add an assert that the offset is positive. I think we need to switch this to the signed mad when we switch the scratch instruction ABI to use negative offsets
sure, will add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We start with a frame pointer with a wave space value, and an offset in lane-space. We are materializing a lane space value. You can either do a right shift of the frame pointer to get to lane space, or a left shift of the offset to get to wavespace. We can right shift after the computation to get back to the desired per-lane value.
I also spent some time with alive2, and unfortunately we can only use the mad24 on gfx9 and gfx10. https://alive2.llvm.org/ce/z/_gVKVG. The stack size increased in gfx11 so that the maximum sized object computation can no longer be done with the 24-bit multiply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we might be able to - I wrote the assumes, assuming the FP and the offset are the maximum. But these are actually linked. The condition is more like fp + offset < maximum value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Still seems like it would require a 26-bit multiply in wave32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made changes, is it what you have suggested.
// We have to produce a carry out, and there isn't a free SGPR | ||
// pair for it. This a way around to avoid carry. | ||
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), ResultReg) | ||
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you do this reassociation, you need to multiply the offset by the wavesize. You're doing the mul + add in terms of waves
@@ -368,9 +368,9 @@ body: | | |||
; GFX8-NEXT: S_NOP 0, implicit-def $sgpr56_sgpr57_sgpr58_sgpr59_sgpr60_sgpr61_sgpr62_sgpr63 | |||
; GFX8-NEXT: V_CMP_EQ_U32_e32 0, killed $vgpr0, implicit-def $vcc, implicit $exec | |||
; GFX8-NEXT: S_CMP_EQ_I32 $sgpr4, $sgpr5, implicit-def $scc | |||
; GFX8-NEXT: $vgpr0 = V_MOV_B32_e32 64, implicit $exec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test also should get gfx10, gfx11, and gfx12 runlines (maybe leave for a separate commit though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32), | ||
TmpResultReg) | ||
.addImm(Offset); | ||
assert(Offset > 0 && "Offset is positive"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(Offset > 0 && "Offset is positive"); | |
assert(Offset > 0 && isUInt<24>(2 * ST.getMaxWaveScratchSize()) && | |
"offset is unsafe for v_mad_u32_u24"); |
This assert should also be strengthened to cover the range, I need to work out the exact condition but something like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can squeeze a bit more range out of this with more instructions, but we shouldn't need to on gfx11/12 since they can just directly encode the larger offset in the add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus when we use scratch instructions, the SP value is in lane values anyway so we can do everything in one add
.addReg(FrameReg) | ||
.addImm(ST.getWavefrontSize()) | ||
.addReg(TmpResultReg, RegState::Kill) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the operands swapped. This should be doing the multiply of the offset and adding the frame reg
.addReg(FrameReg) | |
.addImm(ST.getWavefrontSize()) | |
.addReg(TmpResultReg, RegState::Kill) | |
.addReg(TmpResultReg, RegState::Kill) | |
.addImm(ST.getWavefrontSize()) | |
.addReg(FrameReg) |
} else | ||
Add.addImm(Offset).addReg(TmpResultReg, RegState::Kill); | ||
} else { | ||
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs comments about why this is doing this reassociation vs. above
It would also be good if we had an end to end IR test that hit these new paths |
// value. We can either do a right shift of the frame pointer to | ||
// get to lane space, or a left shift of the offset to get to | ||
// wavespace. We can right shift after the computation to get | ||
// back to the desired per-lane value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention we are using the mad_u32_u24 primarily as an add with no carry out clobber
@@ -2580,6 +2580,8 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI, | |||
// get to lane space, or a left shift of the offset to get to | |||
// wavespace. We can right shift after the computation to get | |||
// back to the desired per-lane value. | |||
// we are using the mad_u32_u24 primarily as an add with no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments should be capitalized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would still be good if we had an IR test that hit these cases
I think I've managed to craft one that will need some adjustment to hit the various cases |
BuildMI(*MBB, *Add, DL, TII->get(AMDGPU::V_LSHR_B32_e64), | ||
TmpResultReg) | ||
.addImm(ST.getWavefrontSizeLog2()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use V_LSHRREV_B32_e64
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_LSHR_B32_e64), | ||
TmpResultReg) | ||
.addImm(ST.getWavefrontSizeLog2()) | ||
.addReg(FrameReg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lshrrev
} else | ||
Add.addImm(Offset).addReg(TmpResultReg, RegState::Kill); | ||
} else { | ||
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the offset is an inline immediate, you can just fold it directly into the mad24 operand as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should address this in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also should special case folding frame indexes into S_ADD_I32/S_OR_B32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open another PR with additional IR tests
sure matt, will add the changes, you suggested. |
This is #101306 |
In case of v_mad we have materialized the offset in vgpr and mad is performed in wave space, later vgpr have to be shifted back in lane space. [#99556](#99556) introduces a bug. Co-authored-by: Pankajdwivedi-25 <[email protected]>
llvm#99556) eliminateFrameIndex wasn't handling the FI copy to Scalar registers and the default implementation breaks the code while trying to handle it. This patch handles the broken lowering and also takes care of some edge cases that might arise. This case is tricky for non-zero offset, scc & vcc is live and we don't find sgpr pair available. Co-authored by @arsenm --------- Co-authored-by: Matt Arsenault <[email protected]> Co-authored-by: PankajDwivedi-25 <[email protected]> Change-Id: I760538f594d267bdc414833c4489486e426fb55b
In case of v_mad we have materialized the offset in vgpr and mad is performed in wave space, later vgpr have to be shifted back in lane space. [llvm#99556](llvm#99556) introduces a bug. Co-authored-by: Pankajdwivedi-25 <[email protected]> Change-Id: Ic02338cf087cb3c95964b7273609a73a3e7b2d03
llvm#99556) eliminateFrameIndex wasn't handling the FI copy to Scalar registers and the default implementation breaks the code while trying to handle it. This patch handles the broken lowering and also takes care of some edge cases that might arise. This case is tricky for non-zero offset, scc & vcc is live and we don't find sgpr pair available. Co-authored by @arsenm --------- Co-authored-by: Matt Arsenault <[email protected]> Co-authored-by: PankajDwivedi-25 <[email protected]> Change-Id: I760538f594d267bdc414833c4489486e426fb55b
In case of v_mad we have materialized the offset in vgpr and mad is performed in wave space, later vgpr have to be shifted back in lane space. [llvm#99556](llvm#99556) introduces a bug. Co-authored-by: Pankajdwivedi-25 <[email protected]> Change-Id: Ic02338cf087cb3c95964b7273609a73a3e7b2d03
llvm#99556) eliminateFrameIndex wasn't handling the FI copy to Scalar registers and the default implementation breaks the code while trying to handle it. This patch handles the broken lowering and also takes care of some edge cases that might arise. This case is tricky for non-zero offset, scc & vcc is live and we don't find sgpr pair available. Co-authored by @arsenm --------- Co-authored-by: Matt Arsenault <[email protected]> Co-authored-by: PankajDwivedi-25 <[email protected]> Change-Id: I8db13328357723da96c1bfaa2fd2368430b17100
In case of v_mad we have materialized the offset in vgpr and mad is performed in wave space, later vgpr have to be shifted back in lane space. [llvm#99556](llvm#99556) introduces a bug. Co-authored-by: Pankajdwivedi-25 <[email protected]> Change-Id: I23db2a321c9bf3e28b5e173ddc95fa98c4d16fd6
eliminateFrameIndex wasn't handling the FI copy to Scalar registers and the default implementation breaks the code while trying to handle it. This patch handles the broken lowering and also takes care of some edge cases that might arise. This case is tricky for non-zero offset, scc & vcc is live and we don't find sgpr pair available.
Co-authored by @arsenm