-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AMDGPU] Restore SP correctly in functions with dynamic allocas #122743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently, the AMDGPU backend sets up FP and then increments SP by fixed size, from FP, in the prolog and decrements it by the same amount in the epilog. Prolog: tmp = sp + (alignment - 1) fp &= -alignment sp += frameSize + alignment Epilog: sp -= (frameSize + alignment) In the presence of dynamic alloca, this leads to incorrect restoration of SP. This patch enforces the presence of a base pointer for all functions with dynamic allocas, and SP is restored from the saved BP in the epilog. Prolog: tmp = sp + (alignment - 1) fp &= -alignment bp = sp sp += frameSize + alignment Epilog: sp += bp + frameSize + alignment sp -= (frameSize + alignment) (Note: for dynamic allocas with default alignment, SP can be restored with saved FP as well. However, for the sake of uniformity, presence of BP is enforced) Fixes: SWDEV-408164
@llvm/pr-subscribers-backend-amdgpu Author: Aaditya (easyonaadit) ChangesCurrently, the AMDGPU backend sets up FP and then In the presence of dynamic alloca, this leads to incorrect restoration of SP. This patch enforces the presence of a base pointer for all functions with dynamic allocas, and SP is restored from the saved BP in the epilog. Prolog: (Note: for dynamic allocas with default alignment, SP can be restored with saved FP as well. However, for the sake of uniformity, presence of BP is enforced) Fixes: SWDEV-408164 Patch is 75.29 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/122743.diff 8 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index dcd4f0f65e8ef2..274416ec054817 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -1259,6 +1259,17 @@ void SIFrameLowering::emitEpilogue(MachineFunction &MF,
Register FramePtrRegScratchCopy;
Register SGPRForFPSaveRestoreCopy =
FuncInfo->getScratchSGPRCopyDstReg(FramePtrReg);
+
+ if (MFI.hasVarSizedObjects()) {
+ assert(TRI.hasBasePointer(MF) &&
+ "Variable sized objects require base pointer to be setup!");
+ Register BasePtrReg = TRI.getBaseRegister();
+ // Restore SP to fixed frame size
+ BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::S_ADD_I32), StackPtrReg)
+ .addReg(BasePtrReg)
+ .addImm(RoundedSize * getScratchScaleFactor(ST))
+ .setMIFlag(MachineInstr::FrameDestroy);
+ }
if (FPSaved) {
// CSR spill restores should use FP as base register. If
// SGPRForFPSaveRestoreCopy is not true, restore the previous value of FP
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
index 704435dad65d7b..5a4e6ec48da823 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
@@ -525,8 +525,11 @@ Register SIRegisterInfo::getFrameRegister(const MachineFunction &MF) const {
bool SIRegisterInfo::hasBasePointer(const MachineFunction &MF) const {
// When we need stack realignment, we can't reference off of the
// stack pointer, so we reserve a base pointer.
+ // For functions with dynamically sized stack objects, we need to reference
+ // off the base pointer in the epilog to restore the stack frame.
const MachineFrameInfo &MFI = MF.getFrameInfo();
- return MFI.getNumFixedObjects() && shouldRealignStack(MF);
+ return (MFI.getNumFixedObjects() && shouldRealignStack(MF)) ||
+ MFI.hasVarSizedObjects();
}
Register SIRegisterInfo::getBaseRegister() const { return AMDGPU::SGPR34; }
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
index ae055ea041297e..e497ed6526a059 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
@@ -69,6 +69,8 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: s_mov_b32 s7, s33
; GFX9-NEXT: s_mov_b32 s33, s32
+; GFX9-NEXT: s_mov_b32 s8, s34
+; GFX9-NEXT: s_mov_b32 s34, s32
; GFX9-NEXT: s_addk_i32 s32, 0x400
; GFX9-NEXT: s_getpc_b64 s[4:5]
; GFX9-NEXT: s_add_u32 s4, s4, gv@gotpcrel32@lo+4
@@ -86,6 +88,8 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX9-NEXT: s_and_b32 s4, s4, -16
; GFX9-NEXT: s_lshl_b32 s4, s4, 6
; GFX9-NEXT: s_add_u32 s32, s6, s4
+; GFX9-NEXT: s_add_i32 s32, s34, 0x400
+; GFX9-NEXT: s_mov_b32 s34, s8
; GFX9-NEXT: s_addk_i32 s32, 0xfc00
; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: s_setpc_b64 s[30:31]
@@ -95,6 +99,8 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: s_mov_b32 s7, s33
; GFX10-NEXT: s_mov_b32 s33, s32
+; GFX10-NEXT: s_mov_b32 s8, s34
+; GFX10-NEXT: s_mov_b32 s34, s32
; GFX10-NEXT: s_addk_i32 s32, 0x200
; GFX10-NEXT: s_getpc_b64 s[4:5]
; GFX10-NEXT: s_add_u32 s4, s4, gv@gotpcrel32@lo+4
@@ -112,6 +118,8 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX10-NEXT: s_and_b32 s4, s4, -16
; GFX10-NEXT: s_lshl_b32 s4, s4, 5
; GFX10-NEXT: s_add_u32 s32, s6, s4
+; GFX10-NEXT: s_add_i32 s32, s34, 0x200
+; GFX10-NEXT: s_mov_b32 s34, s8
; GFX10-NEXT: s_addk_i32 s32, 0xfe00
; GFX10-NEXT: s_setpc_b64 s[30:31]
;
@@ -120,13 +128,15 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: s_mov_b32 s3, s33
; GFX11-NEXT: s_mov_b32 s33, s32
+; GFX11-NEXT: s_mov_b32 s4, s34
+; GFX11-NEXT: s_mov_b32 s34, s32
; GFX11-NEXT: s_add_i32 s32, s32, 16
; GFX11-NEXT: s_getpc_b64 s[0:1]
; GFX11-NEXT: s_add_u32 s0, s0, gv@gotpcrel32@lo+4
; GFX11-NEXT: s_addc_u32 s1, s1, gv@gotpcrel32@hi+12
-; GFX11-NEXT: v_mov_b32_e32 v0, 0
-; GFX11-NEXT: s_load_b64 s[0:1], s[0:1], 0x0
; GFX11-NEXT: s_mov_b32 s2, s32
+; GFX11-NEXT: s_load_b64 s[0:1], s[0:1], 0x0
+; GFX11-NEXT: v_mov_b32_e32 v0, 0
; GFX11-NEXT: s_mov_b32 s33, s3
; GFX11-NEXT: scratch_store_b32 off, v0, s2
; GFX11-NEXT: s_waitcnt lgkmcnt(0)
@@ -136,8 +146,10 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; GFX11-NEXT: s_and_b32 s0, s0, -16
; GFX11-NEXT: s_lshl_b32 s0, s0, 5
-; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX11-NEXT: s_add_u32 s32, s2, s0
+; GFX11-NEXT: s_add_i32 s32, s34, 16
+; GFX11-NEXT: s_mov_b32 s34, s4
; GFX11-NEXT: s_add_i32 s32, s32, -16
; GFX11-NEXT: s_setpc_b64 s[30:31]
%n = load i32, ptr addrspace(4) @gv, align 4
@@ -210,6 +222,8 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: s_mov_b32 s7, s33
; GFX9-NEXT: s_mov_b32 s33, s32
+; GFX9-NEXT: s_mov_b32 s8, s34
+; GFX9-NEXT: s_mov_b32 s34, s32
; GFX9-NEXT: s_addk_i32 s32, 0x400
; GFX9-NEXT: s_getpc_b64 s[4:5]
; GFX9-NEXT: s_add_u32 s4, s4, gv@gotpcrel32@lo+4
@@ -227,6 +241,8 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX9-NEXT: s_and_b32 s4, s4, -16
; GFX9-NEXT: s_lshl_b32 s4, s4, 6
; GFX9-NEXT: s_add_u32 s32, s6, s4
+; GFX9-NEXT: s_add_i32 s32, s34, 0x400
+; GFX9-NEXT: s_mov_b32 s34, s8
; GFX9-NEXT: s_addk_i32 s32, 0xfc00
; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: s_setpc_b64 s[30:31]
@@ -236,6 +252,8 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: s_mov_b32 s7, s33
; GFX10-NEXT: s_mov_b32 s33, s32
+; GFX10-NEXT: s_mov_b32 s8, s34
+; GFX10-NEXT: s_mov_b32 s34, s32
; GFX10-NEXT: s_addk_i32 s32, 0x200
; GFX10-NEXT: s_getpc_b64 s[4:5]
; GFX10-NEXT: s_add_u32 s4, s4, gv@gotpcrel32@lo+4
@@ -253,6 +271,8 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX10-NEXT: s_and_b32 s4, s4, -16
; GFX10-NEXT: s_lshl_b32 s4, s4, 5
; GFX10-NEXT: s_add_u32 s32, s6, s4
+; GFX10-NEXT: s_add_i32 s32, s34, 0x200
+; GFX10-NEXT: s_mov_b32 s34, s8
; GFX10-NEXT: s_addk_i32 s32, 0xfe00
; GFX10-NEXT: s_setpc_b64 s[30:31]
;
@@ -261,13 +281,15 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: s_mov_b32 s3, s33
; GFX11-NEXT: s_mov_b32 s33, s32
+; GFX11-NEXT: s_mov_b32 s4, s34
+; GFX11-NEXT: s_mov_b32 s34, s32
; GFX11-NEXT: s_add_i32 s32, s32, 16
; GFX11-NEXT: s_getpc_b64 s[0:1]
; GFX11-NEXT: s_add_u32 s0, s0, gv@gotpcrel32@lo+4
; GFX11-NEXT: s_addc_u32 s1, s1, gv@gotpcrel32@hi+12
-; GFX11-NEXT: v_mov_b32_e32 v0, 0
-; GFX11-NEXT: s_load_b64 s[0:1], s[0:1], 0x0
; GFX11-NEXT: s_mov_b32 s2, s32
+; GFX11-NEXT: s_load_b64 s[0:1], s[0:1], 0x0
+; GFX11-NEXT: v_mov_b32_e32 v0, 0
; GFX11-NEXT: s_mov_b32 s33, s3
; GFX11-NEXT: scratch_store_b32 off, v0, s2
; GFX11-NEXT: s_waitcnt lgkmcnt(0)
@@ -277,8 +299,10 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; GFX11-NEXT: s_and_b32 s0, s0, -16
; GFX11-NEXT: s_lshl_b32 s0, s0, 5
-; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX11-NEXT: s_add_u32 s32, s2, s0
+; GFX11-NEXT: s_add_i32 s32, s34, 16
+; GFX11-NEXT: s_mov_b32 s34, s4
; GFX11-NEXT: s_add_i32 s32, s32, -16
; GFX11-NEXT: s_setpc_b64 s[30:31]
%n = load i32, ptr addrspace(4) @gv, align 16
@@ -355,6 +379,8 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX9-NEXT: s_mov_b32 s6, s33
; GFX9-NEXT: s_add_i32 s33, s32, 0x7c0
; GFX9-NEXT: s_and_b32 s33, s33, 0xfffff800
+; GFX9-NEXT: s_mov_b32 s7, s34
+; GFX9-NEXT: s_mov_b32 s34, s32
; GFX9-NEXT: s_addk_i32 s32, 0x1000
; GFX9-NEXT: s_getpc_b64 s[4:5]
; GFX9-NEXT: s_add_u32 s4, s4, gv@gotpcrel32@lo+4
@@ -373,6 +399,8 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX9-NEXT: s_and_b32 s4, s4, -16
; GFX9-NEXT: s_lshl_b32 s4, s4, 6
; GFX9-NEXT: s_add_u32 s32, s5, s4
+; GFX9-NEXT: s_add_i32 s32, s34, 0x1000
+; GFX9-NEXT: s_mov_b32 s34, s7
; GFX9-NEXT: s_addk_i32 s32, 0xf000
; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: s_setpc_b64 s[30:31]
@@ -382,8 +410,10 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: s_mov_b32 s6, s33
; GFX10-NEXT: s_add_i32 s33, s32, 0x3e0
-; GFX10-NEXT: s_addk_i32 s32, 0x800
+; GFX10-NEXT: s_mov_b32 s7, s34
; GFX10-NEXT: s_and_b32 s33, s33, 0xfffffc00
+; GFX10-NEXT: s_mov_b32 s34, s32
+; GFX10-NEXT: s_addk_i32 s32, 0x800
; GFX10-NEXT: s_getpc_b64 s[4:5]
; GFX10-NEXT: s_add_u32 s4, s4, gv@gotpcrel32@lo+4
; GFX10-NEXT: s_addc_u32 s5, s5, gv@gotpcrel32@hi+12
@@ -401,6 +431,8 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX10-NEXT: s_and_b32 s4, s4, -16
; GFX10-NEXT: s_lshl_b32 s4, s4, 5
; GFX10-NEXT: s_add_u32 s32, s5, s4
+; GFX10-NEXT: s_add_i32 s32, s34, 0x800
+; GFX10-NEXT: s_mov_b32 s34, s7
; GFX10-NEXT: s_addk_i32 s32, 0xf800
; GFX10-NEXT: s_setpc_b64 s[30:31]
;
@@ -409,8 +441,10 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: s_mov_b32 s2, s33
; GFX11-NEXT: s_add_i32 s33, s32, 31
-; GFX11-NEXT: s_add_i32 s32, s32, 64
+; GFX11-NEXT: s_mov_b32 s3, s34
; GFX11-NEXT: s_and_not1_b32 s33, s33, 31
+; GFX11-NEXT: s_mov_b32 s34, s32
+; GFX11-NEXT: s_add_i32 s32, s32, 64
; GFX11-NEXT: s_getpc_b64 s[0:1]
; GFX11-NEXT: s_add_u32 s0, s0, gv@gotpcrel32@lo+4
; GFX11-NEXT: s_addc_u32 s1, s1, gv@gotpcrel32@hi+12
@@ -429,7 +463,8 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; GFX11-NEXT: s_lshl_b32 s0, s0, 5
; GFX11-NEXT: s_add_u32 s32, s1, s0
-; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT: s_add_i32 s32, s34, 64
+; GFX11-NEXT: s_mov_b32 s34, s3
; GFX11-NEXT: s_addk_i32 s32, 0xffc0
; GFX11-NEXT: s_setpc_b64 s[30:31]
%n = load i32, ptr addrspace(4) @gv
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll
index 69abef02d3d924..f4e6b7c033b3c4 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll
@@ -151,8 +151,10 @@ define void @func_non_entry_block_static_alloca_align4(ptr addrspace(1) %out, i3
; GCN: ; %bb.0: ; %entry
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: s_mov_b32 s7, s33
+; GCN-NEXT: s_mov_b32 s8, s34
; GCN-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2
; GCN-NEXT: s_mov_b32 s33, s32
+; GCN-NEXT: s_mov_b32 s34, s32
; GCN-NEXT: s_addk_i32 s32, 0x400
; GCN-NEXT: s_and_saveexec_b64 s[4:5], vcc
; GCN-NEXT: s_cbranch_execz .LBB2_3
@@ -178,8 +180,10 @@ define void @func_non_entry_block_static_alloca_align4(ptr addrspace(1) %out, i3
; GCN-NEXT: .LBB2_3: ; %bb.2
; GCN-NEXT: s_or_b64 exec, exec, s[4:5]
; GCN-NEXT: v_mov_b32_e32 v0, 0
+; GCN-NEXT: s_add_i32 s32, s34, 0x400
; GCN-NEXT: global_store_dword v[0:1], v0, off
; GCN-NEXT: s_waitcnt vmcnt(0)
+; GCN-NEXT: s_mov_b32 s34, s8
; GCN-NEXT: s_addk_i32 s32, 0xfc00
; GCN-NEXT: s_mov_b32 s33, s7
; GCN-NEXT: s_setpc_b64 s[30:31]
@@ -216,8 +220,10 @@ define void @func_non_entry_block_static_alloca_align64(ptr addrspace(1) %out, i
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: s_mov_b32 s7, s33
; GCN-NEXT: s_add_i32 s33, s32, 0xfc0
+; GCN-NEXT: s_mov_b32 s8, s34
; GCN-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2
; GCN-NEXT: s_and_b32 s33, s33, 0xfffff000
+; GCN-NEXT: s_mov_b32 s34, s32
; GCN-NEXT: s_addk_i32 s32, 0x2000
; GCN-NEXT: s_and_saveexec_b64 s[4:5], vcc
; GCN-NEXT: s_cbranch_execz .LBB3_2
@@ -240,8 +246,10 @@ define void @func_non_entry_block_static_alloca_align64(ptr addrspace(1) %out, i
; GCN-NEXT: .LBB3_2: ; %bb.1
; GCN-NEXT: s_or_b64 exec, exec, s[4:5]
; GCN-NEXT: v_mov_b32_e32 v0, 0
+; GCN-NEXT: s_add_i32 s32, s34, 0x2000
; GCN-NEXT: global_store_dword v[0:1], v0, off
; GCN-NEXT: s_waitcnt vmcnt(0)
+; GCN-NEXT: s_mov_b32 s34, s8
; GCN-NEXT: s_addk_i32 s32, 0xe000
; GCN-NEXT: s_mov_b32 s33, s7
; GCN-NEXT: s_setpc_b64 s[30:31]
diff --git a/llvm/test/CodeGen/AMDGPU/amdpal-callable.ll b/llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
index a5f915c48ebeea..2bd58f41ec7905 100644
--- a/llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
@@ -149,15 +149,15 @@ attributes #0 = { nounwind }
; GCN-NEXT: dynamic_stack:
; GCN-NEXT: .backend_stack_size: 0x10{{$}}
; GCN-NEXT: .lds_size: 0{{$}}
-; GCN-NEXT: .sgpr_count: 0x28{{$}}
+; GCN-NEXT: .sgpr_count: 0x2a{{$}}
; GCN-NEXT: .stack_frame_size_in_bytes: 0x10{{$}}
; SDAG-NEXT: .vgpr_count: 0x2{{$}}
; GISEL-NEXT: .vgpr_count: 0x3{{$}}
; GCN-NEXT: dynamic_stack_loop:
; GCN-NEXT: .backend_stack_size: 0x10{{$}}
; GCN-NEXT: .lds_size: 0{{$}}
-; SDAG-NEXT: .sgpr_count: 0x25{{$}}
-; GISEL-NEXT: .sgpr_count: 0x26{{$}}
+; SDAG-NEXT: .sgpr_count: 0x27{{$}}
+; GISEL-NEXT: .sgpr_count: 0x28{{$}}
; GCN-NEXT: .stack_frame_size_in_bytes: 0x10{{$}}
; SDAG-NEXT: .vgpr_count: 0x3{{$}}
; GISEL-NEXT: .vgpr_count: 0x4{{$}}
@@ -182,22 +182,22 @@ attributes #0 = { nounwind }
; GCN-NEXT: no_stack_extern_call:
; GCN-NEXT: .backend_stack_size: 0x10{{$}}
; GCN-NEXT: .lds_size: 0{{$}}
-; GFX8-NEXT: .sgpr_count: 0x28{{$}}
-; GFX9-NEXT: .sgpr_count: 0x2c{{$}}
+; GFX8-NEXT: .sgpr_count: 0x2a{{$}}
+; GFX9-NEXT: .sgpr_count: 0x2e{{$}}
; GCN-NEXT: .stack_frame_size_in_bytes: 0x10{{$}}
; GCN-NEXT: .vgpr_count: 0x2b{{$}}
; GCN-NEXT: no_stack_extern_call_many_args:
; GCN-NEXT: .backend_stack_size: 0x90{{$}}
; GCN-NEXT: .lds_size: 0{{$}}
-; GFX8-NEXT: .sgpr_count: 0x28{{$}}
-; GFX9-NEXT: .sgpr_count: 0x2c{{$}}
+; GFX8-NEXT: .sgpr_count: 0x2a{{$}}
+; GFX9-NEXT: .sgpr_count: 0x2e{{$}}
; GCN-NEXT: .stack_frame_size_in_bytes: 0x90{{$}}
; GCN-NEXT: .vgpr_count: 0x2b{{$}}
; GCN-NEXT: no_stack_indirect_call:
; GCN-NEXT: .backend_stack_size: 0x10{{$}}
; GCN-NEXT: .lds_size: 0{{$}}
-; GFX8-NEXT: .sgpr_count: 0x28{{$}}
-; GFX9-NEXT: .sgpr_count: 0x2c{{$}}
+; GFX8-NEXT: .sgpr_count: 0x2a{{$}}
+; GFX9-NEXT: .sgpr_count: 0x2e{{$}}
; GCN-NEXT: .stack_frame_size_in_bytes: 0x10{{$}}
; GCN-NEXT: .vgpr_count: 0x2b{{$}}
; GCN-NEXT: simple_lds:
@@ -227,15 +227,15 @@ attributes #0 = { nounwind }
; GCN-NEXT: simple_stack_extern_call:
; GCN-NEXT: .backend_stack_size: 0x20{{$}}
; GCN-NEXT: .lds_size: 0{{$}}
-; GFX8-NEXT: .sgpr_count: 0x28{{$}}
-; GFX9-NEXT: .sgpr_count: 0x2c{{$}}
+; GFX8-NEXT: .sgpr_count: 0x2a{{$}}
+; GFX9-NEXT: .sgpr_count: 0x2e{{$}}
; GCN-NEXT: .stack_frame_size_in_bytes: 0x20{{$}}
; GCN-NEXT: .vgpr_count: 0x2b{{$}}
; GCN-NEXT: simple_stack_indirect_call:
; GCN-NEXT: .backend_stack_size: 0x20{{$}}
; GCN-NEXT: .lds_size: 0{{$}}
-; GFX8-NEXT: .sgpr_count: 0x28{{$}}
-; GFX9-NEXT: .sgpr_count: 0x2c{{$}}
+; GFX8-NEXT: .sgpr_count: 0x2a{{$}}
+; GFX9-NEXT: .sgpr_count: 0x2e{{$}}
; GCN-NEXT: .stack_frame_size_in_bytes: 0x20{{$}}
; GCN-NEXT: .vgpr_count: 0x2b{{$}}
; GCN-NEXT: simple_stack_recurse:
diff --git a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
index 9acb3a42ae102c..c298a5609ddb35 100644
--- a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
+++ b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
@@ -1064,10 +1064,12 @@ define void @test_dynamic_stackalloc_device_uniform(i32 %n) {
; GFX9-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-SDAG-NEXT: v_lshl_add_u32 v0, v0, 2, 15
; GFX9-SDAG-NEXT: s_mov_b32 s9, s33
+; GFX9-SDAG-NEXT: s_mov_b32 s10, s34
; GFX9-SDAG-NEXT: v_and_b32_e32 v0, -16, v0
; GFX9-SDAG-NEXT: s_mov_b64 s[4:5], exec
; GFX9-SDAG-NEXT: s_mov_b32 s6, 0
; GFX9-SDAG-NEXT: s_mov_b32 s33, s32
+; GFX9-SDAG-NEXT: s_mov_b32 s34, s32
; GFX9-SDAG-NEXT: s_addk_i32 s32, 0x400
; GFX9-SDAG-NEXT: .LBB8_1: ; =>This Inner Loop Header: Depth=1
; GFX9-SDAG-NEXT: s_ff1_i32_b64 s7, s[4:5]
@@ -1082,8 +1084,10 @@ define void @test_dynamic_stackalloc_device_uniform(i32 %n) {
; GFX9-SDAG-NEXT: v_lshl_add_u32 v0, s6, 6, v0
; GFX9-SDAG-NEXT: v_readfirstlane_b32 s32, v0
; GFX9-SDAG-NEXT: v_mov_b32_e32 v0, 0x7b
+; GFX9-SDAG-NEXT: s_add_i32 s32, s34, 0x400
; GFX9-SDAG-NEXT: buffer_store_dword v0, off, s[0:3], s4
; GFX9-SDAG-NEXT: s_waitcnt vmcnt(0)
+; GFX9-SDAG-NEXT: s_mov_b32 s34, s10
; GFX9-SDAG-NEXT: s_addk_i32 s32, 0xfc00
; GFX9-SDAG-NEXT: s_mov_b32 s33, s9
; GFX9-SDAG-NEXT: s_setpc_b64 s[30:31]
@@ -1093,10 +1097,12 @@ define void @test_dynamic_stackalloc_device_uniform(i32 %n) {
; GFX9-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-GISEL-NEXT: v_lshl_add_u32 v0, v0, 2, 15
; GFX9-GISEL-NEXT: s_mov_b32 s9, s33
+; GFX9-GISEL-NEXT: s_mov_b32 s10, s34
; GFX9-GISEL-NEXT: v_and_b32_e32 v0, -16, v0
; GFX9-GISEL-NEXT: s_mov_b64 s[4:5], exec
; GFX9-GISEL-NEXT: s_mov_b32 s6, 0
; GFX9-GISEL-NEXT: s_mov_b32 s33, s32
+; GFX9-GISEL-NEXT: s_mov_b32 s34, s32
; GFX9-GISEL-NEXT: s_addk_i32 s32, 0x400
; GFX9-GISEL-NEXT: .LBB8_1: ; =>This Inner Loop Header: Depth=1
; GFX9-GISEL-NEXT: s_ff1_i32_b64 s7, s[4:5]
@@ -1111,8 +1117,10 @@ define void @test_dynamic_stackalloc_device_uniform(i32 %n) {
; GFX9-GISEL-NEXT: s_add_u32 s32, s4, s5
; GFX9-GISEL-NEXT: v_mov_b32_e32 v0, 0x7b
; GFX9-GISEL-NEXT: v_mov_b32_e32 v1, s4
+; GFX9-GISEL-NEXT: s_add_i32 s32, s34, 0x400
; GFX9-GISEL-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
; GFX9-GISEL-NEXT: s_waitcnt vmcnt(0)
+; GFX9-GISEL-NEXT: s_mov_b32 s34, s10
; GFX9-GISEL-NEXT: s_addk_i32 s32, 0xfc00
; GFX9-GISEL-NEXT: s_mov_b32 s33, s9
; GFX9-GISEL-NEXT: s_setpc_b64 s[30:31]
@@ -1122,17 +1130,18 @@ define void @test_dynamic_stackalloc_device_uniform(i32 %n) {
; GFX11-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-SDAG-NEXT: v_lshl_add_u32 v0, v0, 2, 15
; GFX11-SDAG-NEXT: s_mov_b32 s4, s33
+; GFX11-SDAG-NEXT: s_mov_b32 s5, s34
; GFX11-SDAG-NEXT: s_mov_b32 s1, exec_lo
; GFX11-SDAG-NEXT: s_mov_b32 s0, 0
-; GFX11-SDAG-NEXT: s_mov_b32 s33, s32
; GFX11-SDAG-NEXT: v_and_b32_e32 v0, -16, v0
+; GFX11-SDAG-NEXT: s_mov_b32 s33, s32
+; GFX11-SDAG-NEXT: s_mov_b32 s34, s32
; GFX11-SDAG-NEXT: s_add_i32 s32, s32, 16
; GFX11-SDAG-NEXT: .LBB8_1: ; =>This Inner Loop Header: Depth=1
; GFX11-SDAG-NEXT: s_ctz_i32_b32 s2, s1
-; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
+; GFX11-SDAG-NEXT: s_delay_alu in...
[truncated]
|
@llvm/pr-subscribers-llvm-globalisel Author: Aaditya (easyonaadit) ChangesCurrently, the AMDGPU backend sets up FP and then In the presence of dynamic alloca, this leads to incorrect restoration of SP. This patch enforces the presence of a base pointer for all functions with dynamic allocas, and SP is restored from the saved BP in the epilog. Prolog: (Note: for dynamic allocas with default alignment, SP can be restored with saved FP as well. However, for the sake of uniformity, presence of BP is enforced) Fixes: SWDEV-408164 Patch is 75.29 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/122743.diff 8 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index dcd4f0f65e8ef2..274416ec054817 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -1259,6 +1259,17 @@ void SIFrameLowering::emitEpilogue(MachineFunction &MF,
Register FramePtrRegScratchCopy;
Register SGPRForFPSaveRestoreCopy =
FuncInfo->getScratchSGPRCopyDstReg(FramePtrReg);
+
+ if (MFI.hasVarSizedObjects()) {
+ assert(TRI.hasBasePointer(MF) &&
+ "Variable sized objects require base pointer to be setup!");
+ Register BasePtrReg = TRI.getBaseRegister();
+ // Restore SP to fixed frame size
+ BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::S_ADD_I32), StackPtrReg)
+ .addReg(BasePtrReg)
+ .addImm(RoundedSize * getScratchScaleFactor(ST))
+ .setMIFlag(MachineInstr::FrameDestroy);
+ }
if (FPSaved) {
// CSR spill restores should use FP as base register. If
// SGPRForFPSaveRestoreCopy is not true, restore the previous value of FP
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
index 704435dad65d7b..5a4e6ec48da823 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
@@ -525,8 +525,11 @@ Register SIRegisterInfo::getFrameRegister(const MachineFunction &MF) const {
bool SIRegisterInfo::hasBasePointer(const MachineFunction &MF) const {
// When we need stack realignment, we can't reference off of the
// stack pointer, so we reserve a base pointer.
+ // For functions with dynamically sized stack objects, we need to reference
+ // off the base pointer in the epilog to restore the stack frame.
const MachineFrameInfo &MFI = MF.getFrameInfo();
- return MFI.getNumFixedObjects() && shouldRealignStack(MF);
+ return (MFI.getNumFixedObjects() && shouldRealignStack(MF)) ||
+ MFI.hasVarSizedObjects();
}
Register SIRegisterInfo::getBaseRegister() const { return AMDGPU::SGPR34; }
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
index ae055ea041297e..e497ed6526a059 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/dynamic-alloca-uniform.ll
@@ -69,6 +69,8 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: s_mov_b32 s7, s33
; GFX9-NEXT: s_mov_b32 s33, s32
+; GFX9-NEXT: s_mov_b32 s8, s34
+; GFX9-NEXT: s_mov_b32 s34, s32
; GFX9-NEXT: s_addk_i32 s32, 0x400
; GFX9-NEXT: s_getpc_b64 s[4:5]
; GFX9-NEXT: s_add_u32 s4, s4, gv@gotpcrel32@lo+4
@@ -86,6 +88,8 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX9-NEXT: s_and_b32 s4, s4, -16
; GFX9-NEXT: s_lshl_b32 s4, s4, 6
; GFX9-NEXT: s_add_u32 s32, s6, s4
+; GFX9-NEXT: s_add_i32 s32, s34, 0x400
+; GFX9-NEXT: s_mov_b32 s34, s8
; GFX9-NEXT: s_addk_i32 s32, 0xfc00
; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: s_setpc_b64 s[30:31]
@@ -95,6 +99,8 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: s_mov_b32 s7, s33
; GFX10-NEXT: s_mov_b32 s33, s32
+; GFX10-NEXT: s_mov_b32 s8, s34
+; GFX10-NEXT: s_mov_b32 s34, s32
; GFX10-NEXT: s_addk_i32 s32, 0x200
; GFX10-NEXT: s_getpc_b64 s[4:5]
; GFX10-NEXT: s_add_u32 s4, s4, gv@gotpcrel32@lo+4
@@ -112,6 +118,8 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX10-NEXT: s_and_b32 s4, s4, -16
; GFX10-NEXT: s_lshl_b32 s4, s4, 5
; GFX10-NEXT: s_add_u32 s32, s6, s4
+; GFX10-NEXT: s_add_i32 s32, s34, 0x200
+; GFX10-NEXT: s_mov_b32 s34, s8
; GFX10-NEXT: s_addk_i32 s32, 0xfe00
; GFX10-NEXT: s_setpc_b64 s[30:31]
;
@@ -120,13 +128,15 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: s_mov_b32 s3, s33
; GFX11-NEXT: s_mov_b32 s33, s32
+; GFX11-NEXT: s_mov_b32 s4, s34
+; GFX11-NEXT: s_mov_b32 s34, s32
; GFX11-NEXT: s_add_i32 s32, s32, 16
; GFX11-NEXT: s_getpc_b64 s[0:1]
; GFX11-NEXT: s_add_u32 s0, s0, gv@gotpcrel32@lo+4
; GFX11-NEXT: s_addc_u32 s1, s1, gv@gotpcrel32@hi+12
-; GFX11-NEXT: v_mov_b32_e32 v0, 0
-; GFX11-NEXT: s_load_b64 s[0:1], s[0:1], 0x0
; GFX11-NEXT: s_mov_b32 s2, s32
+; GFX11-NEXT: s_load_b64 s[0:1], s[0:1], 0x0
+; GFX11-NEXT: v_mov_b32_e32 v0, 0
; GFX11-NEXT: s_mov_b32 s33, s3
; GFX11-NEXT: scratch_store_b32 off, v0, s2
; GFX11-NEXT: s_waitcnt lgkmcnt(0)
@@ -136,8 +146,10 @@ define void @func_dynamic_stackalloc_sgpr_align4() {
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; GFX11-NEXT: s_and_b32 s0, s0, -16
; GFX11-NEXT: s_lshl_b32 s0, s0, 5
-; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX11-NEXT: s_add_u32 s32, s2, s0
+; GFX11-NEXT: s_add_i32 s32, s34, 16
+; GFX11-NEXT: s_mov_b32 s34, s4
; GFX11-NEXT: s_add_i32 s32, s32, -16
; GFX11-NEXT: s_setpc_b64 s[30:31]
%n = load i32, ptr addrspace(4) @gv, align 4
@@ -210,6 +222,8 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: s_mov_b32 s7, s33
; GFX9-NEXT: s_mov_b32 s33, s32
+; GFX9-NEXT: s_mov_b32 s8, s34
+; GFX9-NEXT: s_mov_b32 s34, s32
; GFX9-NEXT: s_addk_i32 s32, 0x400
; GFX9-NEXT: s_getpc_b64 s[4:5]
; GFX9-NEXT: s_add_u32 s4, s4, gv@gotpcrel32@lo+4
@@ -227,6 +241,8 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX9-NEXT: s_and_b32 s4, s4, -16
; GFX9-NEXT: s_lshl_b32 s4, s4, 6
; GFX9-NEXT: s_add_u32 s32, s6, s4
+; GFX9-NEXT: s_add_i32 s32, s34, 0x400
+; GFX9-NEXT: s_mov_b32 s34, s8
; GFX9-NEXT: s_addk_i32 s32, 0xfc00
; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: s_setpc_b64 s[30:31]
@@ -236,6 +252,8 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: s_mov_b32 s7, s33
; GFX10-NEXT: s_mov_b32 s33, s32
+; GFX10-NEXT: s_mov_b32 s8, s34
+; GFX10-NEXT: s_mov_b32 s34, s32
; GFX10-NEXT: s_addk_i32 s32, 0x200
; GFX10-NEXT: s_getpc_b64 s[4:5]
; GFX10-NEXT: s_add_u32 s4, s4, gv@gotpcrel32@lo+4
@@ -253,6 +271,8 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX10-NEXT: s_and_b32 s4, s4, -16
; GFX10-NEXT: s_lshl_b32 s4, s4, 5
; GFX10-NEXT: s_add_u32 s32, s6, s4
+; GFX10-NEXT: s_add_i32 s32, s34, 0x200
+; GFX10-NEXT: s_mov_b32 s34, s8
; GFX10-NEXT: s_addk_i32 s32, 0xfe00
; GFX10-NEXT: s_setpc_b64 s[30:31]
;
@@ -261,13 +281,15 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: s_mov_b32 s3, s33
; GFX11-NEXT: s_mov_b32 s33, s32
+; GFX11-NEXT: s_mov_b32 s4, s34
+; GFX11-NEXT: s_mov_b32 s34, s32
; GFX11-NEXT: s_add_i32 s32, s32, 16
; GFX11-NEXT: s_getpc_b64 s[0:1]
; GFX11-NEXT: s_add_u32 s0, s0, gv@gotpcrel32@lo+4
; GFX11-NEXT: s_addc_u32 s1, s1, gv@gotpcrel32@hi+12
-; GFX11-NEXT: v_mov_b32_e32 v0, 0
-; GFX11-NEXT: s_load_b64 s[0:1], s[0:1], 0x0
; GFX11-NEXT: s_mov_b32 s2, s32
+; GFX11-NEXT: s_load_b64 s[0:1], s[0:1], 0x0
+; GFX11-NEXT: v_mov_b32_e32 v0, 0
; GFX11-NEXT: s_mov_b32 s33, s3
; GFX11-NEXT: scratch_store_b32 off, v0, s2
; GFX11-NEXT: s_waitcnt lgkmcnt(0)
@@ -277,8 +299,10 @@ define void @func_dynamic_stackalloc_sgpr_align16() {
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; GFX11-NEXT: s_and_b32 s0, s0, -16
; GFX11-NEXT: s_lshl_b32 s0, s0, 5
-; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX11-NEXT: s_add_u32 s32, s2, s0
+; GFX11-NEXT: s_add_i32 s32, s34, 16
+; GFX11-NEXT: s_mov_b32 s34, s4
; GFX11-NEXT: s_add_i32 s32, s32, -16
; GFX11-NEXT: s_setpc_b64 s[30:31]
%n = load i32, ptr addrspace(4) @gv, align 16
@@ -355,6 +379,8 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX9-NEXT: s_mov_b32 s6, s33
; GFX9-NEXT: s_add_i32 s33, s32, 0x7c0
; GFX9-NEXT: s_and_b32 s33, s33, 0xfffff800
+; GFX9-NEXT: s_mov_b32 s7, s34
+; GFX9-NEXT: s_mov_b32 s34, s32
; GFX9-NEXT: s_addk_i32 s32, 0x1000
; GFX9-NEXT: s_getpc_b64 s[4:5]
; GFX9-NEXT: s_add_u32 s4, s4, gv@gotpcrel32@lo+4
@@ -373,6 +399,8 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX9-NEXT: s_and_b32 s4, s4, -16
; GFX9-NEXT: s_lshl_b32 s4, s4, 6
; GFX9-NEXT: s_add_u32 s32, s5, s4
+; GFX9-NEXT: s_add_i32 s32, s34, 0x1000
+; GFX9-NEXT: s_mov_b32 s34, s7
; GFX9-NEXT: s_addk_i32 s32, 0xf000
; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: s_setpc_b64 s[30:31]
@@ -382,8 +410,10 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: s_mov_b32 s6, s33
; GFX10-NEXT: s_add_i32 s33, s32, 0x3e0
-; GFX10-NEXT: s_addk_i32 s32, 0x800
+; GFX10-NEXT: s_mov_b32 s7, s34
; GFX10-NEXT: s_and_b32 s33, s33, 0xfffffc00
+; GFX10-NEXT: s_mov_b32 s34, s32
+; GFX10-NEXT: s_addk_i32 s32, 0x800
; GFX10-NEXT: s_getpc_b64 s[4:5]
; GFX10-NEXT: s_add_u32 s4, s4, gv@gotpcrel32@lo+4
; GFX10-NEXT: s_addc_u32 s5, s5, gv@gotpcrel32@hi+12
@@ -401,6 +431,8 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX10-NEXT: s_and_b32 s4, s4, -16
; GFX10-NEXT: s_lshl_b32 s4, s4, 5
; GFX10-NEXT: s_add_u32 s32, s5, s4
+; GFX10-NEXT: s_add_i32 s32, s34, 0x800
+; GFX10-NEXT: s_mov_b32 s34, s7
; GFX10-NEXT: s_addk_i32 s32, 0xf800
; GFX10-NEXT: s_setpc_b64 s[30:31]
;
@@ -409,8 +441,10 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: s_mov_b32 s2, s33
; GFX11-NEXT: s_add_i32 s33, s32, 31
-; GFX11-NEXT: s_add_i32 s32, s32, 64
+; GFX11-NEXT: s_mov_b32 s3, s34
; GFX11-NEXT: s_and_not1_b32 s33, s33, 31
+; GFX11-NEXT: s_mov_b32 s34, s32
+; GFX11-NEXT: s_add_i32 s32, s32, 64
; GFX11-NEXT: s_getpc_b64 s[0:1]
; GFX11-NEXT: s_add_u32 s0, s0, gv@gotpcrel32@lo+4
; GFX11-NEXT: s_addc_u32 s1, s1, gv@gotpcrel32@hi+12
@@ -429,7 +463,8 @@ define void @func_dynamic_stackalloc_sgpr_align32(ptr addrspace(1) %out) {
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; GFX11-NEXT: s_lshl_b32 s0, s0, 5
; GFX11-NEXT: s_add_u32 s32, s1, s0
-; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT: s_add_i32 s32, s34, 64
+; GFX11-NEXT: s_mov_b32 s34, s3
; GFX11-NEXT: s_addk_i32 s32, 0xffc0
; GFX11-NEXT: s_setpc_b64 s[30:31]
%n = load i32, ptr addrspace(4) @gv
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll
index 69abef02d3d924..f4e6b7c033b3c4 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll
@@ -151,8 +151,10 @@ define void @func_non_entry_block_static_alloca_align4(ptr addrspace(1) %out, i3
; GCN: ; %bb.0: ; %entry
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: s_mov_b32 s7, s33
+; GCN-NEXT: s_mov_b32 s8, s34
; GCN-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2
; GCN-NEXT: s_mov_b32 s33, s32
+; GCN-NEXT: s_mov_b32 s34, s32
; GCN-NEXT: s_addk_i32 s32, 0x400
; GCN-NEXT: s_and_saveexec_b64 s[4:5], vcc
; GCN-NEXT: s_cbranch_execz .LBB2_3
@@ -178,8 +180,10 @@ define void @func_non_entry_block_static_alloca_align4(ptr addrspace(1) %out, i3
; GCN-NEXT: .LBB2_3: ; %bb.2
; GCN-NEXT: s_or_b64 exec, exec, s[4:5]
; GCN-NEXT: v_mov_b32_e32 v0, 0
+; GCN-NEXT: s_add_i32 s32, s34, 0x400
; GCN-NEXT: global_store_dword v[0:1], v0, off
; GCN-NEXT: s_waitcnt vmcnt(0)
+; GCN-NEXT: s_mov_b32 s34, s8
; GCN-NEXT: s_addk_i32 s32, 0xfc00
; GCN-NEXT: s_mov_b32 s33, s7
; GCN-NEXT: s_setpc_b64 s[30:31]
@@ -216,8 +220,10 @@ define void @func_non_entry_block_static_alloca_align64(ptr addrspace(1) %out, i
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: s_mov_b32 s7, s33
; GCN-NEXT: s_add_i32 s33, s32, 0xfc0
+; GCN-NEXT: s_mov_b32 s8, s34
; GCN-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2
; GCN-NEXT: s_and_b32 s33, s33, 0xfffff000
+; GCN-NEXT: s_mov_b32 s34, s32
; GCN-NEXT: s_addk_i32 s32, 0x2000
; GCN-NEXT: s_and_saveexec_b64 s[4:5], vcc
; GCN-NEXT: s_cbranch_execz .LBB3_2
@@ -240,8 +246,10 @@ define void @func_non_entry_block_static_alloca_align64(ptr addrspace(1) %out, i
; GCN-NEXT: .LBB3_2: ; %bb.1
; GCN-NEXT: s_or_b64 exec, exec, s[4:5]
; GCN-NEXT: v_mov_b32_e32 v0, 0
+; GCN-NEXT: s_add_i32 s32, s34, 0x2000
; GCN-NEXT: global_store_dword v[0:1], v0, off
; GCN-NEXT: s_waitcnt vmcnt(0)
+; GCN-NEXT: s_mov_b32 s34, s8
; GCN-NEXT: s_addk_i32 s32, 0xe000
; GCN-NEXT: s_mov_b32 s33, s7
; GCN-NEXT: s_setpc_b64 s[30:31]
diff --git a/llvm/test/CodeGen/AMDGPU/amdpal-callable.ll b/llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
index a5f915c48ebeea..2bd58f41ec7905 100644
--- a/llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
@@ -149,15 +149,15 @@ attributes #0 = { nounwind }
; GCN-NEXT: dynamic_stack:
; GCN-NEXT: .backend_stack_size: 0x10{{$}}
; GCN-NEXT: .lds_size: 0{{$}}
-; GCN-NEXT: .sgpr_count: 0x28{{$}}
+; GCN-NEXT: .sgpr_count: 0x2a{{$}}
; GCN-NEXT: .stack_frame_size_in_bytes: 0x10{{$}}
; SDAG-NEXT: .vgpr_count: 0x2{{$}}
; GISEL-NEXT: .vgpr_count: 0x3{{$}}
; GCN-NEXT: dynamic_stack_loop:
; GCN-NEXT: .backend_stack_size: 0x10{{$}}
; GCN-NEXT: .lds_size: 0{{$}}
-; SDAG-NEXT: .sgpr_count: 0x25{{$}}
-; GISEL-NEXT: .sgpr_count: 0x26{{$}}
+; SDAG-NEXT: .sgpr_count: 0x27{{$}}
+; GISEL-NEXT: .sgpr_count: 0x28{{$}}
; GCN-NEXT: .stack_frame_size_in_bytes: 0x10{{$}}
; SDAG-NEXT: .vgpr_count: 0x3{{$}}
; GISEL-NEXT: .vgpr_count: 0x4{{$}}
@@ -182,22 +182,22 @@ attributes #0 = { nounwind }
; GCN-NEXT: no_stack_extern_call:
; GCN-NEXT: .backend_stack_size: 0x10{{$}}
; GCN-NEXT: .lds_size: 0{{$}}
-; GFX8-NEXT: .sgpr_count: 0x28{{$}}
-; GFX9-NEXT: .sgpr_count: 0x2c{{$}}
+; GFX8-NEXT: .sgpr_count: 0x2a{{$}}
+; GFX9-NEXT: .sgpr_count: 0x2e{{$}}
; GCN-NEXT: .stack_frame_size_in_bytes: 0x10{{$}}
; GCN-NEXT: .vgpr_count: 0x2b{{$}}
; GCN-NEXT: no_stack_extern_call_many_args:
; GCN-NEXT: .backend_stack_size: 0x90{{$}}
; GCN-NEXT: .lds_size: 0{{$}}
-; GFX8-NEXT: .sgpr_count: 0x28{{$}}
-; GFX9-NEXT: .sgpr_count: 0x2c{{$}}
+; GFX8-NEXT: .sgpr_count: 0x2a{{$}}
+; GFX9-NEXT: .sgpr_count: 0x2e{{$}}
; GCN-NEXT: .stack_frame_size_in_bytes: 0x90{{$}}
; GCN-NEXT: .vgpr_count: 0x2b{{$}}
; GCN-NEXT: no_stack_indirect_call:
; GCN-NEXT: .backend_stack_size: 0x10{{$}}
; GCN-NEXT: .lds_size: 0{{$}}
-; GFX8-NEXT: .sgpr_count: 0x28{{$}}
-; GFX9-NEXT: .sgpr_count: 0x2c{{$}}
+; GFX8-NEXT: .sgpr_count: 0x2a{{$}}
+; GFX9-NEXT: .sgpr_count: 0x2e{{$}}
; GCN-NEXT: .stack_frame_size_in_bytes: 0x10{{$}}
; GCN-NEXT: .vgpr_count: 0x2b{{$}}
; GCN-NEXT: simple_lds:
@@ -227,15 +227,15 @@ attributes #0 = { nounwind }
; GCN-NEXT: simple_stack_extern_call:
; GCN-NEXT: .backend_stack_size: 0x20{{$}}
; GCN-NEXT: .lds_size: 0{{$}}
-; GFX8-NEXT: .sgpr_count: 0x28{{$}}
-; GFX9-NEXT: .sgpr_count: 0x2c{{$}}
+; GFX8-NEXT: .sgpr_count: 0x2a{{$}}
+; GFX9-NEXT: .sgpr_count: 0x2e{{$}}
; GCN-NEXT: .stack_frame_size_in_bytes: 0x20{{$}}
; GCN-NEXT: .vgpr_count: 0x2b{{$}}
; GCN-NEXT: simple_stack_indirect_call:
; GCN-NEXT: .backend_stack_size: 0x20{{$}}
; GCN-NEXT: .lds_size: 0{{$}}
-; GFX8-NEXT: .sgpr_count: 0x28{{$}}
-; GFX9-NEXT: .sgpr_count: 0x2c{{$}}
+; GFX8-NEXT: .sgpr_count: 0x2a{{$}}
+; GFX9-NEXT: .sgpr_count: 0x2e{{$}}
; GCN-NEXT: .stack_frame_size_in_bytes: 0x20{{$}}
; GCN-NEXT: .vgpr_count: 0x2b{{$}}
; GCN-NEXT: simple_stack_recurse:
diff --git a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
index 9acb3a42ae102c..c298a5609ddb35 100644
--- a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
+++ b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
@@ -1064,10 +1064,12 @@ define void @test_dynamic_stackalloc_device_uniform(i32 %n) {
; GFX9-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-SDAG-NEXT: v_lshl_add_u32 v0, v0, 2, 15
; GFX9-SDAG-NEXT: s_mov_b32 s9, s33
+; GFX9-SDAG-NEXT: s_mov_b32 s10, s34
; GFX9-SDAG-NEXT: v_and_b32_e32 v0, -16, v0
; GFX9-SDAG-NEXT: s_mov_b64 s[4:5], exec
; GFX9-SDAG-NEXT: s_mov_b32 s6, 0
; GFX9-SDAG-NEXT: s_mov_b32 s33, s32
+; GFX9-SDAG-NEXT: s_mov_b32 s34, s32
; GFX9-SDAG-NEXT: s_addk_i32 s32, 0x400
; GFX9-SDAG-NEXT: .LBB8_1: ; =>This Inner Loop Header: Depth=1
; GFX9-SDAG-NEXT: s_ff1_i32_b64 s7, s[4:5]
@@ -1082,8 +1084,10 @@ define void @test_dynamic_stackalloc_device_uniform(i32 %n) {
; GFX9-SDAG-NEXT: v_lshl_add_u32 v0, s6, 6, v0
; GFX9-SDAG-NEXT: v_readfirstlane_b32 s32, v0
; GFX9-SDAG-NEXT: v_mov_b32_e32 v0, 0x7b
+; GFX9-SDAG-NEXT: s_add_i32 s32, s34, 0x400
; GFX9-SDAG-NEXT: buffer_store_dword v0, off, s[0:3], s4
; GFX9-SDAG-NEXT: s_waitcnt vmcnt(0)
+; GFX9-SDAG-NEXT: s_mov_b32 s34, s10
; GFX9-SDAG-NEXT: s_addk_i32 s32, 0xfc00
; GFX9-SDAG-NEXT: s_mov_b32 s33, s9
; GFX9-SDAG-NEXT: s_setpc_b64 s[30:31]
@@ -1093,10 +1097,12 @@ define void @test_dynamic_stackalloc_device_uniform(i32 %n) {
; GFX9-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-GISEL-NEXT: v_lshl_add_u32 v0, v0, 2, 15
; GFX9-GISEL-NEXT: s_mov_b32 s9, s33
+; GFX9-GISEL-NEXT: s_mov_b32 s10, s34
; GFX9-GISEL-NEXT: v_and_b32_e32 v0, -16, v0
; GFX9-GISEL-NEXT: s_mov_b64 s[4:5], exec
; GFX9-GISEL-NEXT: s_mov_b32 s6, 0
; GFX9-GISEL-NEXT: s_mov_b32 s33, s32
+; GFX9-GISEL-NEXT: s_mov_b32 s34, s32
; GFX9-GISEL-NEXT: s_addk_i32 s32, 0x400
; GFX9-GISEL-NEXT: .LBB8_1: ; =>This Inner Loop Header: Depth=1
; GFX9-GISEL-NEXT: s_ff1_i32_b64 s7, s[4:5]
@@ -1111,8 +1117,10 @@ define void @test_dynamic_stackalloc_device_uniform(i32 %n) {
; GFX9-GISEL-NEXT: s_add_u32 s32, s4, s5
; GFX9-GISEL-NEXT: v_mov_b32_e32 v0, 0x7b
; GFX9-GISEL-NEXT: v_mov_b32_e32 v1, s4
+; GFX9-GISEL-NEXT: s_add_i32 s32, s34, 0x400
; GFX9-GISEL-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
; GFX9-GISEL-NEXT: s_waitcnt vmcnt(0)
+; GFX9-GISEL-NEXT: s_mov_b32 s34, s10
; GFX9-GISEL-NEXT: s_addk_i32 s32, 0xfc00
; GFX9-GISEL-NEXT: s_mov_b32 s33, s9
; GFX9-GISEL-NEXT: s_setpc_b64 s[30:31]
@@ -1122,17 +1130,18 @@ define void @test_dynamic_stackalloc_device_uniform(i32 %n) {
; GFX11-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-SDAG-NEXT: v_lshl_add_u32 v0, v0, 2, 15
; GFX11-SDAG-NEXT: s_mov_b32 s4, s33
+; GFX11-SDAG-NEXT: s_mov_b32 s5, s34
; GFX11-SDAG-NEXT: s_mov_b32 s1, exec_lo
; GFX11-SDAG-NEXT: s_mov_b32 s0, 0
-; GFX11-SDAG-NEXT: s_mov_b32 s33, s32
; GFX11-SDAG-NEXT: v_and_b32_e32 v0, -16, v0
+; GFX11-SDAG-NEXT: s_mov_b32 s33, s32
+; GFX11-SDAG-NEXT: s_mov_b32 s34, s32
; GFX11-SDAG-NEXT: s_add_i32 s32, s32, 16
; GFX11-SDAG-NEXT: .LBB8_1: ; =>This Inner Loop Header: Depth=1
; GFX11-SDAG-NEXT: s_ctz_i32_b32 s2, s1
-; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
+; GFX11-SDAG-NEXT: s_delay_alu in...
[truncated]
|
const MachineFrameInfo &MFI = MF.getFrameInfo(); | ||
return MFI.getNumFixedObjects() && shouldRealignStack(MF); | ||
return (MFI.getNumFixedObjects() && shouldRealignStack(MF)) || | ||
MFI.hasVarSizedObjects(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to expand the conditions for having a base pointer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the function doesn't access any incoming parameters, BP is not being setup.
For dynamic allocas with default alignment, I can use the frame pointer for restoration.
However, for a function with an over-aligned dynamic alloca, I couldn't find a way to reverse the calculations done in the prolog without using the base pointer (I have another idea but it would require heavy changes in PEI)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this required to mark BP as reserved so that regalloc won't use the specific SGPR (S34 at the moment) for allocation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have the use fp path and use BP as a fallback if really necessary. What is the idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should probably handle hasOpaqueSPAdjustment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this required to mark BP as reserved so that regalloc won't use the specific SGPR (S34 at the moment) for allocation?
This function (hasBasePointer) is being called to decide if we should reserve BP or not. I was under the impression that this change would suffice, will something more be needed to reserve it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have the use fp path and use BP as a fallback if really necessary. What is the idea?
The FP path is ideal for dynamic allocas with default alignment.
When some object is over-aligned, currently in the prolog:
sp += frameSize + alignment
This is working fine but is not the right calculation, we are allocating excess space on the stack.
Once FP and BP have been established, we should just:
sp = aligned_fp + frameSize
I wasn't sure about the design decision behind the current calculations, so I didn't want to tinker with it. I guess this was needed when we didn't have BP supported, as there is no other way to unwind the stack in the epilog, if it has been realigned in the prolog.
I assume this will need an overhaul in the PEI code tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should probably handle hasOpaqueSPAdjustment
I'll have to look into this, is it a part of this patch?
@@ -1259,6 +1259,17 @@ void SIFrameLowering::emitEpilogue(MachineFunction &MF, | |||
Register FramePtrRegScratchCopy; | |||
Register SGPRForFPSaveRestoreCopy = | |||
FuncInfo->getScratchSGPRCopyDstReg(FramePtrReg); | |||
|
|||
if (MFI.hasVarSizedObjects()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's any kind of dynamic stack adjustment. I think this is the same conditions as hasFP (that's what I was using in my last attempt at this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasFP checks for variable-sized objects, stack realignment, disable-frame-pointer-elimination optimization, and these intrinsics: @llvm.frameaddress
, @llvm.experimental.stackmap
, @llvm.experimental.patchpoint
.
Of these, I think we only need SP restoration in case of variable-sized objects and stack realignment.
If a function has only stack-realignment without variable-sized objects, then that is handled in the existing prolog/epilog.
So I think we only need to handle SP restoration for variable sized objects (unless there is something else which does a dynamic stack adjustment)
The epilogue sequence is effectively |
Can you edit the description so these sequences make sense? Also for formatting multiple lines of code it is better to use triple-backticks around the block, instead of single-backticks on each line. Or just indent the block by 4 spaces, then you don't need any backticks at all. |
AMDGPU has a growing-up stack, all calculations are based on that. The calculations are based on if there has been a dynamic allocation. |
Got it, got it. |
BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::S_ADD_I32), StackPtrReg) | ||
.addReg(FramePtrReg) | ||
.addImm(RoundedSize * getScratchScaleFactor(ST)) | ||
.setMIFlag(MachineInstr::FrameDestroy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should SP restoration be marked as FrameDestroy
?
Since SP is actually modified during ISel, and there is no counter FrameSetup
action in the prolog for this.
Ping @arsenm |
@easyonaadit can you see if we could restore the Stack Pointer (SP) in the function epilog directly from FP/BP (based on which frame register holds the stack base at its entry). If that can be done, we can avoid this patch and any other subtract we currently do in the epilog to restore the SP at return as given here: |
Currently, the AMDGPU backend sets up FP and then
increments SP by fixed size, from FP, in the prolog and
decrements it by the same amount in the epilog.
Prolog:
Epilog:
In the presence of dynamic alloca, this leads to incorrect
restoration of SP. This patch enforces the presence of a
base pointer for all functions with dynamic allocas, and
SP is restored from the saved BP in the epilog.
Prolog:
Epilog:
(Note: for dynamic allocas with default alignment, SP can be restored with saved FP as well. However, for the sake of uniformity, presence of BP is enforced)
Fixes: SWDEV-408164