Skip to content

[AMDGPU] Remove implicit definition of register group when restoring the last sub-register after a spill #133986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 5, 2025

Conversation

bababuck
Copy link
Contributor

@bababuck bababuck commented Apr 1, 2025

Extension of https://reviews.llvm.org/D141101 to remove the define flag from the last stack memory access. Fixes case where COPY instructions are used for some of the stack restoration, but the copies get optimized away during the machine-cp pass.

Prior to this change, was possible to produce the following code:
$agpr16_agpr17_agpr18_agpr19 = SCRATCH_LOAD_DWORDX4_ST 64, 0, implicit $exec, implicit $flat_scr, implicit-def $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31 :: (load (s128) from %stack.17, align 4, addrspace 5)
$agpr20_agpr21_agpr22_agpr23 = SCRATCH_LOAD_DWORDX4_ST 80, 0, implicit $exec, implicit $flat_scr :: (load (s128) from %stack.17 + 16, align 4, addrspace 5)
$agpr24_agpr25_agpr26_agpr27 = SCRATCH_LOAD_DWORDX4_ST 96, 0, implicit $exec, implicit $flat_scr :: (load (s128) from %stack.17 + 32, align 4, addrspace 5)
$agpr31 = COPY $agpr112, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31
$agpr30 = COPY $agpr208, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31
$agpr28_agpr29 = SCRATCH_LOAD_DWORDX2_ST 112, 0, implicit $exec, implicit $flat_scr, implicit-def $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31 :: (load (s64) from %stack.17 + 48, align 4, addrspace 5)

where $agpr30 = COPY $agpr208 would be optimized away by machine-cp pass. Instead, change to:
$agpr28_agpr29 = SCRATCH_LOAD_DWORDX2_ST 112, 0, implicit $exec, implicit $flat_scr :: (load (s64) from %stack.17 + 48, align 4, addrspace 5)

Fixes #131386.

Made the simple fix, but I'm not completely comfortable with this change since the reason for the previous inclusion of IsLastSubReg is unclear to me.

@krzysz00

Copy link

github-actions bot commented Apr 1, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@@ -4111,10 +4111,10 @@ body: |
; GFX908-NEXT: S_CMP_EQ_U32 0, 0, implicit-def $scc
; GFX908-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec :: (store (s32) into %stack.2, addrspace 5)
; GFX908-NEXT: $vgpr1 = V_MOV_B32_e32 8200, implicit $exec
; GFX908-NEXT: $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr0, implicit $exec, implicit-def $agpr0_agpr1
; GFX908-NEXT: $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr0, implicit $exec, implicit-def $agpr0_agpr1, implicit $agpr0_agpr1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a side effect where more implicit register reads are now included, unclear to me why this occurs. I will dig further to try to understand unless someone here has any thoughts.

@bababuck
Copy link
Contributor Author

bababuck commented Apr 5, 2025

If there is any interest, here's the emulator I wrote to help with debug: https://github.com/bababuck/AMDGPU

@bababuck
Copy link
Contributor Author

bababuck commented Apr 8, 2025

Ping

Adding @arsenm, apologies is this is out of your area, but you seem to be involved in the AMDGPU backend.

Copy link
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a straightforward bugfix to me but I don't understand the backend well enough to approve it

@bababuck
Copy link
Contributor Author

This looks like a straightforward bugfix to me but I don't understand the backend well enough to approve it

Do you know who would be an appropriate reviewer?

@krzysz00 krzysz00 requested review from arsenm, kosarev and jplehr April 22, 2025 22:50
@krzysz00
Copy link
Contributor

Pinging the people Github suggested

@llvmbot
Copy link
Member

llvmbot commented Apr 23, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Ryan Buchner (bababuck)

Changes

Extension of https://reviews.llvm.org/D141101 to remove the define flag from the last stack memory access. Fixes case where COPY instructions are used for some of the stack restoration, but the copies get optimized away during the machine-cp pass.

Prior to this change, was possible to produce the following code:
$agpr16_agpr17_agpr18_agpr19 = SCRATCH_LOAD_DWORDX4_ST 64, 0, implicit $exec, implicit $flat_scr, implicit-def $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31 :: (load (s128) from %stack.17, align 4, addrspace 5)
$agpr20_agpr21_agpr22_agpr23 = SCRATCH_LOAD_DWORDX4_ST 80, 0, implicit $exec, implicit $flat_scr :: (load (s128) from %stack.17 + 16, align 4, addrspace 5)
$agpr24_agpr25_agpr26_agpr27 = SCRATCH_LOAD_DWORDX4_ST 96, 0, implicit $exec, implicit $flat_scr :: (load (s128) from %stack.17 + 32, align 4, addrspace 5)
$agpr31 = COPY $agpr112, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31
$agpr30 = COPY $agpr208, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31
$agpr28_agpr29 = SCRATCH_LOAD_DWORDX2_ST 112, 0, implicit $exec, implicit $flat_scr, implicit-def $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31 :: (load (s64) from %stack.17 + 48, align 4, addrspace 5)

where $agpr30 = COPY $agpr208 would be optimized away by machine-cp pass. Instead, change to:
$agpr28_agpr29 = SCRATCH_LOAD_DWORDX2_ST 112, 0, implicit $exec, implicit $flat_scr :: (load (s64) from %stack.17 + 48, align 4, addrspace 5)

Fixes #131386.

Made the simple fix, but I'm not completely comfortable with this change since the reason for the previous inclusion of IsLastSubReg is unclear to me.

@krzysz00


Patch is 372.61 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/133986.diff

9 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp (+4-2)
  • (modified) llvm/test/CodeGen/AMDGPU/accvgpr-spill-scc-clobber.mir (+20-20)
  • (modified) llvm/test/CodeGen/AMDGPU/av_spill_cross_bb_usage.mir (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/pei-build-av-spill.mir (+210-210)
  • (modified) llvm/test/CodeGen/AMDGPU/pei-build-spill.mir (+142-142)
  • (modified) llvm/test/CodeGen/AMDGPU/spill-agpr-partially-undef.mir (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/spill-reg-tuple-super-reg-use.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-spill-scc-clobber.mir (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-spill.mir (+5-5)
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
index c1ac9491b2363..da373a2a18df5 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
@@ -1910,8 +1910,10 @@ void SIRegisterInfo::buildSpillLoadStore(
       MIB->setAsmPrinterFlag(MachineInstr::ReloadReuse);
     }
 
-    if (NeedSuperRegImpOperand && (IsFirstSubReg || IsLastSubReg))
-      MIB.addReg(ValueReg, RegState::Implicit | SrcDstRegState);
+    if (NeedSuperRegImpOperand && IsFirstSubReg) {
+      unsigned State = RegState::Implicit | SrcDstRegState;
+      MIB.addReg(ValueReg, State);
+    }
 
     // The epilog restore of a wwm-scratch register can cause undesired
     // optimization during machine-cp post PrologEpilogInserter if the same
diff --git a/llvm/test/CodeGen/AMDGPU/accvgpr-spill-scc-clobber.mir b/llvm/test/CodeGen/AMDGPU/accvgpr-spill-scc-clobber.mir
index c91b686697b9d..a99f44776eda9 100644
--- a/llvm/test/CodeGen/AMDGPU/accvgpr-spill-scc-clobber.mir
+++ b/llvm/test/CodeGen/AMDGPU/accvgpr-spill-scc-clobber.mir
@@ -1046,7 +1046,7 @@ body:             |
   ; GFX908-NEXT:   $vgpr0 = BUFFER_LOAD_DWORD_OFFEN $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec :: (load (s32) from %stack.1, addrspace 5)
   ; GFX908-NEXT:   $agpr0 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec, implicit-def $agpr0_agpr1
   ; GFX908-NEXT:   $vgpr0 = BUFFER_LOAD_DWORD_OFFEN $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (load (s32) from %stack.1 + 4, addrspace 5)
-  ; GFX908-NEXT:   $agpr1 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec, implicit-def $agpr0_agpr1
+  ; GFX908-NEXT:   $agpr1 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec
   ; GFX908-NEXT:   $vgpr1 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec :: (load (s32) from %stack.2, addrspace 5)
   ; GFX908-NEXT:   S_CBRANCH_SCC1 %bb.2, implicit $scc
   ; GFX908-NEXT: {{  $}}
@@ -1293,7 +1293,7 @@ body:             |
   ; GFX90A-NEXT:   BUFFER_STORE_DWORD_OFFSET killed $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 704, 0, 0, implicit $exec :: (store (s32) into %stack.226, addrspace 5)
   ; GFX90A-NEXT:   $vgpr40 = V_MOV_B32_e32 8904, implicit $exec
   ; GFX90A-NEXT:   $agpr0 = BUFFER_LOAD_DWORD_OFFEN $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec, implicit-def $agpr0_agpr1 :: (load (s32) from %stack.1, addrspace 5)
-  ; GFX90A-NEXT:   $agpr1 = BUFFER_LOAD_DWORD_OFFEN killed $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec, implicit-def $agpr0_agpr1 :: (load (s32) from %stack.1 + 4, addrspace 5)
+  ; GFX90A-NEXT:   $agpr1 = BUFFER_LOAD_DWORD_OFFEN killed $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (load (s32) from %stack.1 + 4, addrspace 5)
   ; GFX90A-NEXT:   $vgpr40 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 704, 0, 0, implicit $exec :: (load (s32) from %stack.226, addrspace 5)
   ; GFX90A-NEXT:   S_CBRANCH_SCC1 %bb.2, implicit $scc
   ; GFX90A-NEXT: {{  $}}
@@ -1544,7 +1544,7 @@ body:             |
   ; GFX908-FLATSCR-NEXT:   $vgpr0 = SCRATCH_LOAD_DWORD $vgpr1, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.1, addrspace 5)
   ; GFX908-FLATSCR-NEXT:   $agpr0 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec, implicit-def $agpr0_agpr1
   ; GFX908-FLATSCR-NEXT:   $vgpr0 = SCRATCH_LOAD_DWORD $vgpr1, 4, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.1 + 4, addrspace 5)
-  ; GFX908-FLATSCR-NEXT:   $agpr1 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec, implicit-def $agpr0_agpr1
+  ; GFX908-FLATSCR-NEXT:   $agpr1 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec
   ; GFX908-FLATSCR-NEXT:   $vgpr1 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.2, addrspace 5)
   ; GFX908-FLATSCR-NEXT:   S_CBRANCH_SCC1 %bb.2, implicit $scc
   ; GFX908-FLATSCR-NEXT: {{  $}}
@@ -2071,7 +2071,7 @@ body:             |
   ; GFX908-NEXT:   $vgpr0 = BUFFER_LOAD_DWORD_OFFEN $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (load (s32) from %stack.1 + 4, addrspace 5)
   ; GFX908-NEXT:   $agpr1 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec
   ; GFX908-NEXT:   $vgpr0 = BUFFER_LOAD_DWORD_OFFEN $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 8, 0, 0, implicit $exec :: (load (s32) from %stack.1 + 8, addrspace 5)
-  ; GFX908-NEXT:   $agpr2 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec, implicit-def $agpr0_agpr1_agpr2
+  ; GFX908-NEXT:   $agpr2 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec
   ; GFX908-NEXT:   $vgpr1 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec :: (load (s32) from %stack.2, addrspace 5)
   ; GFX908-NEXT:   S_CBRANCH_SCC1 %bb.2, implicit $scc
   ; GFX908-NEXT: {{  $}}
@@ -2319,7 +2319,7 @@ body:             |
   ; GFX90A-NEXT:   $vgpr40 = V_MOV_B32_e32 8904, implicit $exec
   ; GFX90A-NEXT:   $agpr0 = BUFFER_LOAD_DWORD_OFFEN $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec, implicit-def $agpr0_agpr1_agpr2 :: (load (s32) from %stack.1, addrspace 5)
   ; GFX90A-NEXT:   $agpr1 = BUFFER_LOAD_DWORD_OFFEN $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (load (s32) from %stack.1 + 4, addrspace 5)
-  ; GFX90A-NEXT:   $agpr2 = BUFFER_LOAD_DWORD_OFFEN killed $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 8, 0, 0, implicit $exec, implicit-def $agpr0_agpr1_agpr2 :: (load (s32) from %stack.1 + 8, addrspace 5)
+  ; GFX90A-NEXT:   $agpr2 = BUFFER_LOAD_DWORD_OFFEN killed $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 8, 0, 0, implicit $exec :: (load (s32) from %stack.1 + 8, addrspace 5)
   ; GFX90A-NEXT:   $vgpr40 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 704, 0, 0, implicit $exec :: (load (s32) from %stack.226, addrspace 5)
   ; GFX90A-NEXT:   S_CBRANCH_SCC1 %bb.2, implicit $scc
   ; GFX90A-NEXT: {{  $}}
@@ -2572,7 +2572,7 @@ body:             |
   ; GFX908-FLATSCR-NEXT:   $vgpr0 = SCRATCH_LOAD_DWORD $vgpr1, 4, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.1 + 4, addrspace 5)
   ; GFX908-FLATSCR-NEXT:   $agpr1 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec
   ; GFX908-FLATSCR-NEXT:   $vgpr0 = SCRATCH_LOAD_DWORD $vgpr1, 8, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.1 + 8, addrspace 5)
-  ; GFX908-FLATSCR-NEXT:   $agpr2 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec, implicit-def $agpr0_agpr1_agpr2
+  ; GFX908-FLATSCR-NEXT:   $agpr2 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec
   ; GFX908-FLATSCR-NEXT:   $vgpr1 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.2, addrspace 5)
   ; GFX908-FLATSCR-NEXT:   S_CBRANCH_SCC1 %bb.2, implicit $scc
   ; GFX908-FLATSCR-NEXT: {{  $}}
@@ -4111,10 +4111,10 @@ body:             |
   ; GFX908-NEXT:   S_CMP_EQ_U32 0, 0, implicit-def $scc
   ; GFX908-NEXT:   BUFFER_STORE_DWORD_OFFSET killed $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec :: (store (s32) into %stack.2, addrspace 5)
   ; GFX908-NEXT:   $vgpr1 = V_MOV_B32_e32 8200, implicit $exec
-  ; GFX908-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr0, implicit $exec, implicit-def $agpr0_agpr1
+  ; GFX908-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr0, implicit $exec, implicit-def $agpr0_agpr1, implicit $agpr0_agpr1
   ; GFX908-NEXT:   BUFFER_STORE_DWORD_OFFEN $vgpr0, $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec, implicit $agpr0_agpr1 :: (store (s32) into %stack.1, addrspace 5)
-  ; GFX908-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr1, implicit $exec
-  ; GFX908-NEXT:   BUFFER_STORE_DWORD_OFFEN $vgpr0, $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec, implicit $agpr0_agpr1 :: (store (s32) into %stack.1 + 4, addrspace 5)
+  ; GFX908-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr1, implicit $exec, implicit $agpr0_agpr1
+  ; GFX908-NEXT:   BUFFER_STORE_DWORD_OFFEN $vgpr0, $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (store (s32) into %stack.1 + 4, addrspace 5)
   ; GFX908-NEXT:   $vgpr1 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec :: (load (s32) from %stack.2, addrspace 5)
   ; GFX908-NEXT:   S_CBRANCH_SCC1 %bb.2, implicit $scc
   ; GFX908-NEXT: {{  $}}
@@ -4361,7 +4361,7 @@ body:             |
   ; GFX90A-NEXT:   BUFFER_STORE_DWORD_OFFSET killed $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 704, 0, 0, implicit $exec :: (store (s32) into %stack.226, addrspace 5)
   ; GFX90A-NEXT:   $vgpr40 = V_MOV_B32_e32 8904, implicit $exec
   ; GFX90A-NEXT:   BUFFER_STORE_DWORD_OFFEN $agpr0, $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec, implicit-def $agpr0_agpr1, implicit $agpr0_agpr1 :: (store (s32) into %stack.1, addrspace 5)
-  ; GFX90A-NEXT:   BUFFER_STORE_DWORD_OFFEN $agpr1, killed $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec, implicit $agpr0_agpr1 :: (store (s32) into %stack.1 + 4, addrspace 5)
+  ; GFX90A-NEXT:   BUFFER_STORE_DWORD_OFFEN $agpr1, killed $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (store (s32) into %stack.1 + 4, addrspace 5)
   ; GFX90A-NEXT:   $vgpr40 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 704, 0, 0, implicit $exec :: (load (s32) from %stack.226, addrspace 5)
   ; GFX90A-NEXT:   S_CBRANCH_SCC1 %bb.2, implicit $scc
   ; GFX90A-NEXT: {{  $}}
@@ -4609,10 +4609,10 @@ body:             |
   ; GFX908-FLATSCR-NEXT:   SCRATCH_STORE_DWORD_SADDR killed $vgpr1, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.2, addrspace 5)
   ; GFX908-FLATSCR-NEXT:   $vgpr1 = V_MOV_B32_e32 $sgpr32, implicit $exec
   ; GFX908-FLATSCR-NEXT:   $vgpr1 = V_ADD_U32_e32 8200, $vgpr1, implicit $exec
-  ; GFX908-FLATSCR-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr0, implicit $exec, implicit-def $agpr0_agpr1
+  ; GFX908-FLATSCR-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr0, implicit $exec, implicit-def $agpr0_agpr1, implicit $agpr0_agpr1
   ; GFX908-FLATSCR-NEXT:   SCRATCH_STORE_DWORD $vgpr0, $vgpr1, 0, 0, implicit $exec, implicit $flat_scr, implicit $agpr0_agpr1 :: (store (s32) into %stack.1, addrspace 5)
-  ; GFX908-FLATSCR-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr1, implicit $exec
-  ; GFX908-FLATSCR-NEXT:   SCRATCH_STORE_DWORD $vgpr0, $vgpr1, 4, 0, implicit $exec, implicit $flat_scr, implicit $agpr0_agpr1 :: (store (s32) into %stack.1 + 4, addrspace 5)
+  ; GFX908-FLATSCR-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr1, implicit $exec, implicit $agpr0_agpr1
+  ; GFX908-FLATSCR-NEXT:   SCRATCH_STORE_DWORD $vgpr0, $vgpr1, 4, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.1 + 4, addrspace 5)
   ; GFX908-FLATSCR-NEXT:   $vgpr1 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.2, addrspace 5)
   ; GFX908-FLATSCR-NEXT:   S_CBRANCH_SCC1 %bb.2, implicit $scc
   ; GFX908-FLATSCR-NEXT: {{  $}}
@@ -5132,12 +5132,12 @@ body:             |
   ; GFX908-NEXT:   S_CMP_EQ_U32 0, 0, implicit-def $scc
   ; GFX908-NEXT:   BUFFER_STORE_DWORD_OFFSET killed $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec :: (store (s32) into %stack.2, addrspace 5)
   ; GFX908-NEXT:   $vgpr1 = V_MOV_B32_e32 8200, implicit $exec
-  ; GFX908-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr0, implicit $exec, implicit-def $agpr0_agpr1_agpr2
+  ; GFX908-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr0, implicit $exec, implicit-def $agpr0_agpr1_agpr2, implicit $agpr0_agpr1_agpr2
   ; GFX908-NEXT:   BUFFER_STORE_DWORD_OFFEN $vgpr0, $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec, implicit $agpr0_agpr1_agpr2 :: (store (s32) into %stack.1, addrspace 5)
   ; GFX908-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr1, implicit $exec
   ; GFX908-NEXT:   BUFFER_STORE_DWORD_OFFEN $vgpr0, $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (store (s32) into %stack.1 + 4, addrspace 5)
-  ; GFX908-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr2, implicit $exec
-  ; GFX908-NEXT:   BUFFER_STORE_DWORD_OFFEN $vgpr0, $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 8, 0, 0, implicit $exec, implicit $agpr0_agpr1_agpr2 :: (store (s32) into %stack.1 + 8, addrspace 5)
+  ; GFX908-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr2, implicit $exec, implicit $agpr0_agpr1_agpr2
+  ; GFX908-NEXT:   BUFFER_STORE_DWORD_OFFEN $vgpr0, $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 8, 0, 0, implicit $exec :: (store (s32) into %stack.1 + 8, addrspace 5)
   ; GFX908-NEXT:   $vgpr1 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec :: (load (s32) from %stack.2, addrspace 5)
   ; GFX908-NEXT:   S_CBRANCH_SCC1 %bb.2, implicit $scc
   ; GFX908-NEXT: {{  $}}
@@ -5385,7 +5385,7 @@ body:             |
   ; GFX90A-NEXT:   $vgpr40 = V_MOV_B32_e32 8904, implicit $exec
   ; GFX90A-NEXT:   BUFFER_STORE_DWORD_OFFEN $agpr0, $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec, implicit-def $agpr0_agpr1_agpr2, implicit $agpr0_agpr1_agpr2 :: (store (s32) into %stack.1, addrspace 5)
   ; GFX90A-NEXT:   BUFFER_STORE_DWORD_OFFEN $agpr1, $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (store (s32) into %stack.1 + 4, addrspace 5)
-  ; GFX90A-NEXT:   BUFFER_STORE_DWORD_OFFEN $agpr2, killed $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 8, 0, 0, implicit $exec, implicit $agpr0_agpr1_agpr2 :: (store (s32) into %stack.1 + 8, addrspace 5)
+  ; GFX90A-NEXT:   BUFFER_STORE_DWORD_OFFEN $agpr2, killed $vgpr40, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 8, 0, 0, implicit $exec :: (store (s32) into %stack.1 + 8, addrspace 5)
   ; GFX90A-NEXT:   $vgpr40 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 704, 0, 0, implicit $exec :: (load (s32) from %stack.226, addrspace 5)
   ; GFX90A-NEXT:   S_CBRANCH_SCC1 %bb.2, implicit $scc
   ; GFX90A-NEXT: {{  $}}
@@ -5633,12 +5633,12 @@ body:             |
   ; GFX908-FLATSCR-NEXT:   SCRATCH_STORE_DWORD_SADDR killed $vgpr1, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.2, addrspace 5)
   ; GFX908-FLATSCR-NEXT:   $vgpr1 = V_MOV_B32_e32 $sgpr32, implicit $exec
   ; GFX908-FLATSCR-NEXT:   $vgpr1 = V_ADD_U32_e32 8200, $vgpr1, implicit $exec
-  ; GFX908-FLATSCR-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr0, implicit $exec, implicit-def $agpr0_agpr1_agpr2
+  ; GFX908-FLATSCR-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr0, implicit $exec, implicit-def $agpr0_agpr1_agpr2, implicit $agpr0_agpr1_agpr2
   ; GFX908-FLATSCR-NEXT:   SCRATCH_STORE_DWORD $vgpr0, $vgpr1, 0, 0, implicit $exec, implicit $flat_scr, implicit $agpr0_agpr1_agpr2 :: (store (s32) into %stack.1, addrspace 5)
   ; GFX908-FLATSCR-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr1, implicit $exec
   ; GFX908-FLATSCR-NEXT:   SCRATCH_STORE_DWORD $vgpr0, $vgpr1, 4, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.1 + 4, addrspace 5)
-  ; GFX908-FLATSCR-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr2, implicit $exec
-  ; GFX908-FLATSCR-NEXT:   SCRATCH_STORE_DWORD $vgpr0, $vgpr1, 8, 0, implicit $exec, implicit $flat_scr, implicit $agpr0_agpr1_agpr2 :: (store (s32) into %stack.1 + 8, addrspace 5)
+  ; GFX908-FLATSCR-NEXT:   $vgpr0 = V_ACCVGPR_READ_B32_e64 $agpr2, implicit $exec, implicit $agpr0_agpr1_agpr2
+  ; GFX908-FLATSCR-NEXT:   SCRATCH_STORE_DWORD $vgpr0, $vgpr1, 8, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %stack.1 + 8, addrspace 5)
   ; GFX908-FLATSCR-NEXT:   $vgpr1 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from %stack.2, addrspace 5)
   ; GFX908-FLATSCR-NEXT:   S_CBRANCH_SCC1 %bb.2, implicit $scc
   ; GFX908-FLATSCR-NEXT: {{  $}}
diff --git a/llvm/test/CodeGen/AMDGPU/av_spill_cross_bb_usage.mir b/llvm/test/CodeGen/AMDGPU/av_spill_cross_bb_usage.mir
index 3228962ed01f7..63c7eb6e7c83b 100644
--- a/llvm/test/CodeGen/AMDGPU/av_spill_cross_bb_usage.mir
+++ b/llvm/test/CodeGen/AMDGPU/av_spill_cross_bb_usage.mir
@@ -65,16 +65,16 @@ body:             |
   ; GCN-NEXT:   $vgpr40 = SI_SPILL_S32_TO_VGPR $sgpr30, 0, $vgpr40, implicit-def $sgpr30_sgpr31, implicit $sgpr30_sgpr31
   ; GCN-NEXT:   $vgpr40 = SI_SPILL_S32_TO_VGPR $sgpr31, 1, $vgpr40, implicit $sgpr30_sgpr31
   ; GCN-NEXT:   BUFFER_STORE_DWORD_OFFSET killed $vgpr14, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 52, 0, 0, implicit $exec, implicit-def $vgpr14_vgpr15, implicit $vgpr14_vgpr15 :: (store (s32) into %stack.1, addrspace 5)
-  ; GCN-NEXT:   BUFFER_STORE_DWORD_OFFSET killed $vgpr15, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 56, 0, 0, implicit $exec, implicit killed $vgpr14_vgpr15 :: (store (s32) into %stack.1 + 4, addrspace 5)
+  ; GCN-NEXT:   BUFFER_STORE_DWORD_OFFSET killed $vgpr15, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 56, 0, 0, implicit $exec :: (store (s32) into %stack.1 + 4, addrspace 5)
   ; GCN-NEXT:   BUFFER_STORE_DWORD_OFFSET killed $vgpr10, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 60, 0, 0, implicit $exec, implicit-def $vgpr10_vgpr11, implicit $vgpr10_vgpr11 :: (store (s32) into %stack.2, addrspace 5)
-  ; GCN-NEXT:   BUFFER_STORE_DWORD_OFFSET killed $vgpr11, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 64, 0, 0, implicit $exec, implicit killed $vgpr10_vgpr11 :: (store (s32) into %stack.2 + 4, addrspace 5)
+  ; GCN-NEXT:   BUFFER_STORE_DWORD_OFFSET killed $vgpr11, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 64, 0, 0, implicit $exec :: (store (s32) into %stack.2 + 4, addrspace 5)
   ; GCN-NEXT:   dead $sgpr30_sgpr31 = SI_CALL killed renamable $sgpr16_sgpr17, 0, csr_amdgpu, implicit-def dead $vgpr0
   ; GCN-NEXT:   $vgpr14 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 52, 0, 0, implicit $exec, implicit-def $vgpr14_vgpr15 :: (load (s32) from %stack.1, addrspace 5)
-  ; GCN-NEXT:   $vgpr15 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 56, 0, 0, implicit $exec, implicit-def $vgpr14_vgpr15 :: (load (s32) from %stack.1 + 4, addrspace 5)
+  ; GCN-NEXT:   $vgpr15 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 56, 0, 0, implicit $exec :: (load (s32) from %stack.1 + 4, addrspace 5)
   ; GCN-NEXT:   renamable $vgpr0_vgpr1 = nofpexcept V_FMA_F64_e64 0, killed $vgpr45_vgpr46, 0, killed $vgpr41_vgpr42, 0, killed $vgpr60_vgpr61, 0, 0, implicit $mode, implicit $exec
   ; GCN-NEXT:   FLAT_STORE_DWORDX2 killed renamable $vgpr58_vgpr59, killed renamable $vgpr0_vgpr1, 0, 0, implicit $exec, implicit $flat_scr :: (store (s64))
   ; GCN-NEXT:   $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 60, 0, 0, implicit $exec, implicit-def $vgpr0_vgpr1 :: (load (s32) from %stack.2, addrspace 5)
-  ; GCN-NEXT:   $vgpr1 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 64, 0, 0, implicit $exec, implicit-def $vgpr0_vgpr1 :: (load (s32) from %stack.2 + 4, addrspace 5)
+  ; GCN-NEXT:   $vgpr1 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 64, 0, 0, implicit $exec :: (load (s32) from %stack.2 + 4, addrspace 5)
   ; GCN-NEXT:   FLAT_STORE_DWORDX2 killed renamable $vgpr0_vgpr1, killed renamable $vgpr56_vgpr57, 0, 0, implicit $exec, implicit $flat_scr :: (store (s64))
   ; GCN-NEXT: {{  $}}
   ; GCN-NEXT: bb.2:
diff --git a/llvm/test/CodeGen/AMDGPU/pei-build-av-spill.mir b/llvm/test/CodeGen/AMDGPU/pei-build-av-spill.mir
index 9ce2464e0968a..45820091d59ce 100644
--- a/llvm/test/CodeGen/AMDGPU/pei-build-av-spill.mir
+++ b/llvm/test/CodeGen/AMDGPU/pei-build-av-spill.mir
@@ -95,9 +95,9 @@ body:             |
     ; MUBUF-LABEL: name: test_spill_av_v2
     ; MUBUF: $vgpr0_vgpr1 = IMPLICIT_DEF
     ; MUBUF-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec, implicit-def $vgpr0_vgpr1, implicit $vgpr0_vgpr1 :: (store (s32) into %stack.0, addrspace 5)
-    ; MUBUF-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec, implicit killed $vgpr0_vgpr1 :: (store (s32) into %stack.0 + 4, addrspace 5)
+    ; MUBUF-NEXT: BUFFER_STORE_DWORD_OFFSET killed $vgpr1, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, implicit $exec :: (store (s32) into %stack.0 + 4, addrspace 5)
     ; MUBUF-NEXT: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, implicit $exec, implicit-def $vgpr0_vgpr1 :: (load (s32) from %stack.0, addrspace 5)
-    ; MUBUF-NEXT: $vgpr1 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, ...
[truncated]

@arsenm arsenm requested a review from jrbyrnes April 23, 2025 10:27
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced version of the failing situation with the lowered spill + machine-cp would be good

Comment on lines 1913 to 1916
if (NeedSuperRegImpOperand && IsFirstSubReg) {
unsigned State = RegState::Implicit | SrcDstRegState;
MIB.addReg(ValueReg, State);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (NeedSuperRegImpOperand && IsFirstSubReg) {
unsigned State = RegState::Implicit | SrcDstRegState;
MIB.addReg(ValueReg, State);
}
if (NeedSuperRegImpOperand && IsFirstSubReg)
MIB.addReg(ValueReg, RegState::Implicit | SrcDstRegState);

Can leave the second line alone, this is only dropping the IsLastSubReg part

@bababuck
Copy link
Contributor Author

bababuck commented Apr 29, 2025

Updated with the following changes:

  • Added new MIR test that reduces the failing scenario into a few lines of MIR code
  • Update the source code change to only change the if statement (no formatting changes to the surrounding code)
  • Rather than dropping IsLastSubReg, I instead qualify it with checking that the register state is not a define. I noticed in the new test I wrote that an implicit-kill was being dropped after my original change.
  • Updated the branch to be on top of the most recent main

@bababuck bababuck requested a review from arsenm April 29, 2025 21:54
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First line description should only contain direct description, not the issue number

--- |
define amdgpu_kernel void @agpr_spill_copy() #0 { ret void }

attributes #0 = { "amdgpu-num-vgpr"="32" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using amdgpu-num-vgpr. Use amdgpu-waves-per-eu (does this even need a budget restriction? There's no actual allocation)

Copy link
Contributor Author

@bababuck bababuck Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a budget, the registers will spill from vgpr into available agpr registers rather than onto the stack. The bug only occurs when there is enough register pressure such that some of the registers in the register group spill onto the stack, and some spill into the agprs. Technically, I could remove the budget restriction, but that would require adding code to the test to put all of the agprs into use (that seems much messier to me).

I made the change to using amdgpu-waves-per-eu in this latest push, just curious as to why this is preferred? It seems like a more indirect way of getting the behavior that I want (which is limiting the number of vgpr's to 32).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amdgpu-num-vgpr is hacky and introduces a lot of edge cases that don't actually work, and I want to remove it. It's not future proof when the number of registers change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't using amdgpu-waves-per-eu also not robust to if the number of registers change in the context of this test? Because I'm relying on the fact that there are 256 registers so with amdgpu-waves-per-eu=8, 32 registers are allocated for each wave, which is the condition I need for the failing case to occur. Is there a better way to write the test to avoid this dependency; my end goal is to have exactly 2 unused agpr's when the register spill occurs.

Copy link
Contributor

@arsenm arsenm May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future proof for future targets. This test is fixed to a specific target. You should use the equivalent amdgpu-waves-per-eu

@@ -0,0 +1,43 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
# RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx942 -verify-machineinstrs -run-pass prologepilog,machine-cp -o - %s | FileCheck -check-prefix=GFX942 %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx942 -verify-machineinstrs -run-pass prologepilog,machine-cp -o - %s | FileCheck -check-prefix=GFX942 %s
# RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx942 -run-pass prologepilog,machine-cp -o - %s | FileCheck -check-prefix=GFX942 %s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is updated with the latest push.

@bababuck bababuck requested a review from arsenm April 30, 2025 16:35
@arsenm
Copy link
Contributor

arsenm commented May 1, 2025

First line description should only contain direct description, not the issue number

Still should fix description

--- |
define amdgpu_kernel void @agpr_spill_copy() #0 { ret void }

attributes #0 = { "amdgpu-waves-per-eu"="8" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
attributes #0 = { "amdgpu-waves-per-eu"="8" }
attributes #0 = { "amdgpu-waves-per-eu"="8,8" }

@bababuck bababuck changed the title [AMDGPU] Fix for 131386 by reducing implicit definitions on register restoration [AMDGPU] Remove implicit definition of register group when restoring the last sub-register after a spill May 1, 2025
bababuck added 2 commits May 1, 2025 13:09
…-reg's are saved to agpr's

Replicates llvm#131386.

Currently, this test "fails", note the missing copy to $agpr14.
…the last sub-register after a spill

Extension of https://reviews.llvm.org/D141101 to remove the define flag from
the last stack memory access. Fixes case where COPY instructions are used
for some of the stack restoration, but the copies get optimized away during the
machine-cp pass.

Prior to this change, was possible to produce the following code:
  $agpr16_agpr17_agpr18_agpr19 = SCRATCH_LOAD_DWORDX4_ST 64, 0, implicit $exec, implicit $flat_scr, implicit-def $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31 :: (load (s128) from %stack.17, align 4, addrspace 5)
  $agpr20_agpr21_agpr22_agpr23 = SCRATCH_LOAD_DWORDX4_ST 80, 0, implicit $exec, implicit $flat_scr :: (load (s128) from %stack.17 + 16, align 4, addrspace 5)
  $agpr24_agpr25_agpr26_agpr27 = SCRATCH_LOAD_DWORDX4_ST 96, 0, implicit $exec, implicit $flat_scr :: (load (s128) from %stack.17 + 32, align 4, addrspace 5)
  $agpr31 = COPY $agpr112, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31
  $agpr30 = COPY $agpr208, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31
  $agpr28_agpr29 = SCRATCH_LOAD_DWORDX2_ST 112, 0, implicit $exec, implicit $flat_scr, implicit-def $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31 :: (load (s64) from %stack.17 + 48, align 4, addrspace 5)

where `$agpr30 = COPY $agpr208` would be optimized away by `machine-cp` pass.
Instead, change to:
  $agpr28_agpr29 = SCRATCH_LOAD_DWORDX2_ST 112, 0, implicit $exec, implicit $flat_scr :: (load (s64) from %stack.17 + 48, align 4, addrspace 5)

Fixes llvm#131386.
@bababuck bababuck force-pushed the rbuchner/131386 branch from 99b3a44 to b137354 Compare May 1, 2025 21:33
@bababuck
Copy link
Contributor Author

bababuck commented May 1, 2025

Fixed:

  • Issue and commit titles to no longer include issue number in the first line
  • "amdgpu-waves-per-eu"="8" --> "amdgpu-waves-per-eu"="8,8"
  • Rebased to latest main

if (NeedSuperRegImpOperand && (IsFirstSubReg || IsLastSubReg))
bool IsSrcDstDef = SrcDstRegState & RegState::Define;
if (NeedSuperRegImpOperand &&
(IsFirstSubReg || (IsLastSubReg && !IsSrcDstDef)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the second condition is doing here, why isn't this just IsFirstSubReg && isDef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the IsLastSubReg is needed for other flags, namely RegState::Kill. Removing it, the following change is seen in the test I wrote:

-    ; GFX942-NEXT: SCRATCH_STORE_DWORDX2_SADDR killed $vgpr12_vgpr13, $sgpr32, 48, 0, implicit $exec, implicit $flat_scr, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15 :: (store (s64) into %stack.0 + 48, align 4, addrspace 5)
+    ; GFX942-NEXT: SCRATCH_STORE_DWORDX2_SADDR killed $vgpr12_vgpr13, $sgpr32, 48, 0, implicit $exec, implicit $flat_scr :: (store (s64) into %stack.0 + 48, align 4, addrspace 5)

This seems to be the analog of implicit def on the first sub-register.

@arsenm arsenm merged commit 4fb7d19 into llvm:main May 5, 2025
11 checks passed
Copy link

github-actions bot commented May 5, 2025

@bababuck Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this pull request May 7, 2025
…the last sub-register after a spill (llvm#133986)

Extension of https://reviews.llvm.org/D141101 to remove the define flag
from the last stack memory access. Fixes case where COPY instructions
are used for some of the stack restoration, but the copies get optimized
away during the machine-cp pass.

Prior to this change, was possible to produce the following code:
$agpr16_agpr17_agpr18_agpr19 = SCRATCH_LOAD_DWORDX4_ST 64, 0, implicit
$exec, implicit $flat_scr, implicit-def
$agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31
:: (load (s128) from %stack.17, align 4, addrspace 5)
$agpr20_agpr21_agpr22_agpr23 = SCRATCH_LOAD_DWORDX4_ST 80, 0, implicit
$exec, implicit $flat_scr :: (load (s128) from %stack.17 + 16, align 4,
addrspace 5)
$agpr24_agpr25_agpr26_agpr27 = SCRATCH_LOAD_DWORDX4_ST 96, 0, implicit
$exec, implicit $flat_scr :: (load (s128) from %stack.17 + 32, align 4,
addrspace 5)
$agpr31 = COPY $agpr112, implicit
$agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31
$agpr30 = COPY $agpr208, implicit
$agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31
$agpr28_agpr29 = SCRATCH_LOAD_DWORDX2_ST 112, 0, implicit $exec,
implicit $flat_scr, implicit-def
$agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23_agpr24_agpr25_agpr26_agpr27_agpr28_agpr29_agpr30_agpr31
:: (load (s64) from %stack.17 + 48, align 4, addrspace 5)

where `$agpr30 = COPY $agpr208` would be optimized away by `machine-cp`
pass. Instead, change to:
$agpr28_agpr29 = SCRATCH_LOAD_DWORDX2_ST 112, 0, implicit $exec,
implicit $flat_scr :: (load (s64) from %stack.17 + 48, align 4,
addrspace 5)

Fixes llvm#131386.

Made the simple fix, but I'm not completely comfortable with this change
since the reason for the previous inclusion of `IsLastSubReg` is unclear
to me.

@krzysz00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AMDGPU] Incorrect outputs from tiling kernel after PR #129464
4 participants