-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AArch64] Restore Z-registers before P-registers #79623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64] Restore Z-registers before P-registers #79623
Conversation
The reverse for SVE register restore in the epilogue was added to guarantee performance, but further work was done to improve sve frame restore and besides that the schedule also may change the order of the restore, undoing the reverse restore.
@llvm/pr-subscribers-backend-aarch64 Author: None (CarolineConcatto) ChangesThe reverse for SVE register restore in the epilogue was added to guarantee performance, but further work was done to improve sve frame restore and besides that the schedule also may change the order of the restore, undoing the reverse restore. Patch is 258.32 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/79623.diff 18 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
index cffd414221c30cf..74e94e85d1eb9d5 100644
--- a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
@@ -3186,11 +3186,6 @@ bool AArch64FrameLowering::restoreCalleeSavedRegisters(
return MIB->getIterator();
};
- // SVE objects are always restored in reverse order.
- for (const RegPairInfo &RPI : reverse(RegPairs))
- if (RPI.isScalable())
- EmitMI(RPI);
-
if (homogeneousPrologEpilog(MF, &MBB)) {
auto MIB = BuildMI(MBB, MBBI, DL, TII.get(AArch64::HOM_Epilog))
.setMIFlag(MachineInstr::FrameDestroy);
@@ -3204,8 +3199,6 @@ bool AArch64FrameLowering::restoreCalleeSavedRegisters(
if (ReverseCSRRestoreSeq) {
MachineBasicBlock::iterator First = MBB.end();
for (const RegPairInfo &RPI : reverse(RegPairs)) {
- if (RPI.isScalable())
- continue;
MachineBasicBlock::iterator It = EmitMI(RPI);
if (First == MBB.end())
First = It;
@@ -3214,8 +3207,6 @@ bool AArch64FrameLowering::restoreCalleeSavedRegisters(
MBB.splice(MBBI, &MBB, First);
} else {
for (const RegPairInfo &RPI : RegPairs) {
- if (RPI.isScalable())
- continue;
(void)EmitMI(RPI);
}
}
diff --git a/llvm/test/CodeGen/AArch64/active_lane_mask.ll b/llvm/test/CodeGen/AArch64/active_lane_mask.ll
index a65c5d66677946a..15a61910fb10099 100644
--- a/llvm/test/CodeGen/AArch64/active_lane_mask.ll
+++ b/llvm/test/CodeGen/AArch64/active_lane_mask.ll
@@ -191,8 +191,8 @@ define <vscale x 32 x i1> @lane_mask_nxv32i1_i32(i32 %index, i32 %TC) {
; CHECK-NEXT: uzp1 p3.h, p3.h, p4.h
; CHECK-NEXT: cmphi p0.s, p0/z, z25.s, z1.s
; CHECK-NEXT: uzp1 p4.h, p5.h, p6.h
-; CHECK-NEXT: ldr p6, [sp, #5, mul vl] // 2-byte Folded Reload
; CHECK-NEXT: ldr p5, [sp, #6, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p6, [sp, #5, mul vl] // 2-byte Folded Reload
; CHECK-NEXT: uzp1 p2.h, p2.h, p0.h
; CHECK-NEXT: uzp1 p0.b, p1.b, p3.b
; CHECK-NEXT: uzp1 p1.b, p4.b, p2.b
diff --git a/llvm/test/CodeGen/AArch64/framelayout-sve-calleesaves-fix.mir b/llvm/test/CodeGen/AArch64/framelayout-sve-calleesaves-fix.mir
index 3dba21d59b4087e..aed314507361918 100644
--- a/llvm/test/CodeGen/AArch64/framelayout-sve-calleesaves-fix.mir
+++ b/llvm/test/CodeGen/AArch64/framelayout-sve-calleesaves-fix.mir
@@ -19,8 +19,8 @@
; CHECK-NEXT: // implicit-def: $p4
; CHECK-NEXT: addvl sp, sp, #1
; CHECK-NEXT: .cfi_escape 0x0f, 0x0c, 0x8f, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0x2e, 0x00, 0x1e, 0x22 // sp + 16 + 16 * VG
- ; CHECK-NEXT: ldr p4, [sp, #7, mul vl] // 2-byte Folded Reload
; CHECK-NEXT: ldr z8, [sp, #1, mul vl] // 16-byte Folded Reload
+ ; CHECK-NEXT: ldr p4, [sp, #7, mul vl] // 2-byte Folded Reload
; CHECK-NEXT: addvl sp, sp, #2
; CHECK-NEXT: .cfi_def_cfa wsp, 16
; CHECK-NEXT: .cfi_restore z8
diff --git a/llvm/test/CodeGen/AArch64/framelayout-sve.mir b/llvm/test/CodeGen/AArch64/framelayout-sve.mir
index 213d7919e4a7270..8bfd12067d77c9d 100644
--- a/llvm/test/CodeGen/AArch64/framelayout-sve.mir
+++ b/llvm/test/CodeGen/AArch64/framelayout-sve.mir
@@ -710,9 +710,9 @@ body: |
# CHECK: $sp = frame-destroy ADDXri $sp, 32, 0
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION escape 0x0f, 0x0c, 0x8f, 0x00, 0x11, 0x10, 0x22, 0x11, 0x08, 0x92, 0x2e, 0x00, 0x1e, 0x22
-# CHECK: $p6 = frame-destroy LDR_PXI $sp, 5
-# CHECK: $p5 = frame-destroy LDR_PXI $sp, 6
# CHECK: $p4 = frame-destroy LDR_PXI $sp, 7
+# CHECK: $p5 = frame-destroy LDR_PXI $sp, 6
+# CHECK: $p6 = frame-destroy LDR_PXI $sp, 5
# CHECK: $sp = frame-destroy ADDVL_XXI $sp, 1
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION def_cfa $wsp, 16
# CHECK-NEXT: early-clobber $sp, $fp = frame-destroy LDRXpost $sp, 16 :: (load (s64) from %stack.4)
@@ -772,9 +772,9 @@ body: |
# CHECK: $sp = frame-destroy ADDXri $sp, 32, 0
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION escape 0x0f, 0x0c, 0x8f, 0x00, 0x11, 0x10, 0x22, 0x11, 0x18, 0x92, 0x2e, 0x00, 0x1e, 0x22
-# CHECK-NEXT: $z10 = frame-destroy LDR_ZXI $sp, 0
-# CHECK-NEXT: $z9 = frame-destroy LDR_ZXI $sp, 1
# CHECK-NEXT: $z8 = frame-destroy LDR_ZXI $sp, 2
+# CHECK-NEXT: $z9 = frame-destroy LDR_ZXI $sp, 1
+# CHECK-NEXT: $z10 = frame-destroy LDR_ZXI $sp, 0
# CHECK-NEXT: $sp = frame-destroy ADDVL_XXI $sp, 3
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION def_cfa $wsp, 16
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION restore $z8
@@ -873,14 +873,14 @@ body: |
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION escape 0x0f, 0x0d, 0x8f, 0x00, 0x11, 0x20, 0x22, 0x11, 0x98, 0x01, 0x92, 0x2e, 0x00, 0x1e, 0x22
# CHECK: $sp = frame-destroy ADDVL_XXI $sp, 1
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION escape 0x0f, 0x0d, 0x8f, 0x00, 0x11, 0x20, 0x22, 0x11, 0x90, 0x01, 0x92, 0x2e, 0x00, 0x1e, 0x22
-# CHECK: $p15 = frame-destroy LDR_PXI $sp, 4
-# CHECK: $p14 = frame-destroy LDR_PXI $sp, 5
-# CHECK: $p5 = frame-destroy LDR_PXI $sp, 14
-# CHECK: $p4 = frame-destroy LDR_PXI $sp, 15
-# CHECK: $z23 = frame-destroy LDR_ZXI $sp, 2
-# CHECK: $z22 = frame-destroy LDR_ZXI $sp, 3
-# CHECK: $z9 = frame-destroy LDR_ZXI $sp, 16
# CHECK: $z8 = frame-destroy LDR_ZXI $sp, 17
+# CHECK: $z9 = frame-destroy LDR_ZXI $sp, 16
+# CHECK: $z22 = frame-destroy LDR_ZXI $sp, 3
+# CHECK: $z23 = frame-destroy LDR_ZXI $sp, 2
+# CHECK: $p4 = frame-destroy LDR_PXI $sp, 15
+# CHECK: $p5 = frame-destroy LDR_PXI $sp, 14
+# CHECK: $p14 = frame-destroy LDR_PXI $sp, 5
+# CHECK: $p15 = frame-destroy LDR_PXI $sp, 4
# CHECK: $sp = frame-destroy ADDVL_XXI $sp, 18
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION def_cfa $wsp, 32
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION restore $z8
@@ -1037,14 +1037,14 @@ body: |
# CHECK-NEXT: $sp = frame-setup ANDXri killed $[[TMP]]
# CHECK: $sp = frame-destroy ADDVL_XXI $fp, -18
-# CHECK-NEXT: $p15 = frame-destroy LDR_PXI $sp, 4
-# CHECK-NEXT: $p14 = frame-destroy LDR_PXI $sp, 5
-# CHECK: $p5 = frame-destroy LDR_PXI $sp, 14
-# CHECK-NEXT: $p4 = frame-destroy LDR_PXI $sp, 15
+# CHECK: $z8 = frame-destroy LDR_ZXI $sp, 17
+# CHECK-NEXT: $z9 = frame-destroy LDR_ZXI $sp, 16
+# CHECK: $z22 = frame-destroy LDR_ZXI $sp, 3
# CHECK-NEXT: $z23 = frame-destroy LDR_ZXI $sp, 2
-# CHECK-NEXT: $z22 = frame-destroy LDR_ZXI $sp, 3
-# CHECK: $z9 = frame-destroy LDR_ZXI $sp, 16
-# CHECK-NEXT: $z8 = frame-destroy LDR_ZXI $sp, 17
+# CHECK-NEXT: $p4 = frame-destroy LDR_PXI $sp, 15
+# CHECK-NEXT: $p5 = frame-destroy LDR_PXI $sp, 14
+# CHECK: $p14 = frame-destroy LDR_PXI $sp, 5
+# CHECK-NEXT: $p15 = frame-destroy LDR_PXI $sp, 4
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION restore $z8
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION restore $z9
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION restore $z10
@@ -1198,10 +1198,10 @@ body: |
# CHECK: $sp = frame-destroy ADDVL_XXI $sp, 7
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION escape 0x0f, 0x0c, 0x8f, 0x00, 0x11, 0x10, 0x22, 0x11, 0x18, 0x92, 0x2e, 0x00, 0x1e, 0x22
-# CHECK-NEXT: $p15 = frame-destroy LDR_PXI $sp, 6
-# CHECK-NEXT: $p4 = frame-destroy LDR_PXI $sp, 7
-# CHECK-NEXT: $z23 = frame-destroy LDR_ZXI $sp, 1
# CHECK-NEXT: $z8 = frame-destroy LDR_ZXI $sp, 2
+# CHECK-NEXT: $z23 = frame-destroy LDR_ZXI $sp, 1
+# CHECK-NEXT: $p4 = frame-destroy LDR_PXI $sp, 7
+# CHECK-NEXT: $p15 = frame-destroy LDR_PXI $sp, 6
# CHECK-NEXT: $sp = frame-destroy ADDVL_XXI $sp, 3
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION def_cfa $wsp, 16
# CHECK-NEXT: frame-destroy CFI_INSTRUCTION restore $z8
diff --git a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
index 5d0c9127d3ebb24..23acc57159907b9 100644
--- a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
+++ b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
@@ -219,34 +219,34 @@ define <vscale x 2 x double> @streaming_compatible_with_scalable_vectors(<vscale
; CHECK-NEXT: ldr z1, [sp] // 16-byte Folded Reload
; CHECK-NEXT: fadd z0.d, z1.d, z0.d
; CHECK-NEXT: addvl sp, sp, #2
-; CHECK-NEXT: ldr p15, [sp, #4, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z23, [sp, #2, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z22, [sp, #3, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p14, [sp, #5, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z21, [sp, #4, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z20, [sp, #5, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p13, [sp, #6, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z19, [sp, #6, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z18, [sp, #7, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p12, [sp, #7, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z17, [sp, #8, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z16, [sp, #9, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p11, [sp, #8, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z15, [sp, #10, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z14, [sp, #11, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p10, [sp, #9, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z13, [sp, #12, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z12, [sp, #13, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p9, [sp, #10, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z11, [sp, #14, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z10, [sp, #15, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p8, [sp, #11, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z9, [sp, #16, mul vl] // 16-byte Folded Reload
; CHECK-NEXT: ldr z8, [sp, #17, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p7, [sp, #12, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr p6, [sp, #13, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr p5, [sp, #14, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr z9, [sp, #16, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z10, [sp, #15, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z11, [sp, #14, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z12, [sp, #13, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z13, [sp, #12, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z14, [sp, #11, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z15, [sp, #10, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z16, [sp, #9, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z17, [sp, #8, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z18, [sp, #7, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z19, [sp, #6, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z20, [sp, #5, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z21, [sp, #4, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z22, [sp, #3, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z23, [sp, #2, mul vl] // 16-byte Folded Reload
; CHECK-NEXT: ldr p4, [sp, #15, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p5, [sp, #14, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p6, [sp, #13, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p7, [sp, #12, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p8, [sp, #11, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p9, [sp, #10, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p10, [sp, #9, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p11, [sp, #8, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p12, [sp, #7, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p13, [sp, #6, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p14, [sp, #5, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p15, [sp, #4, mul vl] // 2-byte Folded Reload
; CHECK-NEXT: addvl sp, sp, #18
; CHECK-NEXT: ldp x30, x19, [sp, #16] // 16-byte Folded Reload
; CHECK-NEXT: ldr x29, [sp], #32 // 8-byte Folded Reload
@@ -311,34 +311,34 @@ define <vscale x 2 x i1> @streaming_compatible_with_predicate_vectors(<vscale x
; CHECK-NEXT: ldr p1, [sp, #6, mul vl] // 2-byte Folded Reload
; CHECK-NEXT: and p0.b, p1/z, p1.b, p0.b
; CHECK-NEXT: addvl sp, sp, #1
-; CHECK-NEXT: ldr p15, [sp, #4, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z23, [sp, #2, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z22, [sp, #3, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p14, [sp, #5, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z21, [sp, #4, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z20, [sp, #5, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p13, [sp, #6, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z19, [sp, #6, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z18, [sp, #7, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p12, [sp, #7, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z17, [sp, #8, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z16, [sp, #9, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p11, [sp, #8, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z15, [sp, #10, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z14, [sp, #11, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p10, [sp, #9, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z13, [sp, #12, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z12, [sp, #13, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p9, [sp, #10, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z11, [sp, #14, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z10, [sp, #15, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p8, [sp, #11, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z9, [sp, #16, mul vl] // 16-byte Folded Reload
; CHECK-NEXT: ldr z8, [sp, #17, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p7, [sp, #12, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr p6, [sp, #13, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr p5, [sp, #14, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr z9, [sp, #16, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z10, [sp, #15, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z11, [sp, #14, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z12, [sp, #13, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z13, [sp, #12, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z14, [sp, #11, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z15, [sp, #10, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z16, [sp, #9, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z17, [sp, #8, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z18, [sp, #7, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z19, [sp, #6, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z20, [sp, #5, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z21, [sp, #4, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z22, [sp, #3, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z23, [sp, #2, mul vl] // 16-byte Folded Reload
; CHECK-NEXT: ldr p4, [sp, #15, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p5, [sp, #14, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p6, [sp, #13, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p7, [sp, #12, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p8, [sp, #11, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p9, [sp, #10, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p10, [sp, #9, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p11, [sp, #8, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p12, [sp, #7, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p13, [sp, #6, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p14, [sp, #5, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p15, [sp, #4, mul vl] // 2-byte Folded Reload
; CHECK-NEXT: addvl sp, sp, #18
; CHECK-NEXT: ldp x30, x19, [sp, #16] // 16-byte Folded Reload
; CHECK-NEXT: ldr x29, [sp], #32 // 8-byte Folded Reload
diff --git a/llvm/test/CodeGen/AArch64/sme-streaming-interface.ll b/llvm/test/CodeGen/AArch64/sme-streaming-interface.ll
index dd7d6470ad7b084..efb904b9bb333e3 100644
--- a/llvm/test/CodeGen/AArch64/sme-streaming-interface.ll
+++ b/llvm/test/CodeGen/AArch64/sme-streaming-interface.ll
@@ -187,34 +187,34 @@ define <vscale x 4 x i32> @smstart_clobber_sve(<vscale x 4 x i32> %x) nounwind {
; CHECK-NEXT: smstop sm
; CHECK-NEXT: ldr z0, [sp] // 16-byte Folded Reload
; CHECK-NEXT: addvl sp, sp, #1
-; CHECK-NEXT: ldr p15, [sp, #4, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z23, [sp, #2, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z22, [sp, #3, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p14, [sp, #5, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z21, [sp, #4, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z20, [sp, #5, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p13, [sp, #6, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z19, [sp, #6, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z18, [sp, #7, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p12, [sp, #7, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z17, [sp, #8, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z16, [sp, #9, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p11, [sp, #8, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z15, [sp, #10, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z14, [sp, #11, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p10, [sp, #9, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z13, [sp, #12, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z12, [sp, #13, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p9, [sp, #10, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z11, [sp, #14, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr z10, [sp, #15, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p8, [sp, #11, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr z9, [sp, #16, mul vl] // 16-byte Folded Reload
; CHECK-NEXT: ldr z8, [sp, #17, mul vl] // 16-byte Folded Reload
-; CHECK-NEXT: ldr p7, [sp, #12, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr p6, [sp, #13, mul vl] // 2-byte Folded Reload
-; CHECK-NEXT: ldr p5, [sp, #14, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr z9, [sp, #16, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z10, [sp, #15, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z11, [sp, #14, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z12, [sp, #13, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z13, [sp, #12, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z14, [sp, #11, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z15, [sp, #10, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z16, [sp, #9, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z17, [sp, #8, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z18, [sp, #7, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z19, [sp, #6, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z20, [sp, #5, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z21, [sp, #4, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z22, [sp, #3, mul vl] // 16-byte Folded Reload
+; CHECK-NEXT: ldr z23, [sp, #2, mul vl] // 16-byte Folded Reload
; CHECK-NEXT: ldr p4, [sp, #15, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p5, [sp, #14, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p6, [sp, #13, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p7, [sp, #12, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p8, [sp, #11, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p9, [sp, #10, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p10, [sp, #9, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p11, [sp, #8, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p12, [sp, #7, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p13, [sp, #6, mul vl] // 2-byte Folded Reload
+; CHECK-NEXT: ldr p14, [...
[truncated]
|
; CHECK-NEXT: ldr z9, [sp, #16, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr z10, [sp, #15, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr z11, [sp, #14, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr z12, [sp, #13, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr z13, [sp, #12, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr z14, [sp, #11, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr z15, [sp, #10, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr z16, [sp, #9, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr z17, [sp, #8, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr z18, [sp, #7, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr z19, [sp, #6, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr z20, [sp, #5, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr z21, [sp, #4, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr z22, [sp, #3, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr z23, [sp, #2, mul vl] // 16-byte Folded Reload | ||
; CHECK-NEXT: ldr p4, [sp, #15, mul vl] // 2-byte Folded Reload | ||
; CHECK-NEXT: ldr p5, [sp, #14, mul vl] // 2-byte Folded Reload | ||
; CHECK-NEXT: ldr p6, [sp, #13, mul vl] // 2-byte Folded Reload | ||
; CHECK-NEXT: ldr p7, [sp, #12, mul vl] // 2-byte Folded Reload | ||
; CHECK-NEXT: ldr p8, [sp, #11, mul vl] // 2-byte Folded Reload | ||
; CHECK-NEXT: ldr p9, [sp, #10, mul vl] // 2-byte Folded Reload | ||
; CHECK-NEXT: ldr p10, [sp, #9, mul vl] // 2-byte Folded Reload | ||
; CHECK-NEXT: ldr p11, [sp, #8, mul vl] // 2-byte Folded Reload | ||
; CHECK-NEXT: ldr p12, [sp, #7, mul vl] // 2-byte Folded Reload | ||
; CHECK-NEXT: ldr p13, [sp, #6, mul vl] // 2-byte Folded Reload | ||
; CHECK-NEXT: ldr p14, [sp, #5, mul vl] // 2-byte Folded Reload | ||
; CHECK-NEXT: ldr p15, [sp, #4, mul vl] // 2-byte Folded Reload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably recommend keeping incrementing addresses for performance reasons, but you could split the order of things such that you end up with:
ldr z23, [sp, #2, mul vl] // 16-byte Folded Reload
ldr z22, [sp, #3, mul vl] // 16-byte Folded Reload
...
ldr z8, [sp, #17, mul vl] // 16-byte Folded Reload
ldr p15, [sp, #4, mul vl] // 2-byte Folded Reload
ldr p14, [sp, #5, mul vl] // 2-byte Folded Reload
...
ldr p4, [sp, #16, mul vl] // 2-byte Folded Reload
You could do that by doing a stable_sort
of the RegPairs (rather than a regular sort), so that you only swap when the two regpairs are a combination of P and Z registers. A stable_sort should retain the original sequence of the their respective P and Z spill/reload sequences.
@@ -3196,9 +3196,21 @@ bool AArch64FrameLowering::restoreCalleeSavedRegisters( | |||
return true; | |||
} | |||
|
|||
SmallVector<RegPairInfo, 8> RegPairsScalable = RegPairs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copy here is not necessary (and stable_sort
may do another pair of copies, sigh)
LGTM. I would suggest dropping the NFC from the commit message and rewording it a bit. Something similar to:
|
auto PPRBegin = | ||
std::find_if(RegPairs.begin(), RegPairs.end(), [](const RegPairInfo &c) { | ||
return c.Type == RegPairInfo::PPR; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
auto PPRBegin = | |
std::find_if(RegPairs.begin(), RegPairs.end(), [](const RegPairInfo &c) { | |
return c.Type == RegPairInfo::PPR; | |
}); | |
auto IsPPR = [](const RegPairInfo &c) { return c.Type == RegPairInfo::PPR; }; | |
auto PPRBegin = std::find_if(RegPairs.begin(), RegPairs.end(), IsPPR); | |
auto PPREnd = std::find_if(RegPairs.rbegin(), RegPairs.rend(), IsPPR); |
(same suggestion for ZPR)
RegPairs.rbegin(), RegPairs.rend(), | ||
[](const RegPairInfo &c) { return c.Type == RegPairInfo::ZPR; }); | ||
std::reverse(ZPRBegin, ZPREnd.base()); | ||
|
||
if (ReverseCSRRestoreSeq) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ReverseCSRRestoreSeq == true
, then it always reverses all RegPairs. That means that if RegPairs has { ..., P4, P5, ... Z8, Z9, .. }
that if ReverseCSRRestoreSeq == true
, that the Z registers will be restored first, which I thought is what you want to avoid (reading the commit messsage in the PR).
That said, I think this option is never really used other than in some of the tests. Which makes me wonder, can we remove it? @francisvm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove it! I added it for experiments at first and it didn't turn out useful in the end.
// For performance reasons restore SVE register in increasing order | ||
auto IsPPR = [](const RegPairInfo &c) { return c.Type == RegPairInfo::PPR; }; | ||
auto PPRBegin = std::find_if(RegPairs.begin(), RegPairs.end(), IsPPR); | ||
auto PPREnd = std::find_if(RegPairs.rbegin(), RegPairs.rend(), IsPPR); | ||
std::reverse(PPRBegin, PPREnd.base()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing a crash here in a downstream project, running on Windows compiled via MSVC.
attached .bc
and .ll
output from llvm-dis [file.bc]
: 79623_repro_files.zip
If I revert 3f0404a, the crash goes away.
In debug builds this is hitting the following assert in the CRT, so it looks like this std::reverse is incorrect and just happens to be passing on libc++: #if _ITERATOR_DEBUG_LEVEL != 0
template <class _Ty>
constexpr void _Verify_range(const _Ty* const _First, const _Ty* const _Last) noexcept {
// special case range verification for pointers
_STL_VERIFY(_First <= _Last, "transposed pointer range");
}
#endif // _ITERATOR_DEBUG_LEVEL != 0 |
Ok, I will revert this patch and 493f101 while we don't find a fix for it |
Just sent a proposed fix: #82392 |
This reverts commit 3f0404a. std::reverse is breaking some builds
Patches reverted:
|
This reverts commit 3f0404a.
This fixes the Windows error reported on #16483. An upstream revert or fix-forward is being discussed on llvm/llvm-project#79623 (this local revert applies cleanly from the LLVM commit we are pinned to). ci-extra: build_test_all_windows --------- Co-authored-by: Scott Todd <[email protected]>
This is needed by PR#77665[1] that uses a P-register while restoring Z-registers. The reverse for SVE register restore in the epilogue was added to guarantee performance, but further work was done to improve sve frame restore and besides that the schedule also may change the order of the restore, undoing the reverse restore. This also fix the problem reported on Windows with std::reverse and .base(). [1]llvm#77665
This is needed by PR#77665[1] that uses a P-register while restoring Z-registers. The reverse for SVE register restore in the epilogue was added to guarantee performance, but further work was done to improve sve frame restore and besides that the schedule also may change the order of the restore, undoing the reverse restore. This also fix the problem reported in (PR #79623) on Windows with std::reverse and .base(). [1]#77665
This is needed by PR#77665[1] that uses a P-register while restoring Z-registers.
The reverse for SVE register restore in the epilogue was added to guarantee performance, but further work was done to improve sve frame restore and besides that the schedule also may change the order of the restore, undoing the reverse restore.
[1]#77665