Skip to content

[AArch64] Override isLSRCostLess, take number of instructions into account #84189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

huntergr-arm
Copy link
Collaborator

Adds an AArch64-specific version of isLSRCostLess, changing the relative importance of the various terms from the formulae being evaluated.

This has been split out from my vscale-aware LSR work, see the RFC for reference: https://discourse.llvm.org/t/rfc-vscale-aware-loopstrengthreduce/77131

I intend to do some benchmarking of this independently of the LSR work to check that there's no major regressions.

@llvmbot
Copy link
Member

llvmbot commented Mar 6, 2024

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-aarch64

Author: Graham Hunter (huntergr-arm)

Changes

Adds an AArch64-specific version of isLSRCostLess, changing the relative importance of the various terms from the formulae being evaluated.

This has been split out from my vscale-aware LSR work, see the RFC for reference: https://discourse.llvm.org/t/rfc-vscale-aware-loopstrengthreduce/77131

I intend to do some benchmarking of this independently of the LSR work to check that there's no major regressions.


Patch is 32.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/84189.diff

9 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+19)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+3)
  • (modified) llvm/test/CodeGen/AArch64/arm64-2011-10-18-LdStOptBug.ll (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/arm64-ldp-cluster.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-predicated-scalable.ll (+44-48)
  • (modified) llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-scalable.ll (+56-70)
  • (modified) llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions.ll (+8-9)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+96-97)
  • (modified) llvm/test/Transforms/LoopStrengthReduce/AArch64/lsr-reuse.ll (+4-2)
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 755b034764ed2d..9ed98267e35fcc 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -58,6 +58,9 @@ static cl::opt<unsigned> InlineCallPenaltyChangeSM(
 static cl::opt<bool> EnableOrLikeSelectOpt("enable-aarch64-or-like-select",
                                            cl::init(true), cl::Hidden);
 
+static cl::opt<bool> EnableLSRCostOpt("enable-aarch64-lsr-cost-opt",
+                                      cl::init(true), cl::Hidden);
+
 namespace {
 class TailFoldingOption {
   // These bitfields will only ever be set to something non-zero in operator=,
@@ -4152,3 +4155,19 @@ bool AArch64TTIImpl::shouldTreatInstructionLikeSelect(const Instruction *I) {
     return true;
   return BaseT::shouldTreatInstructionLikeSelect(I);
 }
+
+bool AArch64TTIImpl::isLSRCostLess(const TargetTransformInfo::LSRCost &C1,
+                                   const TargetTransformInfo::LSRCost &C2) {
+  // AArch64 specific here is adding the number of instructions to the
+  // comparison (though not as the first consideration, as some targets do)
+  // along with changing the priority of the base additions.
+  // TODO: Maybe a more nuanced tradeoff between instruction count
+  // and number of registers? To be investigated at a later date.
+  if (EnableLSRCostOpt)
+    return std::tie(C1.NumRegs, C1.Insns, C1.NumBaseAdds, C1.AddRecCost,
+                    C1.NumIVMuls, C1.ScaleCost, C1.ImmCost, C1.SetupCost) <
+           std::tie(C2.NumRegs, C2.Insns, C2.NumBaseAdds, C2.AddRecCost,
+                    C2.NumIVMuls, C2.ScaleCost, C2.ImmCost, C2.SetupCost);
+
+  return TargetTransformInfoImplBase::isLSRCostLess(C1, C2);
+}
\ No newline at end of file
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index de39dea2be43e1..f438cf7f615920 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -424,6 +424,9 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   }
 
   std::optional<unsigned> getMinPageSize() const { return 4096; }
+
+  bool isLSRCostLess(const TargetTransformInfo::LSRCost &C1,
+                     const TargetTransformInfo::LSRCost &C2);
 };
 
 } // end namespace llvm
diff --git a/llvm/test/CodeGen/AArch64/arm64-2011-10-18-LdStOptBug.ll b/llvm/test/CodeGen/AArch64/arm64-2011-10-18-LdStOptBug.ll
index 3b6c4fa875e604..dafdcf82f311d4 100644
--- a/llvm/test/CodeGen/AArch64/arm64-2011-10-18-LdStOptBug.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-2011-10-18-LdStOptBug.ll
@@ -12,7 +12,7 @@ entry:
 
 for.body:
 ; CHECK: for.body
-; CHECK: ldr w{{[0-9]+}}, [x{{[0-9]+}}, x{{[0-9]+}}]
+; CHECK: ldr w{{[0-9]+}}, [x{{[0-9]+}}]
 ; CHECK: add x[[REG:[0-9]+]],
 ; CHECK:                      x[[REG]], #1, lsl  #12
   %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
diff --git a/llvm/test/CodeGen/AArch64/arm64-ldp-cluster.ll b/llvm/test/CodeGen/AArch64/arm64-ldp-cluster.ll
index 8c7b31fd34c488..114203e46f196b 100644
--- a/llvm/test/CodeGen/AArch64/arm64-ldp-cluster.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-ldp-cluster.ll
@@ -176,13 +176,13 @@ exit:
 ; CHECK: ********** MI Scheduling **********
 ; CHECK: LDURDi_LDRDui:%bb.1 vector_body
 ;
-; CHECK: Cluster ld/st SU(2) - SU(6)
-; CHECK: Cluster ld/st SU(3) - SU(7)
+; CHECK: Cluster ld/st SU(0) - SU(4)
+; CHECK: Cluster ld/st SU(1) - SU(5)
 ;
-; CHECK: SU(2): %{{[0-9]+}}:fpr64 = LDURDi
-; CHECK: SU(3): %{{[0-9]+}}:fpr64 = LDURDi
-; CHECK: SU(6): %{{[0-9]+}}:fpr64 = LDRDui
-; CHECK: SU(7): %{{[0-9]+}}:fpr64 = LDRDui
+; CHECK: SU(0): %{{[0-9]+}}:fpr64 = LDURDi
+; CHECK: SU(1): %{{[0-9]+}}:fpr64 = LDURDi
+; CHECK: SU(4): %{{[0-9]+}}:fpr64 = LDRDui
+; CHECK: SU(5): %{{[0-9]+}}:fpr64 = LDRDui
 ;
 define void @LDURDi_LDRDui(ptr nocapture readonly %arg) {
 entry:
diff --git a/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-predicated-scalable.ll b/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-predicated-scalable.ll
index 467c3c254fc2d3..cb219bf28c5109 100644
--- a/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-predicated-scalable.ll
+++ b/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-predicated-scalable.ll
@@ -14,31 +14,29 @@ target triple = "aarch64"
 define %"class.std::complex" @complex_mul_v2f64(ptr %a, ptr %b) {
 ; CHECK-LABEL: complex_mul_v2f64:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    mov w9, #100 // =0x64
+; CHECK-NEXT:    mov w8, #100 // =0x64
 ; CHECK-NEXT:    mov z1.d, #0 // =0x0
 ; CHECK-NEXT:    ptrue p0.d
-; CHECK-NEXT:    whilelo p1.d, xzr, x9
-; CHECK-NEXT:    cntd x10
-; CHECK-NEXT:    mov x8, xzr
-; CHECK-NEXT:    rdvl x11, #2
-; CHECK-NEXT:    mov x12, x10
+; CHECK-NEXT:    whilelo p1.d, xzr, x8
+; CHECK-NEXT:    cntd x9
+; CHECK-NEXT:    rdvl x10, #2
+; CHECK-NEXT:    mov x11, x9
 ; CHECK-NEXT:    zip2 z0.d, z1.d, z1.d
 ; CHECK-NEXT:    zip1 z1.d, z1.d, z1.d
 ; CHECK-NEXT:  .LBB0_1: // %vector.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    zip2 p3.d, p1.d, p1.d
-; CHECK-NEXT:    add x13, x0, x8
-; CHECK-NEXT:    add x14, x1, x8
-; CHECK-NEXT:    zip1 p2.d, p1.d, p1.d
 ; CHECK-NEXT:    mov z6.d, z1.d
 ; CHECK-NEXT:    mov z7.d, z0.d
-; CHECK-NEXT:    whilelo p1.d, x12, x9
-; CHECK-NEXT:    add x8, x8, x11
-; CHECK-NEXT:    add x12, x12, x10
-; CHECK-NEXT:    ld1d { z2.d }, p3/z, [x13, #1, mul vl]
-; CHECK-NEXT:    ld1d { z4.d }, p3/z, [x14, #1, mul vl]
-; CHECK-NEXT:    ld1d { z3.d }, p2/z, [x13]
-; CHECK-NEXT:    ld1d { z5.d }, p2/z, [x14]
+; CHECK-NEXT:    zip1 p2.d, p1.d, p1.d
+; CHECK-NEXT:    whilelo p1.d, x11, x8
+; CHECK-NEXT:    add x11, x11, x9
+; CHECK-NEXT:    ld1d { z2.d }, p3/z, [x0, #1, mul vl]
+; CHECK-NEXT:    ld1d { z4.d }, p3/z, [x1, #1, mul vl]
+; CHECK-NEXT:    ld1d { z3.d }, p2/z, [x0]
+; CHECK-NEXT:    ld1d { z5.d }, p2/z, [x1]
+; CHECK-NEXT:    add x1, x1, x10
+; CHECK-NEXT:    add x0, x0, x10
 ; CHECK-NEXT:    fcmla z7.d, p0/m, z4.d, z2.d, #0
 ; CHECK-NEXT:    fcmla z6.d, p0/m, z5.d, z3.d, #0
 ; CHECK-NEXT:    fcmla z7.d, p0/m, z4.d, z2.d, #90
@@ -115,32 +113,30 @@ define %"class.std::complex" @complex_mul_predicated_v2f64(ptr %a, ptr %b, ptr %
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    mov z1.d, #0 // =0x0
 ; CHECK-NEXT:    ptrue p0.d
-; CHECK-NEXT:    cntd x10
-; CHECK-NEXT:    neg x11, x10
-; CHECK-NEXT:    mov w12, #100 // =0x64
+; CHECK-NEXT:    cntd x9
+; CHECK-NEXT:    neg x10, x9
+; CHECK-NEXT:    mov w11, #100 // =0x64
 ; CHECK-NEXT:    mov x8, xzr
-; CHECK-NEXT:    mov x9, xzr
-; CHECK-NEXT:    and x11, x11, x12
-; CHECK-NEXT:    rdvl x12, #2
+; CHECK-NEXT:    and x10, x10, x11
+; CHECK-NEXT:    rdvl x11, #2
 ; CHECK-NEXT:    zip2 z0.d, z1.d, z1.d
 ; CHECK-NEXT:    zip1 z1.d, z1.d, z1.d
 ; CHECK-NEXT:  .LBB1_1: // %vector.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ld1w { z2.d }, p0/z, [x2, x9, lsl #2]
-; CHECK-NEXT:    add x13, x0, x8
-; CHECK-NEXT:    add x14, x1, x8
+; CHECK-NEXT:    ld1w { z2.d }, p0/z, [x2, x8, lsl #2]
 ; CHECK-NEXT:    mov z6.d, z1.d
 ; CHECK-NEXT:    mov z7.d, z0.d
-; CHECK-NEXT:    add x9, x9, x10
-; CHECK-NEXT:    add x8, x8, x12
+; CHECK-NEXT:    add x8, x8, x9
 ; CHECK-NEXT:    cmpne p1.d, p0/z, z2.d, #0
-; CHECK-NEXT:    cmp x11, x9
+; CHECK-NEXT:    cmp x10, x8
 ; CHECK-NEXT:    zip2 p2.d, p1.d, p1.d
 ; CHECK-NEXT:    zip1 p1.d, p1.d, p1.d
-; CHECK-NEXT:    ld1d { z2.d }, p2/z, [x13, #1, mul vl]
-; CHECK-NEXT:    ld1d { z4.d }, p2/z, [x14, #1, mul vl]
-; CHECK-NEXT:    ld1d { z3.d }, p1/z, [x13]
-; CHECK-NEXT:    ld1d { z5.d }, p1/z, [x14]
+; CHECK-NEXT:    ld1d { z2.d }, p2/z, [x0, #1, mul vl]
+; CHECK-NEXT:    ld1d { z4.d }, p2/z, [x1, #1, mul vl]
+; CHECK-NEXT:    ld1d { z3.d }, p1/z, [x0]
+; CHECK-NEXT:    ld1d { z5.d }, p1/z, [x1]
+; CHECK-NEXT:    add x1, x1, x11
+; CHECK-NEXT:    add x0, x0, x11
 ; CHECK-NEXT:    fcmla z7.d, p0/m, z4.d, z2.d, #0
 ; CHECK-NEXT:    fcmla z6.d, p0/m, z5.d, z3.d, #0
 ; CHECK-NEXT:    fcmla z7.d, p0/m, z4.d, z2.d, #90
@@ -217,33 +213,33 @@ exit.block:                                     ; preds = %vector.body
 define %"class.std::complex" @complex_mul_predicated_x2_v2f64(ptr %a, ptr %b, ptr %cond) {
 ; CHECK-LABEL: complex_mul_predicated_x2_v2f64:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    mov w10, #100 // =0x64
+; CHECK-NEXT:    mov w8, #100 // =0x64
 ; CHECK-NEXT:    mov z1.d, #0 // =0x0
 ; CHECK-NEXT:    ptrue p0.d
-; CHECK-NEXT:    whilelo p1.d, xzr, x10
-; CHECK-NEXT:    mov x8, xzr
-; CHECK-NEXT:    mov x9, xzr
-; CHECK-NEXT:    cntd x11
-; CHECK-NEXT:    rdvl x12, #2
+; CHECK-NEXT:    whilelo p1.d, xzr, x8
+; CHECK-NEXT:    cntd x9
+; CHECK-NEXT:    rdvl x10, #2
+; CHECK-NEXT:    cnth x11
+; CHECK-NEXT:    mov x12, x9
 ; CHECK-NEXT:    zip2 z0.d, z1.d, z1.d
 ; CHECK-NEXT:    zip1 z1.d, z1.d, z1.d
 ; CHECK-NEXT:  .LBB2_1: // %vector.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ld1w { z2.d }, p1/z, [x2, x9, lsl #2]
-; CHECK-NEXT:    add x13, x0, x8
-; CHECK-NEXT:    add x14, x1, x8
+; CHECK-NEXT:    ld1w { z2.d }, p1/z, [x2]
 ; CHECK-NEXT:    mov z6.d, z1.d
 ; CHECK-NEXT:    mov z7.d, z0.d
-; CHECK-NEXT:    add x9, x9, x11
-; CHECK-NEXT:    add x8, x8, x12
+; CHECK-NEXT:    add x2, x2, x11
 ; CHECK-NEXT:    cmpne p1.d, p1/z, z2.d, #0
 ; CHECK-NEXT:    zip2 p3.d, p1.d, p1.d
 ; CHECK-NEXT:    zip1 p2.d, p1.d, p1.d
-; CHECK-NEXT:    whilelo p1.d, x9, x10
-; CHECK-NEXT:    ld1d { z2.d }, p3/z, [x13, #1, mul vl]
-; CHECK-NEXT:    ld1d { z4.d }, p3/z, [x14, #1, mul vl]
-; CHECK-NEXT:    ld1d { z3.d }, p2/z, [x13]
-; CHECK-NEXT:    ld1d { z5.d }, p2/z, [x14]
+; CHECK-NEXT:    whilelo p1.d, x12, x8
+; CHECK-NEXT:    add x12, x12, x9
+; CHECK-NEXT:    ld1d { z2.d }, p3/z, [x0, #1, mul vl]
+; CHECK-NEXT:    ld1d { z4.d }, p3/z, [x1, #1, mul vl]
+; CHECK-NEXT:    ld1d { z3.d }, p2/z, [x0]
+; CHECK-NEXT:    ld1d { z5.d }, p2/z, [x1]
+; CHECK-NEXT:    add x1, x1, x10
+; CHECK-NEXT:    add x0, x0, x10
 ; CHECK-NEXT:    fcmla z7.d, p0/m, z4.d, z2.d, #0
 ; CHECK-NEXT:    fcmla z6.d, p0/m, z5.d, z3.d, #0
 ; CHECK-NEXT:    fcmla z7.d, p0/m, z4.d, z2.d, #90
diff --git a/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-scalable.ll b/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-scalable.ll
index 1696ac8709d406..933b5f05975106 100644
--- a/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-scalable.ll
+++ b/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-scalable.ll
@@ -15,30 +15,27 @@ define %"class.std::complex" @complex_mul_v2f64(ptr %a, ptr %b) {
 ; CHECK-LABEL: complex_mul_v2f64:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    mov z1.d, #0 // =0x0
-; CHECK-NEXT:    ptrue p1.b
-; CHECK-NEXT:    cntd x9
 ; CHECK-NEXT:    ptrue p0.d
-; CHECK-NEXT:    neg x9, x9
-; CHECK-NEXT:    mov w10, #100 // =0x64
-; CHECK-NEXT:    mov x8, xzr
-; CHECK-NEXT:    and x10, x9, x10
-; CHECK-NEXT:    rdvl x11, #2
+; CHECK-NEXT:    cntd x8
+; CHECK-NEXT:    neg x8, x8
+; CHECK-NEXT:    mov w9, #100 // =0x64
+; CHECK-NEXT:    rdvl x10, #2
+; CHECK-NEXT:    and x9, x8, x9
 ; CHECK-NEXT:    zip2 z0.d, z1.d, z1.d
 ; CHECK-NEXT:    zip1 z1.d, z1.d, z1.d
 ; CHECK-NEXT:  .LBB0_1: // %vector.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    add x12, x0, x8
-; CHECK-NEXT:    add x13, x1, x8
-; CHECK-NEXT:    ld1b { z2.b }, p1/z, [x0, x8]
-; CHECK-NEXT:    ld1d { z3.d }, p0/z, [x12, #1, mul vl]
-; CHECK-NEXT:    ld1b { z4.b }, p1/z, [x1, x8]
-; CHECK-NEXT:    ld1d { z5.d }, p0/z, [x13, #1, mul vl]
-; CHECK-NEXT:    adds x10, x10, x9
-; CHECK-NEXT:    add x8, x8, x11
-; CHECK-NEXT:    fcmla z1.d, p0/m, z4.d, z2.d, #0
-; CHECK-NEXT:    fcmla z0.d, p0/m, z5.d, z3.d, #0
-; CHECK-NEXT:    fcmla z1.d, p0/m, z4.d, z2.d, #90
-; CHECK-NEXT:    fcmla z0.d, p0/m, z5.d, z3.d, #90
+; CHECK-NEXT:    ld1d { z2.d }, p0/z, [x0, #1, mul vl]
+; CHECK-NEXT:    ld1d { z3.d }, p0/z, [x0]
+; CHECK-NEXT:    adds x9, x9, x8
+; CHECK-NEXT:    ld1d { z4.d }, p0/z, [x1, #1, mul vl]
+; CHECK-NEXT:    ld1d { z5.d }, p0/z, [x1]
+; CHECK-NEXT:    add x1, x1, x10
+; CHECK-NEXT:    add x0, x0, x10
+; CHECK-NEXT:    fcmla z1.d, p0/m, z5.d, z3.d, #0
+; CHECK-NEXT:    fcmla z0.d, p0/m, z4.d, z2.d, #0
+; CHECK-NEXT:    fcmla z1.d, p0/m, z5.d, z3.d, #90
+; CHECK-NEXT:    fcmla z0.d, p0/m, z4.d, z2.d, #90
 ; CHECK-NEXT:    b.ne .LBB0_1
 ; CHECK-NEXT:  // %bb.2: // %exit.block
 ; CHECK-NEXT:    uzp1 z2.d, z1.d, z0.d
@@ -105,13 +102,11 @@ define %"class.std::complex" @complex_mul_nonzero_init_v2f64(ptr %a, ptr %b) {
 ; CHECK-NEXT:    fmov d0, #1.00000000
 ; CHECK-NEXT:    mov z1.d, #0 // =0x0
 ; CHECK-NEXT:    fmov d2, #2.00000000
-; CHECK-NEXT:    cntd x9
-; CHECK-NEXT:    mov w10, #100 // =0x64
-; CHECK-NEXT:    ptrue p1.b
-; CHECK-NEXT:    neg x9, x9
-; CHECK-NEXT:    mov x8, xzr
-; CHECK-NEXT:    and x10, x9, x10
-; CHECK-NEXT:    rdvl x11, #2
+; CHECK-NEXT:    cntd x8
+; CHECK-NEXT:    mov w9, #100 // =0x64
+; CHECK-NEXT:    neg x8, x8
+; CHECK-NEXT:    rdvl x10, #2
+; CHECK-NEXT:    and x9, x8, x9
 ; CHECK-NEXT:    sel z3.d, p0, z0.d, z1.d
 ; CHECK-NEXT:    mov z1.d, p0/m, z2.d
 ; CHECK-NEXT:    ptrue p0.d
@@ -119,18 +114,17 @@ define %"class.std::complex" @complex_mul_nonzero_init_v2f64(ptr %a, ptr %b) {
 ; CHECK-NEXT:    zip1 z1.d, z1.d, z3.d
 ; CHECK-NEXT:  .LBB1_1: // %vector.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    add x12, x0, x8
-; CHECK-NEXT:    add x13, x1, x8
-; CHECK-NEXT:    ld1b { z2.b }, p1/z, [x0, x8]
-; CHECK-NEXT:    ld1d { z3.d }, p0/z, [x12, #1, mul vl]
-; CHECK-NEXT:    ld1b { z4.b }, p1/z, [x1, x8]
-; CHECK-NEXT:    ld1d { z5.d }, p0/z, [x13, #1, mul vl]
-; CHECK-NEXT:    adds x10, x10, x9
-; CHECK-NEXT:    add x8, x8, x11
-; CHECK-NEXT:    fcmla z1.d, p0/m, z4.d, z2.d, #0
-; CHECK-NEXT:    fcmla z0.d, p0/m, z5.d, z3.d, #0
-; CHECK-NEXT:    fcmla z1.d, p0/m, z4.d, z2.d, #90
-; CHECK-NEXT:    fcmla z0.d, p0/m, z5.d, z3.d, #90
+; CHECK-NEXT:    ld1d { z2.d }, p0/z, [x0, #1, mul vl]
+; CHECK-NEXT:    ld1d { z3.d }, p0/z, [x0]
+; CHECK-NEXT:    adds x9, x9, x8
+; CHECK-NEXT:    ld1d { z4.d }, p0/z, [x1, #1, mul vl]
+; CHECK-NEXT:    ld1d { z5.d }, p0/z, [x1]
+; CHECK-NEXT:    add x1, x1, x10
+; CHECK-NEXT:    add x0, x0, x10
+; CHECK-NEXT:    fcmla z1.d, p0/m, z5.d, z3.d, #0
+; CHECK-NEXT:    fcmla z0.d, p0/m, z4.d, z2.d, #0
+; CHECK-NEXT:    fcmla z1.d, p0/m, z5.d, z3.d, #90
+; CHECK-NEXT:    fcmla z0.d, p0/m, z4.d, z2.d, #90
 ; CHECK-NEXT:    b.ne .LBB1_1
 ; CHECK-NEXT:  // %bb.2: // %exit.block
 ; CHECK-NEXT:    uzp1 z2.d, z1.d, z0.d
@@ -190,45 +184,37 @@ define %"class.std::complex" @complex_mul_v2f64_unrolled(ptr %a, ptr %b) {
 ; CHECK-LABEL: complex_mul_v2f64_unrolled:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    mov z1.d, #0 // =0x0
-; CHECK-NEXT:    ptrue p1.b
-; CHECK-NEXT:    cntw x9
 ; CHECK-NEXT:    ptrue p0.d
-; CHECK-NEXT:    neg x9, x9
-; CHECK-NEXT:    mov w10, #1000 // =0x3e8
-; CHECK-NEXT:    rdvl x12, #2
-; CHECK-NEXT:    mov x8, xzr
-; CHECK-NEXT:    and x10, x9, x10
+; CHECK-NEXT:    cntw x8
+; CHECK-NEXT:    neg x8, x8
+; CHECK-NEXT:    mov w9, #1000 // =0x3e8
+; CHECK-NEXT:    rdvl x10, #4
+; CHECK-NEXT:    and x9, x8, x9
 ; CHECK-NEXT:    zip2 z0.d, z1.d, z1.d
 ; CHECK-NEXT:    zip1 z1.d, z1.d, z1.d
-; CHECK-NEXT:    add x11, x1, x12
-; CHECK-NEXT:    add x12, x0, x12
-; CHECK-NEXT:    rdvl x13, #4
 ; CHECK-NEXT:    mov z2.d, z1.d
 ; CHECK-NEXT:    mov z3.d, z0.d
 ; CHECK-NEXT:  .LBB2_1: // %vector.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    add x14, x0, x8
-; CHECK-NEXT:    add x15, x12, x8
-; CHECK-NEXT:    add x16, x1, x8
-; CHECK-NEXT:    add x17, x11, x8
-; CHECK-NEXT:    ld1b { z4.b }, p1/z, [x0, x8]
-; CHECK-NEXT:    ld1d { z5.d }, p0/z, [x14, #1, mul vl]
-; CHECK-NEXT:    ld1b { z6.b }, p1/z, [x12, x8]
-; CHECK-NEXT:    ld1b { z7.b }, p1/z, [x1, x8]
-; CHECK-NEXT:    ld1d { z16.d }, p0/z, [x16, #1, mul vl]
-; CHECK-NEXT:    ld1d { z17.d }, p0/z, [x15, #1, mul vl]
-; CHECK-NEXT:    ld1b { z18.b }, p1/z, [x11, x8]
-; CHECK-NEXT:    ld1d { z19.d }, p0/z, [x17, #1, mul vl]
-; CHECK-NEXT:    adds x10, x10, x9
-; CHECK-NEXT:    add x8, x8, x13
-; CHECK-NEXT:    fcmla z1.d, p0/m, z7.d, z4.d, #0
-; CHECK-NEXT:    fcmla z0.d, p0/m, z16.d, z5.d, #0
-; CHECK-NEXT:    fcmla z2.d, p0/m, z18.d, z6.d, #0
-; CHECK-NEXT:    fcmla z3.d, p0/m, z19.d, z17.d, #0
-; CHECK-NEXT:    fcmla z1.d, p0/m, z7.d, z4.d, #90
-; CHECK-NEXT:    fcmla z0.d, p0/m, z16.d, z5.d, #90
-; CHECK-NEXT:    fcmla z2.d, p0/m, z18.d, z6.d, #90
-; CHECK-NEXT:    fcmla z3.d, p0/m, z19.d, z17.d, #90
+; CHECK-NEXT:    ld1d { z4.d }, p0/z, [x0, #1, mul vl]
+; CHECK-NEXT:    ld1d { z5.d }, p0/z, [x0]
+; CHECK-NEXT:    adds x9, x9, x8
+; CHECK-NEXT:    ld1d { z6.d }, p0/z, [x0, #3, mul vl]
+; CHECK-NEXT:    ld1d { z7.d }, p0/z, [x1, #1, mul vl]
+; CHECK-NEXT:    ld1d { z16.d }, p0/z, [x1]
+; CHECK-NEXT:    ld1d { z17.d }, p0/z, [x0, #2, mul vl]
+; CHECK-NEXT:    add x0, x0, x10
+; CHECK-NEXT:    ld1d { z18.d }, p0/z, [x1, #3, mul vl]
+; CHECK-NEXT:    ld1d { z19.d }, p0/z, [x1, #2, mul vl]
+; CHECK-NEXT:    add x1, x1, x10
+; CHECK-NEXT:    fcmla z1.d, p0/m, z16.d, z5.d, #0
+; CHECK-NEXT:    fcmla z0.d, p0/m, z7.d, z4.d, #0
+; CHECK-NEXT:    fcmla z3.d, p0/m, z18.d, z6.d, #0
+; CHECK-NEXT:    fcmla z2.d, p0/m, z19.d, z17.d, #0
+; CHECK-NEXT:    fcmla z1.d, p0/m, z16.d, z5.d, #90
+; CHECK-NEXT:    fcmla z0.d, p0/m, z7.d, z4.d, #90
+; CHECK-NEXT:    fcmla z3.d, p0/m, z18.d, z6.d, #90
+; CHECK-NEXT:    fcmla z2.d, p0/m, z19.d, z17.d, #90
 ; CHECK-NEXT:    b.ne .LBB2_1
 ; CHECK-NEXT:  // %bb.2: // %exit.block
 ; CHECK-NEXT:    uzp1 z4.d, z2.d, z3.d
diff --git a/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions.ll b/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions.ll
index 44d0a9392ba629..aed3072bb4af37 100644
--- a/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions.ll
+++ b/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions.ll
@@ -148,17 +148,16 @@ define %"struct.std::complex" @complex_mul_v2f64_unrolled(ptr %a, ptr %b) {
 ; CHECK-NEXT:    adrp x8, .LCPI2_0
 ; CHECK-NEXT:    movi v3.2d, #0000000000000000
 ; CHECK-NEXT:    ldr q2, [x8, :lo12:.LCPI2_0]
-; CHECK-NEXT:    mov x8, xzr
+; CHECK-NEXT:    add x8, x0, #32
+; CHECK-NEXT:    add x9, x1, #32
+; CHECK-NEXT:    mov x10, #-100 // =0xffffffffffffff9c
 ; CHECK-NEXT:  .LBB2_1: // %vector.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    add x9, x0, x8
-; CHECK-NEXT:    add x10, x1, x8
-; CHECK-NEXT:    add x8, x8, #64
-; CHECK-NEXT:    ldp q5, q4, [x9]
-; CHECK-NEXT:    cmp x8, #1600
-; CHECK-NEXT:    ldp q7, q6, [x10]
-; CHECK-NEXT:    ldp q17, q16, [x9, #32]
-; CHECK-NEXT:    ldp q19, q18, [x10, #32]
+; CHECK-NEXT:    ldp q5, q4, [x8, #-32]
+; CHECK-NEXT:    adds x10, x10, #4
+; CHECK-NEXT:    ldp q7, q6, [x9, #-32]
+; CHECK-NEXT:    ldp q17, q16, [x8], #64
+; CHECK-NEXT:    ldp q19, q18, [x9], #64
 ; CHECK-NEXT:    fcmla v2.2d, v7.2d, v5.2d, #0
 ; CHECK-NEXT:    fcmla v0.2d, v6.2d, v4.2d, #0
 ; CHECK-NEXT:    fcmla v1.2d, v19.2d, v17.2d, #0
diff --git a/llvm/test/CodeGen/AArch64/zext-to-tbl.ll b/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
index 08ad34c7b03ba0..54d7ecfaa8caf3 100644
--- a/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
+++ b/llvm/test/CodeGen/AArch64/zext-to-tbl.ll
@@ -1669,42 +1669,41 @@ define void @zext_v8i8_to_v8i64_with_add_in_sequence_in_loop(ptr %src, ptr %dst)
 ; CHECK-LABEL: zext_v8i8_to_v8i64_with_add_in_sequence_in_loop:
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:  Lloh18:
-; CHECK-NEXT:    adrp x9, lCPI17_0@PAGE
+; CHECK-NEXT:    adrp x8, lCPI17_0@PAGE
 ; CHECK-NEXT:  Lloh19:
-; CHECK-NEXT:    adrp x10, lCPI17_1@PAGE
-; CHECK-NEXT:    mov x8, xzr
+; CHECK-NEXT:    adrp x9, lCPI17_1@PAGE
+; CHECK-NEXT:    mov w10, #128 ; =0x80
 ; CHECK-NEXT:  Lloh20:
-; CHECK-NEXT:    ldr q0, [x9, lCPI17_0@PAGEOFF]
+; CHECK-NEXT:    ldr q0, [x8, lCPI17_0@PAGEOFF]
 ; CHECK-NEXT:  Lloh21:
-; CHECK-NEXT:    ldr q1, [x10, lCPI17_1@PAGEOFF]
+; CHECK-NEXT:    ldr q1, [x9, lCPI17_1@PAGEOFF]
+; CHECK-NEXT:    add x8, x1, #64
 ; CHECK-NEXT:    add x9, x0, #8
 ; CHECK-NEXT:  LBB17_1: ; %loop
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    ldp d2, d3, [x9, #-8]
-; CHECK-NEXT:    ad...
[truncated]

Copy link
Collaborator

@paulwalker-arm paulwalker-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good change to me. I'm happy to trust you'll do the right thing if your performance runs suggest otherwise.

@huntergr-arm huntergr-arm force-pushed the aarch64-islsrcostless-override branch from c5c2568 to 9ebb1f6 Compare June 6, 2024 10:50
@huntergr-arm
Copy link
Collaborator Author

rebased to address merge conflicts.

I thought I'd seen a regression in bwaves from spec, but it turns out that it's performance is just unstable, at least on the machine I was benchmarking on. So I think we see almost no performance difference from this patch, but it does remove a few instructions from some loops. I'll commit as is, but we'll probably want to revisit it the ordering later without the strict focus on vscale-relative offsets that I had when originally writing this.

@huntergr-arm huntergr-arm merged commit e16f2f5 into llvm:main Jun 6, 2024
7 checks passed
@huntergr-arm huntergr-arm deleted the aarch64-islsrcostless-override branch June 6, 2024 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants