Skip to content

[LV][AArch64] Prefer Fixed over Scalable if cost-model is equal (Neoverse V2) #95819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 17, 2024

Conversation

sjoerdmeijer
Copy link
Collaborator

For the Neoverse V2, prefer fixed width vectorisation If the cost-model assigns an equal cost to fixed and scalable vectorisation. This improves 7 kernels from TSVC-2 by about 2x, and does not affect SPEC21017 INT and FP.

This tends to benefit small kernels, like the ones in TSVC, for a number of reasons: processing the predicates does not come entirely for free, NEON tends to generate slightly less code which can have a big impact on these small kernels, and then there are second order affects that SVE codegen is slightly less optimal in some areas.

This codegen strategy to generate more NEON is inline with GCC's codegen strategy, which is actually even more aggressive in generating NEON when no predication is required. We could be smarter and more aggressive too about generating more NEON (and improve performance), but this seems to be a first good and straight forward step.

This depends on #95818

@llvmbot
Copy link
Member

llvmbot commented Jun 17, 2024

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-llvm-transforms

Author: Sjoerd Meijer (sjoerdmeijer)

Changes

For the Neoverse V2, prefer fixed width vectorisation If the cost-model assigns an equal cost to fixed and scalable vectorisation. This improves 7 kernels from TSVC-2 by about 2x, and does not affect SPEC21017 INT and FP.

This tends to benefit small kernels, like the ones in TSVC, for a number of reasons: processing the predicates does not come entirely for free, NEON tends to generate slightly less code which can have a big impact on these small kernels, and then there are second order affects that SVE codegen is slightly less optimal in some areas.

This codegen strategy to generate more NEON is inline with GCC's codegen strategy, which is actually even more aggressive in generating NEON when no predication is required. We could be smarter and more aggressive too about generating more NEON (and improve performance), but this seems to be a first good and straight forward step.

This depends on #95818


Full diff: https://github.com/llvm/llvm-project/pull/95819.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+4-1)
  • (added) llvm/test/Transforms/LoopVectorize/AArch64/prefer-fixed-if-equal-to-scalable.ll (+196)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index c7c19ef456c7c..cfe5e0157b67b 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4780,7 +4780,10 @@ bool LoopVectorizationPlanner::isMoreProfitable(
   // Assume vscale may be larger than 1 (or the value being tuned for),
   // so that scalable vectorization is slightly favorable over fixed-width
   // vectorization.
-  bool PreferScalable = A.Width.isScalable() && !B.Width.isScalable();
+  bool PreferScalable = false;
+  if (!TTI.preferFixedIfEqualToScalable())
+     PreferScalable = A.Width.isScalable() && !B.Width.isScalable();
+
   auto CmpFn = [PreferScalable](const InstructionCost &LHS,
                                 const InstructionCost &RHS) {
     return PreferScalable ? LHS <= RHS : LHS < RHS;
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/prefer-fixed-if-equal-to-scalable.ll b/llvm/test/Transforms/LoopVectorize/AArch64/prefer-fixed-if-equal-to-scalable.ll
new file mode 100644
index 0000000000000..1567485ef02a9
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/prefer-fixed-if-equal-to-scalable.ll
@@ -0,0 +1,196 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S < %s -passes=loop-vectorize | FileCheck %s
+
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
+target triple = "aarch64-unknown-linux-gnu"
+
+@a = dso_local local_unnamed_addr global [32000 x float] zeroinitializer, align 64
+@b = dso_local local_unnamed_addr global [32000 x float] zeroinitializer, align 64
+
+define void @NeoverseV2() local_unnamed_addr #0 {
+; CHECK-LABEL: define void @NeoverseV2(
+; CHECK-SAME: ) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT:    [[TMP1:%.*]] = add i64 [[INDEX]], 4
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds float, ptr [[TMP2]], i32 0
+; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds float, ptr [[TMP2]], i32 4
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x float>, ptr [[TMP4]], align 4
+; CHECK-NEXT:    [[WIDE_LOAD1:%.*]] = load <4 x float>, ptr [[TMP5]], align 4
+; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds float, ptr [[TMP6]], i32 0
+; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr inbounds float, ptr [[TMP6]], i32 4
+; CHECK-NEXT:    [[WIDE_LOAD2:%.*]] = load <4 x float>, ptr [[TMP8]], align 4
+; CHECK-NEXT:    [[WIDE_LOAD3:%.*]] = load <4 x float>, ptr [[TMP9]], align 4
+; CHECK-NEXT:    [[TMP10:%.*]] = fadd fast <4 x float> [[WIDE_LOAD2]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP11:%.*]] = fadd fast <4 x float> [[WIDE_LOAD3]], [[WIDE_LOAD1]]
+; CHECK-NEXT:    [[TMP12:%.*]] = add nuw nsw i64 [[TMP0]], 16000
+; CHECK-NEXT:    [[TMP13:%.*]] = add nuw nsw i64 [[TMP1]], 16000
+; CHECK-NEXT:    [[TMP14:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP12]]
+; CHECK-NEXT:    [[TMP15:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP13]]
+; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr inbounds float, ptr [[TMP14]], i32 0
+; CHECK-NEXT:    [[TMP17:%.*]] = getelementptr inbounds float, ptr [[TMP14]], i32 4
+; CHECK-NEXT:    store <4 x float> [[TMP10]], ptr [[TMP16]], align 4
+; CHECK-NEXT:    store <4 x float> [[TMP11]], ptr [[TMP17]], align 4
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
+; CHECK-NEXT:    [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16000
+; CHECK-NEXT:    br i1 [[TMP18]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br i1 true, label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ 16000, %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    br label %[[FOR_BODY:.*]]
+; CHECK:       [[FOR_COND_CLEANUP]]:
+; CHECK-NEXT:    ret void
+; CHECK:       [[FOR_BODY]]:
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP19:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP20:%.*]] = load float, ptr [[ARRAYIDX2]], align 4
+; CHECK-NEXT:    [[ADD:%.*]] = fadd fast float [[TMP20]], [[TMP19]]
+; CHECK-NEXT:    [[TMP21:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 16000
+; CHECK-NEXT:    [[ARRAYIDX5:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP21]]
+; CHECK-NEXT:    store float [[ADD]], ptr [[ARRAYIDX5]], align 4
+; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 16000
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+;
+entry:
+  br label %for.body
+
+for.cond.cleanup:
+  ret void
+
+for.body:
+  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+  %arrayidx = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %indvars.iv
+  %0 = load float, ptr %arrayidx, align 4
+  %arrayidx2 = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 %indvars.iv
+  %1 = load float, ptr %arrayidx2, align 4
+  %add = fadd fast float %1, %0
+  %2 = add nuw nsw i64 %indvars.iv, 16000
+  %arrayidx5 = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %2
+  store float %add, ptr %arrayidx5, align 4
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, 16000
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
+
+define void @NeoverseV1() #1 {
+; CHECK-LABEL: define void @NeoverseV1(
+; CHECK-SAME: ) #[[ATTR1:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 8
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 16000, [[TMP1]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 8
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 16000, [[TMP3]]
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 16000, [[N_MOD_VF]]
+; CHECK-NEXT:    [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP5:%.*]] = mul i64 [[TMP4]], 8
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP6:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT:    [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP8:%.*]] = mul i64 [[TMP7]], 4
+; CHECK-NEXT:    [[TMP9:%.*]] = add i64 [[TMP8]], 0
+; CHECK-NEXT:    [[TMP10:%.*]] = mul i64 [[TMP9]], 1
+; CHECK-NEXT:    [[TMP11:%.*]] = add i64 [[INDEX]], [[TMP10]]
+; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP6]]
+; CHECK-NEXT:    [[TMP13:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP11]]
+; CHECK-NEXT:    [[TMP14:%.*]] = getelementptr inbounds float, ptr [[TMP12]], i32 0
+; CHECK-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP16:%.*]] = mul i64 [[TMP15]], 4
+; CHECK-NEXT:    [[TMP17:%.*]] = getelementptr inbounds float, ptr [[TMP12]], i64 [[TMP16]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP14]], align 4
+; CHECK-NEXT:    [[WIDE_LOAD1:%.*]] = load <vscale x 4 x float>, ptr [[TMP17]], align 4
+; CHECK-NEXT:    [[TMP18:%.*]] = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 [[TMP6]]
+; CHECK-NEXT:    [[TMP19:%.*]] = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 [[TMP11]]
+; CHECK-NEXT:    [[TMP20:%.*]] = getelementptr inbounds float, ptr [[TMP18]], i32 0
+; CHECK-NEXT:    [[TMP21:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP22:%.*]] = mul i64 [[TMP21]], 4
+; CHECK-NEXT:    [[TMP23:%.*]] = getelementptr inbounds float, ptr [[TMP18]], i64 [[TMP22]]
+; CHECK-NEXT:    [[WIDE_LOAD2:%.*]] = load <vscale x 4 x float>, ptr [[TMP20]], align 4
+; CHECK-NEXT:    [[WIDE_LOAD3:%.*]] = load <vscale x 4 x float>, ptr [[TMP23]], align 4
+; CHECK-NEXT:    [[TMP24:%.*]] = fadd fast <vscale x 4 x float> [[WIDE_LOAD2]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP25:%.*]] = fadd fast <vscale x 4 x float> [[WIDE_LOAD3]], [[WIDE_LOAD1]]
+; CHECK-NEXT:    [[TMP26:%.*]] = add nuw nsw i64 [[TMP6]], 16000
+; CHECK-NEXT:    [[TMP27:%.*]] = add nuw nsw i64 [[TMP11]], 16000
+; CHECK-NEXT:    [[TMP28:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP26]]
+; CHECK-NEXT:    [[TMP29:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP27]]
+; CHECK-NEXT:    [[TMP30:%.*]] = getelementptr inbounds float, ptr [[TMP28]], i32 0
+; CHECK-NEXT:    [[TMP31:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP32:%.*]] = mul i64 [[TMP31]], 4
+; CHECK-NEXT:    [[TMP33:%.*]] = getelementptr inbounds float, ptr [[TMP28]], i64 [[TMP32]]
+; CHECK-NEXT:    store <vscale x 4 x float> [[TMP24]], ptr [[TMP30]], align 4
+; CHECK-NEXT:    store <vscale x 4 x float> [[TMP25]], ptr [[TMP33]], align 4
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]]
+; CHECK-NEXT:    [[TMP34:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP34]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 16000, [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    br label %[[FOR_BODY:.*]]
+; CHECK:       [[FOR_COND_CLEANUP]]:
+; CHECK-NEXT:    ret void
+; CHECK:       [[FOR_BODY]]:
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP35:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT:    [[TMP36:%.*]] = load float, ptr [[ARRAYIDX2]], align 4
+; CHECK-NEXT:    [[ADD:%.*]] = fadd fast float [[TMP36]], [[TMP35]]
+; CHECK-NEXT:    [[TMP37:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 16000
+; CHECK-NEXT:    [[ARRAYIDX5:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP37]]
+; CHECK-NEXT:    store float [[ADD]], ptr [[ARRAYIDX5]], align 4
+; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 16000
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+;
+entry:
+  br label %for.body
+
+for.cond.cleanup:
+  ret void
+
+for.body:
+  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+  %arrayidx = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %indvars.iv
+  %0 = load float, ptr %arrayidx, align 4
+  %arrayidx2 = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 %indvars.iv
+  %1 = load float, ptr %arrayidx2, align 4
+  %add = fadd fast float %1, %0
+  %2 = add nuw nsw i64 %indvars.iv, 16000
+  %arrayidx5 = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %2
+  store float %add, ptr %arrayidx5, align 4
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, 16000
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
+
+attributes #0 = { mustprogress nofree norecurse nosync nounwind memory(readwrite, argmem: none, inaccessiblemem: none) uwtable vscale_range(1,16) "approx-func-fp-math"="true" "frame-pointer"="non-leaf" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="neoverse-v2" "target-features"="+sve,+sve2,+v9a" "unsafe-fp-math"="true" }
+
+attributes #1 = { mustprogress nofree norecurse nosync nounwind memory(readwrite, argmem: none, inaccessiblemem: none) uwtable vscale_range(1,16) "approx-func-fp-math"="true" "frame-pointer"="non-leaf" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="neoverse-v1" "target-features"="+sve,+v9a" "unsafe-fp-math"="true" }
+;.
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]}
+; CHECK: [[LOOP4]] = distinct !{[[LOOP4]], [[META1]], [[META2]]}
+; CHECK: [[LOOP5]] = distinct !{[[LOOP5]], [[META2]], [[META1]]}
+;.

Copy link

github-actions bot commented Jun 17, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

sjoerdmeijer added a commit to sjoerdmeijer/llvm-project that referenced this pull request Jun 17, 2024
This adds a new hook to prefer fixed width loop vectorization over
scalable. This will be used in the loop vectoriser to generate more NEON
code instead of SVE if cost-model assigns equal costs to fixed and
scalable versions of the vectorised loop.

This is used in llvm#95819.
Comment on lines 4784 to 4819
if (!TTI.preferFixedIfEqualToScalable())
PreferScalable = A.Width.isScalable() && !B.Width.isScalable();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be better as

if (!TTI.preferFixedIfEqualToScalable())
  PreferScalable = !A.Width.isScalable() && B.Width.isScalable();
else
  PreferScalable = A.Width.isScalable() && !B.Width.isScalable();

This way it will not be dependent upon the order it visits the vplans.

PreferScalable could then do with a better name too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted back to my initial implementation, because your suggestion is having a wider impact that I would like to look at in a follow up because I do agree it would make sense.


attributes #0 = { mustprogress nofree norecurse nosync nounwind memory(readwrite, argmem: none, inaccessiblemem: none) uwtable vscale_range(1,16) "approx-func-fp-math"="true" "frame-pointer"="non-leaf" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="neoverse-v2" "target-features"="+sve,+sve2,+v9a" "unsafe-fp-math"="true" }

attributes #1 = { mustprogress nofree norecurse nosync nounwind memory(readwrite, argmem: none, inaccessiblemem: none) uwtable vscale_range(1,16) "approx-func-fp-math"="true" "frame-pointer"="non-leaf" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="neoverse-v1" "target-features"="+sve,+v9a" "unsafe-fp-math"="true" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a CPU without a 256bit SVE vector length? Or maybe cpu=generic.

@david-arm
Copy link
Contributor

Hi @sjoerdmeijer, in the loops that you looked at where SVE performed worse did the SVE code itself look optimal, i.e. as good as it can be? The only thing I'm nervous about here is that we're switching to NEON due to poor codegen or a bad cost model, rather than due to a fundamental problem with SVE. For example, if we're generating more code in the SVE loop then that suggests the costs for the SVE loop are just too low so by hiking the costs we would favour NEON anyway. We should definitely fix any cost model issues like that where we can. Also, if you have any examples where SVE codegen is actually worse can you create issues for them? Thanks!

@david-arm
Copy link
Contributor

Just to be clear, I'm not against the idea of having a flag or enabling it by default on neoverse-v2, but I am just a bit nervous about any problems it may hide that we could be usefully fixing, that's all.

@paulwalker-arm
Copy link
Collaborator

paulwalker-arm commented Jun 18, 2024

Considering the rational as explained in the second paragraph where I see phrases like "predicates does not come entirely for free", "NEON tends to generate slightly less code" and "SVE codegen is slightly less optimal" I'm apposed to having this flag. If there's other rational then great, please elaborate, but otherwise to me this looks like a hack to work round cost model deficiencies and generally poor code generation, both of which should be fixed rather than hidden.

@sjoerdmeijer
Copy link
Collaborator Author

This is actually a quite complicated story, it's a combination of a few factors: a few micro-architectural reasons and (SVE) codegen reasons. To give a better introduction to the problem, we have a number of examples similar to this:

 for (int i = 0; i < 32000/2; i++) {
        a[i+k] = a[i] + b[i];

This is GCC's output, and LLVM's output with this patch:

   .L3:
        ldr     q31, [x20, x0]
        ldr     q30, [x19, x0]
        fadd    v31.4s, v31.4s, v30.4s
        str     q31, [x21, x0]
        add     x0, x0, 16
        cmp     x0, x28
        bne     .L3

LLVM's output is something like this:

   .LBB0_3:
        add     x9, x19, x8, lsl #2
         add     x10, x20, x8, lsl #2
         ld1w    { z0.s }, p0/z, [x19, x8, lsl #2]
         ld1w    { z2.s }, p0/z, [x20, x8, lsl #2]
         add     x8, x8, x21
         ld1w    { z1.s }, p0/z, [x9, x28, lsl #2]
         ld1w    { z3.s }, p0/z, [x10, x28, lsl #2]
         add     x10, x9, x26
         cmp     x8, x22
         fadd    z0.s, z2.s, z0.s
         fadd    z1.s, z3.s, z1.s
         st1w    { z0.s }, p0, [x9, x23, lsl #2]
         st1w    { z1.s }, p0, [x10, x28, lsl #2]
         b.ne    .LBB0_3

There is nothing fundamentally wrong with LLVM's codegen, but it performs a lot worse.

One of the micro-architectural reasons are documented in section "4.1 Dispatch constraints" of the SWOG:

The dispatch stage can process up to 8 MOPs per cycle and dispatch up to 16 μOPs per cycle,

The smaller kernels fit in these dispatch constraints, the bigger ones don't, resulting in significant performance differences.

Most of the performance can be clawed back by interleaving more. But then there's clearly a code quality issue: the amount of code necessary to get on par with the NEON kernel would be disproportional.

Two more subjective arguments:

  • there is no need to go predicated for these kind of examples,
  • GCC prefers this codegen strategy for the same reasons.

Other factors are slightly more complicated SVE addressing modes, also resulting in more MOPS.

We are investigating other micro-architectural issues, but I cannot comment on this yet.

@david-arm
Copy link
Contributor

There is nothing fundamentally wrong with LLVM's codegen, but it performs a lot worse.

Does that in fact suggest for this case that we should be interleaving less in LLVM and be tuning that accordingly? GCC's output does use NEON, but there is no interleaving. Have you tried disabling interleaving when using SVE in LLVM so that it's a fairer comparison with GCC? I'd be interested to know how that performs.

@david-arm
Copy link
Contributor

I do see issues with generating the addresses in the SVE version definitely and we should fix that, but if we assume those issues are fixed and we have a fair comparison with GCC do we still see the SVE loop being slower?

@sjoerdmeijer
Copy link
Collaborator Author

sjoerdmeijer commented Jun 18, 2024

There is nothing fundamentally wrong with LLVM's codegen, but it performs a lot worse.

Does that in fact suggest for this case that we should be interleaving less in LLVM and be tuning that accordingly? GCC's output does use NEON, but there is no interleaving. Have you tried disabling interleaving when using SVE in LLVM so that it's a fairer comparison with GCC? I'd be interested to know how that performs.

Yes, we have experimented also with that, e.g. nano-benchmarked this:

loop:
   ld1w  {z0.s}, p0/z, [x1, x9, lsl #2]
   ld1w  {z1.s}, p0/z, [x2, x9, lsl #2]
   incw  x9
   cmp   x9, x10
   fadd  z0.s, z1.s, z0.s
   st1w  {z0.s}, p0, [x3, x9, lsl #2]
   b.ne  loop:

where we hoisted the adds out of the loop, and this also performs worse.

This is the different micro-architectural reason I was talking about. We do have a suspicion what this could be, but cannot confirm yet.

So, I hope I have given enough examples that demonstrates that unexpected performance results happen, not necessary SVEs fault, but there's a higher chance of hitting micro-architectural issues and secondary effects for this class of small kernels and loops. I think there's also an element of "keeping it simple" to it if you see what I mean.

@sjoerdmeijer
Copy link
Collaborator Author

@paulwalker-arm , @david-arm : I think a very fair question is why we should be generating SVE code for this type of kernels? Can you explain the benefit?

@paulwalker-arm
Copy link
Collaborator

paulwalker-arm commented Jun 18, 2024

@paulwalker-arm , @david-arm : I think a very fair question is why we should be generating SVE code for this type of kernels? Can you explain the benefit?

The general answer is the cost model says there's no reason not to use scalable vectors, and with the potential for them to be bigger than their fixed length counterparts there's scope for more performance beyond the assumed minimum.

What worries me is that you're not suggesting there's a subtle performance difference but in fact the difference is pretty large. I just think this should be captured as either a more accurate cost model or implementation specific code generation decisions (e.g. UseScalarIncVL and UseSVEFPLD1R).

@sjoerdmeijer
Copy link
Collaborator Author

@paulwalker-arm , @david-arm : I think a very fair question is why we should be generating SVE code for this type of kernels? Can you explain the benefit?

The general answer is the cost model says there's no reason not to use scalable vectors, and with the potential for them to be bigger than their fixed length counterparts there's scope for more performance beyond the assumed minimum.

What worries me is that you're not suggesting there's a subtle performance difference but in fact the difference is pretty large. I just think this should be captured as ether a more accurate cost model or implementation specific code generation decisions (e.g. UseScalarIncVL and UseSVEFPLD1R).

Yes, this is the only reason: performance portability of the same binary running on potentially a bigger SVE implementation. This is not helping the Neoverse V2 though. We get very unpredictable and bad performance for small SVE kernels and that's the problem we are dealing with here. The reasons are twofold: micro-architectural reasons, and less mature SVE codegen. The latter can be fixed over time, but then there is still the former. And again, I do want to stress again that predication for this class of loops is simply not necessary.

UseScalarIncVL is what we are investigating, we might want to propose to disable that for Neoverse V2 indeed, but that is yet another thing.

So, all of these problems go away if we resort to NEON if the cost-model assigns equal cost. I could think about a more narrow heuristic: if the kernel is small, there are no masked loads/stores, the cost-model is a tie, favour fixed, something along those lines. But I am afraid you're not going to like that either because you just like to see SVE code here. Cost-modeling is what we are doing here way or another, but I think this is not just a matter of "let's increase the cost of instruction XYZ a little bit". Preferring fixed if the cost-model cannot decide looks very reasonable to me.

@davemgreen
Copy link
Collaborator

I agree that this seems perfectly sensible to me. I tried it on a few other machines (a mixture of emulation and real hardware) and it looked profitable to pick neon on a tie. (Even the Neoverse-V1 with a 256bit SVE length came out as a every-so-slight, meaninglessly small performance gain apparently. Obviously there were a lot of things getting better and worse).

The fact that Neon has LDP and postinc would feel like enough of a reason to favour Neon on a tie. Cost modelling in the mid-end is always going to be fairly high level and there will always be ties. Favouring Neon sounds like the most reasonable option between the two, providing if we have performance to back it up.

@paulwalker-arm
Copy link
Collaborator

I could think about a more narrow heuristic: if the kernel is small, there are no masked loads/stores, the cost-model is a tie, favour fixed, something along those lines. But I am afraid you're not going to like that either because you just like to see SVE code here.

Please don't think me an SVE zealot here. I'd just rather see the compiler make decisions based on data from the IR rather than implement a switch that effectively means "ignore the data". I appreciate this might not always be clean, for example take a look at AArch64TTIImpl::preferPredicateOverEpilogue. This comes under the same "micro-architectural reasons that we cannot fully represent" but at least that's showing a decision based on the input IR that's concrete (i.e. we think predication is bad for small loops) that can be evolved over time.

@paulwalker-arm
Copy link
Collaborator

paulwalker-arm commented Jun 18, 2024

The fact that Neon has LDP and postinc would feel like enough of a reason to favour Neon on a tie.

Whereas I'd say "The fact that Neon has LDP and postinc would feel like enough of a reason for Neon to have a lower cost when comparing equivalent vector lengths."

@sjoerdmeijer
Copy link
Collaborator Author

sjoerdmeijer commented Jun 18, 2024

The fact that Neon has LDP and postinc would feel like enough of a reason to favour Neon on a tie.

Whereas I'd say "The fact that Neon has LDP and postinc would feel like enough of a reason for Neon to have a lower cost when comparing equivalent vector lengths."

I don't think tweaking the cost of a load instructions is going to cut it here. It is too fragile, is going to have so many side effects, thus probably doesn't achieve what we want. I don't really see that as a way forward.

I'd just rather see the compiler make decisions based on data from the IR rather than implement a switch that effectively means "ignore the data".

I see what you mean but I agree and disagree with that statement. First of all, we unleash the cost-model on fixed and scalable and all sorts of VFs and combinations, and if there is a tie we happen to default to scalable. Thus, the status quo is that we make an arbitrary decision that is similarly not based on any data or IR or empirical data. The "arbitration" in case of a tie and proposal to change to fixed is based on empirical data though, which is clearly an improvement over the status quo. However, perhaps we agree on this point now, and the remaining point of discussion is that we could make the arbitration a somewhat better informed decision now or in the future by implementing a similar hook:

I appreciate this might not always be clean, for example take a look at AArch64TTIImpl::preferPredicateOverEpilogue.

As the original author of that hook, I am very well aware and I am happy with this suggestion and this:

but at least that's showing a decision based on the input IR that's concrete (i.e. we think predication is bad for small loops)

It's similar I think to the heuristic that I sketched earlier: a scan of the loop body to see if small loops don't require predication. And so I am happy to go down that route to progress this. Please let me know what you think.

@david-arm
Copy link
Contributor

So here are my thoughts. It would be very useful to have some performance data attached to this PR for benchmarks such as SPEC and/or real world applications demonstrating the problems. As far as I understand it, TSVC is not really meant for benchmarking and is in fact meant to showcase the breadth of a compiler's vectorisation capabilities. Although I do understand it's useful for highlighting potential issues. Like @paulwalker-arm said, it's not about blindly favouring SVE in the face of overwhelming data, but more about making the best choices based on data.

Here are my thoughts:

  1. If there is a problem with the SVE inc instructions then there is probably a 1-3 line fix for that. Would that fix most of the issues? It would be good to know and if it's proven to be an issue we should do that regardless.
  2. If there is a problem with poor codegen when interleaving with SVE due to the calculation of load addresses then we should definitely fix that, but it's not really a fundamental reason why SVE is worse than NEON. It seems to me that's a problem with the compiler, rather than a problem with the architecture and we should look into it.
  3. It is true that NEON has instructions such as LDP and STP that SVE2 does not have but those mostly come into play when interleaving (which GCC isn't doing in the example above). What this patch is doing is effectively disabling SVE for, I assume, the majority of loops, even if we're not interleaving. For example, many of the loops in the SPEC Fortran benchmarks are too large for interleaving.

Having said that, I definitely understand the LDP, STP problem, but I agree with @paulwalker-arm that if possible it would be nice to see this taken account of with better cost modelling. This is just a suggestion, but you could tweak this patch as follows:

  1. Instead of choosing upfront whether we prefer a fixed or scalable VF in the event of a tie, we could instead record the tie for later use.
  2. After invoking selectInterleaveCount from LoopVectorize.cpp we can then call a version of this new TTI hook that passes in the Loop pointer, the fixed-width VF and the chosen interleave count. At that point we can choose between the tie at a point in the code where we are able to make a more informed choice.
  3. In your new hook you can have code like this:
bool preferFixedIfEqualToScalable(const Loop *L, ElementCount VF, unsigned IC) {
  if (IC == 1) // Don't expect to use LDP or STP
    return false;

  for (BasicBlock *B : ...)
    for (Instruction *I : ...)
      if ((isa<LoadInst> || isa<StoreInst>) && preferNeonForInterleavedLoadStore(I, VF, IC))
        return true;

  return false;
}

where preferNeonForInterleavedLoadStore can check whether or not it's possible to use LDP or STP here.

What do you think?

@sjoerdmeijer
Copy link
Collaborator Author

sjoerdmeijer commented Jun 19, 2024

So here are my thoughts. It would be very useful to have some performance data attached to this PR for benchmarks such as SPEC and/or real world applications demonstrating the problems. As far as I understand it, TSVC is not really meant for benchmarking and is in fact meant to showcase the breadth of a compiler's vectorisation capabilities. Although I do understand it's useful for highlighting potential issues. Like @paulwalker-arm said, it's not about blindly favouring SVE in the face of overwhelming data, but more about making the best choices based on data.

Here are my thoughts:

  1. If there is a problem with the SVE inc instructions then there is probably a 1-3 line fix for that. Would that fix most of the issues? It would be good to know and if it's proven to be an issue we should do that regardless.
  2. If there is a problem with poor codegen when interleaving with SVE due to the calculation of load addresses then we should definitely fix that, but it's not really a fundamental reason why SVE is worse than NEON. It seems to me that's a problem with the compiler, rather than a problem with the architecture and we should look into it.
  3. It is true that NEON has instructions such as LDP and STP that SVE2 does not have but those mostly come into play when interleaving (which GCC isn't doing in the example above). What this patch is doing is effectively disabling SVE for, I assume, the majority of loops, even if we're not interleaving. For example, many of the loops in the SPEC Fortran benchmarks are too large for interleaving.

Having said that, I definitely understand the LDP, STP problem, but I agree with @paulwalker-arm that if possible it would be nice to see this taken account of with better cost modelling. This is just a suggestion, but you could tweak this patch as follows:

  1. Instead of choosing upfront whether we prefer a fixed or scalable VF in the event of a tie, we could instead record the tie for later use.
  2. After invoking selectInterleaveCount from LoopVectorize.cpp we can then call a version of this new TTI hook that passes in the Loop pointer, the fixed-width VF and the chosen interleave count. At that point we can choose between the tie at a point in the code where we are able to make a more informed choice.
  3. In your new hook you can have code like this:
bool preferFixedIfEqualToScalable(const Loop *L, ElementCount VF, unsigned IC) {
  if (IC == 1) // Don't expect to use LDP or STP
    return false;

  for (BasicBlock *B : ...)
    for (Instruction *I : ...)
      if ((isa<LoadInst> || isa<StoreInst>) && preferNeonForInterleavedLoadStore(I, VF, IC))
        return true;

  return false;
}

where preferNeonForInterleavedLoadStore can check whether or not it's possible to use LDP or STP here.

What do you think?

I think this is excellent. Thanks for writing this all up! This is pretty much what I now had in mind too, so yes, let's go for this approach.

To be clear, I don't care about TSVC. It's well appreciated it's synthetic, and I can argue that TSVC is as synthetic as x264 from SPEC INT, which is the app that is most amenable to vectorisation in the suite because real x264 implementations are not implemented like that. But the way I look at this is getting the basics right. Small loops do exist, and also for example, if you create a reproducer, unpredictable and bad performance is really not helping, and that's an understatement. I already mentioned my performance data: SPEC INT + FP are unaffected, and I see 2x improvements in TSVC, and some other benchmark. I will get additional data for the llvm test-suite. But it sounded like @davemgreen did a sweep too.

I will now go ahead and change this patch as suggested. Please do shout if there are still concerns.
I do want to note that I am intending to keep #95818 because I only want to do this for the V2. Our new hook preferFixedIfEqualToScalable will first check TTI.preferFixedIfEqualToScalable() and early exits, because I can't do the benchmarking on (most) other targets.

@david-arm
Copy link
Contributor

I will now go ahead and change this patch as suggested. Please do shout if there are still concerns.
I do want to note that I am intending to keep #95819 because I only want to do this for the V2.

Yeah sure, my suggested implementation was perhaps a bit simplistic. :) You're right that it needs a check for the CPU as well. If other CPUs with similar issues would benefit then they could be added here over time as well.

@paulwalker-arm
Copy link
Collaborator

paulwalker-arm commented Jun 19, 2024

Thanks all, sounds like we have a way forward. Ultimately I think we should transition the cost modelling to be tree based rather than instruction based because that's where much of the frailty comes from. That's very much long term work, so not specifically relevant to this immediate goal.

Whilst I look forward to the results I do want to emphasise the "is inc/dec VL" bad for Neoverse V2 point. That feels like something we should get to the bottom of asap and just add the necessary feature flag.

I have a final preferFixedIfEqualToScalable query though. Whilst I'm not against its use in this new context I do wonder if instead you could just use getVScaleForTuning() which will be set to 1 for Neoverse V2. My assumption here is that you basically want to ask "if we're tuning based on the assumption the SVE registers are no bigger than the Neon ones".

@sjoerdmeijer
Copy link
Collaborator Author

Thanks all, sounds like we have a way forward. Ultimately I think we should transition the cost modelling to be tree based rather than instruction based because that's where much of the frailty comes from. That's very much long term work, so not specifically relevant to this immediate goal.

sgtm.

Whilst I look forward to the results I do want to emphasise the "is inc/dec VL" bad for Neoverse V2 point. That feels like something we should get to the bottom of asap and just add the necessary feature flag.

I will ping you an email with details about this because I can't discuss that here. I will do that early next week because I am off tomorrow + Friday.

I have a final preferFixedIfEqualToScalable query though. Whilst I'm not against its use in this new context I do wonder if instead you could just use getVScaleForTuning() which will be set to 1 for Neoverse V2. My assumption here is that you basically want to ask "if we're tuning based on the assumption the SVE registers are no bigger than the Neon ones".

Yes, you're right, that's one of the main questions here. I will add it as an early-exit or fold it into TTI.preferFixedIfEqualToScalable() so that it both checks if the target has FeatureUseFixedOverScalableIfEqualCost set and if vscale = 1. I agree ideally we would just have getVScaleForTuning() and perhaps don't need FeatureUseFixedOverScalableIfEqualCost and TTI.preferFixedIfEqualToScalable(), but I didn't want to do this for all targets with vscale = 1 at the same time.

@davemgreen
Copy link
Collaborator

LDP/STP was only one of the reasons to prefer fixed-width over scalable vectorization on a tie. I have been trying to think of reasons why we would want to favour scalable vectorization over fixed-width, and if we know that the scalable factor would not be higher as it will be here, then I couldn't think of anything that really make scalable vector the better choice. @david-arm I see where you are coming from with your suggestion but it feels like an over-engineered way of only solving part of the problem, and at the end of the day, when we don't know any better and can't make the decision based on anything else, fixed width should be the preferred option.

@llvmbot llvmbot added backend:AArch64 llvm:analysis Includes value tracking, cost tables and constant folding labels Jun 24, 2024
@sjoerdmeijer
Copy link
Collaborator Author

To create a new baseline, I have merged the TTI changes in here (and addressed the minor comments).

I have not yet added more logic to preferFixedIfEqualToScalable, but I am struggling a little and I have (unsurprisingly) a lot of sympathy for Dave's remark about this looking a bit over-engineered, but will give it another try as this is what I promised to do.

if (!TTI.preferFixedOverScalableIfEqualCost())
PreferScalable = !A.Width.isScalable() && B.Width.isScalable();
else
PreferScalable = A.Width.isScalable() && !B.Width.isScalable();
Copy link
Contributor

@dc03-work dc03-work Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be moved into the initializer of the variable. May also be easier to read if the if-statement condition is inverted.

@sjoerdmeijer
Copy link
Collaborator Author

I have added LoopVectorizationPlanner::preferFixedOverScalableIfEqualCost. As I wrote in the comments there, the initial idea here is that if the cost-model is a tie, and gathers/scatters or predication is required, then SVE is probably more efficient so favour SVE in these cases. The way I look at it this function is that it provides a little framework to come up with something better in the future, it should be general enough to allow this. I spent some time on creating a test case for the code-path "cost model is a tie, but there are gather/scatters, so let's favour SVE", but failed within the time I spent on this, but again, I hope the framework is enough. Let me know what you think @david-arm , @paulwalker-arm

@sjoerdmeijer
Copy link
Collaborator Author

gentle ping

@sjoerdmeijer
Copy link
Collaborator Author

As we don't want to base codegen strategies on one benchmark only, here are two examples from production code, @paulwalker-arm , @david-arm , @davemgreen .

The first example is extracted from an HPC app, this is the SVE kernel before:

   39204:       a5e8438b        ld1d    {z11.d}, p0/z, [x28, x8, lsl #3]
   39208:       a5e842cc        ld1d    {z12.d}, p0/z, [x22, x8, lsl #3]
   3920c:       a5e8426d        ld1d    {z13.d}, p0/z, [x19, x8, lsl #3]
   39210:       65cb01ab        fadd    z11.d, z13.d, z11.d
   39214:       a5e843ce        ld1d    {z14.d}, p0/z, [x30, x8, lsl #3]
   39218:       a5e84349        ld1d    {z9.d}, p0/z, [x26, x8, lsl #3]
   3921c:       a5e842aa        ld1d    {z10.d}, p0/z, [x21, x8, lsl #3]
   39220:       65cc01cc        fadd    z12.d, z14.d, z12.d
   39224:       a5e8416d        ld1d    {z13.d}, p0/z, [x11, x8, lsl #3]
   39228:       a5e841ee        ld1d    {z14.d}, p0/z, [x15, x8, lsl #3]
   3922c:       a5e8413f        ld1d    {z31.d}, p0/z, [x9, x8, lsl #3]
   39230:       a5e840e8        ld1d    {z8.d}, p0/z, [x7, x8, lsl #3]
   39234:       65eda169        fmsb    z9.d, p0/m, z11.d, z13.d
   39238:       65eea18a        fmsb    z10.d, p0/m, z12.d, z14.d
   3923c:       65fe013f        fmla    z31.d, p0/m, z9.d, z30.d
   39240:       65fe0148        fmla    z8.d, p0/m, z10.d, z30.d
   39244:       e5e8431f        st1d    {z31.d}, p0, [x24, x8, lsl #3]
   39248:       e5e841a8        st1d    {z8.d}, p0, [x13, x8, lsl #3]
   3924c:       04b0e3e8        incw    x8
   39250:       eb08023f        cmp     x17, x8
   39254:       54fffd81        b.ne    39204

And here's the NEON kernel after which is a case where LPD/STPs is helping a lot:

   39600:       ad7fe4f8        ldp     q24, q25, [x7, #-16]
   39604:       ad7fecba        ldp     q26, q27, [x5, #-16]
   39608:       f1001318        subs    x24, x24, #0x4
   3960c:       910080e7        add     x7, x7, #0x20
   39610:       910080a5        add     x5, x5, #0x20
   39614:       4e79d779        fadd    v25.2d, v27.2d, v25.2d
   39618:       ad7fdeb6        ldp     q22, q23, [x21, #-16]
   3961c:       ad7fd6f4        ldp     q20, q21, [x23, #-16]
   39620:       910082f7        add     x23, x23, #0x20
   39624:       910082b5        add     x21, x21, #0x20
   39628:       4e78d758        fadd    v24.2d, v26.2d, v24.2d
   3962c:       ad7fee9a        ldp     q26, q27, [x20, #-16]
   39630:       91008294        add     x20, x20, #0x20
   39634:       4ef7cf3b        fmls    v27.2d, v25.2d, v23.2d
   39638:       4ef6cf1a        fmls    v26.2d, v24.2d, v22.2d
   3963c:       4e7fcf75        fmla    v21.2d, v27.2d, v31.2d
   39640:       4e7fcf54        fmla    v20.2d, v26.2d, v31.2d
   39644:       ad3fd6d4        stp     q20, q21, [x22, #-16]
   39648:       910082d6        add     x22, x22, #0x20
   3964c:       54fffda1        b.ne    39600

The second production code snippet is a dequantisation kernel:

 void foo( unsigned char * __restrict__ A, const unsigned char * __restrict__ B, int N) {
    for (int i = 0; i < N; ++i) {
        A[i * 2] = (unsigned char)(B[i] & 0xf);
        A[i * 2 + 1] = ((unsigned char)(B[i] & 0xf0) >> 4);
    }
}

It's an example where first of all SVE prediction is unnecessary and second it leads to a code bloat:

.LBB0_9:                                // =>This Inner Loop Header: Depth=1
        add     x12, x1, x10
        ld1b    { z0.b }, p0/z, [x1, x10]
        addvl   x10, x10, #2
        ld1b    { z2.b }, p0/z, [x12, #1, mul vl]
        lsr     z1.b, z0.b, #4
        and     z0.b, z0.b, #0xf
        lsr     z3.b, z2.b, #4
        and     z2.b, z2.b, #0xf
        st2b    { z0.b, z1.b }, p0, [x11]
        st2b    { z2.b, z3.b }, p0, [x11, #2, mul vl]
        addvl   x11, x11, #4
        cmp     x9, x10
        b.ne    .LBB0_9

Even if the SVE codegen could be optimised here, it cannot compete with this NEON kernel in both code quality and performance:

.L4:
        ldr     q31, [x3], 16
        and     v30.16b, v31.16b, v29.16b
        ushr    v31.16b, v31.16b, 4
        st2     {v30.16b - v31.16b}, [x4], 32
        cmp     x3, x5
        bne     .L4

@david-arm
Copy link
Contributor

Hi @sjoerdmeijer, thanks for putting up the new patch! I realise I'm backtracking on this at the risk of you being annoyed :), but I now wonder if it's worth reverting to the original version for now that always favours fixed-width vectorisation in the event of a cost-model tie specifically for neoverse-v2. I do absolutely think that where we find issues with the cost model or codegen using scalable vectors we should try to fix those, but I can buy the argument that for many simple loops in practice when using scalable vectorisation the SVE predication does not come entirely for free and that there is a slight reason to favour fixed-width.

@sjoerdmeijer
Copy link
Collaborator Author

Hi @sjoerdmeijer, thanks for putting up the new patch! I realise I'm backtracking on this at the risk of you being annoyed :), but I now wonder if it's worth reverting to the original version for now that always favours fixed-width vectorisation in the event of a cost-model tie specifically for neoverse-v2. I do absolutely think that where we find issues with the cost model or codegen using scalable vectors we should try to fix those, but I can buy the argument that for many simple loops in practice when using scalable vectorisation the SVE predication does not come entirely for free and that there is a slight reason to favour fixed-width.

Absolutely no problem. :) Thanks for your patience and help with this, @david-arm. I have reverted this to the previous version, which should have all minor comments addressed.

@@ -244,6 +244,10 @@ def FeatureExperimentalZeroingPseudos
def FeatureUseScalarIncVL : SubtargetFeature<"use-scalar-inc-vl",
"UseScalarIncVL", "true", "Prefer inc/dec over add+cnt">;

def FeatureUseFixedOverScalableIfEqualCost: SubtargetFeature<"use-fixed-over-scalable-equal-cost",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm being a bit pedantic here (tell me if I am!), but I think it makes sense to have the flag match feature name, so that the flag should be "use-fixed-over-scalable-if-equal-cost". My thinking is that it's only 3 extra characters in an already long name, and is a bit clearer/consistent. :)

@@ -4780,7 +4780,10 @@ bool LoopVectorizationPlanner::isMoreProfitable(
// Assume vscale may be larger than 1 (or the value being tuned for),
// so that scalable vectorization is slightly favorable over fixed-width
// vectorization.
bool PreferScalable = A.Width.isScalable() && !B.Width.isScalable();
bool PreferScalable = A.Width.isScalable() && !B.Width.isScalable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: There's an extra white space between bool and PreferScalable.

@@ -4780,7 +4780,10 @@ bool LoopVectorizationPlanner::isMoreProfitable(
// Assume vscale may be larger than 1 (or the value being tuned for),
// so that scalable vectorization is slightly favorable over fixed-width
// vectorization.
bool PreferScalable = A.Width.isScalable() && !B.Width.isScalable();
bool PreferScalable = A.Width.isScalable() && !B.Width.isScalable();
if (TTI.preferFixedOverScalableIfEqualCost())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth rewriting this to be something like:

  bool PreferScalable = !TTI.preferFixedOverScalableIfEqualCost() &&
      A.Width.isScalable() && !B.Width.isScalable();

br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
}

attributes #0 = { mustprogress nofree norecurse nosync nounwind memory(readwrite, argmem: none, inaccessiblemem: none) uwtable vscale_range(1,16) "target-cpu"="neoverse-v2" "target-features"="+sve,+sve2,+v9a" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the attributes except for vscale_range and target-cpu can be removed here.

Also, since you've added a new target feature it probably also makes sense to have a test for "target-cpu"="generic" "target-features"="+use-fixed-over-scalable-equal-cost" if possible.

@a = dso_local local_unnamed_addr global [32000 x float] zeroinitializer, align 64
@b = dso_local local_unnamed_addr global [32000 x float] zeroinitializer, align 64

define void @NeoverseV2() local_unnamed_addr #0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you remove local_unnamed_addr from both functions as a cleanup?

@@ -0,0 +1,196 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's necessary to run utils/update_test_checks.py on this test, since all you're really testing is that we vectorised with a particular VF. For example, you could reduce the number of CHECK lines in each function to just the bare minimum:

; CHECK-LABEL: define void @NeoverseV2(
: CHECK:    store <4 x float>

@@ -0,0 +1,196 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
; RUN: opt -S < %s -passes=loop-vectorize | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this way of testing is potentially fragile due to future cost model changes. It might end up choosing the values you expect anyway regardless of your new feature. I've not tried this out, but can you use -force-target-instruction-cost=1 to ensure you're testing the new code path?

@sjoerdmeijer
Copy link
Collaborator Author

Thanks @david-arm , have addressed all comments.


define void @GenericCPUPreferFixed() #2 {
; CHECK-LABEL: define void @GenericCPUPreferFixed(
; CHECK: store <vscale x 4 x float>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this looks odd. Perhaps I've misunderstood how the feature works, but this seems to be preferring scalable? I was expecting us to use VF=4 here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that does look odd. We are certainly expecting VF=4.
Maybe there's some interaction with -force-target-instruction-cost=1, I will have a look what is going on here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that the cost-model isn't a tie anymore. The cost-model indicates vscale is better, so isn't about preferring fixed:

  Vector loop of width vscale x 4 costs: 1

vs.

  Vector loop of width 4 costs: 3

The negative tests not using neoverse v2 are a bit of a pain to come up with, I would need to find other examples where the cost model is a tie for cpu=generic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah of course I see. The only way you can test the feature by itself then is to set the CPU to something where VScaleForTuning is set to 1, e.g. neoverse-n2, cortex-a710. But then that feels like a bit unreliable too. In that case, may be easiest just to remove this particular test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have removed this test, and I don't think we lose (much) testing/coverage.

…erse V2)

For the Neoverse V2 we would like to prefer fixed width over scalable
vectorisation if the cost-model assigns an equal cost to both for certain
loops. This improves 7 kernels from TSVC-2 and several production kernels by
about 2x, and does not affect SPEC21017 INT and FP. This also adds a new TTI
hook that can steer the loop vectorizater to preferring fixed width
vectorization, which can be set per CPU. For now, this is only enabled for the
Neoverse V2.

There are 3 reasons why preferring NEON might be better in the case the
cost-model is a tie and the SVE vector size is the same as NEON (128-bit):
architectural reasons, micro-architecture reasons, and SVE codegen reasons. The
latter will be improved over time, so the more important reasons are the former
two. I.e., (micro) architecture reason is the use of LPD/STP instructions which
are not available in SVE2 and it avoids predication.

For what it is worth: this codegen strategy to generate more NEON is inline
with GCC's codegen strategy, which is actually even more aggressive in
generating NEON when no predication is required. We could be smarter about the
decision making, but this seems to be a first good step in the right direction,
and we can always revise this later (for example make the target hook more
general).
Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for making the changes.

@sjoerdmeijer
Copy link
Collaborator Author

Thanks for all your help with this, really appreciate this.

@sjoerdmeijer sjoerdmeijer merged commit c5329c8 into llvm:main Jul 17, 2024
7 checks passed
@sjoerdmeijer sjoerdmeijer deleted the lv-prefer-fixed branch July 17, 2024 09:59
wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this pull request Jul 25, 2024
This is inspired by llvm#95819.

Some kernels like s000 have some improvements and we can reduce
code for calculating vector length, fully unroll tail epilogue.

Currently, we add a SubtargetFeature for this and the processors
can add it if needed.
yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
…erse V2) (#95819)

Summary:
    For the Neoverse V2 we would like to prefer fixed width over scalable
    vectorisation if the cost-model assigns an equal cost to both for certain
    loops. This improves 7 kernels from TSVC-2 and several production kernels by
    about 2x, and does not affect SPEC21017 INT and FP. This also adds a new TTI
    hook that can steer the loop vectorizater to preferring fixed width
    vectorization, which can be set per CPU. For now, this is only enabled for the
    Neoverse V2.

    There are 3 reasons why preferring NEON might be better in the case the
    cost-model is a tie and the SVE vector size is the same as NEON (128-bit):
    architectural reasons, micro-architecture reasons, and SVE codegen reasons. The
    latter will be improved over time, so the more important reasons are the former
    two. I.e., (micro) architecture reason is the use of LPD/STP instructions which
    are not available in SVE2 and it avoids predication.

    For what it is worth: this codegen strategy to generate more NEON is inline
    with GCC's codegen strategy, which is actually even more aggressive in
    generating NEON when no predication is required. We could be smarter about the
    decision making, but this seems to be a first good step in the right direction,
    and we can always revise this later (for example make the target hook more
    general).

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60250994
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AArch64 llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms vectorizers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants