-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[LV][AArch64] Prefer Fixed over Scalable if cost-model is equal (Neoverse V2) #95819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-llvm-transforms Author: Sjoerd Meijer (sjoerdmeijer) ChangesFor the Neoverse V2, prefer fixed width vectorisation If the cost-model assigns an equal cost to fixed and scalable vectorisation. This improves 7 kernels from TSVC-2 by about 2x, and does not affect SPEC21017 INT and FP. This tends to benefit small kernels, like the ones in TSVC, for a number of reasons: processing the predicates does not come entirely for free, NEON tends to generate slightly less code which can have a big impact on these small kernels, and then there are second order affects that SVE codegen is slightly less optimal in some areas. This codegen strategy to generate more NEON is inline with GCC's codegen strategy, which is actually even more aggressive in generating NEON when no predication is required. We could be smarter and more aggressive too about generating more NEON (and improve performance), but this seems to be a first good and straight forward step. This depends on #95818 Full diff: https://github.com/llvm/llvm-project/pull/95819.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index c7c19ef456c7c..cfe5e0157b67b 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4780,7 +4780,10 @@ bool LoopVectorizationPlanner::isMoreProfitable(
// Assume vscale may be larger than 1 (or the value being tuned for),
// so that scalable vectorization is slightly favorable over fixed-width
// vectorization.
- bool PreferScalable = A.Width.isScalable() && !B.Width.isScalable();
+ bool PreferScalable = false;
+ if (!TTI.preferFixedIfEqualToScalable())
+ PreferScalable = A.Width.isScalable() && !B.Width.isScalable();
+
auto CmpFn = [PreferScalable](const InstructionCost &LHS,
const InstructionCost &RHS) {
return PreferScalable ? LHS <= RHS : LHS < RHS;
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/prefer-fixed-if-equal-to-scalable.ll b/llvm/test/Transforms/LoopVectorize/AArch64/prefer-fixed-if-equal-to-scalable.ll
new file mode 100644
index 0000000000000..1567485ef02a9
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/prefer-fixed-if-equal-to-scalable.ll
@@ -0,0 +1,196 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S < %s -passes=loop-vectorize | FileCheck %s
+
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
+target triple = "aarch64-unknown-linux-gnu"
+
+@a = dso_local local_unnamed_addr global [32000 x float] zeroinitializer, align 64
+@b = dso_local local_unnamed_addr global [32000 x float] zeroinitializer, align 64
+
+define void @NeoverseV2() local_unnamed_addr #0 {
+; CHECK-LABEL: define void @NeoverseV2(
+; CHECK-SAME: ) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 4
+; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP0]]
+; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP1]]
+; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds float, ptr [[TMP2]], i32 0
+; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds float, ptr [[TMP2]], i32 4
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x float>, ptr [[TMP4]], align 4
+; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <4 x float>, ptr [[TMP5]], align 4
+; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 [[TMP0]]
+; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 [[TMP1]]
+; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds float, ptr [[TMP6]], i32 0
+; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds float, ptr [[TMP6]], i32 4
+; CHECK-NEXT: [[WIDE_LOAD2:%.*]] = load <4 x float>, ptr [[TMP8]], align 4
+; CHECK-NEXT: [[WIDE_LOAD3:%.*]] = load <4 x float>, ptr [[TMP9]], align 4
+; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <4 x float> [[WIDE_LOAD2]], [[WIDE_LOAD]]
+; CHECK-NEXT: [[TMP11:%.*]] = fadd fast <4 x float> [[WIDE_LOAD3]], [[WIDE_LOAD1]]
+; CHECK-NEXT: [[TMP12:%.*]] = add nuw nsw i64 [[TMP0]], 16000
+; CHECK-NEXT: [[TMP13:%.*]] = add nuw nsw i64 [[TMP1]], 16000
+; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP12]]
+; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP13]]
+; CHECK-NEXT: [[TMP16:%.*]] = getelementptr inbounds float, ptr [[TMP14]], i32 0
+; CHECK-NEXT: [[TMP17:%.*]] = getelementptr inbounds float, ptr [[TMP14]], i32 4
+; CHECK-NEXT: store <4 x float> [[TMP10]], ptr [[TMP16]], align 4
+; CHECK-NEXT: store <4 x float> [[TMP11]], ptr [[TMP17]], align 4
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
+; CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16000
+; CHECK-NEXT: br i1 [[TMP18]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br i1 true, label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]]
+; CHECK: [[SCALAR_PH]]:
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 16000, %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT: br label %[[FOR_BODY:.*]]
+; CHECK: [[FOR_COND_CLEANUP]]:
+; CHECK-NEXT: ret void
+; CHECK: [[FOR_BODY]]:
+; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT: [[TMP19:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT: [[TMP20:%.*]] = load float, ptr [[ARRAYIDX2]], align 4
+; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP20]], [[TMP19]]
+; CHECK-NEXT: [[TMP21:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 16000
+; CHECK-NEXT: [[ARRAYIDX5:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP21]]
+; CHECK-NEXT: store float [[ADD]], ptr [[ARRAYIDX5]], align 4
+; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 16000
+; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+;
+entry:
+ br label %for.body
+
+for.cond.cleanup:
+ ret void
+
+for.body:
+ %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+ %arrayidx = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %indvars.iv
+ %0 = load float, ptr %arrayidx, align 4
+ %arrayidx2 = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 %indvars.iv
+ %1 = load float, ptr %arrayidx2, align 4
+ %add = fadd fast float %1, %0
+ %2 = add nuw nsw i64 %indvars.iv, 16000
+ %arrayidx5 = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %2
+ store float %add, ptr %arrayidx5, align 4
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %exitcond.not = icmp eq i64 %indvars.iv.next, 16000
+ br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
+
+define void @NeoverseV1() #1 {
+; CHECK-LABEL: define void @NeoverseV1(
+; CHECK-SAME: ) #[[ATTR1:[0-9]+]] {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 8
+; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 16000, [[TMP1]]
+; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 8
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 16000, [[TMP3]]
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 16000, [[N_MOD_VF]]
+; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 8
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 4
+; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[TMP8]], 0
+; CHECK-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 1
+; CHECK-NEXT: [[TMP11:%.*]] = add i64 [[INDEX]], [[TMP10]]
+; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP6]]
+; CHECK-NEXT: [[TMP13:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP11]]
+; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds float, ptr [[TMP12]], i32 0
+; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP16:%.*]] = mul i64 [[TMP15]], 4
+; CHECK-NEXT: [[TMP17:%.*]] = getelementptr inbounds float, ptr [[TMP12]], i64 [[TMP16]]
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP14]], align 4
+; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <vscale x 4 x float>, ptr [[TMP17]], align 4
+; CHECK-NEXT: [[TMP18:%.*]] = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 [[TMP6]]
+; CHECK-NEXT: [[TMP19:%.*]] = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 [[TMP11]]
+; CHECK-NEXT: [[TMP20:%.*]] = getelementptr inbounds float, ptr [[TMP18]], i32 0
+; CHECK-NEXT: [[TMP21:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP22:%.*]] = mul i64 [[TMP21]], 4
+; CHECK-NEXT: [[TMP23:%.*]] = getelementptr inbounds float, ptr [[TMP18]], i64 [[TMP22]]
+; CHECK-NEXT: [[WIDE_LOAD2:%.*]] = load <vscale x 4 x float>, ptr [[TMP20]], align 4
+; CHECK-NEXT: [[WIDE_LOAD3:%.*]] = load <vscale x 4 x float>, ptr [[TMP23]], align 4
+; CHECK-NEXT: [[TMP24:%.*]] = fadd fast <vscale x 4 x float> [[WIDE_LOAD2]], [[WIDE_LOAD]]
+; CHECK-NEXT: [[TMP25:%.*]] = fadd fast <vscale x 4 x float> [[WIDE_LOAD3]], [[WIDE_LOAD1]]
+; CHECK-NEXT: [[TMP26:%.*]] = add nuw nsw i64 [[TMP6]], 16000
+; CHECK-NEXT: [[TMP27:%.*]] = add nuw nsw i64 [[TMP11]], 16000
+; CHECK-NEXT: [[TMP28:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP26]]
+; CHECK-NEXT: [[TMP29:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP27]]
+; CHECK-NEXT: [[TMP30:%.*]] = getelementptr inbounds float, ptr [[TMP28]], i32 0
+; CHECK-NEXT: [[TMP31:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP32:%.*]] = mul i64 [[TMP31]], 4
+; CHECK-NEXT: [[TMP33:%.*]] = getelementptr inbounds float, ptr [[TMP28]], i64 [[TMP32]]
+; CHECK-NEXT: store <vscale x 4 x float> [[TMP24]], ptr [[TMP30]], align 4
+; CHECK-NEXT: store <vscale x 4 x float> [[TMP25]], ptr [[TMP33]], align 4
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]]
+; CHECK-NEXT: [[TMP34:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP34]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 16000, [[N_VEC]]
+; CHECK-NEXT: br i1 [[CMP_N]], label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]]
+; CHECK: [[SCALAR_PH]]:
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT: br label %[[FOR_BODY:.*]]
+; CHECK: [[FOR_COND_CLEANUP]]:
+; CHECK-NEXT: ret void
+; CHECK: [[FOR_BODY]]:
+; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT: [[TMP35:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT: [[TMP36:%.*]] = load float, ptr [[ARRAYIDX2]], align 4
+; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP36]], [[TMP35]]
+; CHECK-NEXT: [[TMP37:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 16000
+; CHECK-NEXT: [[ARRAYIDX5:%.*]] = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 [[TMP37]]
+; CHECK-NEXT: store float [[ADD]], ptr [[ARRAYIDX5]], align 4
+; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 16000
+; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+;
+entry:
+ br label %for.body
+
+for.cond.cleanup:
+ ret void
+
+for.body:
+ %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+ %arrayidx = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %indvars.iv
+ %0 = load float, ptr %arrayidx, align 4
+ %arrayidx2 = getelementptr inbounds [32000 x float], ptr @b, i64 0, i64 %indvars.iv
+ %1 = load float, ptr %arrayidx2, align 4
+ %add = fadd fast float %1, %0
+ %2 = add nuw nsw i64 %indvars.iv, 16000
+ %arrayidx5 = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %2
+ store float %add, ptr %arrayidx5, align 4
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %exitcond.not = icmp eq i64 %indvars.iv.next, 16000
+ br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
+
+attributes #0 = { mustprogress nofree norecurse nosync nounwind memory(readwrite, argmem: none, inaccessiblemem: none) uwtable vscale_range(1,16) "approx-func-fp-math"="true" "frame-pointer"="non-leaf" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="neoverse-v2" "target-features"="+sve,+sve2,+v9a" "unsafe-fp-math"="true" }
+
+attributes #1 = { mustprogress nofree norecurse nosync nounwind memory(readwrite, argmem: none, inaccessiblemem: none) uwtable vscale_range(1,16) "approx-func-fp-math"="true" "frame-pointer"="non-leaf" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="neoverse-v1" "target-features"="+sve,+v9a" "unsafe-fp-math"="true" }
+;.
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]}
+; CHECK: [[LOOP4]] = distinct !{[[LOOP4]], [[META1]], [[META2]]}
+; CHECK: [[LOOP5]] = distinct !{[[LOOP5]], [[META2]], [[META1]]}
+;.
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
This adds a new hook to prefer fixed width loop vectorization over scalable. This will be used in the loop vectoriser to generate more NEON code instead of SVE if cost-model assigns equal costs to fixed and scalable versions of the vectorised loop. This is used in llvm#95819.
6abec3a
to
58b8a84
Compare
if (!TTI.preferFixedIfEqualToScalable()) | ||
PreferScalable = A.Width.isScalable() && !B.Width.isScalable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be better as
if (!TTI.preferFixedIfEqualToScalable())
PreferScalable = !A.Width.isScalable() && B.Width.isScalable();
else
PreferScalable = A.Width.isScalable() && !B.Width.isScalable();
This way it will not be dependent upon the order it visits the vplans.
PreferScalable could then do with a better name too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted back to my initial implementation, because your suggestion is having a wider impact that I would like to look at in a follow up because I do agree it would make sense.
|
||
attributes #0 = { mustprogress nofree norecurse nosync nounwind memory(readwrite, argmem: none, inaccessiblemem: none) uwtable vscale_range(1,16) "approx-func-fp-math"="true" "frame-pointer"="non-leaf" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="neoverse-v2" "target-features"="+sve,+sve2,+v9a" "unsafe-fp-math"="true" } | ||
|
||
attributes #1 = { mustprogress nofree norecurse nosync nounwind memory(readwrite, argmem: none, inaccessiblemem: none) uwtable vscale_range(1,16) "approx-func-fp-math"="true" "frame-pointer"="non-leaf" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="neoverse-v1" "target-features"="+sve,+v9a" "unsafe-fp-math"="true" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this a CPU without a 256bit SVE vector length? Or maybe cpu=generic.
Hi @sjoerdmeijer, in the loops that you looked at where SVE performed worse did the SVE code itself look optimal, i.e. as good as it can be? The only thing I'm nervous about here is that we're switching to NEON due to poor codegen or a bad cost model, rather than due to a fundamental problem with SVE. For example, if we're generating more code in the SVE loop then that suggests the costs for the SVE loop are just too low so by hiking the costs we would favour NEON anyway. We should definitely fix any cost model issues like that where we can. Also, if you have any examples where SVE codegen is actually worse can you create issues for them? Thanks! |
Just to be clear, I'm not against the idea of having a flag or enabling it by default on neoverse-v2, but I am just a bit nervous about any problems it may hide that we could be usefully fixing, that's all. |
Considering the rational as explained in the second paragraph where I see phrases like "predicates does not come entirely for free", "NEON tends to generate slightly less code" and "SVE codegen is slightly less optimal" I'm apposed to having this flag. If there's other rational then great, please elaborate, but otherwise to me this looks like a hack to work round cost model deficiencies and generally poor code generation, both of which should be fixed rather than hidden. |
This is actually a quite complicated story, it's a combination of a few factors: a few micro-architectural reasons and (SVE) codegen reasons. To give a better introduction to the problem, we have a number of examples similar to this:
This is GCC's output, and LLVM's output with this patch:
LLVM's output is something like this:
There is nothing fundamentally wrong with LLVM's codegen, but it performs a lot worse. One of the micro-architectural reasons are documented in section "4.1 Dispatch constraints" of the SWOG:
The smaller kernels fit in these dispatch constraints, the bigger ones don't, resulting in significant performance differences. Most of the performance can be clawed back by interleaving more. But then there's clearly a code quality issue: the amount of code necessary to get on par with the NEON kernel would be disproportional. Two more subjective arguments:
Other factors are slightly more complicated SVE addressing modes, also resulting in more MOPS. We are investigating other micro-architectural issues, but I cannot comment on this yet. |
Does that in fact suggest for this case that we should be interleaving less in LLVM and be tuning that accordingly? GCC's output does use NEON, but there is no interleaving. Have you tried disabling interleaving when using SVE in LLVM so that it's a fairer comparison with GCC? I'd be interested to know how that performs. |
I do see issues with generating the addresses in the SVE version definitely and we should fix that, but if we assume those issues are fixed and we have a fair comparison with GCC do we still see the SVE loop being slower? |
Yes, we have experimented also with that, e.g. nano-benchmarked this:
where we hoisted the adds out of the loop, and this also performs worse. This is the different micro-architectural reason I was talking about. We do have a suspicion what this could be, but cannot confirm yet. So, I hope I have given enough examples that demonstrates that unexpected performance results happen, not necessary SVEs fault, but there's a higher chance of hitting micro-architectural issues and secondary effects for this class of small kernels and loops. I think there's also an element of "keeping it simple" to it if you see what I mean. |
@paulwalker-arm , @david-arm : I think a very fair question is why we should be generating SVE code for this type of kernels? Can you explain the benefit? |
The general answer is the cost model says there's no reason not to use scalable vectors, and with the potential for them to be bigger than their fixed length counterparts there's scope for more performance beyond the assumed minimum. What worries me is that you're not suggesting there's a subtle performance difference but in fact the difference is pretty large. I just think this should be captured as either a more accurate cost model or implementation specific code generation decisions (e.g. |
Yes, this is the only reason: performance portability of the same binary running on potentially a bigger SVE implementation. This is not helping the Neoverse V2 though. We get very unpredictable and bad performance for small SVE kernels and that's the problem we are dealing with here. The reasons are twofold: micro-architectural reasons, and less mature SVE codegen. The latter can be fixed over time, but then there is still the former. And again, I do want to stress again that predication for this class of loops is simply not necessary.
So, all of these problems go away if we resort to NEON if the cost-model assigns equal cost. I could think about a more narrow heuristic: if the kernel is small, there are no masked loads/stores, the cost-model is a tie, favour fixed, something along those lines. But I am afraid you're not going to like that either because you just like to see SVE code here. Cost-modeling is what we are doing here way or another, but I think this is not just a matter of "let's increase the cost of instruction XYZ a little bit". Preferring fixed if the cost-model cannot decide looks very reasonable to me. |
I agree that this seems perfectly sensible to me. I tried it on a few other machines (a mixture of emulation and real hardware) and it looked profitable to pick neon on a tie. (Even the Neoverse-V1 with a 256bit SVE length came out as a every-so-slight, meaninglessly small performance gain apparently. Obviously there were a lot of things getting better and worse). The fact that Neon has LDP and postinc would feel like enough of a reason to favour Neon on a tie. Cost modelling in the mid-end is always going to be fairly high level and there will always be ties. Favouring Neon sounds like the most reasonable option between the two, providing if we have performance to back it up. |
Please don't think me an SVE zealot here. I'd just rather see the compiler make decisions based on data from the IR rather than implement a switch that effectively means "ignore the data". I appreciate this might not always be clean, for example take a look at |
Whereas I'd say "The fact that Neon has LDP and postinc would feel like enough of a reason for Neon to have a lower cost when comparing equivalent vector lengths." |
I don't think tweaking the cost of a load instructions is going to cut it here. It is too fragile, is going to have so many side effects, thus probably doesn't achieve what we want. I don't really see that as a way forward.
I see what you mean but I agree and disagree with that statement. First of all, we unleash the cost-model on fixed and scalable and all sorts of VFs and combinations, and if there is a tie we happen to default to scalable. Thus, the status quo is that we make an arbitrary decision that is similarly not based on any data or IR or empirical data. The "arbitration" in case of a tie and proposal to change to fixed is based on empirical data though, which is clearly an improvement over the status quo. However, perhaps we agree on this point now, and the remaining point of discussion is that we could make the arbitration a somewhat better informed decision now or in the future by implementing a similar hook:
As the original author of that hook, I am very well aware and I am happy with this suggestion and this:
It's similar I think to the heuristic that I sketched earlier: a scan of the loop body to see if small loops don't require predication. And so I am happy to go down that route to progress this. Please let me know what you think. |
So here are my thoughts. It would be very useful to have some performance data attached to this PR for benchmarks such as SPEC and/or real world applications demonstrating the problems. As far as I understand it, TSVC is not really meant for benchmarking and is in fact meant to showcase the breadth of a compiler's vectorisation capabilities. Although I do understand it's useful for highlighting potential issues. Like @paulwalker-arm said, it's not about blindly favouring SVE in the face of overwhelming data, but more about making the best choices based on data. Here are my thoughts:
Having said that, I definitely understand the LDP, STP problem, but I agree with @paulwalker-arm that if possible it would be nice to see this taken account of with better cost modelling. This is just a suggestion, but you could tweak this patch as follows:
where What do you think? |
I think this is excellent. Thanks for writing this all up! This is pretty much what I now had in mind too, so yes, let's go for this approach. To be clear, I don't care about TSVC. It's well appreciated it's synthetic, and I can argue that TSVC is as synthetic as x264 from SPEC INT, which is the app that is most amenable to vectorisation in the suite because real x264 implementations are not implemented like that. But the way I look at this is getting the basics right. Small loops do exist, and also for example, if you create a reproducer, unpredictable and bad performance is really not helping, and that's an understatement. I already mentioned my performance data: SPEC INT + FP are unaffected, and I see 2x improvements in TSVC, and some other benchmark. I will get additional data for the llvm test-suite. But it sounded like @davemgreen did a sweep too. I will now go ahead and change this patch as suggested. Please do shout if there are still concerns. |
Yeah sure, my suggested implementation was perhaps a bit simplistic. :) You're right that it needs a check for the CPU as well. If other CPUs with similar issues would benefit then they could be added here over time as well. |
Thanks all, sounds like we have a way forward. Ultimately I think we should transition the cost modelling to be tree based rather than instruction based because that's where much of the frailty comes from. That's very much long term work, so not specifically relevant to this immediate goal. Whilst I look forward to the results I do want to emphasise the "is inc/dec VL" bad for Neoverse V2 point. That feels like something we should get to the bottom of asap and just add the necessary feature flag. I have a final |
sgtm.
I will ping you an email with details about this because I can't discuss that here. I will do that early next week because I am off tomorrow + Friday.
Yes, you're right, that's one of the main questions here. I will add it as an early-exit or fold it into |
LDP/STP was only one of the reasons to prefer fixed-width over scalable vectorization on a tie. I have been trying to think of reasons why we would want to favour scalable vectorization over fixed-width, and if we know that the scalable factor would not be higher as it will be here, then I couldn't think of anything that really make scalable vector the better choice. @david-arm I see where you are coming from with your suggestion but it feels like an over-engineered way of only solving part of the problem, and at the end of the day, when we don't know any better and can't make the decision based on anything else, fixed width should be the preferred option. |
58b8a84
to
193457a
Compare
To create a new baseline, I have merged the TTI changes in here (and addressed the minor comments). I have not yet added more logic to preferFixedIfEqualToScalable, but I am struggling a little and I have (unsurprisingly) a lot of sympathy for Dave's remark about this looking a bit over-engineered, but will give it another try as this is what I promised to do. |
if (!TTI.preferFixedOverScalableIfEqualCost()) | ||
PreferScalable = !A.Width.isScalable() && B.Width.isScalable(); | ||
else | ||
PreferScalable = A.Width.isScalable() && !B.Width.isScalable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be moved into the initializer of the variable. May also be easier to read if the if-statement condition is inverted.
193457a
to
6efcff1
Compare
I have added |
gentle ping |
As we don't want to base codegen strategies on one benchmark only, here are two examples from production code, @paulwalker-arm , @david-arm , @davemgreen . The first example is extracted from an HPC app, this is the SVE kernel before:
And here's the NEON kernel after which is a case where LPD/STPs is helping a lot:
The second production code snippet is a dequantisation kernel:
It's an example where first of all SVE prediction is unnecessary and second it leads to a code bloat:
Even if the SVE codegen could be optimised here, it cannot compete with this NEON kernel in both code quality and performance:
|
Hi @sjoerdmeijer, thanks for putting up the new patch! I realise I'm backtracking on this at the risk of you being annoyed :), but I now wonder if it's worth reverting to the original version for now that always favours fixed-width vectorisation in the event of a cost-model tie specifically for neoverse-v2. I do absolutely think that where we find issues with the cost model or codegen using scalable vectors we should try to fix those, but I can buy the argument that for many simple loops in practice when using scalable vectorisation the SVE predication does not come entirely for free and that there is a slight reason to favour fixed-width. |
6efcff1
to
5a4b1b0
Compare
Absolutely no problem. :) Thanks for your patience and help with this, @david-arm. I have reverted this to the previous version, which should have all minor comments addressed. |
@@ -244,6 +244,10 @@ def FeatureExperimentalZeroingPseudos | |||
def FeatureUseScalarIncVL : SubtargetFeature<"use-scalar-inc-vl", | |||
"UseScalarIncVL", "true", "Prefer inc/dec over add+cnt">; | |||
|
|||
def FeatureUseFixedOverScalableIfEqualCost: SubtargetFeature<"use-fixed-over-scalable-equal-cost", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I'm being a bit pedantic here (tell me if I am!), but I think it makes sense to have the flag match feature name, so that the flag should be "use-fixed-over-scalable-if-equal-cost". My thinking is that it's only 3 extra characters in an already long name, and is a bit clearer/consistent. :)
@@ -4780,7 +4780,10 @@ bool LoopVectorizationPlanner::isMoreProfitable( | |||
// Assume vscale may be larger than 1 (or the value being tuned for), | |||
// so that scalable vectorization is slightly favorable over fixed-width | |||
// vectorization. | |||
bool PreferScalable = A.Width.isScalable() && !B.Width.isScalable(); | |||
bool PreferScalable = A.Width.isScalable() && !B.Width.isScalable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: There's an extra white space between bool
and PreferScalable
.
@@ -4780,7 +4780,10 @@ bool LoopVectorizationPlanner::isMoreProfitable( | |||
// Assume vscale may be larger than 1 (or the value being tuned for), | |||
// so that scalable vectorization is slightly favorable over fixed-width | |||
// vectorization. | |||
bool PreferScalable = A.Width.isScalable() && !B.Width.isScalable(); | |||
bool PreferScalable = A.Width.isScalable() && !B.Width.isScalable(); | |||
if (TTI.preferFixedOverScalableIfEqualCost()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth rewriting this to be something like:
bool PreferScalable = !TTI.preferFixedOverScalableIfEqualCost() &&
A.Width.isScalable() && !B.Width.isScalable();
br i1 %exitcond.not, label %for.cond.cleanup, label %for.body | ||
} | ||
|
||
attributes #0 = { mustprogress nofree norecurse nosync nounwind memory(readwrite, argmem: none, inaccessiblemem: none) uwtable vscale_range(1,16) "target-cpu"="neoverse-v2" "target-features"="+sve,+sve2,+v9a" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the attributes except for vscale_range and target-cpu can be removed here.
Also, since you've added a new target feature it probably also makes sense to have a test for "target-cpu"="generic" "target-features"="+use-fixed-over-scalable-equal-cost"
if possible.
@a = dso_local local_unnamed_addr global [32000 x float] zeroinitializer, align 64 | ||
@b = dso_local local_unnamed_addr global [32000 x float] zeroinitializer, align 64 | ||
|
||
define void @NeoverseV2() local_unnamed_addr #0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can you remove local_unnamed_addr from both functions as a cleanup?
@@ -0,0 +1,196 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's necessary to run utils/update_test_checks.py
on this test, since all you're really testing is that we vectorised with a particular VF. For example, you could reduce the number of CHECK lines in each function to just the bare minimum:
; CHECK-LABEL: define void @NeoverseV2(
: CHECK: store <4 x float>
@@ -0,0 +1,196 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 | |||
; RUN: opt -S < %s -passes=loop-vectorize | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this way of testing is potentially fragile due to future cost model changes. It might end up choosing the values you expect anyway regardless of your new feature. I've not tried this out, but can you use -force-target-instruction-cost=1
to ensure you're testing the new code path?
5a4b1b0
to
c3c1c6a
Compare
Thanks @david-arm , have addressed all comments. |
|
||
define void @GenericCPUPreferFixed() #2 { | ||
; CHECK-LABEL: define void @GenericCPUPreferFixed( | ||
; CHECK: store <vscale x 4 x float> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this looks odd. Perhaps I've misunderstood how the feature works, but this seems to be preferring scalable? I was expecting us to use VF=4 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, that does look odd. We are certainly expecting VF=4.
Maybe there's some interaction with -force-target-instruction-cost=1
, I will have a look what is going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that the cost-model isn't a tie anymore. The cost-model indicates vscale is better, so isn't about preferring fixed:
Vector loop of width vscale x 4 costs: 1
vs.
Vector loop of width 4 costs: 3
The negative tests not using neoverse v2 are a bit of a pain to come up with, I would need to find other examples where the cost model is a tie for cpu=generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah of course I see. The only way you can test the feature by itself then is to set the CPU to something where VScaleForTuning is set to 1, e.g. neoverse-n2, cortex-a710. But then that feels like a bit unreliable too. In that case, may be easiest just to remove this particular test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have removed this test, and I don't think we lose (much) testing/coverage.
c3c1c6a
to
e65cb49
Compare
…erse V2) For the Neoverse V2 we would like to prefer fixed width over scalable vectorisation if the cost-model assigns an equal cost to both for certain loops. This improves 7 kernels from TSVC-2 and several production kernels by about 2x, and does not affect SPEC21017 INT and FP. This also adds a new TTI hook that can steer the loop vectorizater to preferring fixed width vectorization, which can be set per CPU. For now, this is only enabled for the Neoverse V2. There are 3 reasons why preferring NEON might be better in the case the cost-model is a tie and the SVE vector size is the same as NEON (128-bit): architectural reasons, micro-architecture reasons, and SVE codegen reasons. The latter will be improved over time, so the more important reasons are the former two. I.e., (micro) architecture reason is the use of LPD/STP instructions which are not available in SVE2 and it avoids predication. For what it is worth: this codegen strategy to generate more NEON is inline with GCC's codegen strategy, which is actually even more aggressive in generating NEON when no predication is required. We could be smarter about the decision making, but this seems to be a first good step in the right direction, and we can always revise this later (for example make the target hook more general).
e65cb49
to
2c50720
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for making the changes.
Thanks for all your help with this, really appreciate this. |
This is inspired by llvm#95819. Some kernels like s000 have some improvements and we can reduce code for calculating vector length, fully unroll tail epilogue. Currently, we add a SubtargetFeature for this and the processors can add it if needed.
…erse V2) (#95819) Summary: For the Neoverse V2 we would like to prefer fixed width over scalable vectorisation if the cost-model assigns an equal cost to both for certain loops. This improves 7 kernels from TSVC-2 and several production kernels by about 2x, and does not affect SPEC21017 INT and FP. This also adds a new TTI hook that can steer the loop vectorizater to preferring fixed width vectorization, which can be set per CPU. For now, this is only enabled for the Neoverse V2. There are 3 reasons why preferring NEON might be better in the case the cost-model is a tie and the SVE vector size is the same as NEON (128-bit): architectural reasons, micro-architecture reasons, and SVE codegen reasons. The latter will be improved over time, so the more important reasons are the former two. I.e., (micro) architecture reason is the use of LPD/STP instructions which are not available in SVE2 and it avoids predication. For what it is worth: this codegen strategy to generate more NEON is inline with GCC's codegen strategy, which is actually even more aggressive in generating NEON when no predication is required. We could be smarter about the decision making, but this seems to be a first good step in the right direction, and we can always revise this later (for example make the target hook more general). Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60250994
For the Neoverse V2, prefer fixed width vectorisation If the cost-model assigns an equal cost to fixed and scalable vectorisation. This improves 7 kernels from TSVC-2 by about 2x, and does not affect SPEC21017 INT and FP.
This tends to benefit small kernels, like the ones in TSVC, for a number of reasons: processing the predicates does not come entirely for free, NEON tends to generate slightly less code which can have a big impact on these small kernels, and then there are second order affects that SVE codegen is slightly less optimal in some areas.
This codegen strategy to generate more NEON is inline with GCC's codegen strategy, which is actually even more aggressive in generating NEON when no predication is required. We could be smarter and more aggressive too about generating more NEON (and improve performance), but this seems to be a first good and straight forward step.
This depends on #95818