Skip to content

Commit 2a859b2

Browse files
committed
[AArch64] Change the cost of vector insert/extract to 2
The cost of vector instructions has always been high under AArch64, in order to add a high cost for inserts/extracts, shuffles and scalarization. This is a conservative approach to limit the scope of unusual SLP vectorization where the codegen ends up being quite poor, but has always been higher than the correct costs would be for any specific core. This relaxes that, reducing the vector insert/extract cost from 3 to 2. It is a generalization of D142359 to all AArch64 cpus. The ScalarizationOverhead is also overridden for integer vector at the same time, to remove the effect of lane 0 being considered free for integer vectors (something that should only be true for float when scalarizing). The lower insert/extract cost will reduce the cost of insert, extracts, shuffling and scalarization. The adjustments of ScalaizationOverhead will increase the cost on integer, especially for small vectors. The end result will be lower cost for float and long-integer types, some higher cost for some smaller vectors. This, along with the raw insert/extract cost being lower, will generally mean more vectorization from the Loop and SLP vectorizer. We may end up regretting this, as that vectorization is not always profitable. In all the benchmarking I have done this is generally an improvement in the overall performance, and I've attempted to address the places where it wasn't with other costmodel adjustments. Differential Revision: https://reviews.llvm.org/D155459
1 parent 4a3c865 commit 2a859b2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+985
-1372
lines changed

llvm/lib/Target/AArch64/AArch64Subtarget.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
103103
#include "AArch64GenSubtargetInfo.inc"
104104

105105
uint8_t MaxInterleaveFactor = 2;
106-
uint8_t VectorInsertExtractBaseCost = 3;
106+
uint8_t VectorInsertExtractBaseCost = 2;
107107
uint16_t CacheLineSize = 0;
108108
uint16_t PrefetchDistance = 0;
109109
uint16_t MinPrefetchStride = 1;

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2560,6 +2560,18 @@ InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,
25602560
return getVectorInstrCostHelper(&I, Val, Index, true /* HasRealUse */);
25612561
}
25622562

2563+
InstructionCost AArch64TTIImpl::getScalarizationOverhead(
2564+
VectorType *Ty, const APInt &DemandedElts, bool Insert, bool Extract,
2565+
TTI::TargetCostKind CostKind) {
2566+
if (isa<ScalableVectorType>(Ty))
2567+
return InstructionCost::getInvalid();
2568+
if (Ty->getElementType()->isFloatingPointTy())
2569+
return BaseT::getScalarizationOverhead(Ty, DemandedElts, Insert, Extract,
2570+
CostKind);
2571+
return DemandedElts.popcount() * (Insert + Extract) *
2572+
ST->getVectorInsertExtractBaseCost();
2573+
}
2574+
25632575
InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
25642576
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
25652577
TTI::OperandValueInfo Op1Info, TTI::OperandValueInfo Op2Info,

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -385,6 +385,11 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
385385
VectorType *SubTp,
386386
ArrayRef<const Value *> Args = std::nullopt);
387387

388+
InstructionCost getScalarizationOverhead(VectorType *Ty,
389+
const APInt &DemandedElts,
390+
bool Insert, bool Extract,
391+
TTI::TargetCostKind CostKind);
392+
388393
/// Return the cost of the scaling factor used in the addressing
389394
/// mode represented by AM for this target, for a load/store
390395
/// of the specified type.

llvm/test/Analysis/CostModel/AArch64/arith-fp.ll

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -198,16 +198,16 @@ define i32 @fdiv(i32 %arg) {
198198
define i32 @frem(i32 %arg) {
199199
; CHECK-LABEL: 'frem'
200200
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %F16 = frem half undef, undef
201-
; CHECK-NEXT: Cost Model: Found an estimated cost of 26 for instruction: %V4F16 = frem <4 x half> undef, undef
202-
; CHECK-NEXT: Cost Model: Found an estimated cost of 58 for instruction: %V8F16 = frem <8 x half> undef, undef
203-
; CHECK-NEXT: Cost Model: Found an estimated cost of 116 for instruction: %V16F16 = frem <16 x half> undef, undef
201+
; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V4F16 = frem <4 x half> undef, undef
202+
; CHECK-NEXT: Cost Model: Found an estimated cost of 44 for instruction: %V8F16 = frem <8 x half> undef, undef
203+
; CHECK-NEXT: Cost Model: Found an estimated cost of 88 for instruction: %V16F16 = frem <16 x half> undef, undef
204204
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %F32 = frem float undef, undef
205-
; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V2F32 = frem <2 x float> undef, undef
206-
; CHECK-NEXT: Cost Model: Found an estimated cost of 26 for instruction: %V4F32 = frem <4 x float> undef, undef
207-
; CHECK-NEXT: Cost Model: Found an estimated cost of 52 for instruction: %V8F32 = frem <8 x float> undef, undef
205+
; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V2F32 = frem <2 x float> undef, undef
206+
; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V4F32 = frem <4 x float> undef, undef
207+
; CHECK-NEXT: Cost Model: Found an estimated cost of 40 for instruction: %V8F32 = frem <8 x float> undef, undef
208208
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %F64 = frem double undef, undef
209-
; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V2F64 = frem <2 x double> undef, undef
210-
; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V4F64 = frem <4 x double> undef, undef
209+
; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V2F64 = frem <2 x double> undef, undef
210+
; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V4F64 = frem <4 x double> undef, undef
211211
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
212212
;
213213
%F16 = frem half undef, undef

llvm/test/Analysis/CostModel/AArch64/arith-overflow.ll

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -355,9 +355,9 @@ declare {<64 x i8>, <64 x i1>} @llvm.smul.with.overflow.v64i8(<64 x i8>, <64 x
355355
define i32 @smul(i32 %arg) {
356356
; RECIP-LABEL: 'smul'
357357
; RECIP-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %I64 = call { i64, i1 } @llvm.smul.with.overflow.i64(i64 undef, i64 undef)
358-
; RECIP-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %V2I64 = call { <2 x i64>, <2 x i1> } @llvm.smul.with.overflow.v2i64(<2 x i64> undef, <2 x i64> undef)
359-
; RECIP-NEXT: Cost Model: Found an estimated cost of 36 for instruction: %V4I64 = call { <4 x i64>, <4 x i1> } @llvm.smul.with.overflow.v4i64(<4 x i64> undef, <4 x i64> undef)
360-
; RECIP-NEXT: Cost Model: Found an estimated cost of 72 for instruction: %V8I64 = call { <8 x i64>, <8 x i1> } @llvm.smul.with.overflow.v8i64(<8 x i64> undef, <8 x i64> undef)
358+
; RECIP-NEXT: Cost Model: Found an estimated cost of 34 for instruction: %V2I64 = call { <2 x i64>, <2 x i1> } @llvm.smul.with.overflow.v2i64(<2 x i64> undef, <2 x i64> undef)
359+
; RECIP-NEXT: Cost Model: Found an estimated cost of 68 for instruction: %V4I64 = call { <4 x i64>, <4 x i1> } @llvm.smul.with.overflow.v4i64(<4 x i64> undef, <4 x i64> undef)
360+
; RECIP-NEXT: Cost Model: Found an estimated cost of 136 for instruction: %V8I64 = call { <8 x i64>, <8 x i1> } @llvm.smul.with.overflow.v8i64(<8 x i64> undef, <8 x i64> undef)
361361
; RECIP-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %I32 = call { i32, i1 } @llvm.smul.with.overflow.i32(i32 undef, i32 undef)
362362
; RECIP-NEXT: Cost Model: Found an estimated cost of 38 for instruction: %V4I32 = call { <4 x i32>, <4 x i1> } @llvm.smul.with.overflow.v4i32(<4 x i32> undef, <4 x i32> undef)
363363
; RECIP-NEXT: Cost Model: Found an estimated cost of 76 for instruction: %V8I32 = call { <8 x i32>, <8 x i1> } @llvm.smul.with.overflow.v8i32(<8 x i32> undef, <8 x i32> undef)
@@ -437,9 +437,9 @@ declare {<64 x i8>, <64 x i1>} @llvm.umul.with.overflow.v64i8(<64 x i8>, <64 x
437437
define i32 @umul(i32 %arg) {
438438
; RECIP-LABEL: 'umul'
439439
; RECIP-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %I64 = call { i64, i1 } @llvm.umul.with.overflow.i64(i64 undef, i64 undef)
440-
; RECIP-NEXT: Cost Model: Found an estimated cost of 17 for instruction: %V2I64 = call { <2 x i64>, <2 x i1> } @llvm.umul.with.overflow.v2i64(<2 x i64> undef, <2 x i64> undef)
441-
; RECIP-NEXT: Cost Model: Found an estimated cost of 34 for instruction: %V4I64 = call { <4 x i64>, <4 x i1> } @llvm.umul.with.overflow.v4i64(<4 x i64> undef, <4 x i64> undef)
442-
; RECIP-NEXT: Cost Model: Found an estimated cost of 68 for instruction: %V8I64 = call { <8 x i64>, <8 x i1> } @llvm.umul.with.overflow.v8i64(<8 x i64> undef, <8 x i64> undef)
440+
; RECIP-NEXT: Cost Model: Found an estimated cost of 33 for instruction: %V2I64 = call { <2 x i64>, <2 x i1> } @llvm.umul.with.overflow.v2i64(<2 x i64> undef, <2 x i64> undef)
441+
; RECIP-NEXT: Cost Model: Found an estimated cost of 66 for instruction: %V4I64 = call { <4 x i64>, <4 x i1> } @llvm.umul.with.overflow.v4i64(<4 x i64> undef, <4 x i64> undef)
442+
; RECIP-NEXT: Cost Model: Found an estimated cost of 132 for instruction: %V8I64 = call { <8 x i64>, <8 x i1> } @llvm.umul.with.overflow.v8i64(<8 x i64> undef, <8 x i64> undef)
443443
; RECIP-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %I32 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 undef, i32 undef)
444444
; RECIP-NEXT: Cost Model: Found an estimated cost of 37 for instruction: %V4I32 = call { <4 x i32>, <4 x i1> } @llvm.umul.with.overflow.v4i32(<4 x i32> undef, <4 x i32> undef)
445445
; RECIP-NEXT: Cost Model: Found an estimated cost of 74 for instruction: %V8I32 = call { <8 x i32>, <8 x i1> } @llvm.umul.with.overflow.v8i32(<8 x i32> undef, <8 x i32> undef)

llvm/test/Analysis/CostModel/AArch64/bswap.ll

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ define void @neon() {
4444
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v2i64 = call <2 x i64> @llvm.bswap.v2i64(<2 x i64> undef)
4545
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v4i64 = call <4 x i64> @llvm.bswap.v4i64(<4 x i64> undef)
4646
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v3i32 = call <3 x i32> @llvm.bswap.v3i32(<3 x i32> undef)
47-
; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %v4i48 = call <4 x i48> @llvm.bswap.v4i48(<4 x i48> undef)
47+
; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %v4i48 = call <4 x i48> @llvm.bswap.v4i48(<4 x i48> undef)
4848
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
4949
;
5050
%v4i16 = call <4 x i16> @llvm.bswap.v4i16(<4 x i16> undef)

0 commit comments

Comments
 (0)