-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AArch64] Improve cost model for legal subvec insert/extract #81135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64] Improve cost model for legal subvec insert/extract #81135
Conversation
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-backend-aarch64 Author: Graham Hunter (huntergr-arm) ChangesCurrently we model subvector inserts and extracts as shuffles, potentially going as far as scalarizing. If the types are legal then they can just be simple zip/unzip operations, or possible even no-ops. Change the cost to a relatively small one to ensure that simple loops featuring such operations between fixed and scalable vector types that are effectively the same at a given sve width can be unrolled and further optimized. Full diff: https://github.com/llvm/llvm-project/pull/81135.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index cdd2750521d2c9..0add50e5067886 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -568,6 +568,32 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
}
return Cost;
}
+ case Intrinsic::vector_extract: {
+ // If both the vector argument and the return type are legal types, then
+ // this should be a no-op or simple operation; return a relatively low cost.
+ LLVMContext &C = RetTy->getContext();
+ EVT MRTy = getTLI()->getValueType(DL, RetTy);
+ EVT MPTy = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
+ TargetLoweringBase::LegalizeKind RLK = getTLI()->getTypeConversion(C, MRTy);
+ TargetLoweringBase::LegalizeKind PLK = getTLI()->getTypeConversion(C, MPTy);
+ if (RLK.first == TargetLoweringBase::TypeLegal &&
+ PLK.first == TargetLoweringBase::TypeLegal)
+ return InstructionCost(1);
+ break;
+ }
+ case Intrinsic::vector_insert: {
+ // If both the vector and subvector arguments are legal types, then this
+ // should be a no-op or simple operation; return a relatively low cost.
+ LLVMContext &C = RetTy->getContext();
+ EVT MTy0 = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
+ EVT MTy1 = getTLI()->getValueType(DL, ICA.getArgTypes()[1]);
+ TargetLoweringBase::LegalizeKind LK0 = getTLI()->getTypeConversion(C, MTy0);
+ TargetLoweringBase::LegalizeKind LK1 = getTLI()->getTypeConversion(C, MTy1);
+ if (LK0.first == TargetLoweringBase::TypeLegal &&
+ LK1.first == TargetLoweringBase::TypeLegal)
+ return InstructionCost(1);
+ break;
+ }
case Intrinsic::bitreverse: {
static const CostTblEntry BitreverseTbl[] = {
{Intrinsic::bitreverse, MVT::i32, 1},
diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/scalable-vec-ins-ext.ll b/llvm/test/Transforms/LoopUnroll/AArch64/scalable-vec-ins-ext.ll
new file mode 100644
index 00000000000000..c69757abd4c1d2
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/AArch64/scalable-vec-ins-ext.ll
@@ -0,0 +1,91 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 3
+; RUN: opt -passes=loop-unroll,simplifycfg -S -mtriple aarch64 -mattr=+sve -aarch64-sve-vector-bits-min=128 %s | FileCheck %s -check-prefix=UNROLL-128
+; RUN: opt -passes=loop-unroll,simplifycfg -S -mtriple aarch64 -mattr=+sve -aarch64-sve-vector-bits-min=256 %s | FileCheck %s -check-prefix=UNROLL-256
+
+;; This test contains IR similar to what would be generated when SVE ACLE
+;; routines are used with fixed-width vector types -- lots of subvector inserts
+;; and extracts that are effectively just bitcasts since the types are the
+;; same at a given SVE bit size. We want to make sure that they are not a
+;; barrier to unrolling simple loops with a fixed trip count which could be
+;; further optimized.
+
+define void @test_ins_ext_cost(ptr readonly %a, ptr readonly %b, ptr readonly %c, ptr noalias %d) {
+; UNROLL-128-LABEL: define void @test_ins_ext_cost(
+; UNROLL-128-SAME: ptr readonly [[A:%.*]], ptr readonly [[B:%.*]], ptr readonly [[C:%.*]], ptr noalias [[D:%.*]]) #[[ATTR0:[0-9]+]] {
+; UNROLL-128-NEXT: entry:
+; UNROLL-128-NEXT: br label [[FOR_BODY:%.*]]
+; UNROLL-128: for.body:
+; UNROLL-128-NEXT: [[EXIT_COND:%.*]] = phi i1 [ true, [[ENTRY:%.*]] ], [ false, [[FOR_BODY]] ]
+; UNROLL-128-NEXT: [[IV:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ 1, [[FOR_BODY]] ]
+; UNROLL-128-NEXT: [[GEP_A:%.*]] = getelementptr inbounds <8 x float>, ptr [[A]], i64 [[IV]]
+; UNROLL-128-NEXT: [[LOAD_A:%.*]] = load <8 x float>, ptr [[GEP_A]], align 16
+; UNROLL-128-NEXT: [[GEP_B:%.*]] = getelementptr inbounds <8 x float>, ptr [[B]], i64 [[IV]]
+; UNROLL-128-NEXT: [[LOAD_B:%.*]] = load <8 x float>, ptr [[GEP_B]], align 16
+; UNROLL-128-NEXT: [[GEP_C:%.*]] = getelementptr inbounds <8 x float>, ptr [[C]], i64 [[IV]]
+; UNROLL-128-NEXT: [[LOAD_C:%.*]] = load <8 x float>, ptr [[GEP_C]], align 16
+; UNROLL-128-NEXT: [[CAST_SCALABLE_B:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_B]], i64 0)
+; UNROLL-128-NEXT: [[CAST_SCALABLE_C:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_C]], i64 0)
+; UNROLL-128-NEXT: [[ADD:%.*]] = fadd <vscale x 4 x float> [[CAST_SCALABLE_B]], [[CAST_SCALABLE_C]]
+; UNROLL-128-NEXT: [[CAST_SCALABLE_A:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_A]], i64 0)
+; UNROLL-128-NEXT: [[MUL:%.*]] = fmul <vscale x 4 x float> [[CAST_SCALABLE_A]], [[ADD]]
+; UNROLL-128-NEXT: [[CAST_FIXED_D:%.*]] = tail call <8 x float> @llvm.vector.extract.v8f32.nxv4f32(<vscale x 4 x float> [[MUL]], i64 0)
+; UNROLL-128-NEXT: [[GEP_D:%.*]] = getelementptr inbounds <8 x float>, ptr [[D]], i64 0, i64 [[IV]]
+; UNROLL-128-NEXT: store <8 x float> [[CAST_FIXED_D]], ptr [[GEP_D]], align 16
+; UNROLL-128-NEXT: br i1 [[EXIT_COND]], label [[FOR_BODY]], label [[EXIT:%.*]]
+; UNROLL-128: exit:
+; UNROLL-128-NEXT: ret void
+;
+; UNROLL-256-LABEL: define void @test_ins_ext_cost(
+; UNROLL-256-SAME: ptr readonly [[A:%.*]], ptr readonly [[B:%.*]], ptr readonly [[C:%.*]], ptr noalias [[D:%.*]]) #[[ATTR0:[0-9]+]] {
+; UNROLL-256-NEXT: entry:
+; UNROLL-256-NEXT: [[LOAD_A:%.*]] = load <8 x float>, ptr [[A]], align 16
+; UNROLL-256-NEXT: [[LOAD_B:%.*]] = load <8 x float>, ptr [[B]], align 16
+; UNROLL-256-NEXT: [[LOAD_C:%.*]] = load <8 x float>, ptr [[C]], align 16
+; UNROLL-256-NEXT: [[CAST_SCALABLE_B:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_B]], i64 0)
+; UNROLL-256-NEXT: [[CAST_SCALABLE_C:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_C]], i64 0)
+; UNROLL-256-NEXT: [[ADD:%.*]] = fadd <vscale x 4 x float> [[CAST_SCALABLE_B]], [[CAST_SCALABLE_C]]
+; UNROLL-256-NEXT: [[CAST_SCALABLE_A:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_A]], i64 0)
+; UNROLL-256-NEXT: [[MUL:%.*]] = fmul <vscale x 4 x float> [[CAST_SCALABLE_A]], [[ADD]]
+; UNROLL-256-NEXT: [[CAST_FIXED_D:%.*]] = tail call <8 x float> @llvm.vector.extract.v8f32.nxv4f32(<vscale x 4 x float> [[MUL]], i64 0)
+; UNROLL-256-NEXT: store <8 x float> [[CAST_FIXED_D]], ptr [[D]], align 16
+; UNROLL-256-NEXT: [[GEP_A_1:%.*]] = getelementptr inbounds <8 x float>, ptr [[A]], i64 1
+; UNROLL-256-NEXT: [[LOAD_A_1:%.*]] = load <8 x float>, ptr [[GEP_A_1]], align 16
+; UNROLL-256-NEXT: [[GEP_B_1:%.*]] = getelementptr inbounds <8 x float>, ptr [[B]], i64 1
+; UNROLL-256-NEXT: [[LOAD_B_1:%.*]] = load <8 x float>, ptr [[GEP_B_1]], align 16
+; UNROLL-256-NEXT: [[GEP_C_1:%.*]] = getelementptr inbounds <8 x float>, ptr [[C]], i64 1
+; UNROLL-256-NEXT: [[LOAD_C_1:%.*]] = load <8 x float>, ptr [[GEP_C_1]], align 16
+; UNROLL-256-NEXT: [[CAST_SCALABLE_B_1:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_B_1]], i64 0)
+; UNROLL-256-NEXT: [[CAST_SCALABLE_C_1:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_C_1]], i64 0)
+; UNROLL-256-NEXT: [[ADD_1:%.*]] = fadd <vscale x 4 x float> [[CAST_SCALABLE_B_1]], [[CAST_SCALABLE_C_1]]
+; UNROLL-256-NEXT: [[CAST_SCALABLE_A_1:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_A_1]], i64 0)
+; UNROLL-256-NEXT: [[MUL_1:%.*]] = fmul <vscale x 4 x float> [[CAST_SCALABLE_A_1]], [[ADD_1]]
+; UNROLL-256-NEXT: [[CAST_FIXED_D_1:%.*]] = tail call <8 x float> @llvm.vector.extract.v8f32.nxv4f32(<vscale x 4 x float> [[MUL_1]], i64 0)
+; UNROLL-256-NEXT: [[GEP_D_1:%.*]] = getelementptr inbounds <8 x float>, ptr [[D]], i64 0, i64 1
+; UNROLL-256-NEXT: store <8 x float> [[CAST_FIXED_D_1]], ptr [[GEP_D_1]], align 16
+; UNROLL-256-NEXT: ret void
+;
+entry:
+ br label %for.body
+
+for.body:
+ %exit.cond = phi i1 [ true, %entry ], [ false, %for.body ]
+ %iv = phi i64 [ 0, %entry ], [ 1, %for.body ]
+ %gep.a = getelementptr inbounds <8 x float>, ptr %a, i64 %iv
+ %load.a = load <8 x float>, ptr %gep.a, align 16
+ %gep.b = getelementptr inbounds <8 x float>, ptr %b, i64 %iv
+ %load.b = load <8 x float>, ptr %gep.b, align 16
+ %gep.c = getelementptr inbounds <8 x float>, ptr %c, i64 %iv
+ %load.c = load <8 x float>, ptr %gep.c, align 16
+ %cast.scalable.b = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> %load.b, i64 0)
+ %cast.scalable.c = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> %load.c, i64 0)
+ %add = fadd <vscale x 4 x float> %cast.scalable.b, %cast.scalable.c
+ %cast.scalable.a = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> %load.a, i64 0)
+ %mul = fmul <vscale x 4 x float> %cast.scalable.a, %add
+ %cast.fixed.d = tail call <8 x float> @llvm.vector.extract.v8f32.nxv4f32(<vscale x 4 x float> %mul, i64 0)
+ %gep.d = getelementptr inbounds <8 x float>, ptr %d, i64 0, i64 %iv
+ store <8 x float> %cast.fixed.d, ptr %gep.d, align 16
+ br i1 %exit.cond, label %for.body, label %exit
+
+exit:
+ ret void
+}
|
@@ -568,6 +568,32 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA, | |||
} | |||
return Cost; | |||
} | |||
case Intrinsic::vector_extract: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to have proper cost model test cases in llvm/test/Analysis/CostModel/AArch64. You can look at existing examples as a guide.
TargetLoweringBase::LegalizeKind PLK = getTLI()->getTypeConversion(C, MPTy); | ||
if (RLK.first == TargetLoweringBase::TypeLegal && | ||
PLK.first == TargetLoweringBase::TypeLegal) | ||
return InstructionCost(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance this seems too optimistic because there are cases where the cost is higher. See llvm/test/CodeGen/AArch64/sve-extract-scalable-vector.ll for some examples:
define <vscale x 4 x i1> @extract_nxv4i1_nxv16i1_0(<vscale x 16 x i1> %in) {
; CHECK-LABEL: extract_nxv4i1_nxv16i1_0:
; CHECK: // %bb.0:
; CHECK-NEXT: punpklo p0.h, p0.b
; CHECK-NEXT: punpklo p0.h, p0.b
; CHECK-NEXT: ret
%res = call <vscale x 4 x i1> @llvm.vector.extract.nxv4i1.nxv16i1(<vscale x 16 x i1> %in, i64 0)
ret <vscale x 4 x i1> %res
}
or
define <vscale x 2 x i1> @extract_nxv2i1_nxv16i1_0(<vscale x 16 x i1> %in) {
; CHECK-LABEL: extract_nxv2i1_nxv16i1_0:
; CHECK: // %bb.0:
; CHECK-NEXT: punpklo p0.h, p0.b
; CHECK-NEXT: punpklo p0.h, p0.b
; CHECK-NEXT: punpklo p0.h, p0.b
; CHECK-NEXT: ret
%res = call <vscale x 2 x i1> @llvm.vector.extract.nxv2i1.nxv16i1(<vscale x 16 x i1> %in, i64 0)
ret <vscale x 2 x i1> %res
}
These should have a higher cost than 1 I think. Or what about mixing legal fixed-width and scalable vector types such as:
define void @foo(ptr %a, ptr %b) vscale_range(16,0) {
%op = load <vscale x 8 x i16>, ptr %a
%ret = call <4 x i16> @llvm.vector.extract.v8i16.v16i16(<vscale x 8 x i16> %op, i64 4)
store <4 x i16> %ret, ptr %b
ret void
}
which leads to this assembly output:
foo: // @foo
str x29, [sp, #-16]! // 8-byte Folded Spill
addvl sp, sp, #-1
ptrue p0.h
ld1h { z0.h }, p0/z, [x0]
st1h { z0.h }, p0, [sp]
ldr d0, [sp, #8]
str d0, [x1]
addvl sp, sp, #1
ldr x29, [sp], #16 // 8-byte Folded Reload
ret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a check to enforce that the lower cost is only allowed when the index is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pointing out that the code isn't updated yet to handle predicates differently, as those inserts/extracts are indeed not free.
TargetLoweringBase::LegalizeKind LK1 = getTLI()->getTypeConversion(C, MTy1); | ||
if (LK0.first == TargetLoweringBase::TypeLegal && | ||
LK1.first == TargetLoweringBase::TypeLegal) | ||
return InstructionCost(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as the extract - I think this model is too simple and makes the cost of vector predicate insert/extracts too cheap, as well as mixing legal fixed-width and scalable vector inserts. In particular, we should worry about insert a fixed-width vector into a region of a scalable vector beyond the known minimum size.
0394c21
to
521921e
Compare
if (RLK.first == TargetLoweringBase::TypeLegal && | ||
PLK.first == TargetLoweringBase::TypeLegal && Idx && | ||
Idx->getZExtValue() == 0) | ||
return InstructionCost(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this wouldn't this be zero-cost?
Also, stylistically to match the rest of this file, maybe return TTI::TCC_Free
(if this is considered a cost of 0) or TTI::TCC_Basic
(if this is considered a cost of 1) instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the subvector is legal but smaller than the vector its being inserted into, we would need an extra operation. e.g. <2 x float> (packed 64b NEON) into <vscale x 2 x float> (unpacked 128b+ SVE) would need an unzip or unpack, I think?
My original idea with this was to only do it in cases where the size of the vectors was the same, but I was encouraged to extend that to legal types. I can walk it back to my initial idea if we're not sure about the extent of the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the extra instructions required for extracting from an unpacked vector is 2log(SVEBitsPerBlock/(MinNumElts * EltSizeInBits))
. For the insert that would require adding another TCC_Basic
for the cost of the mov (when it's not inserting into an undef/poison value). This would automatically return TCC_Free
(==0
) if the scalable vector is a packed vector.
|
||
// If arguments aren't actually supplied, then we cannot determine the | ||
// value of the index. | ||
if (ICA.getArgs().size() < 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
if (ICA.getArgs().size() < 2) | |
if (ICA.getArgs().size() != 2) |
TargetLoweringBase::LegalizeKind PLK = getTLI()->getTypeConversion(C, MPTy); | ||
if (RLK.first == TargetLoweringBase::TypeLegal && | ||
PLK.first == TargetLoweringBase::TypeLegal) | ||
return InstructionCost(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pointing out that the code isn't updated yet to handle predicates differently, as those inserts/extracts are indeed not free.
521921e
to
c08072d
Compare
Rebased on top of the updated precommit tests. |
Ping. |
9b6d7d6
to
37b3bbc
Compare
EVT MRTy = getTLI()->getValueType(DL, RetTy); | ||
EVT MPTy = getTLI()->getValueType(DL, ICA.getArgTypes()[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
EVT MRTy = getTLI()->getValueType(DL, RetTy); | |
EVT MPTy = getTLI()->getValueType(DL, ICA.getArgTypes()[0]); | |
EVT RetVT = getTLI()->getValueType(DL, RetTy); | |
EVT VecVT = getTLI()->getValueType(DL, ICA.getArgTypes()[0]); |
?
if ((MRTy.isScalableVector() && | ||
MRTy.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I comment this out, the tests still pass, so some test-coverage missing.
if ((MRTy.isScalableVector() && | ||
MRTy.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock) || | ||
(MPTy.isScalableVector() && | ||
MPTy.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MPTy.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock)) | |
MPTy.getSizeInBits().getKnownMinValue() < AArch64::SVEBitsPerBlock)) |
Do you want to use a less-than comparison instead? Extracting a <vscale x 4 x float>
from a <vscale x 8 x float>
from index 0 is free. With this code it isn't free, but it's probably a good and common case to add?
Can you also create a isUnpackedVectorVT
function to make this code a bit simpler?
EVT MRTy = getTLI()->getValueType(DL, RetTy); | ||
EVT MPTy = getTLI()->getValueType(DL, ICA.getArgTypes()[0]); | ||
// Skip this if either the return type or the vector argument are unpacked | ||
// SVE types; they may get lowered to stack stores and loads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they may get lowered to stack stores and loads
Either that or uzp's?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw stores and loads being generated for non-predicate types (and we've already excluded those at this point). Might be something to fix later.
EVT MTy0 = getTLI()->getValueType(DL, ICA.getArgTypes()[0]); | ||
EVT MTy1 = getTLI()->getValueType(DL, ICA.getArgTypes()[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
EVT MTy0 = getTLI()->getValueType(DL, ICA.getArgTypes()[0]); | |
EVT MTy1 = getTLI()->getValueType(DL, ICA.getArgTypes()[1]); | |
EVT VecVT = getTLI()->getValueType(DL, ICA.getArgTypes()[0]); | |
EVT SubVecVT = getTLI()->getValueType(DL, ICA.getArgTypes()[1]); |
(MTy1.isScalableVector() && | ||
MTy1.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I comment this out, the tests still pass, so some test-coverage missing.
@@ -45,27 +49,43 @@ define void @vector_insert_extract_idxzero_128b() #1 { | |||
; TYPE_BASED_ONLY-NEXT: Cost Model: Invalid cost for instruction: %insert_nxv16i1_nxv2i1 = call <vscale x 16 x i1> @llvm.vector.insert.nxv16i1.nxv2i1(<vscale x 16 x i1> undef, <vscale x 2 x i1> undef, i64 0) | |||
; TYPE_BASED_ONLY-NEXT: Cost Model: Invalid cost for instruction: %extract_nxv4i1_nxv16i1 = call <vscale x 4 x i1> @llvm.vector.extract.nxv4i1.nxv16i1(<vscale x 16 x i1> undef, i64 0) | |||
; TYPE_BASED_ONLY-NEXT: Cost Model: Invalid cost for instruction: %extract_v8i1_nxv8i1 = call <8 x i1> @llvm.vector.extract.v8i1.nxv8i1(<vscale x 8 x i1> undef, i64 0) | |||
; TYPE_BASED_ONLY-NEXT: Cost Model: Invalid cost for instruction: %insert_v2f32_nxv2f32 = call <vscale x 2 x float> @llvm.vector.insert.nxv2f32.v2f32(<vscale x 2 x float> undef, <2 x float> undef, i64 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps not something to address in this patch, but I think it's wrong for this case to return an invalid cost, even when it cannot deduce anything for the operands. At worst it's a high cost for using spills and fills. But invalid cost basically says that we can't lower the general case, which isn't true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That comes from the generic cost model code, which might just give up for scalable vectors. I guess we need to override the cost for more variants. But yes, in a later patch.
TargetLoweringBase::LegalizeKind RLK = | ||
getTLI()->getTypeConversion(C, RetVT); | ||
TargetLoweringBase::LegalizeKind PLK = | ||
getTLI()->getTypeConversion(C, VecVT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This still uses the 'R' and 'P' names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. LGTM.
TargetLoweringBase::LegalizeKind VecLK = | ||
getTLI()->getTypeConversion(C, VecVT); | ||
const Value *Idx = IsExtract ? ICA.getArgs()[1] : ICA.getArgs()[2]; | ||
const ConstantInt *CIdx = dyn_cast<ConstantInt>(Idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be just cast<>
and the null check removed? I ask because the LangRef says the index operand must be a constant integer.
Post-commit fixup patch for a request on #81135
Currently we model subvector inserts and extracts as shuffles, potentially going as far as scalarizing. If the types are legal then they can just be simple zip/unzip operations, or possible even no-ops. Change the cost to a relatively small one to ensure that simple loops featuring such operations between fixed and scalable vector types that are effectively the same at a given sve width can be unrolled and further optimized.