[AArch64] Improve cost model for legal subvec insert/extract #81135

huntergr-arm · 2024-02-08T13:34:33Z

Currently we model subvector inserts and extracts as shuffles, potentially going as far as scalarizing. If the types are legal then they can just be simple zip/unzip operations, or possible even no-ops. Change the cost to a relatively small one to ensure that simple loops featuring such operations between fixed and scalable vector types that are effectively the same at a given sve width can be unrolled and further optimized.

llvmbot · 2024-02-08T13:35:01Z

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-aarch64

Author: Graham Hunter (huntergr-arm)

Changes

Currently we model subvector inserts and extracts as shuffles, potentially going as far as scalarizing. If the types are legal then they can just be simple zip/unzip operations, or possible even no-ops. Change the cost to a relatively small one to ensure that simple loops featuring such operations between fixed and scalable vector types that are effectively the same at a given sve width can be unrolled and further optimized.

Full diff: https://github.com/llvm/llvm-project/pull/81135.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+26)
(added) llvm/test/Transforms/LoopUnroll/AArch64/scalable-vec-ins-ext.ll (+91)

diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index cdd2750521d2c9..0add50e5067886 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -568,6 +568,32 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     }
     return Cost;
   }
+  case Intrinsic::vector_extract: {
+    // If both the vector argument and the return type are legal types, then
+    // this should be a no-op or simple operation; return a relatively low cost.
+    LLVMContext &C = RetTy->getContext();
+    EVT MRTy = getTLI()->getValueType(DL, RetTy);
+    EVT MPTy = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
+    TargetLoweringBase::LegalizeKind RLK = getTLI()->getTypeConversion(C, MRTy);
+    TargetLoweringBase::LegalizeKind PLK = getTLI()->getTypeConversion(C, MPTy);
+    if (RLK.first == TargetLoweringBase::TypeLegal &&
+        PLK.first == TargetLoweringBase::TypeLegal)
+      return InstructionCost(1);
+    break;
+  }
+  case Intrinsic::vector_insert: {
+    // If both the vector and subvector arguments are legal types, then this
+    // should be a no-op or simple operation; return a relatively low cost.
+    LLVMContext &C = RetTy->getContext();
+    EVT MTy0 = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
+    EVT MTy1 = getTLI()->getValueType(DL, ICA.getArgTypes()[1]);
+    TargetLoweringBase::LegalizeKind LK0 = getTLI()->getTypeConversion(C, MTy0);
+    TargetLoweringBase::LegalizeKind LK1 = getTLI()->getTypeConversion(C, MTy1);
+    if (LK0.first == TargetLoweringBase::TypeLegal &&
+        LK1.first == TargetLoweringBase::TypeLegal)
+      return InstructionCost(1);
+    break;
+  }
   case Intrinsic::bitreverse: {
     static const CostTblEntry BitreverseTbl[] = {
         {Intrinsic::bitreverse, MVT::i32, 1},
diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/scalable-vec-ins-ext.ll b/llvm/test/Transforms/LoopUnroll/AArch64/scalable-vec-ins-ext.ll
new file mode 100644
index 00000000000000..c69757abd4c1d2
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/AArch64/scalable-vec-ins-ext.ll
@@ -0,0 +1,91 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 3
+; RUN: opt -passes=loop-unroll,simplifycfg -S -mtriple aarch64 -mattr=+sve -aarch64-sve-vector-bits-min=128 %s | FileCheck %s -check-prefix=UNROLL-128
+; RUN: opt -passes=loop-unroll,simplifycfg -S -mtriple aarch64 -mattr=+sve -aarch64-sve-vector-bits-min=256 %s | FileCheck %s -check-prefix=UNROLL-256
+
+;; This test contains IR similar to what would be generated when SVE ACLE
+;; routines are used with fixed-width vector types -- lots of subvector inserts
+;; and extracts that are effectively just bitcasts since the types are the
+;; same at a given SVE bit size. We want to make sure that they are not a
+;; barrier to unrolling simple loops with a fixed trip count which could be
+;; further optimized.
+
+define void @test_ins_ext_cost(ptr readonly %a, ptr readonly %b, ptr readonly %c, ptr noalias %d) {
+; UNROLL-128-LABEL: define void @test_ins_ext_cost(
+; UNROLL-128-SAME: ptr readonly [[A:%.*]], ptr readonly [[B:%.*]], ptr readonly [[C:%.*]], ptr noalias [[D:%.*]]) #[[ATTR0:[0-9]+]] {
+; UNROLL-128-NEXT:  entry:
+; UNROLL-128-NEXT:    br label [[FOR_BODY:%.*]]
+; UNROLL-128:       for.body:
+; UNROLL-128-NEXT:    [[EXIT_COND:%.*]] = phi i1 [ true, [[ENTRY:%.*]] ], [ false, [[FOR_BODY]] ]
+; UNROLL-128-NEXT:    [[IV:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ 1, [[FOR_BODY]] ]
+; UNROLL-128-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds <8 x float>, ptr [[A]], i64 [[IV]]
+; UNROLL-128-NEXT:    [[LOAD_A:%.*]] = load <8 x float>, ptr [[GEP_A]], align 16
+; UNROLL-128-NEXT:    [[GEP_B:%.*]] = getelementptr inbounds <8 x float>, ptr [[B]], i64 [[IV]]
+; UNROLL-128-NEXT:    [[LOAD_B:%.*]] = load <8 x float>, ptr [[GEP_B]], align 16
+; UNROLL-128-NEXT:    [[GEP_C:%.*]] = getelementptr inbounds <8 x float>, ptr [[C]], i64 [[IV]]
+; UNROLL-128-NEXT:    [[LOAD_C:%.*]] = load <8 x float>, ptr [[GEP_C]], align 16
+; UNROLL-128-NEXT:    [[CAST_SCALABLE_B:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_B]], i64 0)
+; UNROLL-128-NEXT:    [[CAST_SCALABLE_C:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_C]], i64 0)
+; UNROLL-128-NEXT:    [[ADD:%.*]] = fadd <vscale x 4 x float> [[CAST_SCALABLE_B]], [[CAST_SCALABLE_C]]
+; UNROLL-128-NEXT:    [[CAST_SCALABLE_A:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_A]], i64 0)
+; UNROLL-128-NEXT:    [[MUL:%.*]] = fmul <vscale x 4 x float> [[CAST_SCALABLE_A]], [[ADD]]
+; UNROLL-128-NEXT:    [[CAST_FIXED_D:%.*]] = tail call <8 x float> @llvm.vector.extract.v8f32.nxv4f32(<vscale x 4 x float> [[MUL]], i64 0)
+; UNROLL-128-NEXT:    [[GEP_D:%.*]] = getelementptr inbounds <8 x float>, ptr [[D]], i64 0, i64 [[IV]]
+; UNROLL-128-NEXT:    store <8 x float> [[CAST_FIXED_D]], ptr [[GEP_D]], align 16
+; UNROLL-128-NEXT:    br i1 [[EXIT_COND]], label [[FOR_BODY]], label [[EXIT:%.*]]
+; UNROLL-128:       exit:
+; UNROLL-128-NEXT:    ret void
+;
+; UNROLL-256-LABEL: define void @test_ins_ext_cost(
+; UNROLL-256-SAME: ptr readonly [[A:%.*]], ptr readonly [[B:%.*]], ptr readonly [[C:%.*]], ptr noalias [[D:%.*]]) #[[ATTR0:[0-9]+]] {
+; UNROLL-256-NEXT:  entry:
+; UNROLL-256-NEXT:    [[LOAD_A:%.*]] = load <8 x float>, ptr [[A]], align 16
+; UNROLL-256-NEXT:    [[LOAD_B:%.*]] = load <8 x float>, ptr [[B]], align 16
+; UNROLL-256-NEXT:    [[LOAD_C:%.*]] = load <8 x float>, ptr [[C]], align 16
+; UNROLL-256-NEXT:    [[CAST_SCALABLE_B:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_B]], i64 0)
+; UNROLL-256-NEXT:    [[CAST_SCALABLE_C:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_C]], i64 0)
+; UNROLL-256-NEXT:    [[ADD:%.*]] = fadd <vscale x 4 x float> [[CAST_SCALABLE_B]], [[CAST_SCALABLE_C]]
+; UNROLL-256-NEXT:    [[CAST_SCALABLE_A:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_A]], i64 0)
+; UNROLL-256-NEXT:    [[MUL:%.*]] = fmul <vscale x 4 x float> [[CAST_SCALABLE_A]], [[ADD]]
+; UNROLL-256-NEXT:    [[CAST_FIXED_D:%.*]] = tail call <8 x float> @llvm.vector.extract.v8f32.nxv4f32(<vscale x 4 x float> [[MUL]], i64 0)
+; UNROLL-256-NEXT:    store <8 x float> [[CAST_FIXED_D]], ptr [[D]], align 16
+; UNROLL-256-NEXT:    [[GEP_A_1:%.*]] = getelementptr inbounds <8 x float>, ptr [[A]], i64 1
+; UNROLL-256-NEXT:    [[LOAD_A_1:%.*]] = load <8 x float>, ptr [[GEP_A_1]], align 16
+; UNROLL-256-NEXT:    [[GEP_B_1:%.*]] = getelementptr inbounds <8 x float>, ptr [[B]], i64 1
+; UNROLL-256-NEXT:    [[LOAD_B_1:%.*]] = load <8 x float>, ptr [[GEP_B_1]], align 16
+; UNROLL-256-NEXT:    [[GEP_C_1:%.*]] = getelementptr inbounds <8 x float>, ptr [[C]], i64 1
+; UNROLL-256-NEXT:    [[LOAD_C_1:%.*]] = load <8 x float>, ptr [[GEP_C_1]], align 16
+; UNROLL-256-NEXT:    [[CAST_SCALABLE_B_1:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_B_1]], i64 0)
+; UNROLL-256-NEXT:    [[CAST_SCALABLE_C_1:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_C_1]], i64 0)
+; UNROLL-256-NEXT:    [[ADD_1:%.*]] = fadd <vscale x 4 x float> [[CAST_SCALABLE_B_1]], [[CAST_SCALABLE_C_1]]
+; UNROLL-256-NEXT:    [[CAST_SCALABLE_A_1:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_A_1]], i64 0)
+; UNROLL-256-NEXT:    [[MUL_1:%.*]] = fmul <vscale x 4 x float> [[CAST_SCALABLE_A_1]], [[ADD_1]]
+; UNROLL-256-NEXT:    [[CAST_FIXED_D_1:%.*]] = tail call <8 x float> @llvm.vector.extract.v8f32.nxv4f32(<vscale x 4 x float> [[MUL_1]], i64 0)
+; UNROLL-256-NEXT:    [[GEP_D_1:%.*]] = getelementptr inbounds <8 x float>, ptr [[D]], i64 0, i64 1
+; UNROLL-256-NEXT:    store <8 x float> [[CAST_FIXED_D_1]], ptr [[GEP_D_1]], align 16
+; UNROLL-256-NEXT:    ret void
+;
+entry:
+  br label %for.body
+
+for.body:
+  %exit.cond = phi i1 [ true, %entry ], [ false, %for.body ]
+  %iv = phi i64 [ 0, %entry ], [ 1, %for.body ]
+  %gep.a = getelementptr inbounds <8 x float>, ptr %a, i64 %iv
+  %load.a = load <8 x float>, ptr %gep.a, align 16
+  %gep.b = getelementptr inbounds <8 x float>, ptr %b, i64 %iv
+  %load.b = load <8 x float>, ptr %gep.b, align 16
+  %gep.c = getelementptr inbounds <8 x float>, ptr %c, i64 %iv
+  %load.c = load <8 x float>, ptr %gep.c, align 16
+  %cast.scalable.b = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> %load.b, i64 0)
+  %cast.scalable.c = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> %load.c, i64 0)
+  %add = fadd <vscale x 4 x float> %cast.scalable.b, %cast.scalable.c
+  %cast.scalable.a = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> %load.a, i64 0)
+  %mul = fmul <vscale x 4 x float> %cast.scalable.a, %add
+  %cast.fixed.d = tail call <8 x float> @llvm.vector.extract.v8f32.nxv4f32(<vscale x 4 x float> %mul, i64 0)
+  %gep.d = getelementptr inbounds <8 x float>, ptr %d, i64 0, i64 %iv
+  store <8 x float> %cast.fixed.d, ptr %gep.d, align 16
+  br i1 %exit.cond, label %for.body, label %exit
+
+exit:
+  ret void
+}

david-arm · 2024-02-09T09:19:59Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

@@ -568,6 +568,32 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
    }
    return Cost;
  }
+  case Intrinsic::vector_extract: {


I think it would be good to have proper cost model test cases in llvm/test/Analysis/CostModel/AArch64. You can look at existing examples as a guide.

david-arm · 2024-02-09T09:31:25Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+    TargetLoweringBase::LegalizeKind PLK = getTLI()->getTypeConversion(C, MPTy);
+    if (RLK.first == TargetLoweringBase::TypeLegal &&
+        PLK.first == TargetLoweringBase::TypeLegal)
+      return InstructionCost(1);


At first glance this seems too optimistic because there are cases where the cost is higher. See llvm/test/CodeGen/AArch64/sve-extract-scalable-vector.ll for some examples:

define <vscale x 4 x i1> @extract_nxv4i1_nxv16i1_0(<vscale x 16 x i1> %in) { ; CHECK-LABEL: extract_nxv4i1_nxv16i1_0: ; CHECK: // %bb.0: ; CHECK-NEXT: punpklo p0.h, p0.b ; CHECK-NEXT: punpklo p0.h, p0.b ; CHECK-NEXT: ret %res = call <vscale x 4 x i1> @llvm.vector.extract.nxv4i1.nxv16i1(<vscale x 16 x i1> %in, i64 0) ret <vscale x 4 x i1> %res }

or

define <vscale x 2 x i1> @extract_nxv2i1_nxv16i1_0(<vscale x 16 x i1> %in) { ; CHECK-LABEL: extract_nxv2i1_nxv16i1_0: ; CHECK: // %bb.0: ; CHECK-NEXT: punpklo p0.h, p0.b ; CHECK-NEXT: punpklo p0.h, p0.b ; CHECK-NEXT: punpklo p0.h, p0.b ; CHECK-NEXT: ret %res = call <vscale x 2 x i1> @llvm.vector.extract.nxv2i1.nxv16i1(<vscale x 16 x i1> %in, i64 0) ret <vscale x 2 x i1> %res }

These should have a higher cost than 1 I think. Or what about mixing legal fixed-width and scalable vector types such as:

define void @foo(ptr %a, ptr %b) vscale_range(16,0) { %op = load <vscale x 8 x i16>, ptr %a %ret = call <4 x i16> @llvm.vector.extract.v8i16.v16i16(<vscale x 8 x i16> %op, i64 4) store <4 x i16> %ret, ptr %b ret void }

which leads to this assembly output:

foo: // @foo str x29, [sp, #-16]! // 8-byte Folded Spill addvl sp, sp, #-1 ptrue p0.h ld1h { z0.h }, p0/z, [x0] st1h { z0.h }, p0, [sp] ldr d0, [sp, #8] str d0, [x1] addvl sp, sp, #1 ldr x29, [sp], #16 // 8-byte Folded Reload ret

I've added a check to enforce that the lower cost is only allowed when the index is 0.

Just pointing out that the code isn't updated yet to handle predicates differently, as those inserts/extracts are indeed not free.

david-arm · 2024-02-09T09:32:56Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+    TargetLoweringBase::LegalizeKind LK1 = getTLI()->getTypeConversion(C, MTy1);
+    if (LK0.first == TargetLoweringBase::TypeLegal &&
+        LK1.first == TargetLoweringBase::TypeLegal)
+      return InstructionCost(1);


Same comment as the extract - I think this model is too simple and makes the cost of vector predicate insert/extracts too cheap, as well as mixing legal fixed-width and scalable vector inserts. In particular, we should worry about insert a fixed-width vector into a region of a scalable vector beyond the known minimum size.

sdesmalen-arm · 2024-02-15T14:12:02Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+    if (RLK.first == TargetLoweringBase::TypeLegal &&
+        PLK.first == TargetLoweringBase::TypeLegal && Idx &&
+        Idx->getZExtValue() == 0)
+      return InstructionCost(1);


Is there a reason this wouldn't this be zero-cost?

Also, stylistically to match the rest of this file, maybe return TTI::TCC_Free (if this is considered a cost of 0) or TTI::TCC_Basic (if this is considered a cost of 1) instead?

If the subvector is legal but smaller than the vector its being inserted into, we would need an extra operation. e.g. <2 x float> (packed 64b NEON) into <vscale x 2 x float> (unpacked 128b+ SVE) would need an unzip or unpack, I think?

My original idea with this was to only do it in cases where the size of the vectors was the same, but I was encouraged to extend that to legal types. I can walk it back to my initial idea if we're not sure about the extent of the change.

I guess the extra instructions required for extracting from an unpacked vector is 2log(SVEBitsPerBlock/(MinNumElts * EltSizeInBits)). For the insert that would require adding another TCC_Basic for the cost of the mov (when it's not inserting into an undef/poison value). This would automatically return TCC_Free (==0) if the scalable vector is a packed vector.

sdesmalen-arm · 2024-02-15T14:19:18Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+
+    // If arguments aren't actually supplied, then we cannot determine the
+    // value of the index.
+    if (ICA.getArgs().size() < 2)


nit:

Suggested change

if (ICA.getArgs().size() < 2)

if (ICA.getArgs().size() != 2)

sdesmalen-arm · 2024-02-15T14:55:13Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+    TargetLoweringBase::LegalizeKind PLK = getTLI()->getTypeConversion(C, MPTy);
+    if (RLK.first == TargetLoweringBase::TypeLegal &&
+        PLK.first == TargetLoweringBase::TypeLegal)
+      return InstructionCost(1);


Just pointing out that the code isn't updated yet to handle predicates differently, as those inserts/extracts are indeed not free.

huntergr-arm · 2024-02-19T14:12:22Z

Rebased on top of the updated precommit tests.

huntergr-arm · 2024-02-26T10:13:51Z

Ping.

sdesmalen-arm · 2024-02-29T15:54:33Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+    EVT MRTy = getTLI()->getValueType(DL, RetTy);
+    EVT MPTy = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);


nit:

Suggested change

EVT MRTy = getTLI()->getValueType(DL, RetTy);

EVT MPTy = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);

EVT RetVT = getTLI()->getValueType(DL, RetTy);

EVT VecVT = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);

?

sdesmalen-arm · 2024-03-01T09:07:41Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+    if ((MRTy.isScalableVector() &&
+         MRTy.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock) ||


When I comment this out, the tests still pass, so some test-coverage missing.

sdesmalen-arm · 2024-03-01T09:08:40Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+    if ((MRTy.isScalableVector() &&
+         MRTy.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock) ||
+        (MPTy.isScalableVector() &&
+         MPTy.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock))


Suggested change

MPTy.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock))

MPTy.getSizeInBits().getKnownMinValue() < AArch64::SVEBitsPerBlock))

Do you want to use a less-than comparison instead? Extracting a <vscale x 4 x float> from a <vscale x 8 x float> from index 0 is free. With this code it isn't free, but it's probably a good and common case to add?

Can you also create a isUnpackedVectorVT function to make this code a bit simpler?

sdesmalen-arm · 2024-03-01T09:17:36Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+    EVT MRTy = getTLI()->getValueType(DL, RetTy);
+    EVT MPTy = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
+    // Skip this if either the return type or the vector argument are unpacked
+    // SVE types; they may get lowered to stack stores and loads.


they may get lowered to stack stores and loads

Either that or uzp's?

I saw stores and loads being generated for non-predicate types (and we've already excluded those at this point). Might be something to fix later.

sdesmalen-arm · 2024-03-01T09:18:13Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+    EVT MTy0 = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
+    EVT MTy1 = getTLI()->getValueType(DL, ICA.getArgTypes()[1]);


nit:

Suggested change

EVT MTy0 = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);

EVT MTy1 = getTLI()->getValueType(DL, ICA.getArgTypes()[1]);

EVT VecVT = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);

EVT SubVecVT = getTLI()->getValueType(DL, ICA.getArgTypes()[1]);

sdesmalen-arm · 2024-03-01T09:20:11Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+        (MTy1.isScalableVector() &&
+         MTy1.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock))


When I comment this out, the tests still pass, so some test-coverage missing.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

sdesmalen-arm · 2024-03-04T11:41:55Z

llvm/test/Analysis/CostModel/AArch64/sve-intrinsics.ll

@@ -45,27 +49,43 @@ define void @vector_insert_extract_idxzero_128b() #1 {
 ; TYPE_BASED_ONLY-NEXT:  Cost Model: Invalid cost for instruction: %insert_nxv16i1_nxv2i1 = call <vscale x 16 x i1> @llvm.vector.insert.nxv16i1.nxv2i1(<vscale x 16 x i1> undef, <vscale x 2 x i1> undef, i64 0)
 ; TYPE_BASED_ONLY-NEXT:  Cost Model: Invalid cost for instruction: %extract_nxv4i1_nxv16i1 = call <vscale x 4 x i1> @llvm.vector.extract.nxv4i1.nxv16i1(<vscale x 16 x i1> undef, i64 0)
 ; TYPE_BASED_ONLY-NEXT:  Cost Model: Invalid cost for instruction: %extract_v8i1_nxv8i1 = call <8 x i1> @llvm.vector.extract.v8i1.nxv8i1(<vscale x 8 x i1> undef, i64 0)
+; TYPE_BASED_ONLY-NEXT:  Cost Model: Invalid cost for instruction: %insert_v2f32_nxv2f32 = call <vscale x 2 x float> @llvm.vector.insert.nxv2f32.v2f32(<vscale x 2 x float> undef, <2 x float> undef, i64 0)


Perhaps not something to address in this patch, but I think it's wrong for this case to return an invalid cost, even when it cannot deduce anything for the operands. At worst it's a high cost for using spills and fills. But invalid cost basically says that we can't lower the general case, which isn't true.

That comes from the generic cost model code, which might just give up for scalable vectors. I guess we need to override the cost for more variants. But yes, in a later patch.

sdesmalen-arm · 2024-03-04T11:42:56Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+    TargetLoweringBase::LegalizeKind RLK =
+        getTLI()->getTypeConversion(C, RetVT);
+    TargetLoweringBase::LegalizeKind PLK =
+        getTLI()->getTypeConversion(C, VecVT);


nit: This still uses the 'R' and 'P' names.

sdesmalen-arm

Thanks for the changes. LGTM.

paulwalker-arm · 2024-03-04T16:15:20Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+    TargetLoweringBase::LegalizeKind VecLK =
+        getTLI()->getTypeConversion(C, VecVT);
+    const Value *Idx = IsExtract ? ICA.getArgs()[1] : ICA.getArgs()[2];
+    const ConstantInt *CIdx = dyn_cast<ConstantInt>(Idx);


Can this be just cast<> and the null check removed? I ask because the LangRef says the index operand must be a constant integer.

Post-commit fixup patch for a request on #81135

huntergr-arm requested review from paulwalker-arm, mgabka and david-arm February 8, 2024 13:34

llvmbot added backend:AArch64 llvm:transforms labels Feb 8, 2024

david-arm reviewed Feb 9, 2024

View reviewed changes

huntergr-arm force-pushed the users/huntergr-arm/spr/scalable-subvec-ins-ext-cost branch from 0394c21 to 521921e Compare February 9, 2024 16:14

llvmbot added the llvm:analysis Includes value tracking, cost tables and constant folding label Feb 9, 2024

huntergr-arm changed the base branch from main to users/huntergr-arm/spr/scalable-subvec-ins-ext-cost-tests February 12, 2024 11:28

sdesmalen-arm reviewed Feb 15, 2024

View reviewed changes

Base automatically changed from users/huntergr-arm/spr/scalable-subvec-ins-ext-cost-tests to main February 19, 2024 09:47

huntergr-arm force-pushed the users/huntergr-arm/spr/scalable-subvec-ins-ext-cost branch from 521921e to c08072d Compare February 19, 2024 14:11

huntergr-arm added 3 commits February 28, 2024 15:22

[AArch64] Don't model legal subvector insert/extract as scalarization

a3030ee

Skip predicate types, improve arg count checks

b5b9931

Skip unpacked SVE types

37b3bbc

huntergr-arm force-pushed the users/huntergr-arm/spr/scalable-subvec-ins-ext-cost branch from 9b6d7d6 to 37b3bbc Compare February 29, 2024 09:27

sdesmalen-arm reviewed Mar 1, 2024

View reviewed changes

Improve naming, add extra test

53cac36

sdesmalen-arm reviewed Mar 4, 2024

View reviewed changes

huntergr-arm added 2 commits March 4, 2024 12:57

Unify insert/extract, more renaming

69f866f

Verify number of args matches number of argtypes

595faf8

sdesmalen-arm approved these changes Mar 4, 2024

View reviewed changes

paulwalker-arm reviewed Mar 4, 2024

View reviewed changes

huntergr-arm merged commit 03f852f into main Mar 4, 2024

huntergr-arm deleted the users/huntergr-arm/spr/scalable-subvec-ins-ext-cost branch March 4, 2024 16:17

huntergr-arm added a commit that referenced this pull request Mar 5, 2024

[AArch64] Be stricter about insert/extract index

56abb8d

Post-commit fixup patch for a request on #81135

		EVT MRTy = getTLI()->getValueType(DL, RetTy);
		EVT MPTy = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);

		if ((MRTy.isScalableVector() &&
		MRTy.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock) \|\|

	MPTy.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock))
	MPTy.getSizeInBits().getKnownMinValue() < AArch64::SVEBitsPerBlock))

		EVT MTy0 = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
		EVT MTy1 = getTLI()->getValueType(DL, ICA.getArgTypes()[1]);

		(MTy1.isScalableVector() &&
		MTy1.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock))

[AArch64] Improve cost model for legal subvec insert/extract #81135

[AArch64] Improve cost model for legal subvec insert/extract #81135

Uh oh!

Conversation

huntergr-arm commented Feb 8, 2024

Uh oh!

llvmbot commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huntergr-arm Feb 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huntergr-arm commented Feb 19, 2024

Uh oh!

huntergr-arm commented Feb 26, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdesmalen-arm left a comment

Choose a reason for hiding this comment

Uh oh!

paulwalker-arm Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvmbot commented Feb 8, 2024 •

edited

Loading

huntergr-arm Feb 19, 2024 •

edited

Loading

paulwalker-arm Mar 4, 2024 •

edited

Loading