Skip to content

[AArch64] Improve cost model for legal subvec insert/extract #81135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

huntergr-arm
Copy link
Collaborator

Currently we model subvector inserts and extracts as shuffles, potentially going as far as scalarizing. If the types are legal then they can just be simple zip/unzip operations, or possible even no-ops. Change the cost to a relatively small one to ensure that simple loops featuring such operations between fixed and scalable vector types that are effectively the same at a given sve width can be unrolled and further optimized.

@llvmbot
Copy link
Member

llvmbot commented Feb 8, 2024

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-aarch64

Author: Graham Hunter (huntergr-arm)

Changes

Currently we model subvector inserts and extracts as shuffles, potentially going as far as scalarizing. If the types are legal then they can just be simple zip/unzip operations, or possible even no-ops. Change the cost to a relatively small one to ensure that simple loops featuring such operations between fixed and scalable vector types that are effectively the same at a given sve width can be unrolled and further optimized.


Full diff: https://github.com/llvm/llvm-project/pull/81135.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+26)
  • (added) llvm/test/Transforms/LoopUnroll/AArch64/scalable-vec-ins-ext.ll (+91)
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index cdd2750521d2c9..0add50e5067886 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -568,6 +568,32 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     }
     return Cost;
   }
+  case Intrinsic::vector_extract: {
+    // If both the vector argument and the return type are legal types, then
+    // this should be a no-op or simple operation; return a relatively low cost.
+    LLVMContext &C = RetTy->getContext();
+    EVT MRTy = getTLI()->getValueType(DL, RetTy);
+    EVT MPTy = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
+    TargetLoweringBase::LegalizeKind RLK = getTLI()->getTypeConversion(C, MRTy);
+    TargetLoweringBase::LegalizeKind PLK = getTLI()->getTypeConversion(C, MPTy);
+    if (RLK.first == TargetLoweringBase::TypeLegal &&
+        PLK.first == TargetLoweringBase::TypeLegal)
+      return InstructionCost(1);
+    break;
+  }
+  case Intrinsic::vector_insert: {
+    // If both the vector and subvector arguments are legal types, then this
+    // should be a no-op or simple operation; return a relatively low cost.
+    LLVMContext &C = RetTy->getContext();
+    EVT MTy0 = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
+    EVT MTy1 = getTLI()->getValueType(DL, ICA.getArgTypes()[1]);
+    TargetLoweringBase::LegalizeKind LK0 = getTLI()->getTypeConversion(C, MTy0);
+    TargetLoweringBase::LegalizeKind LK1 = getTLI()->getTypeConversion(C, MTy1);
+    if (LK0.first == TargetLoweringBase::TypeLegal &&
+        LK1.first == TargetLoweringBase::TypeLegal)
+      return InstructionCost(1);
+    break;
+  }
   case Intrinsic::bitreverse: {
     static const CostTblEntry BitreverseTbl[] = {
         {Intrinsic::bitreverse, MVT::i32, 1},
diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/scalable-vec-ins-ext.ll b/llvm/test/Transforms/LoopUnroll/AArch64/scalable-vec-ins-ext.ll
new file mode 100644
index 00000000000000..c69757abd4c1d2
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/AArch64/scalable-vec-ins-ext.ll
@@ -0,0 +1,91 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 3
+; RUN: opt -passes=loop-unroll,simplifycfg -S -mtriple aarch64 -mattr=+sve -aarch64-sve-vector-bits-min=128 %s | FileCheck %s -check-prefix=UNROLL-128
+; RUN: opt -passes=loop-unroll,simplifycfg -S -mtriple aarch64 -mattr=+sve -aarch64-sve-vector-bits-min=256 %s | FileCheck %s -check-prefix=UNROLL-256
+
+;; This test contains IR similar to what would be generated when SVE ACLE
+;; routines are used with fixed-width vector types -- lots of subvector inserts
+;; and extracts that are effectively just bitcasts since the types are the
+;; same at a given SVE bit size. We want to make sure that they are not a
+;; barrier to unrolling simple loops with a fixed trip count which could be
+;; further optimized.
+
+define void @test_ins_ext_cost(ptr readonly %a, ptr readonly %b, ptr readonly %c, ptr noalias %d) {
+; UNROLL-128-LABEL: define void @test_ins_ext_cost(
+; UNROLL-128-SAME: ptr readonly [[A:%.*]], ptr readonly [[B:%.*]], ptr readonly [[C:%.*]], ptr noalias [[D:%.*]]) #[[ATTR0:[0-9]+]] {
+; UNROLL-128-NEXT:  entry:
+; UNROLL-128-NEXT:    br label [[FOR_BODY:%.*]]
+; UNROLL-128:       for.body:
+; UNROLL-128-NEXT:    [[EXIT_COND:%.*]] = phi i1 [ true, [[ENTRY:%.*]] ], [ false, [[FOR_BODY]] ]
+; UNROLL-128-NEXT:    [[IV:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ 1, [[FOR_BODY]] ]
+; UNROLL-128-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds <8 x float>, ptr [[A]], i64 [[IV]]
+; UNROLL-128-NEXT:    [[LOAD_A:%.*]] = load <8 x float>, ptr [[GEP_A]], align 16
+; UNROLL-128-NEXT:    [[GEP_B:%.*]] = getelementptr inbounds <8 x float>, ptr [[B]], i64 [[IV]]
+; UNROLL-128-NEXT:    [[LOAD_B:%.*]] = load <8 x float>, ptr [[GEP_B]], align 16
+; UNROLL-128-NEXT:    [[GEP_C:%.*]] = getelementptr inbounds <8 x float>, ptr [[C]], i64 [[IV]]
+; UNROLL-128-NEXT:    [[LOAD_C:%.*]] = load <8 x float>, ptr [[GEP_C]], align 16
+; UNROLL-128-NEXT:    [[CAST_SCALABLE_B:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_B]], i64 0)
+; UNROLL-128-NEXT:    [[CAST_SCALABLE_C:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_C]], i64 0)
+; UNROLL-128-NEXT:    [[ADD:%.*]] = fadd <vscale x 4 x float> [[CAST_SCALABLE_B]], [[CAST_SCALABLE_C]]
+; UNROLL-128-NEXT:    [[CAST_SCALABLE_A:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_A]], i64 0)
+; UNROLL-128-NEXT:    [[MUL:%.*]] = fmul <vscale x 4 x float> [[CAST_SCALABLE_A]], [[ADD]]
+; UNROLL-128-NEXT:    [[CAST_FIXED_D:%.*]] = tail call <8 x float> @llvm.vector.extract.v8f32.nxv4f32(<vscale x 4 x float> [[MUL]], i64 0)
+; UNROLL-128-NEXT:    [[GEP_D:%.*]] = getelementptr inbounds <8 x float>, ptr [[D]], i64 0, i64 [[IV]]
+; UNROLL-128-NEXT:    store <8 x float> [[CAST_FIXED_D]], ptr [[GEP_D]], align 16
+; UNROLL-128-NEXT:    br i1 [[EXIT_COND]], label [[FOR_BODY]], label [[EXIT:%.*]]
+; UNROLL-128:       exit:
+; UNROLL-128-NEXT:    ret void
+;
+; UNROLL-256-LABEL: define void @test_ins_ext_cost(
+; UNROLL-256-SAME: ptr readonly [[A:%.*]], ptr readonly [[B:%.*]], ptr readonly [[C:%.*]], ptr noalias [[D:%.*]]) #[[ATTR0:[0-9]+]] {
+; UNROLL-256-NEXT:  entry:
+; UNROLL-256-NEXT:    [[LOAD_A:%.*]] = load <8 x float>, ptr [[A]], align 16
+; UNROLL-256-NEXT:    [[LOAD_B:%.*]] = load <8 x float>, ptr [[B]], align 16
+; UNROLL-256-NEXT:    [[LOAD_C:%.*]] = load <8 x float>, ptr [[C]], align 16
+; UNROLL-256-NEXT:    [[CAST_SCALABLE_B:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_B]], i64 0)
+; UNROLL-256-NEXT:    [[CAST_SCALABLE_C:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_C]], i64 0)
+; UNROLL-256-NEXT:    [[ADD:%.*]] = fadd <vscale x 4 x float> [[CAST_SCALABLE_B]], [[CAST_SCALABLE_C]]
+; UNROLL-256-NEXT:    [[CAST_SCALABLE_A:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_A]], i64 0)
+; UNROLL-256-NEXT:    [[MUL:%.*]] = fmul <vscale x 4 x float> [[CAST_SCALABLE_A]], [[ADD]]
+; UNROLL-256-NEXT:    [[CAST_FIXED_D:%.*]] = tail call <8 x float> @llvm.vector.extract.v8f32.nxv4f32(<vscale x 4 x float> [[MUL]], i64 0)
+; UNROLL-256-NEXT:    store <8 x float> [[CAST_FIXED_D]], ptr [[D]], align 16
+; UNROLL-256-NEXT:    [[GEP_A_1:%.*]] = getelementptr inbounds <8 x float>, ptr [[A]], i64 1
+; UNROLL-256-NEXT:    [[LOAD_A_1:%.*]] = load <8 x float>, ptr [[GEP_A_1]], align 16
+; UNROLL-256-NEXT:    [[GEP_B_1:%.*]] = getelementptr inbounds <8 x float>, ptr [[B]], i64 1
+; UNROLL-256-NEXT:    [[LOAD_B_1:%.*]] = load <8 x float>, ptr [[GEP_B_1]], align 16
+; UNROLL-256-NEXT:    [[GEP_C_1:%.*]] = getelementptr inbounds <8 x float>, ptr [[C]], i64 1
+; UNROLL-256-NEXT:    [[LOAD_C_1:%.*]] = load <8 x float>, ptr [[GEP_C_1]], align 16
+; UNROLL-256-NEXT:    [[CAST_SCALABLE_B_1:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_B_1]], i64 0)
+; UNROLL-256-NEXT:    [[CAST_SCALABLE_C_1:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_C_1]], i64 0)
+; UNROLL-256-NEXT:    [[ADD_1:%.*]] = fadd <vscale x 4 x float> [[CAST_SCALABLE_B_1]], [[CAST_SCALABLE_C_1]]
+; UNROLL-256-NEXT:    [[CAST_SCALABLE_A_1:%.*]] = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> [[LOAD_A_1]], i64 0)
+; UNROLL-256-NEXT:    [[MUL_1:%.*]] = fmul <vscale x 4 x float> [[CAST_SCALABLE_A_1]], [[ADD_1]]
+; UNROLL-256-NEXT:    [[CAST_FIXED_D_1:%.*]] = tail call <8 x float> @llvm.vector.extract.v8f32.nxv4f32(<vscale x 4 x float> [[MUL_1]], i64 0)
+; UNROLL-256-NEXT:    [[GEP_D_1:%.*]] = getelementptr inbounds <8 x float>, ptr [[D]], i64 0, i64 1
+; UNROLL-256-NEXT:    store <8 x float> [[CAST_FIXED_D_1]], ptr [[GEP_D_1]], align 16
+; UNROLL-256-NEXT:    ret void
+;
+entry:
+  br label %for.body
+
+for.body:
+  %exit.cond = phi i1 [ true, %entry ], [ false, %for.body ]
+  %iv = phi i64 [ 0, %entry ], [ 1, %for.body ]
+  %gep.a = getelementptr inbounds <8 x float>, ptr %a, i64 %iv
+  %load.a = load <8 x float>, ptr %gep.a, align 16
+  %gep.b = getelementptr inbounds <8 x float>, ptr %b, i64 %iv
+  %load.b = load <8 x float>, ptr %gep.b, align 16
+  %gep.c = getelementptr inbounds <8 x float>, ptr %c, i64 %iv
+  %load.c = load <8 x float>, ptr %gep.c, align 16
+  %cast.scalable.b = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> %load.b, i64 0)
+  %cast.scalable.c = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> %load.c, i64 0)
+  %add = fadd <vscale x 4 x float> %cast.scalable.b, %cast.scalable.c
+  %cast.scalable.a = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v8f32(<vscale x 4 x float> undef, <8 x float> %load.a, i64 0)
+  %mul = fmul <vscale x 4 x float> %cast.scalable.a, %add
+  %cast.fixed.d = tail call <8 x float> @llvm.vector.extract.v8f32.nxv4f32(<vscale x 4 x float> %mul, i64 0)
+  %gep.d = getelementptr inbounds <8 x float>, ptr %d, i64 0, i64 %iv
+  store <8 x float> %cast.fixed.d, ptr %gep.d, align 16
+  br i1 %exit.cond, label %for.body, label %exit
+
+exit:
+  ret void
+}

@@ -568,6 +568,32 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
}
return Cost;
}
case Intrinsic::vector_extract: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to have proper cost model test cases in llvm/test/Analysis/CostModel/AArch64. You can look at existing examples as a guide.

TargetLoweringBase::LegalizeKind PLK = getTLI()->getTypeConversion(C, MPTy);
if (RLK.first == TargetLoweringBase::TypeLegal &&
PLK.first == TargetLoweringBase::TypeLegal)
return InstructionCost(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance this seems too optimistic because there are cases where the cost is higher. See llvm/test/CodeGen/AArch64/sve-extract-scalable-vector.ll for some examples:

define <vscale x 4 x i1> @extract_nxv4i1_nxv16i1_0(<vscale x 16 x i1> %in) {
; CHECK-LABEL: extract_nxv4i1_nxv16i1_0:
; CHECK:       // %bb.0:
; CHECK-NEXT:    punpklo p0.h, p0.b
; CHECK-NEXT:    punpklo p0.h, p0.b
; CHECK-NEXT:    ret
  %res = call <vscale x 4 x i1> @llvm.vector.extract.nxv4i1.nxv16i1(<vscale x 16 x i1> %in, i64 0)
  ret <vscale x 4 x i1> %res
}

or

define <vscale x 2 x i1> @extract_nxv2i1_nxv16i1_0(<vscale x 16 x i1> %in) {
; CHECK-LABEL: extract_nxv2i1_nxv16i1_0:
; CHECK:       // %bb.0:
; CHECK-NEXT:    punpklo p0.h, p0.b
; CHECK-NEXT:    punpklo p0.h, p0.b
; CHECK-NEXT:    punpklo p0.h, p0.b
; CHECK-NEXT:    ret
  %res = call <vscale x 2 x i1> @llvm.vector.extract.nxv2i1.nxv16i1(<vscale x 16 x i1> %in, i64 0)
  ret <vscale x 2 x i1> %res
}

These should have a higher cost than 1 I think. Or what about mixing legal fixed-width and scalable vector types such as:

define void @foo(ptr %a, ptr %b) vscale_range(16,0) {
  %op = load <vscale x 8 x i16>, ptr %a
  %ret = call <4 x i16> @llvm.vector.extract.v8i16.v16i16(<vscale x 8 x i16> %op, i64 4)
  store <4 x i16> %ret, ptr %b
  ret void
}

which leads to this assembly output:

foo:                                    // @foo
	str	x29, [sp, #-16]!                // 8-byte Folded Spill
	addvl	sp, sp, #-1
	ptrue	p0.h
	ld1h	{ z0.h }, p0/z, [x0]
	st1h	{ z0.h }, p0, [sp]
	ldr	d0, [sp, #8]
	str	d0, [x1]
	addvl	sp, sp, #1
	ldr	x29, [sp], #16                  // 8-byte Folded Reload
	ret

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a check to enforce that the lower cost is only allowed when the index is 0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pointing out that the code isn't updated yet to handle predicates differently, as those inserts/extracts are indeed not free.

TargetLoweringBase::LegalizeKind LK1 = getTLI()->getTypeConversion(C, MTy1);
if (LK0.first == TargetLoweringBase::TypeLegal &&
LK1.first == TargetLoweringBase::TypeLegal)
return InstructionCost(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as the extract - I think this model is too simple and makes the cost of vector predicate insert/extracts too cheap, as well as mixing legal fixed-width and scalable vector inserts. In particular, we should worry about insert a fixed-width vector into a region of a scalable vector beyond the known minimum size.

@huntergr-arm huntergr-arm force-pushed the users/huntergr-arm/spr/scalable-subvec-ins-ext-cost branch from 0394c21 to 521921e Compare February 9, 2024 16:14
@llvmbot llvmbot added the llvm:analysis Includes value tracking, cost tables and constant folding label Feb 9, 2024
@huntergr-arm huntergr-arm changed the base branch from main to users/huntergr-arm/spr/scalable-subvec-ins-ext-cost-tests February 12, 2024 11:28
if (RLK.first == TargetLoweringBase::TypeLegal &&
PLK.first == TargetLoweringBase::TypeLegal && Idx &&
Idx->getZExtValue() == 0)
return InstructionCost(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this wouldn't this be zero-cost?

Also, stylistically to match the rest of this file, maybe return TTI::TCC_Free (if this is considered a cost of 0) or TTI::TCC_Basic (if this is considered a cost of 1) instead?

Copy link
Collaborator Author

@huntergr-arm huntergr-arm Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the subvector is legal but smaller than the vector its being inserted into, we would need an extra operation. e.g. <2 x float> (packed 64b NEON) into <vscale x 2 x float> (unpacked 128b+ SVE) would need an unzip or unpack, I think?

My original idea with this was to only do it in cases where the size of the vectors was the same, but I was encouraged to extend that to legal types. I can walk it back to my initial idea if we're not sure about the extent of the change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the extra instructions required for extracting from an unpacked vector is 2log(SVEBitsPerBlock/(MinNumElts * EltSizeInBits)). For the insert that would require adding another TCC_Basic for the cost of the mov (when it's not inserting into an undef/poison value). This would automatically return TCC_Free (==0) if the scalable vector is a packed vector.


// If arguments aren't actually supplied, then we cannot determine the
// value of the index.
if (ICA.getArgs().size() < 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
if (ICA.getArgs().size() < 2)
if (ICA.getArgs().size() != 2)

TargetLoweringBase::LegalizeKind PLK = getTLI()->getTypeConversion(C, MPTy);
if (RLK.first == TargetLoweringBase::TypeLegal &&
PLK.first == TargetLoweringBase::TypeLegal)
return InstructionCost(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pointing out that the code isn't updated yet to handle predicates differently, as those inserts/extracts are indeed not free.

Base automatically changed from users/huntergr-arm/spr/scalable-subvec-ins-ext-cost-tests to main February 19, 2024 09:47
@huntergr-arm huntergr-arm force-pushed the users/huntergr-arm/spr/scalable-subvec-ins-ext-cost branch from 521921e to c08072d Compare February 19, 2024 14:11
@huntergr-arm
Copy link
Collaborator Author

Rebased on top of the updated precommit tests.

@huntergr-arm
Copy link
Collaborator Author

Ping.

@huntergr-arm huntergr-arm force-pushed the users/huntergr-arm/spr/scalable-subvec-ins-ext-cost branch from 9b6d7d6 to 37b3bbc Compare February 29, 2024 09:27
Comment on lines 582 to 583
EVT MRTy = getTLI()->getValueType(DL, RetTy);
EVT MPTy = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
EVT MRTy = getTLI()->getValueType(DL, RetTy);
EVT MPTy = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
EVT RetVT = getTLI()->getValueType(DL, RetTy);
EVT VecVT = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);

?

Comment on lines 586 to 587
if ((MRTy.isScalableVector() &&
MRTy.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock) ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I comment this out, the tests still pass, so some test-coverage missing.

if ((MRTy.isScalableVector() &&
MRTy.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock) ||
(MPTy.isScalableVector() &&
MPTy.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MPTy.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock))
MPTy.getSizeInBits().getKnownMinValue() < AArch64::SVEBitsPerBlock))

Do you want to use a less-than comparison instead? Extracting a <vscale x 4 x float> from a <vscale x 8 x float> from index 0 is free. With this code it isn't free, but it's probably a good and common case to add?

Can you also create a isUnpackedVectorVT function to make this code a bit simpler?

EVT MRTy = getTLI()->getValueType(DL, RetTy);
EVT MPTy = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
// Skip this if either the return type or the vector argument are unpacked
// SVE types; they may get lowered to stack stores and loads.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they may get lowered to stack stores and loads

Either that or uzp's?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw stores and loads being generated for non-predicate types (and we've already excluded those at this point). Might be something to fix later.

Comment on lines 611 to 612
EVT MTy0 = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
EVT MTy1 = getTLI()->getValueType(DL, ICA.getArgTypes()[1]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
EVT MTy0 = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
EVT MTy1 = getTLI()->getValueType(DL, ICA.getArgTypes()[1]);
EVT VecVT = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
EVT SubVecVT = getTLI()->getValueType(DL, ICA.getArgTypes()[1]);

Comment on lines 617 to 618
(MTy1.isScalableVector() &&
MTy1.getSizeInBits().getKnownMinValue() != AArch64::SVEBitsPerBlock))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I comment this out, the tests still pass, so some test-coverage missing.

@@ -45,27 +49,43 @@ define void @vector_insert_extract_idxzero_128b() #1 {
; TYPE_BASED_ONLY-NEXT: Cost Model: Invalid cost for instruction: %insert_nxv16i1_nxv2i1 = call <vscale x 16 x i1> @llvm.vector.insert.nxv16i1.nxv2i1(<vscale x 16 x i1> undef, <vscale x 2 x i1> undef, i64 0)
; TYPE_BASED_ONLY-NEXT: Cost Model: Invalid cost for instruction: %extract_nxv4i1_nxv16i1 = call <vscale x 4 x i1> @llvm.vector.extract.nxv4i1.nxv16i1(<vscale x 16 x i1> undef, i64 0)
; TYPE_BASED_ONLY-NEXT: Cost Model: Invalid cost for instruction: %extract_v8i1_nxv8i1 = call <8 x i1> @llvm.vector.extract.v8i1.nxv8i1(<vscale x 8 x i1> undef, i64 0)
; TYPE_BASED_ONLY-NEXT: Cost Model: Invalid cost for instruction: %insert_v2f32_nxv2f32 = call <vscale x 2 x float> @llvm.vector.insert.nxv2f32.v2f32(<vscale x 2 x float> undef, <2 x float> undef, i64 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not something to address in this patch, but I think it's wrong for this case to return an invalid cost, even when it cannot deduce anything for the operands. At worst it's a high cost for using spills and fills. But invalid cost basically says that we can't lower the general case, which isn't true.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comes from the generic cost model code, which might just give up for scalable vectors. I guess we need to override the cost for more variants. But yes, in a later patch.

Comment on lines 593 to 596
TargetLoweringBase::LegalizeKind RLK =
getTLI()->getTypeConversion(C, RetVT);
TargetLoweringBase::LegalizeKind PLK =
getTLI()->getTypeConversion(C, VecVT);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This still uses the 'R' and 'P' names.

Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. LGTM.

TargetLoweringBase::LegalizeKind VecLK =
getTLI()->getTypeConversion(C, VecVT);
const Value *Idx = IsExtract ? ICA.getArgs()[1] : ICA.getArgs()[2];
const ConstantInt *CIdx = dyn_cast<ConstantInt>(Idx);
Copy link
Collaborator

@paulwalker-arm paulwalker-arm Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be just cast<> and the null check removed? I ask because the LangRef says the index operand must be a constant integer.

@huntergr-arm huntergr-arm merged commit 03f852f into main Mar 4, 2024
@huntergr-arm huntergr-arm deleted the users/huntergr-arm/spr/scalable-subvec-ins-ext-cost branch March 4, 2024 16:17
huntergr-arm added a commit that referenced this pull request Mar 5, 2024
Post-commit fixup patch for a request on
#81135
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AArch64 llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants