-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AArch64][CostModel] Increase the cost of illegal SVE int-to-fp converts #130756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If a scalable vector uitofp or sitofp effectively extend the size of each element as part of the conversion, the AArch64 backend will need to plant multiple unpacks before converting.
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-backend-aarch64 Author: Graham Hunter (huntergr-arm) ChangesIf a scalable vector uitofp or sitofp effectively extend the size of each element as part of the conversion, the AArch64 backend will need to plant multiple unpacks before converting. Patch is 28.65 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/130756.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 7cec8a17dfaaa..8091fb8f990bf 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -3144,6 +3144,21 @@ InstructionCost AArch64TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
{ISD::SIGN_EXTEND, MVT::nxv8i32, MVT::nxv8i16, 2},
{ISD::SIGN_EXTEND, MVT::nxv8i64, MVT::nxv8i16, 6},
{ISD::SIGN_EXTEND, MVT::nxv4i64, MVT::nxv4i32, 2},
+
+ // Add cost for extending and converting to illegal -too wide- scalable
+ // Extending one size (e.g. i32 -> f64) takes 2 unpacks and 2 fcvts, while
+ // extending twice (e.g. i16 -> f64) takes 6 unpacks and 4 fcvts.
+ {ISD::SINT_TO_FP, MVT::nxv16f16, MVT::nxv16i8, 12},
+ {ISD::SINT_TO_FP, MVT::nxv16f32, MVT::nxv16i8, 22},
+ {ISD::SINT_TO_FP, MVT::nxv8f32, MVT::nxv8i16, 12},
+ {ISD::SINT_TO_FP, MVT::nxv8f64, MVT::nxv8i16, 22},
+ {ISD::SINT_TO_FP, MVT::nxv4f64, MVT::nxv4i32, 12},
+
+ {ISD::UINT_TO_FP, MVT::nxv16f16, MVT::nxv16i8, 12},
+ {ISD::UINT_TO_FP, MVT::nxv16f32, MVT::nxv16i8, 22},
+ {ISD::UINT_TO_FP, MVT::nxv8f32, MVT::nxv8i16, 12},
+ {ISD::UINT_TO_FP, MVT::nxv8f64, MVT::nxv8i16, 22},
+ {ISD::UINT_TO_FP, MVT::nxv4f64, MVT::nxv4i32, 12},
};
// We have to estimate a cost of fixed length operation upon
diff --git a/llvm/test/Analysis/CostModel/AArch64/sve-itofp.ll b/llvm/test/Analysis/CostModel/AArch64/sve-itofp.ll
new file mode 100644
index 0000000000000..12fd6411255f2
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/AArch64/sve-itofp.ll
@@ -0,0 +1,268 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -passes="print<cost-model>" 2>&1 -disable-output -mtriple aarch64-linux-gnu -mattr=+sve -o - -S < %s | FileCheck %s
+
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64-unknown-linux-gnu"
+
+define void @sve-itofp() {
+; CHECK-LABEL: 'sve-itofp'
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1si8_to_f16 = sitofp <vscale x 1 x i8> undef to <vscale x 1 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1ui8_to_f16 = uitofp <vscale x 1 x i8> undef to <vscale x 1 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1si16_to_f16 = sitofp <vscale x 1 x i16> undef to <vscale x 1 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1ui16_to_f16 = uitofp <vscale x 1 x i16> undef to <vscale x 1 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1si32_to_f16 = sitofp <vscale x 1 x i32> undef to <vscale x 1 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1ui32_to_f16 = uitofp <vscale x 1 x i32> undef to <vscale x 1 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1si64_to_f16 = sitofp <vscale x 1 x i64> undef to <vscale x 1 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1ui64_to_f16 = uitofp <vscale x 1 x i64> undef to <vscale x 1 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1si8_to_f32 = sitofp <vscale x 1 x i8> undef to <vscale x 1 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1ui8_to_f32 = uitofp <vscale x 1 x i8> undef to <vscale x 1 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1si16_to_f32 = sitofp <vscale x 1 x i16> undef to <vscale x 1 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1ui16_to_f32 = uitofp <vscale x 1 x i16> undef to <vscale x 1 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1si32_to_f32 = sitofp <vscale x 1 x i32> undef to <vscale x 1 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1ui32_to_f32 = uitofp <vscale x 1 x i32> undef to <vscale x 1 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1si64_to_f32 = sitofp <vscale x 1 x i64> undef to <vscale x 1 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1ui64_to_f32 = uitofp <vscale x 1 x i64> undef to <vscale x 1 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1si8_to_f64 = sitofp <vscale x 1 x i8> undef to <vscale x 1 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1ui8_to_f64 = uitofp <vscale x 1 x i8> undef to <vscale x 1 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1si16_to_f64 = sitofp <vscale x 1 x i16> undef to <vscale x 1 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1ui16_to_f64 = uitofp <vscale x 1 x i16> undef to <vscale x 1 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1si32_to_f64 = sitofp <vscale x 1 x i32> undef to <vscale x 1 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1ui32_to_f64 = uitofp <vscale x 1 x i32> undef to <vscale x 1 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1si64_to_f64 = sitofp <vscale x 1 x i64> undef to <vscale x 1 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv1ui64_to_f64 = uitofp <vscale x 1 x i64> undef to <vscale x 1 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2si8_to_f16 = sitofp <vscale x 2 x i8> undef to <vscale x 2 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2ui8_to_f16 = uitofp <vscale x 2 x i8> undef to <vscale x 2 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2si16_to_f16 = sitofp <vscale x 2 x i16> undef to <vscale x 2 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2ui16_to_f16 = uitofp <vscale x 2 x i16> undef to <vscale x 2 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2si32_to_f16 = sitofp <vscale x 2 x i32> undef to <vscale x 2 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2ui32_to_f16 = uitofp <vscale x 2 x i32> undef to <vscale x 2 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2si64_to_f16 = sitofp <vscale x 2 x i64> undef to <vscale x 2 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2ui64_to_f16 = uitofp <vscale x 2 x i64> undef to <vscale x 2 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2si8_to_f32 = sitofp <vscale x 2 x i8> undef to <vscale x 2 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2ui8_to_f32 = uitofp <vscale x 2 x i8> undef to <vscale x 2 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2si16_to_f32 = sitofp <vscale x 2 x i16> undef to <vscale x 2 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2ui16_to_f32 = uitofp <vscale x 2 x i16> undef to <vscale x 2 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2si32_to_f32 = sitofp <vscale x 2 x i32> undef to <vscale x 2 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2ui32_to_f32 = uitofp <vscale x 2 x i32> undef to <vscale x 2 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2si64_to_f32 = sitofp <vscale x 2 x i64> undef to <vscale x 2 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2ui64_to_f32 = uitofp <vscale x 2 x i64> undef to <vscale x 2 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2si8_to_f64 = sitofp <vscale x 2 x i8> undef to <vscale x 2 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2ui8_to_f64 = uitofp <vscale x 2 x i8> undef to <vscale x 2 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2si16_to_f64 = sitofp <vscale x 2 x i16> undef to <vscale x 2 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2ui16_to_f64 = uitofp <vscale x 2 x i16> undef to <vscale x 2 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2si32_to_f64 = sitofp <vscale x 2 x i32> undef to <vscale x 2 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2ui32_to_f64 = uitofp <vscale x 2 x i32> undef to <vscale x 2 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2si64_to_f64 = sitofp <vscale x 2 x i64> undef to <vscale x 2 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2ui64_to_f64 = uitofp <vscale x 2 x i64> undef to <vscale x 2 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv4si8_to_f16 = sitofp <vscale x 4 x i8> undef to <vscale x 4 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv4ui8_to_f16 = uitofp <vscale x 4 x i8> undef to <vscale x 4 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv4si16_to_f16 = sitofp <vscale x 4 x i16> undef to <vscale x 4 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv4ui16_to_f16 = uitofp <vscale x 4 x i16> undef to <vscale x 4 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv4si32_to_f16 = sitofp <vscale x 4 x i32> undef to <vscale x 4 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv4ui32_to_f16 = uitofp <vscale x 4 x i32> undef to <vscale x 4 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nv4si64_to_f16 = sitofp <vscale x 4 x i64> undef to <vscale x 4 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nv4ui64_to_f16 = uitofp <vscale x 4 x i64> undef to <vscale x 4 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv4si8_to_f32 = sitofp <vscale x 4 x i8> undef to <vscale x 4 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv4ui8_to_f32 = uitofp <vscale x 4 x i8> undef to <vscale x 4 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv4si16_to_f32 = sitofp <vscale x 4 x i16> undef to <vscale x 4 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv4ui16_to_f32 = uitofp <vscale x 4 x i16> undef to <vscale x 4 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv4si32_to_f32 = sitofp <vscale x 4 x i32> undef to <vscale x 4 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv4ui32_to_f32 = uitofp <vscale x 4 x i32> undef to <vscale x 4 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nv4si64_to_f32 = sitofp <vscale x 4 x i64> undef to <vscale x 4 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nv4ui64_to_f32 = uitofp <vscale x 4 x i64> undef to <vscale x 4 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nv4si8_to_f64 = sitofp <vscale x 4 x i8> undef to <vscale x 4 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nv4ui8_to_f64 = uitofp <vscale x 4 x i8> undef to <vscale x 4 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nv4si16_to_f64 = sitofp <vscale x 4 x i16> undef to <vscale x 4 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nv4ui16_to_f64 = uitofp <vscale x 4 x i16> undef to <vscale x 4 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %nv4si32_to_f64 = sitofp <vscale x 4 x i32> undef to <vscale x 4 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %nv4ui32_to_f64 = uitofp <vscale x 4 x i32> undef to <vscale x 4 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nv4si64_to_f64 = sitofp <vscale x 4 x i64> undef to <vscale x 4 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nv4ui64_to_f64 = uitofp <vscale x 4 x i64> undef to <vscale x 4 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv8si8_to_f16 = sitofp <vscale x 8 x i8> undef to <vscale x 8 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv8ui8_to_f16 = uitofp <vscale x 8 x i8> undef to <vscale x 8 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv8si16_to_f16 = sitofp <vscale x 8 x i16> undef to <vscale x 8 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv8ui16_to_f16 = uitofp <vscale x 8 x i16> undef to <vscale x 8 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nv8si32_to_f16 = sitofp <vscale x 8 x i32> undef to <vscale x 8 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nv8ui32_to_f16 = uitofp <vscale x 8 x i32> undef to <vscale x 8 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %nv8si64_to_f16 = sitofp <vscale x 8 x i64> undef to <vscale x 8 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %nv8ui64_to_f16 = uitofp <vscale x 8 x i64> undef to <vscale x 8 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nv8si8_to_f32 = sitofp <vscale x 8 x i8> undef to <vscale x 8 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nv8ui8_to_f32 = uitofp <vscale x 8 x i8> undef to <vscale x 8 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %nv8si16_to_f32 = sitofp <vscale x 8 x i16> undef to <vscale x 8 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %nv8ui16_to_f32 = uitofp <vscale x 8 x i16> undef to <vscale x 8 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nv8si32_to_f32 = sitofp <vscale x 8 x i32> undef to <vscale x 8 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nv8ui32_to_f32 = uitofp <vscale x 8 x i32> undef to <vscale x 8 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %nv8si64_to_f32 = sitofp <vscale x 8 x i64> undef to <vscale x 8 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %nv8ui64_to_f32 = uitofp <vscale x 8 x i64> undef to <vscale x 8 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %nv8si8_to_f64 = sitofp <vscale x 8 x i8> undef to <vscale x 8 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %nv8ui8_to_f64 = uitofp <vscale x 8 x i8> undef to <vscale x 8 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 22 for instruction: %nv8si16_to_f64 = sitofp <vscale x 8 x i16> undef to <vscale x 8 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 22 for instruction: %nv8ui16_to_f64 = uitofp <vscale x 8 x i16> undef to <vscale x 8 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 24 for instruction: %nv8si32_to_f64 = sitofp <vscale x 8 x i32> undef to <vscale x 8 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 24 for instruction: %nv8ui32_to_f64 = uitofp <vscale x 8 x i32> undef to <vscale x 8 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %nv8si64_to_f64 = sitofp <vscale x 8 x i64> undef to <vscale x 8 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %nv8ui64_to_f64 = uitofp <vscale x 8 x i64> undef to <vscale x 8 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %nv16si8_to_f16 = sitofp <vscale x 16 x i8> undef to <vscale x 16 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %nv16ui8_to_f16 = uitofp <vscale x 16 x i8> undef to <vscale x 16 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nv16si16_to_f16 = sitofp <vscale x 16 x i16> undef to <vscale x 16 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nv16ui16_to_f16 = uitofp <vscale x 16 x i16> undef to <vscale x 16 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %nv16si32_to_f16 = sitofp <vscale x 16 x i32> undef to <vscale x 16 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %nv16ui32_to_f16 = uitofp <vscale x 16 x i32> undef to <vscale x 16 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %nv16si64_to_f16 = sitofp <vscale x 16 x i64> undef to <vscale x 16 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %nv16ui64_to_f16 = uitofp <vscale x 16 x i64> undef to <vscale x 16 x half>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 22 for instruction: %nv16si8_to_f32 = sitofp <vscale x 16 x i8> undef to <vscale x 16 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 22 for instruction: %nv16ui8_to_f32 = uitofp <vscale x 16 x i8> undef to <vscale x 16 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 24 for instruction: %nv16si16_to_f32 = sitofp <vscale x 16 x i16> undef to <vscale x 16 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 24 for instruction: %nv16ui16_to_f32 = uitofp <vscale x 16 x i16> undef to <vscale x 16 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %nv16si32_to_f32 = sitofp <vscale x 16 x i32> undef to <vscale x 16 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %nv16ui32_to_f32 = uitofp <vscale x 16 x i32> undef to <vscale x 16 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %nv16si64_to_f32 = sitofp <vscale x 16 x i64> undef to <vscale x 16 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %nv16ui64_to_f32 = uitofp <vscale x 16 x i64> undef to <vscale x 16 x float>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %nv16si8_to_f64 = sitofp <vscale x 16 x i8> undef to <vscale x 16 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %nv16ui8_to_f64 = uitofp <vscale x 16 x i8> undef to <vscale x 16 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 44 for instruction: %nv16si16_to_f64 = sitofp <vscale x 16 x i16> undef to <vscale x 16 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 44 for instruction: %nv16ui16_to_f64 = uitofp <vscale x 16 x i16> undef to <vscale x 16 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 48 for instruction: %nv16si32_to_f64 = sitofp <vscale x 16 x i32> undef to <vscale x 16 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 48 for instruction: %nv16ui32_to_f64 = uitofp <vscale x 16 x i32> undef to <vscale x 16 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %nv16si64_to_f64 = sitofp <vscale x 16 x i64> undef to <vscale x 16 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %nv16ui64_to_f64 = uitofp <vscale x 16 x i64> undef to <vscale x 16 x double>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+
+ %nv1si8_to_f16 = sitofp <vscale x 1 x i8> undef to <vscale x 1 x half>
+ %nv1ui8_to_f16 = uitofp <vscale x 1 x i8> undef to <vscale x 1 x half>
+ %nv1si16_to_f16 = sitofp <vscale x 1 x i16> undef to <vscale x 1 x half>
+ %nv1ui16_to_f16 = uitofp <vscale x 1 x i16> undef to <vscale x 1 x half>
+ %nv1si32_to_f16 = sitofp <vscale x 1 x i32> undef to <vscale x 1 x half>
+ %nv1ui32_to_f16 = uitofp <vscale x 1 x i32> undef to <vscale x 1 x half>
+ %nv1si64_t...
[truncated]
|
✅ With the latest revision this PR passed the undef deprecator. |
@@ -3144,6 +3144,21 @@ InstructionCost AArch64TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst, | |||
{ISD::SIGN_EXTEND, MVT::nxv8i32, MVT::nxv8i16, 2}, | |||
{ISD::SIGN_EXTEND, MVT::nxv8i64, MVT::nxv8i16, 6}, | |||
{ISD::SIGN_EXTEND, MVT::nxv4i64, MVT::nxv4i32, 2}, | |||
|
|||
// Add cost for extending and converting to illegal -too wide- scalable | |||
// Extending one size (e.g. i32 -> f64) takes 2 unpacks and 2 fcvts, while |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that the cost-model is a bit of a guessing game, but is there any rationale behind picking a factor of 3? (i.e. why the cost is 12 instead of 4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fcvt instructions seem to have 1/2 to 1/8 the throughput (depending on type) compared to simple arithmetic instructions, e.g. add
, so I bumped the cost of those. The numbers may not be the best overall, but don't seem to lead to regressions at present. We may want to try a range of values at some point to see if there's a better estimate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the cost of converts of the 'not too wide' types then also be increased to reflect a higher reciprocal cost?
e.g. I see a cost of 1 for:
CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nv2si16_to_f64 = sitofp <vscale x 2 x i16> undef to <vscale x 2 x double>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just discussed this offline, but just sharing my thoughts here: IMO the table should represent the cost of casts of legal types. Illegal types should be handled by generic code that multiplies the cost by the 'type legalization cost'. This is actually what happens for fixed-length types (see the code just below the table), but not (yet) for scalable types. Otherwise, any other illegal types that are not in the table (which includes types that cannot be represented by MVTs because they're "too wide") will get some default cost, which may be far too low.
It also seems that SINT_TO_FP
records are missing in the table for scalable vector types (only FP_TO_SINT is handled). This is probably just a historical omission because this table gets updated/botched on an ad-hoc basis when people find that the cost is wrong for some workload, for some type and operation. It would be nice to clean this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've changed the approach slightly to use the pseudo-legalization from the base getCastInstrCost
that NEON uses (note the code below the table is about using SVE for fixed-length, so doesn't always apply).
Using this approach, we'll still get some illegal types (e.g. mapping nxv2i16 -> nxv2f64, the input would be promoted to nxv2i64 but that's not done in the current code for NEON), but I'm covering the cases where the destination type is legal.
I've decided to back away from increasing the cost of direct fcvts here – even though they have less throughput than add
, the NEON values are not written with that in mind so we might incorrectly decide to favour NEON (or scalar) code.
I'll rerun some benchmarking with these adjusted values to see whether there's any regression from doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numbers look OK to me for the most part.
Some of the Neon numbers might need to increase after #130665 due to the double-rounding. I believe SVE has native instructions for a lot of the problem cases, so it should hopefully not be so much of an issue there.
(I don't remember if we use undef for a reason that is different to poison or we just didn't change them yet. I usually just ignore that bot).
{ISD::UINT_TO_FP, MVT::nxv4f16, MVT::nxv4i32, 1}, | ||
|
||
// SVE: to nxv8f16 | ||
{ISD::SINT_TO_FP, MVT::nxv8f16, MVT::nxv8i8, 3}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for 3 as opposed to 2 here? (i'm not sure if the sxtb would disappear if the input was not from an arg too, but it is probably OK to include it).
define <vscale x 8 x half> @test(<vscale x 8 x i8> %a) {
%r = sitofp <vscale x 8 x i8> %a to <vscale x 8 x half>
ret <vscale x 8 x half> %r
}
ptrue p0.h
sxtb z0.h, p0/m, z0.h
scvtf z0.h, p0/m, z0.h
ret
// SVE: to nxv2f64 | ||
{ISD::SINT_TO_FP, MVT::nxv2f64, MVT::nxv2i8, 7}, | ||
{ISD::SINT_TO_FP, MVT::nxv2f64, MVT::nxv2i16, 5}, | ||
{ISD::SINT_TO_FP, MVT::nxv2f64, MVT::nxv2i32, 3}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be 1, considering there is a native instruction for it?
IR:
define <vscale x 2 x double> @test(<vscale x 2 x i32> %a) {
%r = sitofp <vscale x 2 x i32> %a to <vscale x 2 x double>
ret <vscale x 2 x double> %r
}
ptrue p0.d
scvtf z0.d, p0/m, z0.s
ret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I based the entries on how the corresponding NEON entries are used. <vscale x 2 x i32>
is not a legal type, and here it's treated as a packed vector – so rather than interleaving lanes the data is in the first half of the vector. The cost represents the required unpack operations on top of the fcvts themselves.
This isn't great, but it does mostly work with the multiply-by-legalization-factor approach discussed above with Sander.
This work is to address a particular regression in SPEC when max vector bandwidth is enabled, and the cost of a vplan with VF vscale x 8
is considered to be cheaper than a fixed VF of 8
due to the cost of the converts.
In the NEON case, a v8i16
is converted to a v8f64
; TTI reaches this function, hits the call to BasicTTIImplBase::getCastInstrCost
at the bottom, retries with v4i16
to v4f64
, calls the base again and finally finds a match when called with v2i16
(an illegal type) to v2f64
. That cost (4) then gets multiplied by the 2 rounds of splitting to give 16, and there's an extra penalty of 3 on top giving a score of 19.
For SVE, it was costed as 1 * 4 (for 2 rounds of splitting) + 3, giving 7. But NEON was able to use 4 tbl instructions, where SVE currently uses 6 unpack instructions. So now the line with a cost of 5 fornxv2i16
to nxv2f64
gives us a total cost of 23, and we now pick the fixed length VF instead, preventing the regression.
The same applies to the nxv2i32
to nxv2f64
case – we're expecting this to come from a conversion of nxv4i32
to nxv4f64
, so the cost of the unpacks is bundled in the same way it is for NEON.
I don't particularly like it, but I don't want to overhaul all the existing NEON code here – some of the numbers in the table date back to when the backend was first merged upstream, and I'm not sure how they were derived.
One possible alternative for this current patch would be to have an SVE-specific helper which calculates legalization separately (and comes up with the cost of the unpacks separately from the cost of the fcvt, which would allow us to change that cost in future if we decide to use tbl instructions for SVE as well), then only asks for the cost of a fully legalized fcvt to multiply by the number of registers required. I initially decided against that because I didn't want to reimplement a bunch of logic for legalizing the types, but it would be more accurate.
Would that be preferable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so if I understand you correctly, the cost you've encoded here includes the cost of possibly (or likely) extracting a sub-vector as part of type legalisation (like extracting a <vscale x 4 x i16>
from a <vscale x 8 x i16>
source operand for example, when it has to split a <vscale x 8 x float> uitofp <vscale x 8 x 16> %in
). Ideally the code in BaseT::getCastInstrCost
would add some cost for extracting a subvector when calculating the legalisation cost, but this doesn't really happen anywhere at the moment.
If you want to reflect these costs in this table, can you decompose the cost into a "cost of convert" + "cost of type-legalisation/extract-subvector" in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've update the code to include two kinds of entry in the table -- one where the destination type is legal, with minimal costs for them. The other is where the destination type is illegal (too wide) but the source type is either legal or narrower than legal and requires splitting, so I've added a penalty cost to them.
I've done that via symbolic constants, so we can change the numbers later without trying to figure out what we were modeling.
The target-independent code still gets involved in some of the cases and seems to work ok (e.g nxv16i8 -> nxv16f64, since there's currently no MVT for the latter type and it has to be split at the EVT level before we can look it up in the table).
%r207 = sitofp <4 x i32> undef to <4 x double> | ||
%r208 = uitofp <4 x i64> undef to <4 x double> | ||
%r209 = sitofp <4 x i64> undef to <4 x double> | ||
%r200 = uitofp <4 x i1> poison to <4 x double> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When making cost model changes I've typically avoided converting undef to poison to avoid the github pre-commit errors to make patches easier to review. If you do want to update the cost model tests to use poison I think that's best done as a NFC patch separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted the undef -> poison change for the existing test file, but the new file will use poison. I have left the changes to the duplicated lines though (we had two sets of i16 -> f64 tests and skipped over i32 -> f64; I assumed that wasn't intended.)
…, add wider entries to table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, if there are no other comments.
// FIXME: Use tbl instructions for SVE as well, at least in cases where the | ||
// conversion is done in a loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this FIXME - mostly as it just doesn't feel like the right place for it and I think if we supported tbl extensions that would only apply to loops, and the cost model should probably be the worst-case of with and without tbl. (There is a chance we want to change that in the future to have something that can cost "invariant" vs "fixed" costs, similar for constants, but for the moment it is probably OK to stick with the higher).
Co-authored-by: Benjamin Maxwell <[email protected]>
If a scalable vector uitofp or sitofp effectively extend the size of each element as part of the conversion, the AArch64 backend will need to plant multiple unpacks before converting.