-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[X86] Fold VPERMV3(X,M,Y) -> VPERMV(CONCAT(X,Y),WIDEN(M)) iff the CONCAT is free #122485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…CAT is free This extends the existing fold which concatenates X and Y if they are sequential subvectors extracted from the same source. By using combineConcatVectorOps we can recognise other patterns where X and Y can be concatenated for free (e.g. sequential loads, concatenating repeated instructions etc.), which allows the VPERMV3 fold to be a lot more aggressive. This required combineConcatVectorOps to be extended to fold the additional case of "concat(extract_subvector(x,lo), extract_subvector(x,hi)) -> extract_subvector(x)", similar to the original VPERMV3 fold where "x" was larger than the concat result type. This also exposes more cases where we have repeated vector/subvector loads if they have multiple uses - e.g. where we're loading a ymm and the lo/hi xmm pairs independently - in the past we've always considered this to be relatively benign, but I'm not certain if we should now do more to keep these from splitting?
@llvm/pr-subscribers-backend-x86 Author: Simon Pilgrim (RKSimon) ChangesThis extends the existing fold which concatenates X and Y if they are sequential subvectors extracted from the same source. By using combineConcatVectorOps we can recognise other patterns where X and Y can be concatenated for free (e.g. sequential loads, concatenating repeated instructions etc.), which allows the VPERMV3 fold to be a lot more aggressive. This required combineConcatVectorOps to be extended to fold the additional case of "concat(extract_subvector(x,lo), extract_subvector(x,hi)) -> extract_subvector(x)", similar to the original VPERMV3 fold where "x" was larger than the concat result type. This also exposes more cases where we have repeated vector/subvector loads if they have multiple uses - e.g. where we're loading a ymm and the lo/hi xmm pairs independently - in the past we've always considered this to be relatively benign, but I'm not certain if we should now do more to keep these from splitting? Patch is 312.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/122485.diff 22 Files Affected:
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index fbfcfc700ed62d..eda13286a22350 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -41699,6 +41699,11 @@ static SDValue canonicalizeLaneShuffleWithRepeatedOps(SDValue V,
return SDValue();
}
+static SDValue combineConcatVectorOps(const SDLoc &DL, MVT VT,
+ ArrayRef<SDValue> Ops, SelectionDAG &DAG,
+ TargetLowering::DAGCombinerInfo &DCI,
+ const X86Subtarget &Subtarget);
+
/// Try to combine x86 target specific shuffles.
static SDValue combineTargetShuffle(SDValue N, const SDLoc &DL,
SelectionDAG &DAG,
@@ -42399,25 +42404,17 @@ static SDValue combineTargetShuffle(SDValue N, const SDLoc &DL,
return SDValue();
}
case X86ISD::VPERMV3: {
- SDValue V1 = peekThroughBitcasts(N.getOperand(0));
- SDValue V2 = peekThroughBitcasts(N.getOperand(2));
- MVT SVT = V1.getSimpleValueType();
- // Combine VPERMV3 to widened VPERMV if the two source operands are split
- // from the same vector.
- if (V1.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
- V1.getConstantOperandVal(1) == 0 &&
- V2.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
- V2.getConstantOperandVal(1) == SVT.getVectorNumElements() &&
- V1.getOperand(0) == V2.getOperand(0)) {
- EVT NVT = V1.getOperand(0).getValueType();
- if (NVT.is256BitVector() ||
- (NVT.is512BitVector() && Subtarget.hasEVEX512())) {
- MVT WideVT = MVT::getVectorVT(
- VT.getScalarType(), NVT.getSizeInBits() / VT.getScalarSizeInBits());
+ // Combine VPERMV3 to widened VPERMV if the two source operands can be
+ // freely concatenated.
+ if (VT.is128BitVector() ||
+ (VT.is256BitVector() && Subtarget.useAVX512Regs())) {
+ SDValue Ops[] = {N.getOperand(0), N.getOperand(2)};
+ MVT WideVT = VT.getDoubleNumVectorElementsVT();
+ if (SDValue ConcatSrc =
+ combineConcatVectorOps(DL, WideVT, Ops, DAG, DCI, Subtarget)) {
SDValue Mask = widenSubVector(N.getOperand(1), false, Subtarget, DAG,
DL, WideVT.getSizeInBits());
- SDValue Perm = DAG.getNode(X86ISD::VPERMV, DL, WideVT, Mask,
- DAG.getBitcast(WideVT, V1.getOperand(0)));
+ SDValue Perm = DAG.getNode(X86ISD::VPERMV, DL, WideVT, Mask, ConcatSrc);
return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, Perm,
DAG.getIntPtrConstant(0, DL));
}
@@ -42425,6 +42422,9 @@ static SDValue combineTargetShuffle(SDValue N, const SDLoc &DL,
SmallVector<SDValue, 2> Ops;
SmallVector<int, 32> Mask;
if (getTargetShuffleMask(N, /*AllowSentinelZero=*/false, Ops, Mask)) {
+ assert(Mask.size() == NumElts && "Unexpected shuffle mask size");
+ SDValue V1 = peekThroughBitcasts(N.getOperand(0));
+ SDValue V2 = peekThroughBitcasts(N.getOperand(2));
MVT MaskVT = N.getOperand(1).getSimpleValueType();
// Canonicalize to VPERMV if both sources are the same.
if (V1 == V2) {
@@ -57367,10 +57367,8 @@ static SDValue combineConcatVectorOps(const SDLoc &DL, MVT VT,
Op0.getOperand(1));
}
- // concat(extract_subvector(v0,c0), extract_subvector(v1,c1)) -> vperm2x128.
- // Only concat of subvector high halves which vperm2x128 is best at.
// TODO: This should go in combineX86ShufflesRecursively eventually.
- if (VT.is256BitVector() && NumOps == 2) {
+ if (NumOps == 2) {
SDValue Src0 = peekThroughBitcasts(Ops[0]);
SDValue Src1 = peekThroughBitcasts(Ops[1]);
if (Src0.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
@@ -57379,7 +57377,10 @@ static SDValue combineConcatVectorOps(const SDLoc &DL, MVT VT,
EVT SrcVT1 = Src1.getOperand(0).getValueType();
unsigned NumSrcElts0 = SrcVT0.getVectorNumElements();
unsigned NumSrcElts1 = SrcVT1.getVectorNumElements();
- if (SrcVT0.is256BitVector() && SrcVT1.is256BitVector() &&
+ // concat(extract_subvector(v0), extract_subvector(v1)) -> vperm2x128.
+ // Only concat of subvector high halves which vperm2x128 is best at.
+ if (VT.is256BitVector() && SrcVT0.is256BitVector() &&
+ SrcVT1.is256BitVector() &&
Src0.getConstantOperandAPInt(1) == (NumSrcElts0 / 2) &&
Src1.getConstantOperandAPInt(1) == (NumSrcElts1 / 2)) {
return DAG.getNode(X86ISD::VPERM2X128, DL, VT,
@@ -57387,6 +57388,14 @@ static SDValue combineConcatVectorOps(const SDLoc &DL, MVT VT,
DAG.getBitcast(VT, Src1.getOperand(0)),
DAG.getTargetConstant(0x31, DL, MVT::i8));
}
+ // concat(extract_subvector(x,lo), extract_subvector(x,hi)) -> x.
+ if (Src0.getOperand(0) == Src1.getOperand(0) &&
+ Src0.getConstantOperandAPInt(1) == 0 &&
+ Src1.getConstantOperandAPInt(1) ==
+ Src0.getValueType().getVectorNumElements()) {
+ return DAG.getBitcast(VT, extractSubVector(Src0.getOperand(0), 0, DAG,
+ DL, VT.getSizeInBits()));
+ }
}
}
diff --git a/llvm/test/CodeGen/X86/any_extend_vector_inreg_of_broadcast_from_memory.ll b/llvm/test/CodeGen/X86/any_extend_vector_inreg_of_broadcast_from_memory.ll
index 1305559bc04e0f..3d72319f59ca9e 100644
--- a/llvm/test/CodeGen/X86/any_extend_vector_inreg_of_broadcast_from_memory.ll
+++ b/llvm/test/CodeGen/X86/any_extend_vector_inreg_of_broadcast_from_memory.ll
@@ -1337,10 +1337,9 @@ define void @vec256_i16_widen_to_i32_factor2_broadcast_to_v8i32_factor8(ptr %in.
;
; AVX512BW-LABEL: vec256_i16_widen_to_i32_factor2_broadcast_to_v8i32_factor8:
; AVX512BW: # %bb.0:
-; AVX512BW-NEXT: vmovdqa (%rdi), %ymm0
-; AVX512BW-NEXT: vpmovsxbw {{.*#+}} ymm1 = [0,17,0,19,0,21,0,23,0,25,0,27,0,29,0,31]
-; AVX512BW-NEXT: vpermi2w 32(%rdi), %ymm0, %ymm1
-; AVX512BW-NEXT: vpaddb (%rsi), %zmm1, %zmm0
+; AVX512BW-NEXT: vpmovsxbw {{.*#+}} ymm0 = [0,17,0,19,0,21,0,23,0,25,0,27,0,29,0,31]
+; AVX512BW-NEXT: vpermw (%rdi), %zmm0, %zmm0
+; AVX512BW-NEXT: vpaddb (%rsi), %zmm0, %zmm0
; AVX512BW-NEXT: vmovdqa64 %zmm0, (%rdx)
; AVX512BW-NEXT: vzeroupper
; AVX512BW-NEXT: retq
@@ -1789,10 +1788,9 @@ define void @vec256_i64_widen_to_i128_factor2_broadcast_to_v2i128_factor2(ptr %i
;
; AVX512F-FAST-LABEL: vec256_i64_widen_to_i128_factor2_broadcast_to_v2i128_factor2:
; AVX512F-FAST: # %bb.0:
-; AVX512F-FAST-NEXT: vmovdqa (%rdi), %ymm0
-; AVX512F-FAST-NEXT: vpmovsxbq {{.*#+}} ymm1 = [0,5,0,7]
-; AVX512F-FAST-NEXT: vpermi2q 32(%rdi), %ymm0, %ymm1
-; AVX512F-FAST-NEXT: vpaddb (%rsi), %ymm1, %ymm0
+; AVX512F-FAST-NEXT: vpmovsxbq {{.*#+}} ymm0 = [0,5,0,7]
+; AVX512F-FAST-NEXT: vpermq (%rdi), %zmm0, %zmm0
+; AVX512F-FAST-NEXT: vpaddb (%rsi), %ymm0, %ymm0
; AVX512F-FAST-NEXT: vmovdqa %ymm0, (%rdx)
; AVX512F-FAST-NEXT: vzeroupper
; AVX512F-FAST-NEXT: retq
@@ -1808,10 +1806,9 @@ define void @vec256_i64_widen_to_i128_factor2_broadcast_to_v2i128_factor2(ptr %i
;
; AVX512DQ-FAST-LABEL: vec256_i64_widen_to_i128_factor2_broadcast_to_v2i128_factor2:
; AVX512DQ-FAST: # %bb.0:
-; AVX512DQ-FAST-NEXT: vmovdqa (%rdi), %ymm0
-; AVX512DQ-FAST-NEXT: vpmovsxbq {{.*#+}} ymm1 = [0,5,0,7]
-; AVX512DQ-FAST-NEXT: vpermi2q 32(%rdi), %ymm0, %ymm1
-; AVX512DQ-FAST-NEXT: vpaddb (%rsi), %ymm1, %ymm0
+; AVX512DQ-FAST-NEXT: vpmovsxbq {{.*#+}} ymm0 = [0,5,0,7]
+; AVX512DQ-FAST-NEXT: vpermq (%rdi), %zmm0, %zmm0
+; AVX512DQ-FAST-NEXT: vpaddb (%rsi), %ymm0, %ymm0
; AVX512DQ-FAST-NEXT: vmovdqa %ymm0, (%rdx)
; AVX512DQ-FAST-NEXT: vzeroupper
; AVX512DQ-FAST-NEXT: retq
@@ -1827,10 +1824,9 @@ define void @vec256_i64_widen_to_i128_factor2_broadcast_to_v2i128_factor2(ptr %i
;
; AVX512BW-FAST-LABEL: vec256_i64_widen_to_i128_factor2_broadcast_to_v2i128_factor2:
; AVX512BW-FAST: # %bb.0:
-; AVX512BW-FAST-NEXT: vmovdqa (%rdi), %ymm0
-; AVX512BW-FAST-NEXT: vpmovsxbq {{.*#+}} ymm1 = [0,5,0,7]
-; AVX512BW-FAST-NEXT: vpermi2q 32(%rdi), %ymm0, %ymm1
-; AVX512BW-FAST-NEXT: vpaddb (%rsi), %zmm1, %zmm0
+; AVX512BW-FAST-NEXT: vpmovsxbq {{.*#+}} ymm0 = [0,5,0,7]
+; AVX512BW-FAST-NEXT: vpermq (%rdi), %zmm0, %zmm0
+; AVX512BW-FAST-NEXT: vpaddb (%rsi), %zmm0, %zmm0
; AVX512BW-FAST-NEXT: vmovdqa64 %zmm0, (%rdx)
; AVX512BW-FAST-NEXT: vzeroupper
; AVX512BW-FAST-NEXT: retq
diff --git a/llvm/test/CodeGen/X86/avx512-shuffles/partial_permute.ll b/llvm/test/CodeGen/X86/avx512-shuffles/partial_permute.ll
index 5078130f180779..ba26eb9b649f8a 100644
--- a/llvm/test/CodeGen/X86/avx512-shuffles/partial_permute.ll
+++ b/llvm/test/CodeGen/X86/avx512-shuffles/partial_permute.ll
@@ -149,9 +149,10 @@ define <8 x i16> @test_masked_z_16xi16_to_8xi16_perm_mask3(<16 x i16> %vec, <8 x
define <8 x i16> @test_16xi16_to_8xi16_perm_mem_mask0(ptr %vp) {
; CHECK-LABEL: test_16xi16_to_8xi16_perm_mem_mask0:
; CHECK: # %bb.0:
-; CHECK-NEXT: vmovdqa (%rdi), %xmm1
; CHECK-NEXT: vpmovsxbw {{.*#+}} xmm0 = [0,7,13,3,5,13,3,9]
-; CHECK-NEXT: vpermi2w 16(%rdi), %xmm1, %xmm0
+; CHECK-NEXT: vpermw (%rdi), %ymm0, %ymm0
+; CHECK-NEXT: # kill: def $xmm0 killed $xmm0 killed $ymm0
+; CHECK-NEXT: vzeroupper
; CHECK-NEXT: retq
%vec = load <16 x i16>, ptr %vp
%res = shufflevector <16 x i16> %vec, <16 x i16> undef, <8 x i32> <i32 0, i32 7, i32 13, i32 3, i32 5, i32 13, i32 3, i32 9>
@@ -160,11 +161,12 @@ define <8 x i16> @test_16xi16_to_8xi16_perm_mem_mask0(ptr %vp) {
define <8 x i16> @test_masked_16xi16_to_8xi16_perm_mem_mask0(ptr %vp, <8 x i16> %vec2, <8 x i16> %mask) {
; CHECK-LABEL: test_masked_16xi16_to_8xi16_perm_mem_mask0:
; CHECK: # %bb.0:
-; CHECK-NEXT: vmovdqa (%rdi), %xmm2
-; CHECK-NEXT: vpmovsxbw {{.*#+}} xmm3 = [0,7,13,3,5,13,3,9]
-; CHECK-NEXT: vpermi2w 16(%rdi), %xmm2, %xmm3
+; CHECK-NEXT: # kill: def $xmm0 killed $xmm0 def $ymm0
+; CHECK-NEXT: vpmovsxbw {{.*#+}} xmm2 = [0,7,13,3,5,13,3,9]
; CHECK-NEXT: vptestnmw %xmm1, %xmm1, %k1
-; CHECK-NEXT: vmovdqu16 %xmm3, %xmm0 {%k1}
+; CHECK-NEXT: vpermw (%rdi), %ymm2, %ymm0 {%k1}
+; CHECK-NEXT: # kill: def $xmm0 killed $xmm0 killed $ymm0
+; CHECK-NEXT: vzeroupper
; CHECK-NEXT: retq
%vec = load <16 x i16>, ptr %vp
%shuf = shufflevector <16 x i16> %vec, <16 x i16> undef, <8 x i32> <i32 0, i32 7, i32 13, i32 3, i32 5, i32 13, i32 3, i32 9>
@@ -176,11 +178,11 @@ define <8 x i16> @test_masked_16xi16_to_8xi16_perm_mem_mask0(ptr %vp, <8 x i16>
define <8 x i16> @test_masked_z_16xi16_to_8xi16_perm_mem_mask0(ptr %vp, <8 x i16> %mask) {
; CHECK-LABEL: test_masked_z_16xi16_to_8xi16_perm_mem_mask0:
; CHECK: # %bb.0:
-; CHECK-NEXT: vmovdqa (%rdi), %xmm2
; CHECK-NEXT: vpmovsxbw {{.*#+}} xmm1 = [0,7,13,3,5,13,3,9]
; CHECK-NEXT: vptestnmw %xmm0, %xmm0, %k1
-; CHECK-NEXT: vpermi2w 16(%rdi), %xmm2, %xmm1 {%k1} {z}
-; CHECK-NEXT: vmovdqa %xmm1, %xmm0
+; CHECK-NEXT: vpermw (%rdi), %ymm1, %ymm0 {%k1} {z}
+; CHECK-NEXT: # kill: def $xmm0 killed $xmm0 killed $ymm0
+; CHECK-NEXT: vzeroupper
; CHECK-NEXT: retq
%vec = load <16 x i16>, ptr %vp
%shuf = shufflevector <16 x i16> %vec, <16 x i16> undef, <8 x i32> <i32 0, i32 7, i32 13, i32 3, i32 5, i32 13, i32 3, i32 9>
@@ -192,11 +194,12 @@ define <8 x i16> @test_masked_z_16xi16_to_8xi16_perm_mem_mask0(ptr %vp, <8 x i16
define <8 x i16> @test_masked_16xi16_to_8xi16_perm_mem_mask1(ptr %vp, <8 x i16> %vec2, <8 x i16> %mask) {
; CHECK-LABEL: test_masked_16xi16_to_8xi16_perm_mem_mask1:
; CHECK: # %bb.0:
-; CHECK-NEXT: vmovdqa (%rdi), %xmm2
-; CHECK-NEXT: vpmovsxbw {{.*#+}} xmm3 = [3,15,12,7,1,5,8,14]
-; CHECK-NEXT: vpermi2w 16(%rdi), %xmm2, %xmm3
+; CHECK-NEXT: # kill: def $xmm0 killed $xmm0 def $ymm0
+; CHECK-NEXT: vpmovsxbw {{.*#+}} xmm2 = [3,15,12,7,1,5,8,14]
; CHECK-NEXT: vptestnmw %xmm1, %xmm1, %k1
-; CHECK-NEXT: vmovdqu16 %xmm3, %xmm0 {%k1}
+; CHECK-NEXT: vpermw (%rdi), %ymm2, %ymm0 {%k1}
+; CHECK-NEXT: # kill: def $xmm0 killed $xmm0 killed $ymm0
+; CHECK-NEXT: vzeroupper
; CHECK-NEXT: retq
%vec = load <16 x i16>, ptr %vp
%shuf = shufflevector <16 x i16> %vec, <16 x i16> undef, <8 x i32> <i32 3, i32 15, i32 12, i32 7, i32 1, i32 5, i32 8, i32 14>
@@ -208,11 +211,11 @@ define <8 x i16> @test_masked_16xi16_to_8xi16_perm_mem_mask1(ptr %vp, <8 x i16>
define <8 x i16> @test_masked_z_16xi16_to_8xi16_perm_mem_mask1(ptr %vp, <8 x i16> %mask) {
; CHECK-LABEL: test_masked_z_16xi16_to_8xi16_perm_mem_mask1:
; CHECK: # %bb.0:
-; CHECK-NEXT: vmovdqa (%rdi), %xmm2
; CHECK-NEXT: vpmovsxbw {{.*#+}} xmm1 = [3,15,12,7,1,5,8,14]
; CHECK-NEXT: vptestnmw %xmm0, %xmm0, %k1
-; CHECK-NEXT: vpermi2w 16(%rdi), %xmm2, %xmm1 {%k1} {z}
-; CHECK-NEXT: vmovdqa %xmm1, %xmm0
+; CHECK-NEXT: vpermw (%rdi), %ymm1, %ymm0 {%k1} {z}
+; CHECK-NEXT: # kill: def $xmm0 killed $xmm0 killed $ymm0
+; CHECK-NEXT: vzeroupper
; CHECK-NEXT: retq
%vec = load <16 x i16>, ptr %vp
%shuf = shufflevector <16 x i16> %vec, <16 x i16> undef, <8 x i32> <i32 3, i32 15, i32 12, i32 7, i32 1, i32 5, i32 8, i32 14>
@@ -256,9 +259,10 @@ define <8 x i16> @test_masked_z_16xi16_to_8xi16_perm_mem_mask2(ptr %vp, <8 x i16
define <8 x i16> @test_16xi16_to_8xi16_perm_mem_mask3(ptr %vp) {
; CHECK-LABEL: test_16xi16_to_8xi16_perm_mem_mask3:
; CHECK: # %bb.0:
-; CHECK-NEXT: vmovdqa (%rdi), %xmm1
; CHECK-NEXT: vpmovsxbw {{.*#+}} xmm0 = [9,7,9,6,9,4,3,2]
-; CHECK-NEXT: vpermi2w 16(%rdi), %xmm1, %xmm0
+; CHECK-NEXT: vpermw (%rdi), %ymm0, %ymm0
+; CHECK-NEXT: # kill: def $xmm0 killed $xmm0 killed $ymm0
+; CHECK-NEXT: vzeroupper
; CHECK-NEXT: retq
%vec = load <16 x i16>, ptr %vp
%res = shufflevector <16 x i16> %vec, <16 x i16> undef, <8 x i32> <i32 9, i32 7, i32 9, i32 6, i32 9, i32 4, i32 3, i32 2>
@@ -267,11 +271,12 @@ define <8 x i16> @test_16xi16_to_8xi16_perm_mem_mask3(ptr %vp) {
define <8 x i16> @test_masked_16xi16_to_8xi16_perm_mem_mask3(ptr %vp, <8 x i16> %vec2, <8 x i16> %mask) {
; CHECK-LABEL: test_masked_16xi16_to_8xi16_perm_mem_mask3:
; CHECK: # %bb.0:
-; CHECK-NEXT: vmovdqa (%rdi), %xmm2
-; CHECK-NEXT: vpmovsxbw {{.*#+}} xmm3 = [9,7,9,6,9,4,3,2]
-; CHECK-NEXT: vpermi2w 16(%rdi), %xmm2, %xmm3
+; CHECK-NEXT: # kill: def $xmm0 killed $xmm0 def $ymm0
+; CHECK-NEXT: vpmovsxbw {{.*#+}} xmm2 = [9,7,9,6,9,4,3,2]
; CHECK-NEXT: vptestnmw %xmm1, %xmm1, %k1
-; CHECK-NEXT: vmovdqu16 %xmm3, %xmm0 {%k1}
+; CHECK-NEXT: vpermw (%rdi), %ymm2, %ymm0 {%k1}
+; CHECK-NEXT: # kill: def $xmm0 killed $xmm0 killed $ymm0
+; CHECK-NEXT: vzeroupper
; CHECK-NEXT: retq
%vec = load <16 x i16>, ptr %vp
%shuf = shufflevector <16 x i16> %vec, <16 x i16> undef, <8 x i32> <i32 9, i32 7, i32 9, i32 6, i32 9, i32 4, i32 3, i32 2>
@@ -283,11 +288,11 @@ define <8 x i16> @test_masked_16xi16_to_8xi16_perm_mem_mask3(ptr %vp, <8 x i16>
define <8 x i16> @test_masked_z_16xi16_to_8xi16_perm_mem_mask3(ptr %vp, <8 x i16> %mask) {
; CHECK-LABEL: test_masked_z_16xi16_to_8xi16_perm_mem_mask3:
; CHECK: # %bb.0:
-; CHECK-NEXT: vmovdqa (%rdi), %xmm2
; CHECK-NEXT: vpmovsxbw {{.*#+}} xmm1 = [9,7,9,6,9,4,3,2]
; CHECK-NEXT: vptestnmw %xmm0, %xmm0, %k1
-; CHECK-NEXT: vpermi2w 16(%rdi), %xmm2, %xmm1 {%k1} {z}
-; CHECK-NEXT: vmovdqa %xmm1, %xmm0
+; CHECK-NEXT: vpermw (%rdi), %ymm1, %ymm0 {%k1} {z}
+; CHECK-NEXT: # kill: def $xmm0 killed $xmm0 killed $ymm0
+; CHECK-NEXT: vzeroupper
; CHECK-NEXT: retq
%vec = load <16 x i16>, ptr %vp
%shuf = shufflevector <16 x i16> %vec, <16 x i16> undef, <8 x i32> <i32 9, i32 7, i32 9, i32 6, i32 9, i32 4, i32 3, i32 2>
@@ -579,9 +584,9 @@ define <8 x i16> @test_masked_z_32xi16_to_8xi16_perm_mask3(<32 x i16> %vec, <8 x
define <16 x i16> @test_32xi16_to_16xi16_perm_mem_mask0(ptr %vp) {
; CHECK-LABEL: test_32xi16_to_16xi16_perm_mem_mask0:
; CHECK: # %bb.0:
-; CHECK-NEXT: vmovdqa (%rdi), %ymm1
; CHECK-NEXT: vpmovsxbw {{.*#+}} ymm0 = [20,19,22,12,13,20,0,6,10,7,20,12,28,18,13,12]
-; CHECK-NEXT: vpermi2w 32(%rdi), %ymm1, %ymm0
+; CHECK-NEXT: vpermw (%rdi), %zmm0, %zmm0
+; CHECK-NEXT: # kill: def $ymm0 killed $ymm0 killed $zmm0
; CHECK-NEXT: retq
%vec = load <32 x i16>, ptr %vp
%res = shufflevector <32 x i16> %vec, <32 x i16> undef, <16 x i32> <i32 20, i32 19, i32 22, i32 12, i32 13, i32 20, i32 0, i32 6, i32 10, i32 7, i32 20, i32 12, i32 28, i32 18, i32 13, i32 12>
@@ -590,11 +595,11 @@ define <16 x i16> @test_32xi16_to_16xi16_perm_mem_mask0(ptr %vp) {
define <16 x i16> @test_masked_32xi16_to_16xi16_perm_mem_mask0(ptr %vp, <16 x i16> %vec2, <16 x i16> %mask) {
; CHECK-LABEL: test_masked_32xi16_to_16xi16_perm_mem_mask0:
; CHECK: # %bb.0:
-; CHECK-NEXT: vmovdqa (%rdi), %ymm2
-; CHECK-NEXT: vpmovsxbw {{.*#+}} ymm3 = [20,19,22,12,13,20,0,6,10,7,20,12,28,18,13,12]
-; CHECK-NEXT: vpermi2w 32(%rdi), %ymm2, %ymm3
+; CHECK-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
+; CHECK-NEXT: vpmovsxbw {{.*#+}} ymm2 = [20,19,22,12,13,20,0,6,10,7,20,12,28,18,13,12]
; CHECK-NEXT: vptestnmw %ymm1, %ymm1, %k1
-; CHECK-NEXT: vmovdqu16 %ymm3, %ymm0 {%k1}
+; CHECK-NEXT: vpermw (%rdi), %zmm2, %zmm0 {%k1}
+; CHECK-NEXT: # kill: def $ymm0 killed $ymm0 killed $zmm0
; CHECK-NEXT: retq
%vec = load <32 x i16>, ptr %vp
%shuf = shufflevector <32 x i16> %vec, <32 x i16> undef, <16 x i32> <i32 20, i32 19, i32 22, i32 12, i32 13, i32 20, i32 0, i32 6, i32 10, i32 7, i32 20, i32 12, i32 28, i32 18, i32 13, i32 12>
@@ -606,11 +611,10 @@ define <16 x i16> @test_masked_32xi16_to_16xi16_perm_mem_mask0(ptr %vp, <16 x i1
define <16 x i16> @test_masked_z_32xi16_to_16xi16_perm_mem_mask0(ptr %vp, <16 x i16> %mask) {
; CHECK-LABEL: test_masked_z_32xi16_to_16xi16_perm_mem_mask0:
; CHECK: # %bb.0:
-; CHECK-NEXT: vmovdqa (%rdi), %ymm2
; CHECK-NEXT: vpmovsxbw {{.*#+}} ymm1 = [20,19,22,12,13,20,0,6,10,7,20,12,28,18,13,12]
; CHECK-NEXT: vptestnmw %ymm0, %ymm0, %k1
-; CHECK-NEXT: vpermi2w 32(%rdi), %ymm2, %ymm1 {%k1} {z}
-; CHECK-NEXT: vmovdqa %ymm1, %ymm0
+; CHECK-NEXT: vpermw (%rdi), %zmm1, %zmm0 {%k1} {z}
+; CHECK-NEXT: # kill: def $ymm0 killed $ymm0 killed $zmm0
; CHECK-NEXT: retq
%vec = load <32 x i16>, ptr %vp
%shuf = shufflevector <32 x i16> %vec, <32 x i16> undef, <16 x i32> <i32 20, i32 19, i32 22, i32 12, i32 13, i32 20, i32 0, i32 6, i32 10, i32 7, i32 20, i32 12, i32 28, i32 18, i32 13, i32 12>
@@ -622,11 +626,11 @@ define <16 x i16> @test_masked_z_32xi16_to_16xi16_perm_mem_mask0(ptr %vp, <16 x
define <16 x i16> @test_masked_32xi16_to_16xi16_perm_mem_mask1(ptr %vp, <16 x i16> %vec2, <16 x i16> %mask) {
; CHECK-LABEL: test_masked_32xi16_to_16xi16_perm_mem_mask1:
; CHECK: # %bb.0:
-; CHECK-NEXT: vmovdqa (%rdi), %ymm2
-; CHECK-NEXT: vpmovsxbw {{.*#+}} ymm3 = [22,13,21,1,14,8,5,16,15,17,24,28,15,9,14,25]
-; CHECK-NEXT: vpermi2w 32(%rdi), %ymm2, %ymm3
+; CHECK-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
+; CHECK-NEXT: vpmovsxbw {{.*#+}} ymm2 = [22,13,21,1,14,8,5,16,15,17,24,28,15,9,14,25]
; CHECK-NEXT: vptestnmw %ymm1, %ymm1, %k1
-; CHECK-NEXT: vmovdqu16 %ymm3, %ymm0 {%k1}
+; CHECK-NEXT: vpermw (%rdi), %zmm2, %zmm0 {%k1}
+; CHECK-NEXT: # kill: def $ymm0 killed $ymm0 killed $zmm0
; CHECK-NEXT: retq
%vec = load <32 x i16>, ptr %vp
%shuf = shufflevector <32 x i16> %vec, <32 x i16> undef, <16 x i32> <i32 22, i32 13, i32 21, i32 1, i32 14, i32 8, i32 5, i32 16, i32 15, i32 17, i32 24, i32 28, i32 15, i32 9, i32 14, i32 25>
@@ -638,11 +642,10 @@ define <16 x i16> @test_masked_32xi16_to_16xi16_perm_mem_mask1(ptr %vp, <16 x i1
define <16 x i16> @test_masked_z_32xi16_to_16xi16_perm_mem_mask1(ptr %vp, <16 x i16> %mask) {
; CHECK-LABEL: test_masked_z_32xi16_to_16xi16_perm_mem_mask1:
; CHECK: # %bb.0:
-; CHECK-NEXT: vmovdqa (%rdi), %ymm2
; CHECK-NEXT: vpmovsxbw {{.*#+}} ymm1 = [22,13,21,1,14,8,5,16,15,17,24,28,15,9,14,25]
; CHECK-NEXT: vptestnmw %ymm0, %ymm0, %k1
-; CHECK-NEXT: vpermi2w 32(%rdi), %ymm2, %ymm1 {%k1} {z}
-; CHECK-NEXT: vmovdqa %ymm1, %ymm0
+; CHECK-NEXT: vpermw (%rdi), %zmm1, %zmm0 {%k1} {z}
+; CHECK-NEXT: # kill: def $ymm0 killed $ymm0 killed $zmm0
; CHECK-N...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
; AVX512BWVL-FAST-ALL-NEXT: vpermi2d 32(%rdi), %ymm0, %ymm1 | ||
; AVX512BWVL-FAST-ALL-NEXT: vmovdqa %ymm1, (%rsi) | ||
; AVX512BWVL-FAST-ALL-NEXT: vmovaps {{.*#+}} ymm0 = [1,3,5,7,9,11,13,15] | ||
; AVX512BWVL-FAST-ALL-NEXT: vpermps (%rdi), %zmm0, %zmm0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vpermps has the same TP as vpermi2d, but it load double memory now. Is it a win here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a domain crossing issue as X86FixupVectorConstants runs very late - for AVX512 class loads, are you OK with a full width fp load being replaced with an integer s/zext load or broadcast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my bad english, I'm not saying loading double type. I mean the load size is getting doubled (from 256-bit to 512-bit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np - I think I understood - what I was asking was whether you had any concerns regarding domain stalls from an avx512 capable target for:
vpmovsxbd {{.*#+}} ymm0 = [1,3,5,7,9,11,13,15] ; INT DOMAIN - 64-bit load
vpermps (%rdi), %zmm0, %zmm0 ; PS DOMAIN
vs
vmovaps {{.*#+}} ymm0 = [1,3,5,7,9,11,13,15] ; FP DOMAIN - 256-bit load
vpermps (%rdi), %zmm0, %zmm0 ; PS DOMAIN
I don't think there would be, and we could tweak the X86FixupVectorConstants tables to allow integer s/zext loads to replace fp full width loads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we have the equivalent integer instruction VPERMD, which can solve domain stalls, isn't it?
Though the 64-bit load vs. 256-bit load is a problem, my initial question was just about
vpermi2d 32(%rdi), %ymm0, %ymm1 ; 256-bit load
vs
vpermps (%rdi), %zmm0, %zmm0 ; 512-bit load
So put them together, it looks like a negative benefit to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be solved now - we've gone from a i64 sextload + 2 x 256-bit loads to i64 sextload + a 512-bit load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I missed the fisrt 256-bit loads.
…PD domain moves (#122601) For targets with free domain moves, or AVX512 support, allow the use of VPMOVSX/ZX extension loads to reduce the load sizes. I've limited this to extension to i32/i64 types as we're mostly interested in shuffle mask loading here, but we could include i16 types as well just as easily. Inspired by a regression on #122485
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
… the CONCAT is free Minor followup to llvm#122485 - if the source operands were widened half-size subvectors, then attempt to concatenate the subvectors directly, and then adjust the shuffle mask so references to the second operand now refer to the upper half of the concat result.
…CAT is free (llvm#122485) This extends the existing fold which concatenates X and Y if they are sequential subvectors extracted from the same source. By using combineConcatVectorOps we can recognise other patterns where X and Y can be concatenated for free (e.g. sequential loads, concatenating repeated instructions etc.), which allows the VPERMV3 fold to be a lot more aggressive. This required combineConcatVectorOps to be extended to fold the additional case of "concat(extract_subvector(x,lo), extract_subvector(x,hi)) -> extract_subvector(x)", similar to the original VPERMV3 fold where "x" was larger than the concat result type. This also exposes more cases where we have repeated vector/subvector loads if they have multiple uses - e.g. where we're loading a ymm and the lo/hi xmm pairs independently - in the past we've always considered this to be relatively benign, but I'm not certain if we should now do more to keep these from splitting?
This extends the existing fold which concatenates X and Y if they are sequential subvectors extracted from the same source.
By using combineConcatVectorOps we can recognise other patterns where X and Y can be concatenated for free (e.g. sequential loads, concatenating repeated instructions etc.), which allows the VPERMV3 fold to be a lot more aggressive.
This required combineConcatVectorOps to be extended to fold the additional case of "concat(extract_subvector(x,lo), extract_subvector(x,hi)) -> extract_subvector(x)", similar to the original VPERMV3 fold where "x" was larger than the concat result type.
This also exposes more cases where we have repeated vector/subvector loads if they have multiple uses - e.g. where we're loading a ymm and the lo/hi xmm pairs independently - in the past we've always considered this to be relatively benign, but I'm not certain if we should now do more to keep these from splitting?