Skip to content

[X86] combineConcatVectorOps - concat per-lane v2f64/v4f64 shuffles into vXf64 vshufpd #143017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 6, 2025

Conversation

RKSimon
Copy link
Collaborator

@RKSimon RKSimon commented Jun 5, 2025

We can always concatenate v2f64/v4f64 per-lane shuffles into a single vshufpd instruction, assuming we can profitably concatenate at least one of its operands (or its an unary shuffle).

I was really hoping to get this into combineX86ShufflesRecursively but it still can't handle concatenation as well as combineConcatVectorOps.

@llvmbot
Copy link
Member

llvmbot commented Jun 5, 2025

@llvm/pr-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Changes

We can always concatenate v2f64 per-lane shuffles into a single vshufpd instruction, assuming we can profitably concatenate at least one of its operands

I was really hoping to get this into combineX86ShufflesRecursively but it still can't handle concatenation as well as combineConcatVectorOps.


Patch is 28.70 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/143017.diff

2 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+55-7)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-2.ll (+160-222)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 760119bc62604..f3cc7d57fcfba 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -58492,14 +58492,23 @@ static SDValue combineConcatVectorOps(const SDLoc &DL, MVT VT,
       const APInt &SrcIdx0 = Src0.getConstantOperandAPInt(1);
       const APInt &SrcIdx1 = Src1.getConstantOperandAPInt(1);
       // concat(extract_subvector(v0), extract_subvector(v1)) -> vperm2x128.
-      // Only concat of subvector high halves which vperm2x128 is best at.
+      // Only concat of subvector high halves which vperm2x128 is best at or if
+      // it should fold into a subvector broadcast.
       if (VT.is256BitVector() && SrcVT0.is256BitVector() &&
-          SrcVT1.is256BitVector() && SrcIdx0 == (NumSrcElts0 / 2) &&
-          SrcIdx1 == (NumSrcElts1 / 2)) {
-        return DAG.getNode(X86ISD::VPERM2X128, DL, VT,
-                           DAG.getBitcast(VT, Src0.getOperand(0)),
-                           DAG.getBitcast(VT, Src1.getOperand(0)),
-                           DAG.getTargetConstant(0x31, DL, MVT::i8));
+          SrcVT1.is256BitVector()) {
+        assert((SrcIdx0 == 0 || SrcIdx0 == (NumSrcElts0 / 2)) &&
+               (SrcIdx1 == 0 || SrcIdx1 == (NumSrcElts1 / 2)) &&
+               "Bad subvector index");
+        if ((SrcIdx0 == (NumSrcElts0 / 2) && SrcIdx1 == (NumSrcElts1 / 2)) ||
+            (IsSplat && ISD::isNormalLoad(Src0.getOperand(0).getNode()))) {
+          unsigned Index = 0;
+          Index |= SrcIdx0 == 0 ? 0x00 : 0x01;
+          Index |= SrcIdx1 == 0 ? 0x20 : 0x30;
+          return DAG.getNode(X86ISD::VPERM2X128, DL, VT,
+                             DAG.getBitcast(VT, Src0.getOperand(0)),
+                             DAG.getBitcast(VT, Src1.getOperand(0)),
+                             DAG.getTargetConstant(Index, DL, MVT::i8));
+        }
       }
       // Widen extract_subvector
       // concat(extract_subvector(x,lo), extract_subvector(x,hi))
@@ -59312,6 +59321,45 @@ static SDValue combineConcatVectorOps(const SDLoc &DL, MVT VT,
     return DAG.getBitcast(VT, Res);
   }
 
+  // We can always convert per-lane vXf64 shuffles into VSHUFPD.
+  if (!IsSplat && NumOps == 2 && VT == MVT::v4f64 &&
+      all_of(Ops, [](SDValue Op) {
+        return Op.hasOneUse() && (Op.getOpcode() == X86ISD::MOVDDUP ||
+                                  Op.getOpcode() == X86ISD::SHUFP ||
+                                  Op.getOpcode() == X86ISD::VPERMILPI ||
+                                  Op.getOpcode() == X86ISD::BLENDI ||
+                                  Op.getOpcode() == X86ISD::UNPCKL ||
+                                  Op.getOpcode() == X86ISD::UNPCKH);
+      })) {
+    SmallVector<SDValue, 2> SrcOps0, SrcOps1;
+    SmallVector<int, 8> SrcMask0, SrcMask1;
+    if (getTargetShuffleMask(Ops[0], /*AllowSentinelZero=*/false, SrcOps0,
+                             SrcMask0) &&
+        getTargetShuffleMask(Ops[1], /*AllowSentinelZero=*/false, SrcOps1,
+                             SrcMask1)) {
+      assert(SrcMask0.size() == 2 && SrcMask1.size() == 2 && "Bad shuffles");
+      SDValue LHS[] = {SrcOps0[SrcMask0[0] / 2], SrcOps1[SrcMask1[0] / 2]};
+      SDValue RHS[] = {SrcOps0[SrcMask0[1] / 2], SrcOps1[SrcMask1[1] / 2]};
+      SDValue Concat0 =
+          combineConcatVectorOps(DL, VT, LHS, DAG, Subtarget, Depth + 1);
+      SDValue Concat1 =
+          combineConcatVectorOps(DL, VT, RHS, DAG, Subtarget, Depth + 1);
+      if (Concat0 || Concat1) {
+        unsigned SHUFPDMask = 0;
+        SHUFPDMask |= (SrcMask0[0] & 1) << 0;
+        SHUFPDMask |= (SrcMask0[1] & 1) << 1;
+        SHUFPDMask |= (SrcMask1[0] & 1) << 2;
+        SHUFPDMask |= (SrcMask1[1] & 1) << 3;
+        Concat0 =
+            Concat0 ? Concat0 : DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, LHS);
+        Concat1 =
+            Concat1 ? Concat1 : DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, RHS);
+        return DAG.getNode(X86ISD::SHUFP, DL, VT, Concat0, Concat1,
+                           DAG.getTargetConstant(SHUFPDMask, DL, MVT::i8));
+      }
+    }
+  }
+
   return SDValue();
 }
 
diff --git a/llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-2.ll b/llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-2.ll
index 8d68f88249a9e..3e9fed78b56b4 100644
--- a/llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-2.ll
+++ b/llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-2.ll
@@ -163,16 +163,14 @@ define void @store_i64_stride2_vf4(ptr %in.vecptr0, ptr %in.vecptr1, ptr %out.ve
 ;
 ; AVX-LABEL: store_i64_stride2_vf4:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vmovaps (%rsi), %xmm0
-; AVX-NEXT:    vmovaps (%rdi), %xmm1
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = xmm1[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm0 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm1 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm0 = ymm1[0],ymm0[0],ymm1[3],ymm0[3]
 ; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm1 = mem[0,1,0,1]
 ; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm2 = mem[0,1,0,1]
 ; AVX-NEXT:    vshufpd {{.*#+}} ymm1 = ymm2[0],ymm1[0],ymm2[3],ymm1[3]
 ; AVX-NEXT:    vmovapd %ymm1, 32(%rdx)
-; AVX-NEXT:    vmovaps %ymm0, (%rdx)
+; AVX-NEXT:    vmovapd %ymm0, (%rdx)
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
@@ -343,16 +341,12 @@ define void @store_i64_stride2_vf8(ptr %in.vecptr0, ptr %in.vecptr1, ptr %out.ve
 ;
 ; AVX-LABEL: store_i64_stride2_vf8:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vmovaps (%rsi), %xmm0
-; AVX-NEXT:    vmovaps 32(%rsi), %xmm1
-; AVX-NEXT:    vmovaps (%rdi), %xmm2
-; AVX-NEXT:    vmovaps 32(%rdi), %xmm3
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm4 = xmm2[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm2[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm4, %ymm0, %ymm0
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = xmm3[1],xmm1[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm1 = xmm3[0],xmm1[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm2, %ymm1, %ymm1
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm0 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm1 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm0 = ymm1[0],ymm0[0],ymm1[3],ymm0[3]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm1 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm2 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm1 = ymm2[0],ymm1[0],ymm2[3],ymm1[3]
 ; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm2 = mem[0,1,0,1]
 ; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm3 = mem[0,1,0,1]
 ; AVX-NEXT:    vshufpd {{.*#+}} ymm2 = ymm3[0],ymm2[0],ymm3[3],ymm2[3]
@@ -360,9 +354,9 @@ define void @store_i64_stride2_vf8(ptr %in.vecptr0, ptr %in.vecptr1, ptr %out.ve
 ; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm4 = mem[0,1,0,1]
 ; AVX-NEXT:    vshufpd {{.*#+}} ymm3 = ymm4[0],ymm3[0],ymm4[3],ymm3[3]
 ; AVX-NEXT:    vmovapd %ymm3, 96(%rdx)
-; AVX-NEXT:    vmovapd %ymm2, 32(%rdx)
-; AVX-NEXT:    vmovaps %ymm1, 64(%rdx)
-; AVX-NEXT:    vmovaps %ymm0, (%rdx)
+; AVX-NEXT:    vmovapd %ymm2, 64(%rdx)
+; AVX-NEXT:    vmovapd %ymm1, (%rdx)
+; AVX-NEXT:    vmovapd %ymm0, 32(%rdx)
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
@@ -617,26 +611,18 @@ define void @store_i64_stride2_vf16(ptr %in.vecptr0, ptr %in.vecptr1, ptr %out.v
 ;
 ; AVX-LABEL: store_i64_stride2_vf16:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vmovaps (%rsi), %xmm0
-; AVX-NEXT:    vmovaps 32(%rsi), %xmm1
-; AVX-NEXT:    vmovaps 64(%rsi), %xmm2
-; AVX-NEXT:    vmovaps 96(%rsi), %xmm3
-; AVX-NEXT:    vmovaps (%rdi), %xmm4
-; AVX-NEXT:    vmovaps 32(%rdi), %xmm5
-; AVX-NEXT:    vmovaps 64(%rdi), %xmm6
-; AVX-NEXT:    vmovaps 96(%rdi), %xmm7
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm8 = xmm7[1],xmm3[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm3 = xmm7[0],xmm3[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm8, %ymm3, %ymm3
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm7 = xmm6[1],xmm2[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm2 = xmm6[0],xmm2[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm7, %ymm2, %ymm2
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm6 = xmm4[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm4[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm6, %ymm0, %ymm0
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm4 = xmm5[1],xmm1[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm1 = xmm5[0],xmm1[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm4, %ymm1, %ymm1
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm0 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm1 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm0 = ymm1[0],ymm0[0],ymm1[3],ymm0[3]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm1 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm2 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm1 = ymm2[0],ymm1[0],ymm2[3],ymm1[3]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm2 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm3 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm2 = ymm3[0],ymm2[0],ymm3[3],ymm2[3]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm3 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm4 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm3 = ymm4[0],ymm3[0],ymm4[3],ymm3[3]
 ; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm4 = mem[0,1,0,1]
 ; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm5 = mem[0,1,0,1]
 ; AVX-NEXT:    vshufpd {{.*#+}} ymm4 = ymm5[0],ymm4[0],ymm5[3],ymm4[3]
@@ -651,12 +637,12 @@ define void @store_i64_stride2_vf16(ptr %in.vecptr0, ptr %in.vecptr1, ptr %out.v
 ; AVX-NEXT:    vshufpd {{.*#+}} ymm7 = ymm8[0],ymm7[0],ymm8[3],ymm7[3]
 ; AVX-NEXT:    vmovapd %ymm7, 32(%rdx)
 ; AVX-NEXT:    vmovapd %ymm6, 96(%rdx)
-; AVX-NEXT:    vmovapd %ymm5, 160(%rdx)
-; AVX-NEXT:    vmovapd %ymm4, 224(%rdx)
-; AVX-NEXT:    vmovaps %ymm1, 64(%rdx)
-; AVX-NEXT:    vmovaps %ymm0, (%rdx)
-; AVX-NEXT:    vmovaps %ymm2, 128(%rdx)
-; AVX-NEXT:    vmovaps %ymm3, 192(%rdx)
+; AVX-NEXT:    vmovapd %ymm5, 64(%rdx)
+; AVX-NEXT:    vmovapd %ymm4, (%rdx)
+; AVX-NEXT:    vmovapd %ymm3, 160(%rdx)
+; AVX-NEXT:    vmovapd %ymm2, 128(%rdx)
+; AVX-NEXT:    vmovapd %ymm1, 192(%rdx)
+; AVX-NEXT:    vmovapd %ymm0, 224(%rdx)
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
@@ -1117,47 +1103,31 @@ define void @store_i64_stride2_vf32(ptr %in.vecptr0, ptr %in.vecptr1, ptr %out.v
 ;
 ; AVX-LABEL: store_i64_stride2_vf32:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vmovaps 224(%rsi), %xmm0
-; AVX-NEXT:    vmovaps 224(%rdi), %xmm1
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = xmm1[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vmovaps 128(%rsi), %xmm1
-; AVX-NEXT:    vmovaps 128(%rdi), %xmm2
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm3 = xmm2[1],xmm1[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm1 = xmm2[0],xmm1[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm3, %ymm1, %ymm1
-; AVX-NEXT:    vmovaps (%rsi), %xmm2
-; AVX-NEXT:    vmovaps 32(%rsi), %xmm3
-; AVX-NEXT:    vmovaps 64(%rsi), %xmm4
-; AVX-NEXT:    vmovaps 96(%rsi), %xmm5
-; AVX-NEXT:    vmovaps (%rdi), %xmm6
-; AVX-NEXT:    vmovaps 32(%rdi), %xmm7
-; AVX-NEXT:    vmovaps 64(%rdi), %xmm8
-; AVX-NEXT:    vmovaps 96(%rdi), %xmm9
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm10 = xmm6[1],xmm2[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm2 = xmm6[0],xmm2[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm10, %ymm2, %ymm2
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm6 = xmm7[1],xmm3[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm3 = xmm7[0],xmm3[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm6, %ymm3, %ymm3
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm6 = xmm8[1],xmm4[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm4 = xmm8[0],xmm4[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm6, %ymm4, %ymm4
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm6 = xmm9[1],xmm5[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm5 = xmm9[0],xmm5[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm6, %ymm5, %ymm5
-; AVX-NEXT:    vmovaps 160(%rsi), %xmm6
-; AVX-NEXT:    vmovaps 160(%rdi), %xmm7
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm8 = xmm7[1],xmm6[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm6 = xmm7[0],xmm6[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm8, %ymm6, %ymm6
-; AVX-NEXT:    vmovaps 192(%rsi), %xmm7
-; AVX-NEXT:    vmovaps 192(%rdi), %xmm8
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm9 = xmm8[1],xmm7[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm7 = xmm8[0],xmm7[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm9, %ymm7, %ymm7
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm0 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm1 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm0 = ymm1[0],ymm0[0],ymm1[3],ymm0[3]
+; AVX-NEXT:    vmovupd %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm1 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm2 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm1 = ymm2[0],ymm1[0],ymm2[3],ymm1[3]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm2 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm3 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm2 = ymm3[0],ymm2[0],ymm3[3],ymm2[3]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm3 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm4 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm3 = ymm4[0],ymm3[0],ymm4[3],ymm3[3]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm4 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm5 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm4 = ymm5[0],ymm4[0],ymm5[3],ymm4[3]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm5 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm6 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm5 = ymm6[0],ymm5[0],ymm6[3],ymm5[3]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm6 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm7 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm6 = ymm7[0],ymm6[0],ymm7[3],ymm6[3]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm7 = mem[0,1,0,1]
+; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm8 = mem[0,1,0,1]
+; AVX-NEXT:    vshufpd {{.*#+}} ymm7 = ymm8[0],ymm7[0],ymm8[3],ymm7[3]
 ; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm8 = mem[0,1,0,1]
 ; AVX-NEXT:    vbroadcastf128 {{.*#+}} ymm9 = mem[0,1,0,1]
 ; AVX-NEXT:    vshufpd {{.*#+}} ymm8 = ymm9[0],ymm8[0],ymm9[3],ymm8[3]
@@ -1188,17 +1158,17 @@ define void @store_i64_stride2_vf32(ptr %in.vecptr0, ptr %in.vecptr1, ptr %out.v
 ; AVX-NEXT:    vmovapd %ymm12, 32(%rdx)
 ; AVX-NEXT:    vmovapd %ymm11, 96(%rdx)
 ; AVX-NEXT:    vmovapd %ymm10, 160(%rdx)
-; AVX-NEXT:    vmovapd %ymm9, 288(%rdx)
-; AVX-NEXT:    vmovapd %ymm8, 480(%rdx)
-; AVX-NEXT:    vmovaps %ymm7, 384(%rdx)
-; AVX-NEXT:    vmovaps %ymm6, 320(%rdx)
-; AVX-NEXT:    vmovaps %ymm5, 192(%rdx)
-; AVX-NEXT:    vmovaps %ymm4, 128(%rdx)
-; AVX-NEXT:    vmovaps %ymm3, 64(%rdx)
-; AVX-NEXT:    vmovaps %ymm2, (%rdx)
-; AVX-NEXT:    vmovaps %ymm1, 256(%rdx)
+; AVX-NEXT:    vmovapd %ymm9, 384(%rdx)
+; AVX-NEXT:    vmovapd %ymm8, 320(%rdx)
+; AVX-NEXT:    vmovapd %ymm7, 192(%rdx)
+; AVX-NEXT:    vmovapd %ymm6, 128(%rdx)
+; AVX-NEXT:    vmovapd %ymm5, 64(%rdx)
+; AVX-NEXT:    vmovapd %ymm4, (%rdx)
+; AVX-NEXT:    vmovapd %ymm3, 288(%rdx)
+; AVX-NEXT:    vmovapd %ymm2, 256(%rdx)
+; AVX-NEXT:    vmovapd %ymm1, 448(%rdx)
 ; AVX-NEXT:    vmovups {{[-0-9]+}}(%r{{[sb]}}p), %ymm0 # 32-byte Reload
-; AVX-NEXT:    vmovaps %ymm0, 448(%rdx)
+; AVX-NEXT:    vmovaps %ymm0, 480(%rdx)
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
@@ -2080,102 +2050,70 @@ define void @store_i64_stride2_vf64(ptr %in.vecptr0, ptr %in.vecptr1, ptr %out.v
 ; AVX-LABEL: store_i64_stride2_vf64:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    subq $424, %rsp # imm = 0x1A8
-; AVX-NEXT:    vmovaps (%rsi), %xmm0
-; AVX-NEXT:    vmovaps 32(%rsi), %xmm1
-; AVX-NEXT:    vmovaps 64(%rsi), %xmm2
-; AVX-NEXT:    vmovaps 96(%rsi), %xmm3
-; AVX-NEXT:    vmovaps (%rdi), %xmm4
-; AVX-NEXT:    vmovaps 32(%rdi), %xmm5
-; AVX-NEXT:    vmovaps 64(%rdi), %xmm6
-; AVX-NEXT:    vmovaps 96(%rdi), %xmm7
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm8 = xmm4[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm4[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm8, %ymm0, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm0 = xmm5[1],xmm1[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm1 = xmm5[0],xmm1[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm0, %ymm1, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm0 = xmm6[1],xmm2[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm1 = xmm6[0],xmm2[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm0, %ymm1, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm0 = xmm7[1],xmm3[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm1 = xmm7[0],xmm3[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm0, %ymm1, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vmovaps 128(%rsi), %xmm0
-; AVX-NEXT:    vmovaps 128(%rdi), %xmm1
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = xmm1[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vmovaps 160(%rsi), %xmm0
-; AVX-NEXT:    vmovaps 160(%rdi), %xmm1
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = xmm1[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vmovaps 192(%rsi), %xmm0
-; AVX-NEXT:    vmovaps 192(%rdi), %xmm1
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = xmm1[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vmovaps 224(%rsi), %xmm0
-; AVX-NEXT:    vmovaps 224(%rdi), %xmm1
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = xmm1[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vmovaps 256(%rsi), %xmm0
-; AVX-NEXT:    vmovaps 256(%rdi), %xmm1
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = xmm1[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vmovaps 288(%rsi), %xmm0
-; AVX-NEXT:    vmovaps 288(%rdi), %xmm1
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = xmm1[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vmovaps 320(%rsi), %xmm0
-; AVX-NEXT:    vmovaps 320(%rdi), %xmm1
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = xmm1[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vmovaps 352(%rsi), %xmm0
-; AVX-NEXT:    vmovaps 352(%rdi), %xmm1
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = xmm1[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vmovaps 384(%rsi), %xmm0
-; AVX-NEXT:    vmovaps 384(%rdi), %xmm1
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = xmm1[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX-NEXT:    vmovups %ymm0, (%rsp) # 32-byte Spill
-; AVX-NEXT:    vmovaps 416(%rsi), %xmm0
-; AVX-NEXT:    vmovaps 416(%rdi), %xmm1
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = xmm1[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vmovaps 448(%rsi), %xmm0
-; AVX-NEXT:    vmovaps 448(%rdi), %xmm1
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = xmm1[1],xmm0[1]
-; AVX-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0]
-; AVX-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX-NEXT:    vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
-; AVX-NEXT:    vmovaps 480(%rsi), %xmm0
-; AVX-NEXT:    vmovaps 480(%rdi), %xmm1
-; AVX-NEXT:    vunpckhpd {{.*#+}} xmm2 = x...
[truncated]

…vshufpd

We can always concatenate vXf64 per-lane shuffles into a single vshufpd instruction, assuming we can profitably concatenate at least one of its operands

I was really hoping to get this into combineX86ShufflesRecursively but it still can't handle concatenation as well as combineConcatVectorOps yet.
@RKSimon RKSimon force-pushed the x86-concat-v2f64-shuffles branch from 17eb1d3 to 9e9fe5e Compare June 6, 2025 09:01
@RKSimon RKSimon changed the title [X86] combineConcatVectorOps - concat mixed v2f64 shuffles into 4f64 vshufpd [X86] combineConcatVectorOps - concat per-lane v2f64/v4f64 shuffles into vXf64 vshufpd Jun 6, 2025
@RKSimon RKSimon merged commit 399865c into llvm:main Jun 6, 2025
7 checks passed
@RKSimon RKSimon deleted the x86-concat-v2f64-shuffles branch June 6, 2025 15:41
@vzakhari
Copy link
Contributor

Hi @RKSimon, this commit is my only suspect for the accuracy failure in 454.calculix (with flang -Ofast -march=native) on zen4. The regression happened between 306148b and 2c0a226. The revert of this commit fails on conflicts, so I will try to reproduce the issue by unwinding to these older commits.

By any chance, was there any issue reported about this commit anywhere else?

Thank you.

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jun 10, 2025

Can you provide a asm diff?

@vzakhari
Copy link
Contributor

Can you provide a asm diff?

Sure, I will. I just need to verify that the regression happens exactly after this commit. On the above range of commits, I do see changes in the shuffling, but I also see some reordering of FP math operations (which might be caused by some other pass kicking in after your change). I suppose -fassociative-math allows that reordering, but I will double check.

@vzakhari
Copy link
Contributor

I created #143606. Let's move the dicussion there.

rorth pushed a commit to rorth/llvm-project that referenced this pull request Jun 11, 2025
…nto vXf64 vshufpd (llvm#143017)

We can always concatenate v2f64/v4f64 per-lane shuffles into a single vshufpd instruction, assuming we can profitably concatenate at least one of its operands (or its an unary shuffle).

I was really hoping to get this into combineX86ShufflesRecursively but it still can't handle concatenation/length changing as well as combineConcatVectorOps.
DhruvSrivastavaX pushed a commit to DhruvSrivastavaX/lldb-for-aix that referenced this pull request Jun 12, 2025
…nto vXf64 vshufpd (llvm#143017)

We can always concatenate v2f64/v4f64 per-lane shuffles into a single vshufpd instruction, assuming we can profitably concatenate at least one of its operands (or its an unary shuffle).

I was really hoping to get this into combineX86ShufflesRecursively but it still can't handle concatenation/length changing as well as combineConcatVectorOps.
tomtor pushed a commit to tomtor/llvm-project that referenced this pull request Jun 14, 2025
…nto vXf64 vshufpd (llvm#143017)

We can always concatenate v2f64/v4f64 per-lane shuffles into a single vshufpd instruction, assuming we can profitably concatenate at least one of its operands (or its an unary shuffle).

I was really hoping to get this into combineX86ShufflesRecursively but it still can't handle concatenation/length changing as well as combineConcatVectorOps.
@bgra8
Copy link
Contributor

bgra8 commented Jun 20, 2025

@RKSimon we're seeing many significant increases in memory usage during compilation of certain translation units inside google. In several cases the memory consumption goes above 12GB (might go above this, we just stopped the compilation at this limit). The compilation times also increase significantly for such cases.

a6ace28 does not fix the issue.

We're working on a reproducer.

@alexfh
Copy link
Contributor

alexfh commented Jun 20, 2025

The reduced test case is here: https://gcc.godbolt.org/z/e8f6bxPfT. Looks like an infinite loop, given that the size of the input is quite small.

And here is the profile of a few seconds of the Clang execution:

-   98.39%     0.00%  clang-checked  clang-checked     [.] llvm::MachineFunctionPass::runOnFunction(llvm::Function&)                                                                                                                                ◆
     llvm::MachineFunctionPass::runOnFunction(llvm::Function&)                                                                                                                                                                                      ▒
     llvm::SelectionDAGISelLegacy::runOnMachineFunction(llvm::MachineFunction&)                                                                                                                                                                     ▒
     llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&)                                                                                                                                                                           ▒
   - llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&)                                                                                                                                                                            ▒
      - 98.39% llvm::SelectionDAGISel::CodeGenAndEmitDAG()                                                                                                                                                                                          ▒
         - 98.13% llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::BatchAAResults*, llvm::CodeGenOptLevel)                                                                                                                                     ▒
            - 71.13% (anonymous namespace)::DAGCombiner::combine(llvm::SDNode*)                                                                                                                                                                     ▒
               - 45.17% llvm::X86TargetLowering::PerformDAGCombine(llvm::SDNode*, llvm::TargetLowering::DAGCombinerInfo&) const                                                                                                                     ▒
                  - 42.34% combineINSERT_SUBVECTOR(llvm::SDNode*, llvm::SelectionDAG&, llvm::TargetLowering::DAGCombinerInfo&, llvm::X86Subtarget const&)                                                                                           ▒
                     - 23.12% combineConcatVectorOps(llvm::SDLoc const&, llvm::MVT, llvm::ArrayRef<llvm::SDValue>, llvm::SelectionDAG&, llvm::X86Subtarget const&, unsigned int)                                                                    ▒
                        - 21.60% EltsFromConsecutiveLoads(llvm::EVT, llvm::ArrayRef<llvm::SDValue>, llvm::SDLoc const&, llvm::SelectionDAG&, llvm::X86Subtarget const&, bool)                                                                       ▒
                           - 16.38% EltsFromConsecutiveLoads(llvm::EVT, llvm::ArrayRef<llvm::SDValue>, llvm::SDLoc const&, llvm::SelectionDAG&, llvm::X86Subtarget const&, bool)::$_1::operator()(llvm::EVT, llvm::LoadSDNode*) const               ▒
                              - 8.97% llvm::SelectionDAG::makeEquivalentMemoryOrdering(llvm::SDValue, llvm::SDValue)                                                                                                                                ▒
                                 + 3.24% llvm::SelectionDAG::getNode(unsigned int, llvm::SDLoc const&, llvm::EVT, llvm::SDValue, llvm::SDValue, llvm::SDNodeFlags)                                                                                  ▒
                                 + 3.20% llvm::SelectionDAG::ReplaceAllUsesOfValueWith(llvm::SDValue, llvm::SDValue)                                                                                                                                ▒
                                 + 1.90% llvm::SelectionDAG::UpdateNodeOperands(llvm::SDNode*, llvm::SDValue, llvm::SDValue)                                                                                                                        ▒
                              + 6.70% llvm::SelectionDAG::getLoad(llvm::EVT, llvm::SDLoc const&, llvm::SDValue, llvm::SDValue, llvm::MachinePointerInfo, llvm::MaybeAlign, llvm::MachineMemOperand::Flags, llvm::AAMDNodes const&, llvm::MDNode cons▒
                           + 2.69% llvm::MachinePointerInfo::isDereferenceable(unsigned int, llvm::LLVMContext&, llvm::DataLayout const&) const                                                                                                     ▒
                     - 12.24% concatSubVectors(llvm::SDValue, llvm::SDValue, llvm::SelectionDAG&, llvm::SDLoc const&)                                                                                                                               ▒
                        - 10.83% insertSubVector(llvm::SDValue, llvm::SDValue, unsigned int, llvm::SelectionDAG&, llvm::SDLoc const&, unsigned int)                                                                                                 ▒
                           + 5.45% llvm::SelectionDAG::getNode(unsigned int, llvm::SDLoc const&, llvm::EVT, llvm::SDValue, llvm::SDValue, llvm::SDValue, llvm::SDNodeFlags)                                                                         ▒
                           + 4.06% llvm::SelectionDAG::getVectorIdxConstant(unsigned long, llvm::SDLoc const&, bool)                                                                                                                                ▒
                          0.88% llvm::SelectionDAG::getNode(unsigned int, llvm::SDLoc const&, llvm::EVT)                                                                                                                                            ▒
                     + 2.42% collectConcatOps(llvm::SDNode*, llvm::SmallVectorImpl<llvm::SDValue>&, llvm::SelectionDAG&)                                                                                                                            ▒
                     + 1.25% llvm::SelectionDAG::areNonVolatileConsecutiveLoads(llvm::LoadSDNode*, llvm::LoadSDNode*, unsigned int, int) const                                                                                                      ▒
                       1.21% llvm::MVT::getVectorVT(llvm::MVT, unsigned int)                                                                                                                                                                        ▒
                    1.71% combineLoad(llvm::SDNode*, llvm::SelectionDAG&, llvm::TargetLowering::DAGCombinerInfo&, llvm::X86Subtarget const&)                                                                                                        ▒
               - 15.12% (anonymous namespace)::DAGCombiner::visitINSERT_SUBVECTOR(llvm::SDNode*)                                                                                                                                                    ▒
                  - 14.54% (anonymous namespace)::DAGCombiner::SimplifyDemandedVectorElts(llvm::SDValue)                                                                                                                                            ▒
                     - 13.99% (anonymous namespace)::DAGCombiner::SimplifyDemandedVectorElts(llvm::SDValue, llvm::APInt const&, bool)                                                                                                               ▒
                        - 13.30% llvm::TargetLowering::SimplifyDemandedVectorElts(llvm::SDValue, llvm::APInt const&, llvm::APInt&, llvm::APInt&, llvm::TargetLowering::TargetLoweringOpt&, unsigned int, bool) const                                ▒
                           - 8.47% llvm::TargetLowering::SimplifyDemandedVectorElts(llvm::SDValue, llvm::APInt const&, llvm::APInt&, llvm::APInt&, llvm::TargetLowering::TargetLoweringOpt&, unsigned int, bool) const                              ▒
                              - 4.76% llvm::TargetLowering::SimplifyDemandedBits(llvm::SDValue, llvm::APInt const&, llvm::APInt const&, llvm::KnownBits&, llvm::TargetLowering::TargetLoweringOpt&, unsigned int, bool) const                       ▒
                                   1.09% llvm::SelectionDAG::computeKnownBits(llvm::SDValue, llvm::APInt const&, unsigned int) const                                                                                                                ▒
                           + 1.63% llvm::TargetLowering::SimplifyMultipleUseDemandedVectorElts(llvm::SDValue, llvm::APInt const&, llvm::SelectionDAG&, unsigned int) const                                                                          ▒
               - 7.30% (anonymous namespace)::DAGCombiner::visitLOAD(llvm::SDNode*)                                                                                                                                                                 ▒
                  + 1.69% llvm::SelectionDAG::ReplaceAllUsesOfValueWith(llvm::SDValue, llvm::SDValue)                                                                                                                                               ▒
                    1.09% (anonymous namespace)::DAGCombiner::deleteAndRecombine(llvm::SDNode*)                                                                                                                                                     ▒
                  + 0.95% llvm::SelectionDAG::InferPtrAlign(llvm::SDValue) const                                                                                                                                                                    ▒
                    0.73% (anonymous namespace)::DAGCombiner::AddToWorklist(llvm::SDNode*, bool, bool)                                                                                                                                              ▒
                  + 0.66% (anonymous namespace)::DAGCombiner::FindBetterChain(llvm::SDNode*, llvm::SDValue)                                                                                                                                         ▒
                 1.61% (anonymous namespace)::DAGCombiner::visitTokenFactor(llvm::SDNode*)                                                                                                                                                          ▒
            - 8.52% llvm::SelectionDAG::LegalizeOp(llvm::SDNode*, llvm::SmallSetVector<llvm::SDNode*, 16u>&)                                                                                                                                        ▒
               - 7.80% (anonymous namespace)::SelectionDAGLegalize::LegalizeOp(llvm::SDNode*)                                                                                                                                                       ▒
                  + 3.00% llvm::TargetLoweringBase::allowsMemoryAccessForAlignment(llvm::LLVMContext&, llvm::DataLayout const&, llvm::EVT, llvm::MachineMemOperand const&, unsigned int*) const                                                     ▒
                    1.20% llvm::TargetLoweringBase::getTypeConversion(llvm::LLVMContext&, llvm::EVT) const                                                                                                                                          ▒
            - 5.76% (anonymous namespace)::DAGCombiner::recursivelyDeleteUnusedNodes(llvm::SDNode*)                                                                                                                                                 ▒
                 0.97% llvm::SelectionDAG::DeallocateNode(llvm::SDNode*)                                                                                                                                                                            ▒
                 0.96% llvm::SetVector<llvm::SDNode*, llvm::SmallVector<llvm::SDNode*, 16u>, llvm::DenseSet<llvm::SDNode*, llvm::DenseMapInfo<llvm::SDNode*, void> >, 16u>::insert(llvm::SDNode* const&)                                            ▒
                 0.89% llvm::SelectionDAG::DeleteNode(llvm::SDNode*)                                                                                                                                                                                ▒
                 0.61% llvm::SetVector<llvm::SDNode*, llvm::SmallVector<llvm::SDNode*, 32u>, llvm::DenseSet<llvm::SDNode*, llvm::DenseMapInfo<llvm::SDNode*, void> >, 32u>::remove(llvm::SDNode* const&)                                            ▒
            - 4.15% llvm::SelectionDAG::ReplaceAllUsesWith(llvm::SDValue, llvm::SDValue)                                                                                                                                                            ▒
               - 3.07% llvm::SelectionDAG::AddModifiedNodeToCSEMaps(llvm::SDNode*)                                                                                                                                                                  ▒
                  + 2.59% llvm::FoldingSetBase::GetOrInsertNode(llvm::FoldingSetBase::Node*, llvm::FoldingSetBase::FoldingSetInfo const&)                                                                                                           ▒
            + 1.84% llvm::SelectionDAG::ReplaceAllUsesWith(llvm::SDNode*, llvm::SDNode*)                                                                                                                                                            ▒
            + 1.07% (anonymous namespace)::DAGCombiner::AddToWorklist(llvm::SDNode*, bool, bool)                                                                                                                                                    ▒
              0.84% llvm::SetVector<llvm::SDNode*, llvm::SmallVector<llvm::SDNode*, 32u>, llvm::DenseSet<llvm::SDNode*, llvm::DenseMapInfo<llvm::SDNode*, void> >, 32u>::insert(llvm::SDNode* const&)                                               ▒

@alexfh
Copy link
Contributor

alexfh commented Jun 20, 2025

And a simpler reproducer (using code after SLP vectorizer): https://gcc.godbolt.org/z/bMGrr8Gze

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jun 20, 2025

Thanks - it looks like its awakened a latent bug in load combining (which hasn't been touched for a long time......)

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jun 20, 2025

; ModuleID = 'bugpoint-reduced-simplified.bc'
source_filename = "bug.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define void @blam(ptr readonly align 8 captures(none) dereferenceable(64) %arg) #0 {
  %getelementptr = getelementptr inbounds nuw i8, ptr %arg, i64 8
  %i = load <6 x i64>, ptr %getelementptr, align 8
  %i1 = shufflevector <6 x i64> %i, <6 x i64> poison, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
  store <4 x i64> %i1, ptr poison, align 8
  ret void
}
attributes #0 = { "target-features"="+avx" }

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jun 20, 2025

It looks like #140919 was the actual culprit - I'm putting together a partial reversion.

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jun 20, 2025

@alexfh @bgra8 please can you confirm if #145077 fixes the regression on your end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants