[RISCV][CG]Use processShuffleMasks for per-register shuffles #120803

alexey-bataev · 2024-12-20T22:38:15Z

Patch adds usage of processShuffleMasks in in codegen
in lowerShuffleViaVRegSplitting. This function is already used for X86
shuffles estimations and in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE
functions, unifies the code.

Created using spr 1.3.5

llvmbot · 2024-12-20T22:38:50Z

@llvm/pr-subscribers-backend-risc-v

Author: Alexey Bataev (alexey-bataev)

Changes

Patch adds usage of processShuffleMasks in in codegen
in lowerShuffleViaVRegSplitting. This function is already used for X86
shuffles estimations and in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE
functions, unifies the code.

Full diff: https://github.com/llvm/llvm-project/pull/120803.diff

2 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+52-41)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll (+41-48)

diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index ea8814aa2b4fc7..2ae9e78ed00bfb 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -5103,7 +5103,6 @@ static SDValue lowerShuffleViaVRegSplitting(ShuffleVectorSDNode *SVN,
   SDValue V1 = SVN->getOperand(0);
   SDValue V2 = SVN->getOperand(1);
   ArrayRef<int> Mask = SVN->getMask();
-  unsigned NumElts = VT.getVectorNumElements();
 
   // If we don't know exact data layout, not much we can do.  If this
   // is already m1 or smaller, no point in splitting further.
@@ -5120,58 +5119,70 @@ static SDValue lowerShuffleViaVRegSplitting(ShuffleVectorSDNode *SVN,
 
   MVT ElemVT = VT.getVectorElementType();
   unsigned ElemsPerVReg = *VLen / ElemVT.getFixedSizeInBits();
-  unsigned VRegsPerSrc = NumElts / ElemsPerVReg;
-
-  SmallVector<std::pair<int, SmallVector<int>>>
-    OutMasks(VRegsPerSrc, {-1, {}});
-
-  // Check if our mask can be done as a 1-to-1 mapping from source
-  // to destination registers in the group without needing to
-  // write each destination more than once.
-  for (unsigned DstIdx = 0; DstIdx < Mask.size(); DstIdx++) {
-    int DstVecIdx = DstIdx / ElemsPerVReg;
-    int DstSubIdx = DstIdx % ElemsPerVReg;
-    int SrcIdx = Mask[DstIdx];
-    if (SrcIdx < 0 || (unsigned)SrcIdx >= 2 * NumElts)
-      continue;
-    int SrcVecIdx = SrcIdx / ElemsPerVReg;
-    int SrcSubIdx = SrcIdx % ElemsPerVReg;
-    if (OutMasks[DstVecIdx].first == -1)
-      OutMasks[DstVecIdx].first = SrcVecIdx;
-    if (OutMasks[DstVecIdx].first != SrcVecIdx)
-      // Note: This case could easily be handled by keeping track of a chain
-      // of source values and generating two element shuffles below.  This is
-      // less an implementation question, and more a profitability one.
-      return SDValue();
-
-    OutMasks[DstVecIdx].second.resize(ElemsPerVReg, -1);
-    OutMasks[DstVecIdx].second[DstSubIdx] = SrcSubIdx;
-  }
 
   EVT ContainerVT = getContainerForFixedLengthVector(DAG, VT, Subtarget);
   MVT OneRegVT = MVT::getVectorVT(ElemVT, ElemsPerVReg);
   MVT M1VT = getContainerForFixedLengthVector(DAG, OneRegVT, Subtarget);
   assert(M1VT == getLMUL1VT(M1VT));
   unsigned NumOpElts = M1VT.getVectorMinNumElements();
-  SDValue Vec = DAG.getUNDEF(ContainerVT);
+  unsigned NormalizedVF = ContainerVT.getVectorMinNumElements();
+  unsigned NumOfSrcRegs = NormalizedVF / NumOpElts;
+  unsigned NumOfDestRegs = NormalizedVF / NumOpElts;
   // The following semantically builds up a fixed length concat_vector
   // of the component shuffle_vectors.  We eagerly lower to scalable here
   // to avoid DAG combining it back to a large shuffle_vector again.
   V1 = convertToScalableVector(ContainerVT, V1, DAG, Subtarget);
   V2 = convertToScalableVector(ContainerVT, V2, DAG, Subtarget);
-  for (unsigned DstVecIdx = 0 ; DstVecIdx < OutMasks.size(); DstVecIdx++) {
-    auto &[SrcVecIdx, SrcSubMask] = OutMasks[DstVecIdx];
-    if (SrcVecIdx == -1)
+  SmallVector<SDValue> SubRegs(NumOfDestRegs);
+  unsigned RegCnt = 0;
+  unsigned PrevCnt = 0;
+  processShuffleMasks(
+      Mask, NumOfSrcRegs, NumOfDestRegs, NumOfDestRegs,
+      [&]() {
+        PrevCnt = RegCnt;
+        ++RegCnt;
+      },
+      [&, &DAG = DAG](ArrayRef<int> SrcSubMask, unsigned SrcVecIdx,
+                      unsigned DstVecIdx) {
+        SDValue SrcVec = SrcVecIdx >= NumOfSrcRegs ? V2 : V1;
+        unsigned ExtractIdx = (SrcVecIdx % NumOfSrcRegs) * NumOpElts;
+        SDValue SubVec = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, M1VT, SrcVec,
+                                     DAG.getVectorIdxConstant(ExtractIdx, DL));
+        SubVec = convertFromScalableVector(OneRegVT, SubVec, DAG, Subtarget);
+        SubVec = DAG.getVectorShuffle(OneRegVT, DL, SubVec, SubVec, SrcSubMask);
+        SubRegs[RegCnt] = convertToScalableVector(M1VT, SubVec, DAG, Subtarget);
+        PrevCnt = RegCnt;
+        ++RegCnt;
+      },
+      [&, &DAG = DAG](ArrayRef<int> SrcSubMask, unsigned Idx1, unsigned Idx2) {
+        if (PrevCnt + 1 == RegCnt)
+          ++RegCnt;
+        SDValue SubVec1 = SubRegs[PrevCnt + 1];
+        if (!SubVec1) {
+          SDValue SrcVec = Idx1 >= NumOfSrcRegs ? V2 : V1;
+          unsigned ExtractIdx = (Idx1 % NumOfSrcRegs) * NumOpElts;
+          SubVec1 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, M1VT, SrcVec,
+                                DAG.getVectorIdxConstant(ExtractIdx, DL));
+        }
+        SubVec1 = convertFromScalableVector(OneRegVT, SubVec1, DAG, Subtarget);
+        SDValue SrcVec = Idx2 >= NumOfSrcRegs ? V2 : V1;
+        unsigned ExtractIdx = (Idx2 % NumOfSrcRegs) * NumOpElts;
+        SDValue SubVec2 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, M1VT, SrcVec,
+                                      DAG.getVectorIdxConstant(ExtractIdx, DL));
+        SubVec2 = convertFromScalableVector(OneRegVT, SubVec2, DAG, Subtarget);
+        SubVec1 =
+            DAG.getVectorShuffle(OneRegVT, DL, SubVec1, SubVec2, SrcSubMask);
+        SubVec1 = convertToScalableVector(M1VT, SubVec1, DAG, Subtarget);
+        SubRegs[PrevCnt + 1] = SubVec1;
+      });
+  assert(RegCnt == NumOfDestRegs && "Whole vector must be processed");
+  SDValue Vec = DAG.getUNDEF(ContainerVT);
+  for (auto [I, V] : enumerate(SubRegs)) {
+    if (!V)
       continue;
-    unsigned ExtractIdx = (SrcVecIdx % VRegsPerSrc) * NumOpElts;
-    SDValue SrcVec = (unsigned)SrcVecIdx >= VRegsPerSrc ? V2 : V1;
-    SDValue SubVec = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, M1VT, SrcVec,
-                                 DAG.getVectorIdxConstant(ExtractIdx, DL));
-    SubVec = convertFromScalableVector(OneRegVT, SubVec, DAG, Subtarget);
-    SubVec = DAG.getVectorShuffle(OneRegVT, DL, SubVec, SubVec, SrcSubMask);
-    SubVec = convertToScalableVector(M1VT, SubVec, DAG, Subtarget);
-    unsigned InsertIdx = DstVecIdx * NumOpElts;
-    Vec = DAG.getNode(ISD::INSERT_SUBVECTOR, DL, ContainerVT, Vec, SubVec,
+    unsigned InsertIdx = I * NumOpElts;
+
+    Vec = DAG.getNode(ISD::INSERT_SUBVECTOR, DL, ContainerVT, Vec, V,
                       DAG.getVectorIdxConstant(InsertIdx, DL));
   }
   return convertFromScalableVector(VT, Vec, DAG, Subtarget);
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll
index f0ee780137300f..4e06d0094d945a 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll
@@ -168,12 +168,11 @@ define <4 x i64> @m2_splat_into_slide_two_source_v2_lo(<4 x i64> %v1, <4 x i64>
 define <4 x i64> @m2_splat_into_slide_two_source(<4 x i64> %v1, <4 x i64> %v2) vscale_range(2,2) {
 ; CHECK-LABEL: m2_splat_into_slide_two_source:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetivli zero, 1, e8, mf8, ta, ma
-; CHECK-NEXT:    vmv.v.i v0, 12
-; CHECK-NEXT:    vsetivli zero, 4, e64, m2, ta, mu
+; CHECK-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v13, v10, 1
+; CHECK-NEXT:    vslideup.vi v13, v11, 1
 ; CHECK-NEXT:    vrgather.vi v12, v8, 0
-; CHECK-NEXT:    vslideup.vi v12, v10, 1, v0.t
-; CHECK-NEXT:    vmv.v.v v8, v12
+; CHECK-NEXT:    vmv2r.v v8, v12
 ; CHECK-NEXT:    ret
   %res = shufflevector <4 x i64> %v1, <4 x i64> %v2, <4 x i32> <i32 0, i32 0, i32 5, i32 6>
   ret <4 x i64> %res
@@ -183,18 +182,17 @@ define void @shuffle1(ptr %explicit_0, ptr %explicit_1) vscale_range(2,2) {
 ; CHECK-LABEL: shuffle1:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    addi a0, a0, 252
+; CHECK-NEXT:    vsetivli zero, 8, e32, m2, ta, ma
+; CHECK-NEXT:    vmv.v.i v8, 0
 ; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
-; CHECK-NEXT:    vid.v v8
+; CHECK-NEXT:    vid.v v10
 ; CHECK-NEXT:    vsetivli zero, 3, e32, m1, ta, ma
-; CHECK-NEXT:    vle32.v v9, (a0)
-; CHECK-NEXT:    li a0, 175
-; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
-; CHECK-NEXT:    vsrl.vi v8, v8, 1
-; CHECK-NEXT:    vmv.s.x v0, a0
-; CHECK-NEXT:    vadd.vi v8, v8, 1
-; CHECK-NEXT:    vrgather.vv v11, v9, v8
-; CHECK-NEXT:    vsetivli zero, 8, e32, m2, ta, ma
-; CHECK-NEXT:    vmerge.vim v8, v10, 0, v0
+; CHECK-NEXT:    vle32.v v11, (a0)
+; CHECK-NEXT:    vmv.v.i v0, 5
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, mu
+; CHECK-NEXT:    vsrl.vi v10, v10, 1
+; CHECK-NEXT:    vadd.vi v10, v10, 1
+; CHECK-NEXT:    vrgather.vv v9, v11, v10, v0.t
 ; CHECK-NEXT:    addi a0, a1, 672
 ; CHECK-NEXT:    vs2r.v v8, (a0)
 ; CHECK-NEXT:    ret
@@ -211,15 +209,15 @@ define void @shuffle1(ptr %explicit_0, ptr %explicit_1) vscale_range(2,2) {
 define <16 x float> @shuffle2(<4 x float> %a) vscale_range(2,2) {
 ; CHECK-LABEL: shuffle2:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
-; CHECK-NEXT:    vid.v v9
-; CHECK-NEXT:    li a0, -97
-; CHECK-NEXT:    vadd.vv v9, v9, v9
-; CHECK-NEXT:    vrsub.vi v9, v9, 4
-; CHECK-NEXT:    vmv.s.x v0, a0
-; CHECK-NEXT:    vrgather.vv v13, v8, v9
 ; CHECK-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; CHECK-NEXT:    vmerge.vim v8, v12, 0, v0
+; CHECK-NEXT:    vmv1r.v v12, v8
+; CHECK-NEXT:    vmv.v.i v8, 0
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, mu
+; CHECK-NEXT:    vid.v v13
+; CHECK-NEXT:    vadd.vv v13, v13, v13
+; CHECK-NEXT:    vmv.v.i v0, 6
+; CHECK-NEXT:    vrsub.vi v13, v13, 4
+; CHECK-NEXT:    vrgather.vv v9, v12, v13, v0.t
 ; CHECK-NEXT:    ret
   %b = extractelement <4 x float> %a, i32 2
   %c = insertelement <16 x float> <float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float undef, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00>, float %b, i32 5
@@ -231,16 +229,15 @@ define <16 x float> @shuffle2(<4 x float> %a) vscale_range(2,2) {
 define i64 @extract_any_extend_vector_inreg_v16i64(<16 x i64> %a0, i32 %a1) vscale_range(2,2) {
 ; RV32-LABEL: extract_any_extend_vector_inreg_v16i64:
 ; RV32:       # %bb.0:
-; RV32-NEXT:    li a1, 16
-; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, mu
+; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
 ; RV32-NEXT:    vmv.v.i v16, 0
-; RV32-NEXT:    vmv.s.x v0, a1
+; RV32-NEXT:    vsetivli zero, 2, e64, m1, ta, mu
+; RV32-NEXT:    vmv.v.i v0, 1
 ; RV32-NEXT:    li a1, 32
-; RV32-NEXT:    vrgather.vi v16, v8, 15, v0.t
-; RV32-NEXT:    vsetvli zero, zero, e64, m8, ta, ma
+; RV32-NEXT:    vrgather.vi v18, v15, 1, v0.t
+; RV32-NEXT:    vsetivli zero, 1, e64, m8, ta, ma
 ; RV32-NEXT:    vslidedown.vx v8, v16, a0
 ; RV32-NEXT:    vmv.x.s a0, v8
-; RV32-NEXT:    vsetivli zero, 1, e64, m8, ta, ma
 ; RV32-NEXT:    vsrl.vx v8, v8, a1
 ; RV32-NEXT:    vmv.x.s a1, v8
 ; RV32-NEXT:    ret
@@ -258,13 +255,14 @@ define i64 @extract_any_extend_vector_inreg_v16i64(<16 x i64> %a0, i32 %a1) vsca
 ; RV64-NEXT:    addi s0, sp, 256
 ; RV64-NEXT:    .cfi_def_cfa s0, 0
 ; RV64-NEXT:    andi sp, sp, -128
-; RV64-NEXT:    li a1, -17
+; RV64-NEXT:    vsetivli zero, 1, e8, mf8, ta, ma
+; RV64-NEXT:    vmv.v.i v0, 1
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT:    vmv.s.x v0, a1
-; RV64-NEXT:    vrgather.vi v16, v8, 15
-; RV64-NEXT:    vmerge.vim v8, v16, 0, v0
+; RV64-NEXT:    vmv.v.i v16, 0
+; RV64-NEXT:    vsetivli zero, 2, e64, m1, ta, mu
+; RV64-NEXT:    vrgather.vi v18, v15, 1, v0.t
 ; RV64-NEXT:    mv s2, sp
-; RV64-NEXT:    vs8r.v v8, (s2)
+; RV64-NEXT:    vs8r.v v16, (s2)
 ; RV64-NEXT:    andi a0, a0, 15
 ; RV64-NEXT:    li a1, 8
 ; RV64-NEXT:    call __muldi3
@@ -290,21 +288,16 @@ define i64 @extract_any_extend_vector_inreg_v16i64(<16 x i64> %a0, i32 %a1) vsca
 define <4 x double> @shuffles_add(<4 x double> %0, <4 x double> %1) vscale_range(2,2) {
 ; CHECK-LABEL: shuffles_add:
 ; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 2, e64, m1, ta, mu
+; CHECK-NEXT:    vmv1r.v v13, v10
+; CHECK-NEXT:    vslideup.vi v13, v11, 1
+; CHECK-NEXT:    vmv1r.v v8, v9
+; CHECK-NEXT:    vmv.v.i v0, 1
+; CHECK-NEXT:    vrgather.vi v12, v9, 0
+; CHECK-NEXT:    vmv1r.v v9, v11
+; CHECK-NEXT:    vrgather.vi v9, v10, 1, v0.t
 ; CHECK-NEXT:    vsetivli zero, 4, e64, m2, ta, ma
-; CHECK-NEXT:    vrgather.vi v12, v8, 2
-; CHECK-NEXT:    vsetvli zero, zero, e16, mf2, ta, ma
-; CHECK-NEXT:    vid.v v14
-; CHECK-NEXT:    vmv.v.i v0, 12
-; CHECK-NEXT:    vsetvli zero, zero, e64, m2, ta, ma
-; CHECK-NEXT:    vrgather.vi v16, v8, 3
-; CHECK-NEXT:    vsetvli zero, zero, e16, mf2, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v14, v14
-; CHECK-NEXT:    vadd.vi v9, v8, -4
-; CHECK-NEXT:    vadd.vi v8, v8, -3
-; CHECK-NEXT:    vsetvli zero, zero, e64, m2, ta, mu
-; CHECK-NEXT:    vrgatherei16.vv v12, v10, v9, v0.t
-; CHECK-NEXT:    vrgatherei16.vv v16, v10, v8, v0.t
-; CHECK-NEXT:    vfadd.vv v8, v12, v16
+; CHECK-NEXT:    vfadd.vv v8, v12, v8
 ; CHECK-NEXT:    ret
   %3 = shufflevector <4 x double> %0, <4 x double> %1, <4 x i32> <i32 undef, i32 2, i32 4, i32 6>
   %4 = shufflevector <4 x double> %0, <4 x double> %1, <4 x i32> <i32 undef, i32 3, i32 5, i32 7>

wangpc-pp

LGTM.

preames · 2025-01-01T19:15:30Z

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

-    if (OutMasks[DstVecIdx].first == -1)
-      OutMasks[DstVecIdx].first = SrcVecIdx;
-    if (OutMasks[DstVecIdx].first != SrcVecIdx)
-      // Note: This case could easily be handled by keeping track of a chain


In addition to the functional issue which causes me to revert this change, your comments and tests don't address the profitability issue called out here. Please revise to include.

Do you expect that the multi-shuffles may cause perf degradation? Do you suggest to add some extra analysis and if it requires more than 1 2-vector shuffles just early exit?

Possibly, yes. In particular, using a quadratic number of shuffles (with distinct indices and masks) vs a single one is a lot of code size increase.

I suspect this needs a bit of thought and investigation. I am not proposing any particular heuristic, and am open to being convinced that the right heuristic is to just blindly expand. I'd just like to see it explored and justified.

Though 1-2 shuffles is way to low a threshold. You definitely want something which allows at least the linear expansion of the code you replaced.

preames · 2025-01-01T19:16:54Z

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll

@@ -168,12 +168,11 @@ define <4 x i64> @m2_splat_into_slide_two_source_v2_lo(<4 x i64> %v1, <4 x i64>
 define <4 x i64> @m2_splat_into_slide_two_source(<4 x i64> %v1, <4 x i64> %v2) vscale_range(2,2) {
 ; CHECK-LABEL: m2_splat_into_slide_two_source:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetivli zero, 1, e8, mf8, ta, ma


Did I miss a separate commit which added additional testing for this change? It clearly isn't NFC, and the test changes here seem fairly minimal for something this involved. If not, please commit additional coverage (including the cause of the functional bug) before reapplying.

There are at least 3 tests, which check the functionality, just the changes are in this one.

You had added a couple tests in 78ab771; I had missed finding that on my first quick search.

Please add the following:

A fully quadratic shuffle, possibly at multiple lmuls

A mostly linear case - i.e. something which isn't linear, but you think is still a good example of the generality.

Whatever the functional bug was.

preames · 2025-01-01T19:18:44Z

Hm, noticed the usual auto update on the review didn't happen for some reason. I have revered this in commit 6840521 for the reason described in the revert commit.

alexey-bataev · 2025-01-02T16:57:00Z

Hm, noticed the usual auto update on the review didn't happen for some reason. I have revered this in commit 6840521 for the reason described in the revert commit.

Unable to reproduce a crash, need a reproducer

preames · 2025-01-02T17:10:11Z

Unable to reproduce a crash, need a reproducer

Ok, extracting now. ETA ~1-2 hours.

preames · 2025-01-02T17:35:08Z

Reproducer:

source_filename = "hadamard_ac.c"
target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128"
target triple = "riscv64-unknown-unknown"

define i64 @crash(ptr nocapture noundef readonly %pix, i32 noundef signext %stride) local_unnamed_addr #0 {
entry:
  %idx.ext55 = sext i32 %stride to i64
  %arrayidx11 = getelementptr inbounds nuw i8, ptr %pix, i64 2
  %add.ptr56 = getelementptr inbounds i8, ptr %pix, i64 %idx.ext55
  %add.ptr56.1 = getelementptr inbounds i8, ptr %add.ptr56, i64 %idx.ext55
  %add.ptr56.2 = getelementptr inbounds i8, ptr %add.ptr56.1, i64 %idx.ext55
  %add.ptr56.3 = getelementptr inbounds i8, ptr %add.ptr56.2, i64 %idx.ext55
  %arrayidx4.4 = getelementptr inbounds nuw i8, ptr %add.ptr56.3, i64 1
  %arrayidx11.4 = getelementptr inbounds nuw i8, ptr %add.ptr56.3, i64 2
  %arrayidx13.4 = getelementptr inbounds nuw i8, ptr %add.ptr56.3, i64 3
  %arrayidx27.4 = getelementptr inbounds nuw i8, ptr %add.ptr56.3, i64 4
  %arrayidx29.4 = getelementptr inbounds nuw i8, ptr %add.ptr56.3, i64 5
  %arrayidx39.4 = getelementptr inbounds nuw i8, ptr %add.ptr56.3, i64 6
  %arrayidx41.4 = getelementptr inbounds nuw i8, ptr %add.ptr56.3, i64 7
  %0 = tail call <4 x i8> @llvm.experimental.vp.strided.load.v4i8.p0.i64(ptr align 1 %pix, i64 %idx.ext55, <4 x i1> splat (i1 true), i32 4), !tbaa !6
  %1 = shufflevector <4 x i8> %0, <4 x i8> poison, <4 x i32> <i32 2, i32 3, i32 1, i32 0>
  %2 = tail call <4 x i8> @llvm.experimental.vp.strided.load.v4i8.p0.i64(ptr nonnull align 1 %arrayidx11, i64 %idx.ext55, <4 x i1> splat (i1 true), i32 4), !tbaa !6
  %3 = shufflevector <4 x i8> %2, <4 x i8> poison, <4 x i32> <i32 2, i32 3, i32 1, i32 0>
  %4 = insertelement <8 x ptr> poison, ptr %add.ptr56.1, i64 0
  %5 = insertelement <8 x ptr> %4, ptr %add.ptr56.2, i64 1
  %6 = insertelement <8 x ptr> %5, ptr %add.ptr56, i64 2
  %7 = insertelement <8 x ptr> %6, ptr %pix, i64 3
  %8 = shufflevector <8 x ptr> %7, <8 x ptr> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
  %9 = getelementptr i8, <8 x ptr> %8, <8 x i64> <i64 6, i64 6, i64 6, i64 6, i64 4, i64 4, i64 4, i64 4>
  %10 = tail call <8 x i8> @llvm.masked.gather.v8i8.v8p0(<8 x ptr> %9, i32 1, <8 x i1> splat (i1 true), <8 x i8> poison), !tbaa !6
  %11 = shufflevector <8 x ptr> %7, <8 x ptr> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
  %12 = getelementptr i8, <16 x ptr> %11, <16 x i64> <i64 3, i64 3, i64 3, i64 3, i64 1, i64 1, i64 1, i64 1, i64 7, i64 7, i64 7, i64 7, i64 5, i64 5, i64 5, i64 5>
  %13 = tail call <16 x i8> @llvm.masked.gather.v16i8.v16p0(<16 x ptr> %12, i32 1, <16 x i1> splat (i1 true), <16 x i8> poison), !tbaa !6
  %14 = tail call <4 x i8> @llvm.experimental.vp.strided.load.v4i8.p0.i64(ptr align 1 %add.ptr56.3, i64 %idx.ext55, <4 x i1> splat (i1 true), i32 4), !tbaa !6
  %15 = shufflevector <4 x i8> %14, <4 x i8> poison, <4 x i32> <i32 2, i32 3, i32 1, i32 0>
  %16 = tail call <4 x i8> @llvm.experimental.vp.strided.load.v4i8.p0.i64(ptr nonnull align 1 %arrayidx4.4, i64 %idx.ext55, <4 x i1> splat (i1 true), i32 4), !tbaa !6
  %17 = shufflevector <4 x i8> %16, <4 x i8> poison, <4 x i32> <i32 2, i32 3, i32 1, i32 0>
  %18 = tail call <4 x i8> @llvm.experimental.vp.strided.load.v4i8.p0.i64(ptr nonnull align 1 %arrayidx11.4, i64 %idx.ext55, <4 x i1> splat (i1 true), i32 4), !tbaa !6
  %19 = shufflevector <4 x i8> %18, <4 x i8> poison, <4 x i32> <i32 2, i32 3, i32 1, i32 0>
  %20 = tail call <4 x i8> @llvm.experimental.vp.strided.load.v4i8.p0.i64(ptr nonnull align 1 %arrayidx13.4, i64 %idx.ext55, <4 x i1> splat (i1 true), i32 4), !tbaa !6
  %21 = shufflevector <4 x i8> %20, <4 x i8> poison, <4 x i32> <i32 2, i32 3, i32 1, i32 0>
  %22 = tail call <4 x i8> @llvm.experimental.vp.strided.load.v4i8.p0.i64(ptr nonnull align 1 %arrayidx27.4, i64 %idx.ext55, <4 x i1> splat (i1 true), i32 4), !tbaa !6
  %23 = shufflevector <4 x i8> %22, <4 x i8> poison, <4 x i32> <i32 2, i32 3, i32 1, i32 0>
  %24 = tail call <4 x i8> @llvm.experimental.vp.strided.load.v4i8.p0.i64(ptr nonnull align 1 %arrayidx29.4, i64 %idx.ext55, <4 x i1> splat (i1 true), i32 4), !tbaa !6
  %25 = shufflevector <4 x i8> %24, <4 x i8> poison, <4 x i32> <i32 2, i32 3, i32 1, i32 0>
  %26 = tail call <4 x i8> @llvm.experimental.vp.strided.load.v4i8.p0.i64(ptr nonnull align 1 %arrayidx39.4, i64 %idx.ext55, <4 x i1> splat (i1 true), i32 4), !tbaa !6
  %27 = shufflevector <4 x i8> %26, <4 x i8> poison, <4 x i32> <i32 2, i32 3, i32 1, i32 0>
  %28 = shufflevector <8 x i8> %10, <8 x i8> poison, <32 x i32> <i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %29 = shufflevector <4 x i8> %3, <4 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %30 = shufflevector <32 x i8> %29, <32 x i8> %28, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %31 = shufflevector <4 x i8> %1, <4 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %32 = shufflevector <32 x i8> %30, <32 x i8> %31, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 32, i32 33, i32 34, i32 35, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %33 = shufflevector <4 x i8> %19, <4 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %34 = shufflevector <32 x i8> %32, <32 x i8> %33, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 32, i32 33, i32 34, i32 35, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %35 = shufflevector <4 x i8> %15, <4 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %36 = shufflevector <32 x i8> %34, <32 x i8> %35, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 32, i32 33, i32 34, i32 35, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %37 = shufflevector <4 x i8> %27, <4 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %38 = shufflevector <32 x i8> %36, <32 x i8> %37, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 32, i32 33, i32 34, i32 35, i32 poison, i32 poison, i32 poison, i32 poison>
  %39 = shufflevector <4 x i8> %23, <4 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %40 = shufflevector <32 x i8> %38, <32 x i8> %39, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 32, i32 33, i32 34, i32 35>
  %41 = zext <32 x i8> %40 to <32 x i32>
  %42 = tail call <4 x i8> @llvm.experimental.vp.strided.load.v4i8.p0.i64(ptr nonnull align 1 %arrayidx41.4, i64 %idx.ext55, <4 x i1> splat (i1 true), i32 4), !tbaa !6
  %43 = shufflevector <4 x i8> %42, <4 x i8> poison, <4 x i32> <i32 2, i32 3, i32 1, i32 0>
  %44 = shufflevector <16 x i8> %13, <16 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %45 = shufflevector <4 x i8> %21, <4 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %46 = shufflevector <32 x i8> %44, <32 x i8> %45, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 32, i32 33, i32 34, i32 35, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %47 = shufflevector <4 x i8> %17, <4 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %48 = shufflevector <32 x i8> %46, <32 x i8> %47, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 32, i32 33, i32 34, i32 35, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %49 = shufflevector <4 x i8> %43, <4 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %50 = shufflevector <32 x i8> %48, <32 x i8> %49, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 32, i32 33, i32 34, i32 35, i32 poison, i32 poison, i32 poison, i32 poison>
  %51 = shufflevector <4 x i8> %25, <4 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %52 = shufflevector <32 x i8> %50, <32 x i8> %51, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 32, i32 33, i32 34, i32 35>
  %53 = zext <32 x i8> %52 to <32 x i32>
  %54 = add nuw nsw <32 x i32> %53, %41
  %55 = sub nsw <32 x i32> %41, %53
  %56 = shl nsw <32 x i32> %55, splat (i32 16)
  %57 = or disjoint <32 x i32> %56, %54
  %58 = shufflevector <32 x i32> %57, <32 x i32> poison, <32 x i32> <i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15, i32 8, i32 9, i32 10, i32 11, i32 20, i32 21, i32 22, i32 23, i32 16, i32 17, i32 18, i32 19, i32 28, i32 29, i32 30, i32 31, i32 24, i32 25, i32 26, i32 27>
  %59 = add nsw <32 x i32> %57, %58
  %60 = sub nsw <32 x i32> %57, %58
  %61 = shufflevector <32 x i32> %59, <32 x i32> %60, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 36, i32 37, i32 38, i32 39, i32 8, i32 9, i32 10, i32 11, i32 44, i32 45, i32 46, i32 47, i32 16, i32 17, i32 18, i32 19, i32 52, i32 53, i32 54, i32 55, i32 24, i32 25, i32 26, i32 27, i32 60, i32 61, i32 62, i32 63>
  %62 = shufflevector <32 x i32> %59, <32 x i32> %60, <32 x i32> <i32 1, i32 0, i32 3, i32 2, i32 37, i32 36, i32 39, i32 38, i32 9, i32 8, i32 11, i32 10, i32 45, i32 44, i32 47, i32 46, i32 17, i32 16, i32 19, i32 18, i32 53, i32 52, i32 55, i32 54, i32 25, i32 24, i32 27, i32 26, i32 61, i32 60, i32 63, i32 62>
  %63 = sub nsw <32 x i32> %61, %62
  %64 = add nsw <32 x i32> %61, %62
  %65 = shufflevector <32 x i32> %63, <32 x i32> %64, <32 x i32> <i32 0, i32 33, i32 34, i32 3, i32 4, i32 37, i32 38, i32 7, i32 8, i32 41, i32 42, i32 11, i32 12, i32 45, i32 46, i32 15, i32 16, i32 49, i32 50, i32 19, i32 20, i32 53, i32 54, i32 23, i32 24, i32 57, i32 58, i32 27, i32 28, i32 61, i32 62, i32 31>
  %66 = shufflevector <32 x i32> %63, <32 x i32> %64, <32 x i32> <i32 3, i32 34, i32 33, i32 0, i32 7, i32 38, i32 37, i32 4, i32 11, i32 42, i32 41, i32 8, i32 15, i32 46, i32 45, i32 12, i32 19, i32 50, i32 49, i32 16, i32 23, i32 54, i32 53, i32 20, i32 27, i32 58, i32 57, i32 24, i32 31, i32 62, i32 61, i32 28>
  %67 = add nsw <32 x i32> %65, %66
  %68 = sub nsw <32 x i32> %65, %66
  %69 = shufflevector <32 x i32> %67, <32 x i32> %68, <32 x i32> <i32 0, i32 1, i32 34, i32 35, i32 4, i32 5, i32 38, i32 39, i32 8, i32 9, i32 42, i32 43, i32 12, i32 13, i32 46, i32 47, i32 16, i32 17, i32 50, i32 51, i32 20, i32 21, i32 54, i32 55, i32 24, i32 25, i32 58, i32 59, i32 28, i32 29, i32 62, i32 63>
  %70 = lshr <32 x i32> %69, splat (i32 15)
  %71 = and <32 x i32> %70, splat (i32 65537)
  %72 = mul nuw <32 x i32> %71, splat (i32 65535)
  %73 = add <32 x i32> %72, %69
  %74 = xor <32 x i32> %73, %72
  %75 = shufflevector <32 x i32> %67, <32 x i32> %68, <32 x i32> <i32 8, i32 9, i32 42, i32 43, i32 12, i32 13, i32 46, i32 47, i32 0, i32 1, i32 34, i32 35, i32 4, i32 5, i32 38, i32 39, i32 24, i32 25, i32 58, i32 59, i32 28, i32 29, i32 62, i32 63, i32 16, i32 17, i32 50, i32 51, i32 20, i32 21, i32 54, i32 55>
  %76 = sub <32 x i32> %69, %75
  %77 = add <32 x i32> %69, %75
  %78 = shufflevector <32 x i32> %76, <32 x i32> %77, <32 x i32> <i32 17, i32 57, i32 41, i32 1, i32 16, i32 56, i32 40, i32 0, i32 18, i32 58, i32 42, i32 2, i32 19, i32 59, i32 43, i32 3, i32 21, i32 61, i32 45, i32 5, i32 20, i32 60, i32 44, i32 4, i32 22, i32 62, i32 46, i32 6, i32 23, i32 63, i32 47, i32 7>
  %79 = shufflevector <32 x i32> %76, <32 x i32> %77, <32 x i32> <i32 1, i32 41, i32 57, i32 17, i32 0, i32 40, i32 56, i32 16, i32 2, i32 42, i32 58, i32 18, i32 3, i32 43, i32 59, i32 19, i32 5, i32 45, i32 61, i32 21, i32 4, i32 44, i32 60, i32 20, i32 6, i32 46, i32 62, i32 22, i32 7, i32 47, i32 63, i32 23>
  %80 = add nsw <32 x i32> %78, %79
  %81 = sub nsw <32 x i32> %78, %79
  %82 = shufflevector <32 x i32> %80, <32 x i32> %81, <32 x i32> <i32 0, i32 1, i32 34, i32 35, i32 4, i32 5, i32 38, i32 39, i32 8, i32 9, i32 42, i32 43, i32 12, i32 13, i32 46, i32 47, i32 16, i32 17, i32 50, i32 51, i32 20, i32 21, i32 54, i32 55, i32 24, i32 25, i32 58, i32 59, i32 28, i32 29, i32 62, i32 63>
  %83 = lshr <32 x i32> %82, splat (i32 15)
  %84 = and <32 x i32> %83, splat (i32 65537)
  %85 = mul nuw <32 x i32> %84, splat (i32 65535)
  %86 = add <32 x i32> %85, %82
  %87 = xor <32 x i32> %86, %85
  %88 = tail call i32 @llvm.vector.reduce.add.v32i32(<32 x i32> %87)
  %89 = tail call i32 @llvm.vector.reduce.add.v32i32(<32 x i32> %74)
  %90 = extractelement <32 x i32> %77, i64 9
  %91 = extractelement <32 x i32> %67, i64 17
  %add183 = add i32 %90, %91
  %92 = extractelement <32 x i32> %67, i64 25
  %add185 = add i32 %add183, %92
  %conv187 = and i32 %add185, 65535
  %conv189 = and i32 %89, 65535
  %shr = lshr i32 %89, 16
  %add190 = add nuw nsw i32 %conv189, %shr
  %sub191 = sub nsw i32 %add190, %conv187
  %conv193 = and i32 %88, 65535
  %shr194 = lshr i32 %88, 16
  %add195 = add nuw nsw i32 %conv193, %shr194
  %sub196 = sub nsw i32 %add195, %conv187
  %conv197 = sext i32 %sub196 to i64
  %shl198 = shl nsw i64 %conv197, 32
  %conv199 = sext i32 %sub191 to i64
  %add200 = add nsw i64 %shl198, %conv199
  ret i64 %add200
}

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(read)
declare <8 x i8> @llvm.masked.gather.v8i8.v8p0(<8 x ptr>, i32 immarg, <8 x i1>, <8 x i8>) #1

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(argmem: read)
declare <4 x i8> @llvm.experimental.vp.strided.load.v4i8.p0.i64(ptr nocapture, i64, <4 x i1>, i32) #2

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(read)
declare <16 x i8> @llvm.masked.gather.v16i8.v16p0(<16 x ptr>, i32 immarg, <16 x i1>, <16 x i8>) #1

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.vector.reduce.add.v32i32(<32 x i32>) #3

attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(read, inaccessiblemem: none) vscale_range(8,8) "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="sifive-x280" "target-features"="+64bit,+a,+c,+d,+experimental,+f,+m,+relax,+v,+zaamo,+zalrsc,+zba,+zbb,+zfh,+zfhmin,+zicsr,+zifencei,+zmmul,+zve32f,+zve32x,+zve64d,+zve64f,+zve64x,+zvfh,+zvfhmin,+zvl128b,+zvl256b,+zvl32b,+zvl512b,+zvl64b,-b,-e,-experimental-smctr,-experimental-ssctr,-experimental-svukte,-experimental-xqcia,-experimental-xqciac,-experimental-xqcics,-experimental-xqcicsr,-experimental-xqcilsm,-experimental-xqcisls,-experimental-zalasr,-experimental-zicfilp,-experimental-zicfiss,-experimental-zvbc32e,-experimental-zvkgs,-h,-sha,-shcounterenw,-shgatpa,-shtvala,-shvsatpa,-shvstvala,-shvstvecd,-smaia,-smcdeleg,-smcsrind,-smdbltrp,-smepmp,-smmpm,-smnpm,-smrnmi,-smstateen,-ssaia,-ssccfg,-ssccptr,-sscofpmf,-sscounterenw,-sscsrind,-ssdbltrp,-ssnpm,-sspm,-ssqosid,-ssstateen,-ssstrict,-sstc,-sstvala,-sstvecd,-ssu64xl,-supm,-svade,-svadu,-svbare,-svinval,-svnapot,-svpbmt,-svvptc,-xcvalu,-xcvbi,-xcvbitmanip,-xcvelw,-xcvmac,-xcvmem,-xcvsimd,-xsfcease,-xsfvcp,-xsfvfnrclipxfqf,-xsfvfwmaccqqq,-xsfvqmaccdod,-xsfvqmaccqoq,-xsifivecdiscarddlone,-xsifivecflushdlone,-xtheadba,-xtheadbb,-xtheadbs,-xtheadcmo,-xtheadcondmov,-xtheadfmemidx,-xtheadmac,-xtheadmemidx,-xtheadmempair,-xtheadsync,-xtheadvdot,-xventanacondops,-xwchc,-za128rs,-za64rs,-zabha,-zacas,-zama16b,-zawrs,-zbc,-zbkb,-zbkc,-zbkx,-zbs,-zca,-zcb,-zcd,-zce,-zcf,-zcmop,-zcmp,-zcmt,-zdinx,-zfa,-zfbfmin,-zfinx,-zhinx,-zhinxmin,-zic64b,-zicbom,-zicbop,-zicboz,-ziccamoa,-ziccif,-zicclsm,-ziccrse,-zicntr,-zicond,-zihintntl,-zihintpause,-zihpm,-zimop,-zk,-zkn,-zknd,-zkne,-zknh,-zkr,-zks,-zksed,-zksh,-zkt,-ztso,-zvbb,-zvbc,-zvfbfmin,-zvfbfwma,-zvkb,-zvkg,-zvkn,-zvknc,-zvkned,-zvkng,-zvknha,-zvknhb,-zvks,-zvksc,-zvksed,-zvksg,-zvksh,-zvkt,-zvl1024b,-zvl16384b,-zvl2048b,-zvl32768b,-zvl4096b,-zvl65536b,-zvl8192b" }
attributes #1 = { nocallback nofree nosync nounwind willreturn memory(read) }
attributes #2 = { nocallback nofree nosync nounwind willreturn memory(argmem: read) }
attributes #3 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }

!llvm.module.flags = !{!0, !1, !2, !4}
!llvm.ident = !{!5}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 1, !"target-abi", !"lp64d"}
!2 = !{i32 6, !"riscv-isa", !3}
!3 = !{!"rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_v1p0_zicsr2p0_zifencei2p0_zmmul1p0_zaamo1p0_zalrsc1p0_zfh1p0_zfhmin1p0_zba1p0_zbb1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvfh1p0_zvfhmin1p0_zvl128b1p0_zvl256b1p0_zvl32b1p0_zvl512b1p0_zvl64b1p0"}
!4 = !{i32 8, !"SmallDataLimit", i32 0}
!5 = !{!"clang version 20.0.0git (https://github.com/llvm/llvm-project.git 428daa1ccf87266f6c3d8284909777c5b832d364)"}
!6 = !{!7, !7, i64 0}
!7 = !{!"omnipotent char", !8, i64 0}
!8 = !{!"Simple C/C++ TBAA"}

Just feed the above to llc with your patch applied. I tried using bugpoint to reduce further, but it kept stumbling across others bugs.

Patch adds usage of processShuffleMasks in in codegen in lowerShuffleViaVRegSplitting. This function is already used for X86 shuffles estimations and in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE functions, unifies the code. Reviewers: preames, topperc, lukel97, wangpc-pp Reviewed By: wangpc-pp Pull Request: llvm/llvm-project#120803

[𝘀𝗽𝗿] initial version

0e921f4

Created using spr 1.3.5

llvmbot added the backend:RISC-V label Dec 20, 2024

alexey-bataev requested review from preames, topperc and lukel97 December 20, 2024 22:39

wangpc-pp approved these changes Dec 23, 2024

View reviewed changes

alexey-bataev merged commit b8952d4 into main Dec 23, 2024
10 checks passed

alexey-bataev deleted the users/alexey-bataev/spr/riscvcguse-processshufflemasks-for-per-register-shuffles branch December 23, 2024 16:18

preames reviewed Jan 1, 2025

View reviewed changes

alexey-bataev restored the users/alexey-bataev/spr/riscvcguse-processshufflemasks-for-per-register-shuffles branch January 6, 2025 13:04

alexey-bataev deleted the users/alexey-bataev/spr/riscvcguse-processshufflemasks-for-per-register-shuffles branch January 6, 2025 13:27

alexey-bataev mentioned this pull request Jan 6, 2025

[RISCV][CG]Use processShuffleMasks for per-register shuffles #121765

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RISCV][CG]Use processShuffleMasks for per-register shuffles #120803

[RISCV][CG]Use processShuffleMasks for per-register shuffles #120803

Uh oh!

alexey-bataev commented Dec 20, 2024

Uh oh!

llvmbot commented Dec 20, 2024

Uh oh!

wangpc-pp left a comment

Uh oh!

Uh oh!

preames Jan 1, 2025

Uh oh!

alexey-bataev Jan 2, 2025 •

edited

Loading

Uh oh!

preames Jan 2, 2025

Uh oh!

preames Jan 1, 2025

Uh oh!

alexey-bataev Jan 2, 2025

Uh oh!

preames Jan 2, 2025

Uh oh!

preames commented Jan 1, 2025

Uh oh!

alexey-bataev commented Jan 2, 2025

Uh oh!

preames commented Jan 2, 2025

Uh oh!

preames commented Jan 2, 2025

Uh oh!

Uh oh!

[RISCV][CG]Use processShuffleMasks for per-register shuffles #120803

[RISCV][CG]Use processShuffleMasks for per-register shuffles #120803

Uh oh!

Conversation

alexey-bataev commented Dec 20, 2024

Uh oh!

llvmbot commented Dec 20, 2024

Uh oh!

wangpc-pp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

preames Jan 1, 2025

Choose a reason for hiding this comment

Uh oh!

alexey-bataev Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

preames Jan 2, 2025

Choose a reason for hiding this comment

Uh oh!

preames Jan 1, 2025

Choose a reason for hiding this comment

Uh oh!

alexey-bataev Jan 2, 2025

Choose a reason for hiding this comment

Uh oh!

preames Jan 2, 2025

Choose a reason for hiding this comment

Uh oh!

preames commented Jan 1, 2025

Uh oh!

alexey-bataev commented Jan 2, 2025

Uh oh!

preames commented Jan 2, 2025

Uh oh!

preames commented Jan 2, 2025

Uh oh!

Uh oh!

alexey-bataev Jan 2, 2025 •

edited

Loading