[X86][Codegen] Shuffle certain shifts on i8 vectors to create opportunity for vectorized shift instructions #117980

huangjd · 2024-11-28T08:56:27Z

Vectorized shift instructions are not available for i8 type. The current typical way to handle a shift on i8 vector is to use 2 vector i16 multiply to get the even and odd bytes separately and then combine them. If shift amount is a constant vector and we can somehow shuffle the constant vector so that each pair or quad of adjacent elements has the same value, we can obtain the result by using vector shift on widened type and then a vector AND to clear the bits supposed to be shifted out of a byte. This is typically faster than using vector multiply, as long as the shuffle itself is also fast (because we need to shuffle the operand before and after back to its original order).

github-actions · 2024-11-28T09:01:27Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 9f69da35e2e5438d0c042f76277fff397f6a1505 02249f3c811568e31e78b9290bb2189a089bc5ae --extensions cpp -- llvm/lib/Target/X86/X86ISelLowering.cpp

View the diff from clang-format here.

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 90d7be73c6..7e0e0a5f95 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -29780,9 +29780,9 @@ template <typename InputTy, typename PermutationTy,
                              std::pair<typename InputTy::value_type,
                                        typename PermutationTy::value_type>,
                              8>>
-static bool PermuteAndPairVector(
-    const InputTy &Inputs, PermutationTy &Permutation,
-    MapTy UnpairedInputs = MapTy()) {
+static bool PermuteAndPairVector(const InputTy &Inputs,
+                                 PermutationTy &Permutation,
+                                 MapTy UnpairedInputs = MapTy()) {
   const auto Wildcard = ~typename InputTy::value_type();
   SmallVector<typename PermutationTy::value_type, 16> WildcardPairs;
 
@@ -30258,7 +30258,8 @@ static SDValue LowerShift(SDValue Op, const X86Subtarget &Subtarget,
     // Found a permutation P that can rearrange the shift amouts into adjacent
     // pair or quad of same values. Rewrite the shift S1(x) into P^-1(S2(P(x))).
     if (Profitable) {
-      SDValue InnerShuffle = DAG.getVectorShuffle(VT, dl, R, DAG.getUNDEF(VT), Permutation);
+      SDValue InnerShuffle =
+          DAG.getVectorShuffle(VT, dl, R, DAG.getUNDEF(VT), Permutation);
       SmallVector<SDValue, 64> NewShiftAmt;
       for (int Index : Permutation) {
         NewShiftAmt.push_back(Amt.getOperand(Index));
@@ -30267,7 +30268,8 @@ static SDValue LowerShift(SDValue Op, const X86Subtarget &Subtarget,
       for (size_t I = 0; I < NewShiftAmt.size(); I += 2) {
         SDValue Even = NewShiftAmt[I];
         SDValue Odd = NewShiftAmt[I + 1];
-        assert(Even.isUndef() || Odd.isUndef() || Even->getAsZExtVal() == Odd->getAsZExtVal());
+        assert(Even.isUndef() || Odd.isUndef() ||
+               Even->getAsZExtVal() == Odd->getAsZExtVal());
       }
 #endif
       SDValue NewShiftVector = DAG.getBuildVector(VT, dl, NewShiftAmt);
@@ -30276,7 +30278,8 @@ static SDValue LowerShift(SDValue Op, const X86Subtarget &Subtarget,
       for (size_t I = 0; I < Permutation.size(); ++I) {
         InversePermutation[Permutation[I]] = I;
       }
-      SDValue OuterShuffle = DAG.getVectorShuffle(VT, dl, NewShift, DAG.getUNDEF(VT), InversePermutation);
+      SDValue OuterShuffle = DAG.getVectorShuffle(
+          VT, dl, NewShift, DAG.getUNDEF(VT), InversePermutation);
       return OuterShuffle;
     }
   }

david-xl

can you add a test case?

RKSimon · 2024-12-06T08:14:49Z

can you add a test case?

There should be quite a few tests that have already changed by this.

RKSimon · 2024-12-06T21:11:46Z

@huangjd please can you regenerate the changed tests to get some ideas of what's going on so far? The cost of the shuffles has to be very low to be as cheap as pmullw. It could be that we might be better off just using the general vXi8 lowering with blendvb for some constant combos to get a similar effect to what you're trying here.

huangjd · 2024-12-06T22:28:13Z

I am getting the test cases now, before that I am measuring the impact of this transformation. From some preliminary result I found that, if running in a loop where the CPU pipeline can be sufficiently filled, this transformation can be beneficial, otherwise it is questionable. Given that vector arithmetic operations is typically used in ML kernels or other very parallel operations, can there be a compile flag to toggle this behavior?

RKSimon · 2024-12-07T16:47:26Z

Lowering occurs in the DAG which handles each basic block independently without any real understanding of whether its part of a hot loop etc. - a compile flag would struggle to account for every case. This issue has come up a few times recently but I'm not sure how easy it'd be to delay things like this to a MachineLICM pass afterwards given how different the codegen could turn out to be.

Depending on your target CPU I did start work on generic vXi8 shift lowering using GFNI instructions: #89644 which I haven't had time to go back to, have you looked at anything similar?

llvmbot · 2024-12-15T05:51:00Z

@llvm/pr-subscribers-backend-x86

Author: William Huang (huangjd)

Changes

Vectorized shift instructions are not available for i8 type. The current typical way to handle a shift on i8 vector is to use 2 vector i16 multiply to get the even and odd bytes separately and then combine them. If shift amount is a constant vector and we can somehow shuffle the constant vector so that each pair or quad of adjacent elements has the same value, we can obtain the result by using vector shift on widened type and then a vector AND to clear the bits supposed to be shifted out of a byte. This is typically faster than using vector multiply, as long as the shuffle itself is also fast (because we need to shuffle the operand before and after back to its original order).

Patch is 52.60 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/117980.diff

9 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+237-1)
(modified) llvm/test/CodeGen/X86/combine-sdiv.ll (+43-73)
(modified) llvm/test/CodeGen/X86/vector-fshr-128.ll (+4-6)
(modified) llvm/test/CodeGen/X86/vector-mul.ll (+13-16)
(modified) llvm/test/CodeGen/X86/vector-shift-ashr-sub128.ll (+38-23)
(modified) llvm/test/CodeGen/X86/vector-shift-lshr-sub128.ll (+14-19)
(modified) llvm/test/CodeGen/X86/vector-shift-shl-128.ll (+4-5)
(modified) llvm/test/CodeGen/X86/vector-shift-shl-sub128.ll (+21-24)
(added) llvm/test/CodeGen/X86/vector-shift-widen.ll (+306)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 1c790f3813b7a4..5444d9a91da99c 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -28,7 +28,6 @@
 #include "llvm/ADT/StringSwitch.h"
 #include "llvm/Analysis/BlockFrequencyInfo.h"
 #include "llvm/Analysis/ObjCARCUtil.h"
-#include "llvm/Analysis/ProfileSummaryInfo.h"
 #include "llvm/Analysis/VectorUtils.h"
 #include "llvm/CodeGen/IntrinsicLowering.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
@@ -29766,6 +29765,113 @@ static SDValue convertShiftLeftToScale(SDValue Amt, const SDLoc &dl,
   return SDValue();
 }
 
+// Given a vector of values, find a permutation such that every adjacent even-
+// odd pair has the same value. ~0 is reserved as a special value for wildcard,
+// which can be paired with any value. Returns true if a permutation is found.
+// If output Permutation is not empty, permutation index starts at its previous
+// size, so that this function can concatenate the result of multiple calls.
+// UnpairedInputs contains values yet to be paired, mapping an unpaired value to
+// its current neighbor's value and index.
+// Do not use llvm::DenseMap as ~0 is reserved key.
+template <typename InputTy, typename PermutationTy,
+          typename MapTy =
+              SmallMapVector<typename InputTy::value_type,
+                             std::pair<typename InputTy::value_type,
+                                       typename PermutationTy::value_type>,
+                             8>>
+static bool PermuteAndPairVector(
+    const InputTy &Inputs, PermutationTy &Permutation,
+    MapTy UnpairedInputs = MapTy()) {static_assert(std::is_same<typename InputTy::value_type, uint8_t>::value);
+  const typename InputTy::value_type Wildcard = ~0;
+  SmallVector<typename PermutationTy::value_type, 16> WildcardPairs;
+
+  size_t OutputOffset = Permutation.size();
+  typename PermutationTy::value_type I = 0;
+  for (auto InputIt = Inputs.begin(), InputEnd = Inputs.end();
+       InputIt != InputEnd;) {
+    Permutation.push_back(OutputOffset + I);
+    Permutation.push_back(OutputOffset + I + 1);
+
+    auto Even = *InputIt++;
+    assert(InputIt != InputEnd && "Expected even number of elements");
+    auto Odd = *InputIt++;
+
+    // If both are wildcards, note it for later use by unpairable values.
+    if (Even == Wildcard && Odd == Wildcard) {
+      WildcardPairs.push_back(I);
+    }
+
+    // If both are equal, they are in good position.
+    if (Even != Odd) {
+      auto DoWork = [&](auto &This, auto ThisIndex, auto Other,
+                        auto OtherIndex) {
+        if (This != Wildcard) {
+          // For non-wildcard value, check if it can pair with an exisiting
+          // unpaired value from UnpairedInputs, if so, swap with the unpaired
+          // value's neighbor, otherwise the current value is added to the map.
+          if (auto [MapIt, Inserted] = UnpairedInputs.try_emplace(
+                  This, std::make_pair(Other, OtherIndex));
+              !Inserted) {
+            auto [SwapValue, SwapIndex] = MapIt->second;
+            std::swap(Permutation[OutputOffset + SwapIndex],
+                      Permutation[OutputOffset + ThisIndex]);
+            This = SwapValue;
+            UnpairedInputs.erase(MapIt);
+
+            if (This == Other) {
+              if (This == Wildcard) {
+                // We freed up a wildcard pair by pairing two non-adjacent
+                // values, note it for later use by unpairable values.
+                WildcardPairs.push_back(I);
+              } else {
+                // The swapped element also forms a pair with Other, so it can
+                // be removed from the map.
+                assert(UnpairedInputs.count(This));
+                UnpairedInputs.erase(This);
+              }
+            } else {
+              // Swapped in an unpaired value, update its info.
+              if (This != Wildcard) {
+                assert(UnpairedInputs.count(This));
+                UnpairedInputs[This] = std::make_pair(Other, OtherIndex);
+              }
+              // If its neighbor is also in UnpairedInputs, update its info too.
+              if (auto OtherMapIt = UnpairedInputs.find(Other);
+                  OtherMapIt != UnpairedInputs.end() &&
+                  OtherMapIt->second.second == ThisIndex) {
+                OtherMapIt->second.first = This;
+              }
+            }
+          }
+        }
+      };
+      DoWork(Even, I, Odd, I + 1);
+      if (Even != Odd) {
+        DoWork(Odd, I + 1, Even, I);
+      }
+    }
+    I += 2;
+  }
+
+  // Now check if each remaining unpaired neighboring values can be swapped with
+  // a wildcard pair to form two paired values.
+  for (auto &[Unpaired, V] : UnpairedInputs) {
+    auto [Neighbor, NeighborIndex] = V;
+    if (Neighbor != Wildcard) {
+      assert(UnpairedInputs.count(Neighbor));
+      if (WildcardPairs.size()) {
+        std::swap(Permutation[OutputOffset + WildcardPairs.back()],
+                  Permutation[OutputOffset + NeighborIndex]);
+        WildcardPairs.pop_back();
+        // Mark the neighbor as processed.
+        UnpairedInputs[Neighbor].first = Wildcard;
+      } else
+        return false;
+    }
+  }
+  return true;
+}
+
 static SDValue LowerShift(SDValue Op, const X86Subtarget &Subtarget,
                           SelectionDAG &DAG) {
   MVT VT = Op.getSimpleValueType();
@@ -30044,6 +30150,136 @@ static SDValue LowerShift(SDValue Op, const X86Subtarget &Subtarget,
     }
   }
 
+  // SHL/SRL/SRA on vXi8 can be widened to vYi16 or vYi32 if the constant
+  // amounts can be shuffled such that every pair or quad of adjacent elements
+  // has the same value. This introduces an extra shuffle before and after the
+  // shift, and it is profitable if the operand is aready a shuffle so that both
+  // can be merged or the extra shuffle is fast.
+  // (shift (shuffle X P1) S1) ->
+  // (shuffle (shift (shuffle X (shuffle P2 P1)) S2) P2^-1) where S2 can be
+  // widened, and P2^-1 is the inverse shuffle of P2.
+  // This is not profitable on XOP or AVX512 becasue it has 8/16-bit vector
+  // variable shift instructions.
+  // Picking out GFNI because normally it implies AVX512, and there is no
+  // latency data for CPU with GFNI and SSE or AVX only, but there are tests for
+  // such combination anyways.
+  if (ConstantAmt &&
+      (VT == MVT::v16i8 || VT == MVT::v32i8 || VT == MVT::v64i8) &&
+      R.hasOneUse() && Subtarget.hasSSSE3() && !Subtarget.hasAVX512() &&
+      !Subtarget.hasXOP() && !Subtarget.hasGFNI()) {
+    constexpr size_t LaneBytes = 16;
+    const size_t NumLanes = VT.getVectorNumElements() / LaneBytes;
+
+    SmallVector<int, 64> Permutation;
+    SmallVector<uint8_t, 64> ShiftAmt;
+    for (size_t I = 0; I < Amt.getNumOperands(); ++I) {
+      if (Amt.getOperand(I).isUndef())
+        ShiftAmt.push_back(~0);
+      else {
+        auto A = Amt.getConstantOperandVal(I);
+        ShiftAmt.push_back(A > 8 ? 8 : A);
+      }
+    }
+
+    // Check if we can find an in-lane shuffle to rearrange the shift amounts,
+    // if so, this transformation may be profitable. Cross-lane shuffle is
+    // almost never profitable because there is no general 1-instruction
+    // solution.
+    bool Profitable;
+    for (size_t I = 0; I < NumLanes; ++I) {
+      if (!(Profitable = PermuteAndPairVector(
+                ArrayRef(&ShiftAmt[I * LaneBytes], LaneBytes), Permutation)))
+        break;
+    }
+
+    // For AVX2, check if we can further rearrange shift amounts into adjacent
+    // quads, so that it can use VPS*LVD instead of VPMUL*W as it is 2 cycles
+    // faster.
+    bool IsAdjacentQuads = false;
+    if (Profitable && Subtarget.hasAVX2()) {
+      SmallVector<uint8_t, 64> EveryOtherShiftAmt;
+      for (size_t I = 0; I < Permutation.size(); I += 2) {
+        uint8_t Shift1 = ShiftAmt[Permutation[I]];
+        uint8_t Shift2 = ShiftAmt[Permutation[I + 1]];
+        assert(Shift1 == Shift2 || Shift1 == (uint8_t) ~0 ||
+               Shift2 == (uint8_t) ~0);
+        EveryOtherShiftAmt.push_back(Shift1 != (uint8_t) ~0 ? Shift1 : Shift2);
+      }
+      SmallVector<int, 32> Permutation2;
+      for (size_t I = 0; I < NumLanes; ++I) {
+        if (!(IsAdjacentQuads = PermuteAndPairVector(
+                  ArrayRef(&EveryOtherShiftAmt[I * LaneBytes / 2],
+                           LaneBytes / 2),
+                  Permutation2)))
+          break;
+      }
+      if (IsAdjacentQuads) {
+        SmallVector<int, 64> CombinedPermutation;
+        for (int Index : Permutation2) {
+          CombinedPermutation.push_back(Permutation[Index * 2]);
+          CombinedPermutation.push_back(Permutation[Index * 2 + 1]);
+        }
+        std::swap(Permutation, CombinedPermutation);
+      }
+    }
+
+    // For right shifts, (V)PMULHUW needs 2 extra instructions to handle an
+    // amount of 0, making it unprofitable.
+    if (!IsAdjacentQuads && (Opc == ISD::SRL || Opc == ISD::SRA) &&
+        any_of(ShiftAmt, [](uint8_t x) { return x == 0; }))
+      Profitable = false;
+
+    bool IsOperandShuffle = R.getOpcode() == ISD::VECTOR_SHUFFLE;
+    // If operand R is a shuffle, one of the two shuffles introduced by this
+    // transformation can be merged with it, and the extrast shuffle is 1 cycle.
+    // This is generally profitable because it eliminates one (or both) vector
+    // multiplication, which has to be scheduled at least 1 cycle apart.
+    // If operand R is not a shuffle, several cases are not profitable based on
+    // pipeline modeling, so we are excluding them here.
+    if (!IsOperandShuffle) {
+      // A hack to detect AMD CPU.
+      if (Subtarget.hasSSE4A() && Opc == ISD::SRA) {
+        if (Opc == ISD::SRA)
+          Profitable = false;
+      } else {
+        if ((Subtarget.hasAVX() && !Subtarget.hasAVX2()) ||
+            (Subtarget.hasAVX2() && !IsAdjacentQuads))
+          Profitable = false;
+      }
+    }
+
+    // Found a permutation P that can rearrange the shift amouts into adjacent
+    // pair or quad of same values. Rewrite the shift S1(x) into P^-1(S2(P(x))).
+    if (Profitable) {
+      SDValue InnerShuffle =
+          DAG.getVectorShuffle(VT, dl, R, DAG.getUNDEF(VT), Permutation);
+      SmallVector<SDValue, 64> NewShiftAmt;
+      for (int Index : Permutation) {
+        NewShiftAmt.push_back(Amt.getOperand(Index));
+      }
+      // If using (V)PMULHUW, any undef pair is resolved to shift by 8 so that
+      // it does not create extra instructions in case it is resolved to 0.
+      for (size_t I = 0; I < NewShiftAmt.size(); I += 2) {
+        SDValue &Even = NewShiftAmt[I];
+        SDValue &Odd = NewShiftAmt[I + 1];
+        assert(Even.isUndef() || Odd.isUndef() ||
+               Even->getAsZExtVal() == Odd->getAsZExtVal());
+        if (!IsAdjacentQuads && Even.isUndef() && Odd.isUndef())
+          Even = DAG.getConstant(8, dl, VT.getScalarType());
+      }
+
+      SDValue NewShiftVector = DAG.getBuildVector(VT, dl, NewShiftAmt);
+      SDValue NewShift = DAG.getNode(Opc, dl, VT, InnerShuffle, NewShiftVector);
+      SmallVector<int, 64> InversePermutation(Permutation.size());
+      for (size_t I = 0; I < Permutation.size(); ++I) {
+        InversePermutation[Permutation[I]] = I;
+      }
+      SDValue OuterShuffle = DAG.getVectorShuffle(
+          VT, dl, NewShift, DAG.getUNDEF(VT), InversePermutation);
+      return OuterShuffle;
+    }
+  }
+
   // If possible, lower this packed shift into a vector multiply instead of
   // expanding it into a sequence of scalar shifts.
   // For v32i8 cases, it might be quicker to split/extend to vXi16 shifts.
diff --git a/llvm/test/CodeGen/X86/combine-sdiv.ll b/llvm/test/CodeGen/X86/combine-sdiv.ll
index 2b392e69297f07..b14c839a6f1f11 100644
--- a/llvm/test/CodeGen/X86/combine-sdiv.ll
+++ b/llvm/test/CodeGen/X86/combine-sdiv.ll
@@ -351,32 +351,20 @@ define <16 x i8> @combine_vec_sdiv_by_pow2b_v16i8(<16 x i8> %x) {
 ; SSE41-LABEL: combine_vec_sdiv_by_pow2b_v16i8:
 ; SSE41:       # %bb.0:
 ; SSE41-NEXT:    movdqa %xmm0, %xmm1
-; SSE41-NEXT:    pxor %xmm0, %xmm0
-; SSE41-NEXT:    pxor %xmm3, %xmm3
-; SSE41-NEXT:    pcmpgtb %xmm1, %xmm3
-; SSE41-NEXT:    pmovzxbw {{.*#+}} xmm2 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero,xmm3[4],zero,xmm3[5],zero,xmm3[6],zero,xmm3[7],zero
-; SSE41-NEXT:    punpckhbw {{.*#+}} xmm3 = xmm3[8],xmm0[8],xmm3[9],xmm0[9],xmm3[10],xmm0[10],xmm3[11],xmm0[11],xmm3[12],xmm0[12],xmm3[13],xmm0[13],xmm3[14],xmm0[14],xmm3[15],xmm0[15]
-; SSE41-NEXT:    movdqa {{.*#+}} xmm0 = [256,4,2,16,8,32,64,2]
-; SSE41-NEXT:    pmullw %xmm0, %xmm3
-; SSE41-NEXT:    psrlw $8, %xmm3
-; SSE41-NEXT:    pmullw %xmm0, %xmm2
-; SSE41-NEXT:    psrlw $8, %xmm2
-; SSE41-NEXT:    packuswb %xmm3, %xmm2
+; SSE41-NEXT:    pshufb {{.*#+}} xmm1 = xmm1[9,1,2,7,4,12,11,3,8,0,14,6,5,13,10,15]
+; SSE41-NEXT:    pxor %xmm2, %xmm2
+; SSE41-NEXT:    pcmpgtb %xmm1, %xmm2
+; SSE41-NEXT:    pmulhuw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2 # [1024,512,2048,4096,256,16384,8192,512]
+; SSE41-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2
 ; SSE41-NEXT:    paddb %xmm1, %xmm2
-; SSE41-NEXT:    movdqa %xmm2, %xmm0
-; SSE41-NEXT:    punpckhbw {{.*#+}} xmm0 = xmm0[8],xmm2[8],xmm0[9],xmm2[9],xmm0[10],xmm2[10],xmm0[11],xmm2[11],xmm0[12],xmm2[12],xmm0[13],xmm2[13],xmm0[14],xmm2[14],xmm0[15],xmm2[15]
-; SSE41-NEXT:    psraw $8, %xmm0
-; SSE41-NEXT:    movdqa {{.*#+}} xmm3 = [256,64,128,16,32,8,4,128]
-; SSE41-NEXT:    pmullw %xmm3, %xmm0
-; SSE41-NEXT:    psrlw $8, %xmm0
-; SSE41-NEXT:    punpcklbw {{.*#+}} xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
-; SSE41-NEXT:    psraw $8, %xmm2
-; SSE41-NEXT:    pmullw %xmm3, %xmm2
-; SSE41-NEXT:    psrlw $8, %xmm2
-; SSE41-NEXT:    packuswb %xmm0, %xmm2
-; SSE41-NEXT:    movaps {{.*#+}} xmm0 = [0,255,255,255,255,255,255,255,0,255,255,255,255,255,255,255]
-; SSE41-NEXT:    pblendvb %xmm0, %xmm2, %xmm1
-; SSE41-NEXT:    movdqa %xmm1, %xmm0
+; SSE41-NEXT:    pmulhuw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2 # [16384,32768,8192,4096,256,1024,2048,32768]
+; SSE41-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2
+; SSE41-NEXT:    movdqa {{.*#+}} xmm1 = [32,32,64,64,16,16,8,8,u,u,2,2,4,4,64,64]
+; SSE41-NEXT:    pxor %xmm1, %xmm2
+; SSE41-NEXT:    psubb %xmm1, %xmm2
+; SSE41-NEXT:    pshufb {{.*#+}} xmm2 = zero,xmm2[1,2,7,4,12,11,3],zero,xmm2[0,14,6,5,13,10,15]
+; SSE41-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[8],zero,zero,zero,zero,zero,zero,zero
+; SSE41-NEXT:    por %xmm2, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX1-LABEL: combine_vec_sdiv_by_pow2b_v16i8:
@@ -2184,39 +2172,23 @@ define <16 x i8> @non_splat_minus_one_divisor_1(<16 x i8> %A) {
 ; SSE41-LABEL: non_splat_minus_one_divisor_1:
 ; SSE41:       # %bb.0:
 ; SSE41-NEXT:    movdqa %xmm0, %xmm1
-; SSE41-NEXT:    pxor %xmm0, %xmm0
-; SSE41-NEXT:    pxor %xmm3, %xmm3
-; SSE41-NEXT:    pcmpgtb %xmm1, %xmm3
-; SSE41-NEXT:    pxor %xmm4, %xmm4
-; SSE41-NEXT:    punpcklbw {{.*#+}} xmm4 = xmm4[0],xmm3[0],xmm4[1],xmm3[1],xmm4[2],xmm3[2],xmm4[3],xmm3[3],xmm4[4],xmm3[4],xmm4[5],xmm3[5],xmm4[6],xmm3[6],xmm4[7],xmm3[7]
-; SSE41-NEXT:    pmovzxbw {{.*#+}} xmm2 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero,xmm3[4],zero,xmm3[5],zero,xmm3[6],zero,xmm3[7],zero
-; SSE41-NEXT:    psllw $1, %xmm2
-; SSE41-NEXT:    pblendw {{.*#+}} xmm2 = xmm4[0,1],xmm2[2],xmm4[3,4,5],xmm2[6],xmm4[7]
-; SSE41-NEXT:    psrlw $8, %xmm2
-; SSE41-NEXT:    punpckhbw {{.*#+}} xmm3 = xmm3[8],xmm0[8],xmm3[9],xmm0[9],xmm3[10],xmm0[10],xmm3[11],xmm0[11],xmm3[12],xmm0[12],xmm3[13],xmm0[13],xmm3[14],xmm0[14],xmm3[15],xmm0[15]
-; SSE41-NEXT:    pmullw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm3 # [256,2,2,2,2,128,2,128]
-; SSE41-NEXT:    psrlw $8, %xmm3
-; SSE41-NEXT:    packuswb %xmm3, %xmm2
+; SSE41-NEXT:    pshufb {{.*#+}} xmm1 = xmm1[0,1,2,6,4,5,3,7,12,9,10,11,15,13,14,8]
+; SSE41-NEXT:    pxor %xmm2, %xmm2
+; SSE41-NEXT:    pcmpgtb %xmm1, %xmm2
+; SSE41-NEXT:    pmulhuw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2 # [256,512,256,256,512,512,32768,512]
+; SSE41-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2
 ; SSE41-NEXT:    paddb %xmm1, %xmm2
-; SSE41-NEXT:    movdqa %xmm2, %xmm0
-; SSE41-NEXT:    punpckhbw {{.*#+}} xmm0 = xmm0[8],xmm2[8],xmm0[9],xmm2[9],xmm0[10],xmm2[10],xmm0[11],xmm2[11],xmm0[12],xmm2[12],xmm0[13],xmm2[13],xmm0[14],xmm2[14],xmm0[15],xmm2[15]
-; SSE41-NEXT:    psraw $8, %xmm0
-; SSE41-NEXT:    movdqa %xmm0, %xmm3
-; SSE41-NEXT:    psllw $1, %xmm3
-; SSE41-NEXT:    psllw $7, %xmm0
-; SSE41-NEXT:    pblendw {{.*#+}} xmm0 = xmm0[0,1,2,3,4],xmm3[5],xmm0[6],xmm3[7]
-; SSE41-NEXT:    psrlw $8, %xmm0
-; SSE41-NEXT:    punpcklbw {{.*#+}} xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
-; SSE41-NEXT:    psraw $8, %xmm2
-; SSE41-NEXT:    psllw $7, %xmm2
-; SSE41-NEXT:    psrlw $8, %xmm2
-; SSE41-NEXT:    packuswb %xmm0, %xmm2
-; SSE41-NEXT:    movaps {{.*#+}} xmm0 = [0,0,255,0,0,0,255,0,0,255,255,255,255,255,255,255]
-; SSE41-NEXT:    pblendvb %xmm0, %xmm2, %xmm1
-; SSE41-NEXT:    movdqa {{.*#+}} xmm0 = [255,255,0,255,255,255,0,255,255,0,0,0,0,255,0,255]
-; SSE41-NEXT:    pxor %xmm0, %xmm1
-; SSE41-NEXT:    psubb %xmm0, %xmm1
-; SSE41-NEXT:    movdqa %xmm1, %xmm0
+; SSE41-NEXT:    pmulhuw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2 # [256,32768,256,256,32768,32768,512,32768]
+; SSE41-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2
+; SSE41-NEXT:    movdqa {{.*#+}} xmm1 = [u,u,64,64,u,u,u,u,64,64,64,64,1,1,64,u]
+; SSE41-NEXT:    pxor %xmm1, %xmm2
+; SSE41-NEXT:    psubb %xmm1, %xmm2
+; SSE41-NEXT:    pshufb {{.*#+}} xmm2 = zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,xmm2[9,10,11,8,13,14,12]
+; SSE41-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,1],zero,xmm0[3,4,5],zero,xmm0[7,8],zero,zero,zero,zero,zero,zero,zero
+; SSE41-NEXT:    por %xmm2, %xmm0
+; SSE41-NEXT:    movdqa {{.*#+}} xmm1 = [255,255,0,255,255,255,0,255,255,0,0,0,0,255,0,255]
+; SSE41-NEXT:    pxor %xmm1, %xmm0
+; SSE41-NEXT:    psubb %xmm1, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX1-LABEL: non_splat_minus_one_divisor_1:
@@ -2253,25 +2225,23 @@ define <16 x i8> @non_splat_minus_one_divisor_1(<16 x i8> %A) {
 ;
 ; AVX2-LABEL: non_splat_minus_one_divisor_1:
 ; AVX2:       # %bb.0:
-; AVX2-NEXT:    vpxor %xmm1, %xmm1, %xmm1
-; AVX2-NEXT:    vpcmpgtb %xmm0, %xmm1, %xmm1
-; AVX2-NEXT:    vpmovzxbw {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero,xmm1[8],zero,xmm1[9],zero,xmm1[10],zero,xmm1[11],zero,xmm1[12],zero,xmm1[13],zero,xmm1[14],zero,xmm1[15],zero
-; AVX2-NEXT:    vpmullw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm1, %ymm1 # [256,256,2,256,256,256,2,256,256,2,2,2,2,128,2,128]
-; AVX2-NEXT:    vpsrlw $8, %ymm1, %ymm1
-; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm2
-; AVX2-NEXT:    vpackuswb %xmm2, %xmm1, %xmm1
-; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm1
-; AVX2-NEXT:    vpmovsxbw %xmm1, %ymm1
-; AVX2-NEXT:    vpmullw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm1, %ymm1 # [256,256,128,256,256,256,128,256,256,128,128,128,128,2,128,2]
-; AVX2-NEXT:    vpsrlw $8, %ymm1, %ymm1
-; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm2
-; AVX2-NEXT:    vpackuswb %xmm2, %xmm1, %xmm1
-; AVX2-NEXT:    vmovdqa {{.*#+}} xmm2 = [0,0,255,0,0,0,255,0,0,255,255,255,255,255,255,255]
-; AVX2-NEXT:    vpblendvb %xmm2, %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    vpshufb {{.*#+}} xmm1 = xmm0[14,8,2,6,4,5,3,7,12,9,10,11,15,13,0,1]
+; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
+; AVX2-NEXT:    vpcmpgtb %xmm1, %xmm2, %xmm2
+; AVX2-NEXT:    vpsrlvd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2, %xmm2
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2, %xmm2
+; AVX2-NEXT:    vpaddb %xmm2, %xmm1, %xmm1
+; AVX2-NEXT:    vpsrlvd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpbroadcastq {{.*#+}} xmm2 = [64,64,64,64,1,1,0,0,64,64,64,64,1,1,0,0]
+; AVX2-NEXT:    vpxor %xmm2, %xmm1, %xmm1
+; AVX2-NEXT:    vpsubb %xmm2, %xmm1, %xmm1
+; AVX2-NEXT:    vpshufb {{.*#+}} xmm1 = zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,xmm1[9,10,11,8,13,0,12]
+; AVX2-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,1],zero,xmm0[3,4,5],zero,xmm0[7,8],zero,zero,zero,zero,zero,zero,zero
+; AVX2-NEXT:    vpor %xmm0, %xmm1, %xmm0
 ; AVX2-NEXT:    vmovdqa {{.*#+}} xmm1 = [255,255,0,255,255,255,0,255,255,0,0,0,0,255,0,255]
 ; AVX2-NEXT:    vpxor %xmm1, %xmm0, %xmm0
 ; AVX2-NEXT:    vpsub...
[truncated]

Updated affected tests

huangjd · 2024-12-15T06:04:48Z

Cost/benefit analysis below, assuming a fully utilized pipeline (for example, op mem, reg never stalls on memory load as if the memory load uop is issued early enough so that the actual arithmetic/logic uop can be issued immediately after dependent reg is available).

vi8 column is original latency. v16 and v*i32 are latency values for shift widened to 16 and 32 byte respectively.

huangjd · 2024-12-15T06:07:08Z

Depending on your target CPU I did start work on generic vXi8 shift lowering using GFNI instructions: #89644 which I haven't had time to go back to, have you looked at anything similar?

Won't affect GFNI, as my patch does not apply on AVX512, and all GFNI CPU has AVX512

RKSimon · 2024-12-15T16:38:26Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+  // Picking out GFNI because normally it implies AVX512, and there is no
+  // latency data for CPU with GFNI and SSE or AVX only, but there are tests for
+  // such combination anyways.
+  if (ConstantAmt &&


The lowering scheme immediately above this is very similar to what you're doing (and a lot easier to grok) - I'd recommend you look at extending that code instead of introducing this separate implementation.

The code above is to handle shift widening when adjacent pairs have same shift amount. My patch tries to find a permutation to create such shift, but does not perform widening itself (and hand it to the code above), so it is in fact a different functionality and better left in a separate section

RKSimon · 2024-12-15T16:40:26Z

Depending on your target CPU I did start work on generic vXi8 shift lowering using GFNI instructions: #89644 which I haven't had time to go back to, have you looked at anything similar?

Won't affect GFNI, as my patch does not apply on AVX512, and all GFNI CPU has AVX512

GFNI isn't AVX512 only - everything since Alderlake (P + E cores) has it, as well a some recent Atom cores (Tremont onwards).

huangjd · 2024-12-16T22:59:51Z

Depending on your target CPU I did start work on generic vXi8 shift lowering using GFNI instructions: #89644 which I haven't had time to go back to, have you looked at anything similar?

Won't affect GFNI, as my patch does not apply on AVX512, and all GFNI CPU has AVX512

GFNI isn't AVX512 only - everything since Alderlake (P + E cores) has it, as well a some recent Atom cores (Tremont onwards).

As for now I make my patch mutually exclusive to GFNI (so if GFNI exists on the target CPU, it will be applied while my transformation will not).

…ements. We have several vector shift lowering strategies that have to analyse the distribution of non-uniform constant vector shift amounts, at the moment there is very little sharing of data between these analysis. This patch creates a std::map of the different LEGAL constant shift amounts used, with a mask of which elements they are used in. So far I've only updated the shuffle(immshift(x,c1),immshift(x,c2)) lowering pattern to use it for clarity, there's several more that can be done in followups. Its hoped that the proposed patch llvm#117980 can be simplified after this patch as well. vec_shift6.ll - the existing shuffle(immshift(x,c1),immshift(x,c2)) lowering bails on out of range shift amounts, while this patch now skips them and treats them as UNDEF - this means we manage to fold more cases that before would have to lower to a SHL->MUL pattern, including some legalized cases.

…ements. (#120270) We have several vector shift lowering strategies that have to analyse the distribution of non-uniform constant vector shift amounts, at the moment there is very little sharing of data between these analysis. This patch creates a SmallDenseMap of the different LEGAL constant shift amounts used, with a mask of which elements they are used in. So far I've only updated the shuffle(immshift(x,c1),immshift(x,c2)) lowering pattern to use it for clarity, there's several more that can be done in followups. Its hoped that the proposed patch #117980 can be simplified after this patch as well. vec_shift6.ll - the existing shuffle(immshift(x,c1),immshift(x,c2)) lowering bails on out of range shift amounts, while this patch now skips them and treats them as UNDEF - this means we manage to fold more cases that before would have to lower to a SHL->MUL pattern, including some legalized cases.

huangjd · 2024-12-18T22:16:24Z

@RKSimon How would pr120270 incorporated into this patch? I saw it being mentioned there

…ifferent latency on AMD Zen+, 2 and 3 CPU

RKSimon · 2024-12-23T08:42:04Z

@RKSimon How would pr120270 incorporated into this patch? I saw it being mentioned there

UniqueCstAmt provides a lot of data that this patch currently has to derive itself - it should be a lot of more straightforward to create a pair of shuffle permutation that packs/unpacks the matching shift amounts together.

I'd start with a simple iteration over the map and create the shuffle masks that pack the shifts into order, don't over complicate things.

huangjd added 3 commits November 20, 2024 22:51

initial commit - for vxi8 shifts, try permute vector to widen shift

a398aae

Second version: more cpu latency measurement with llvm-mca

0a0f480

format

02249f3

huangjd requested review from majnemer, boomanaiden154, david-xl and weiguozhi December 5, 2024 01:46

david-xl requested a review from RKSimon December 5, 2024 01:53

david-xl reviewed Dec 5, 2024

View reviewed changes

huangjd added 2 commits December 13, 2024 17:14

bug fixes

0bdbc64

added test cases

0c7f8f2

huangjd marked this pull request as ready for review December 15, 2024 05:50

llvmbot added the backend:X86 label Dec 15, 2024

huangjd removed the request for review from RKSimon December 15, 2024 05:53

fixed corner cases with shift amt > 8 or undef

3268bde

Updated affected tests

huangjd force-pushed the shiftwidening branch from 1a2d6fe to 3268bde Compare December 15, 2024 05:55

RKSimon reviewed Dec 15, 2024

View reviewed changes

added safeguard to prevent the transformation being applied recursively

b8a731d

RKSimon mentioned this pull request Dec 17, 2024

[X86] LowerShift - track the number and location of constant shift elements. #120270

Merged

Update cases to apply this transformation for AMD CPU after finding d…

efeb5f3

…ifferent latency on AMD Zen+, 2 and 3 CPU

This was referenced Jun 2, 2025

[MTE] [NFC] use vector to collect globals to tag (#120283) #142329

Closed

[MTE] [NFC] use vector to collect globals to tag (#120283) #142330

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X86][Codegen] Shuffle certain shifts on i8 vectors to create opportunity for vectorized shift instructions #117980

[X86][Codegen] Shuffle certain shifts on i8 vectors to create opportunity for vectorized shift instructions #117980

Uh oh!

huangjd commented Nov 28, 2024

Uh oh!

github-actions bot commented Nov 28, 2024

Uh oh!

david-xl left a comment

Uh oh!

RKSimon commented Dec 6, 2024

Uh oh!

RKSimon commented Dec 6, 2024

Uh oh!

huangjd commented Dec 6, 2024

Uh oh!

RKSimon commented Dec 7, 2024

Uh oh!

llvmbot commented Dec 15, 2024

Uh oh!

huangjd commented Dec 15, 2024

Uh oh!

huangjd commented Dec 15, 2024

Uh oh!

RKSimon Dec 15, 2024

Uh oh!

huangjd Dec 16, 2024

Uh oh!

RKSimon commented Dec 15, 2024

Uh oh!

huangjd commented Dec 16, 2024

Uh oh!

huangjd commented Dec 18, 2024

Uh oh!

RKSimon commented Dec 23, 2024

Uh oh!

Uh oh!

[X86][Codegen] Shuffle certain shifts on i8 vectors to create opportunity for vectorized shift instructions #117980

Are you sure you want to change the base?

[X86][Codegen] Shuffle certain shifts on i8 vectors to create opportunity for vectorized shift instructions #117980

Uh oh!

Conversation

huangjd commented Nov 28, 2024

Uh oh!

github-actions bot commented Nov 28, 2024

Uh oh!

david-xl left a comment

Choose a reason for hiding this comment

Uh oh!

RKSimon commented Dec 6, 2024

Uh oh!

RKSimon commented Dec 6, 2024

Uh oh!

huangjd commented Dec 6, 2024

Uh oh!

RKSimon commented Dec 7, 2024

Uh oh!

llvmbot commented Dec 15, 2024

Uh oh!

huangjd commented Dec 15, 2024

Uh oh!

huangjd commented Dec 15, 2024

Uh oh!

RKSimon Dec 15, 2024

Choose a reason for hiding this comment

Uh oh!

huangjd Dec 16, 2024

Choose a reason for hiding this comment

Uh oh!

RKSimon commented Dec 15, 2024

Uh oh!

huangjd commented Dec 16, 2024

Uh oh!

huangjd commented Dec 18, 2024

Uh oh!

RKSimon commented Dec 23, 2024

Uh oh!

Uh oh!