[LV]: Teach LV to recursively (de)interleave. #89018

hassnaaHamdi · 2024-04-17T04:19:32Z

Currently available intrinsics are only ld2/st2, which don't support interleaving factor > 2.
This patch teaches the LV to use ld2/st2 recursively to support high interleaving factors.

llvmbot · 2024-04-17T04:20:03Z

@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-backend-aarch64

Author: Hassnaa Hamdi (hassnaaHamdi)

Changes

Given an array of struct like this: struct xyzt { int x; int y; int z; int t; },
The LoopVectorize can't use scalable vectors to vectorize it,
because SV have to use intrinsics to deinterleave,
BUT (de)interleave4 is not available.
This patch uses (de)interleave2 recursively to get same results of using (de)interleave4;
then the vectorizer could use SV.
ex: if we have vector.deinterleave2.nxv16i32(<vscale x 16 x i32> %vec),
it will be deinterleaved into { <vscale x 8 x i32>, <8 x 16 x i32>},
then each extracted vector: <vscale x 8 x i32> will be deinterleaved into { <vscale x 4 x i32>, <vscale x 4 i32> },
so the final result would be: { <vscale x 4 x i32>, <vscale x 4 i32>, <vscale x 4 x i32>, <vscale x 4 i32> },
which is the same result if we could use deinterleave4.
Finally the TLIs that have (de)interleave4 intrinsics can spot that sequence of (de)interleave2 and replace it by (de)interleave4.
this solution is expected to work for any interleaving factor that is pow(2), as long as the TLI has the equivalent intrinsics.

Patch is 32.23 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89018.diff

10 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetLowering.h (+4)
(modified) llvm/lib/CodeGen/InterleavedAccessPass.cpp (+66-6)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+38-11)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.h (+2)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+5-3)
(modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+37-4)
(modified) llvm/lib/Target/RISCV/RISCVISelLowering.h (+3-1)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+48-16)
(added) llvm/test/CodeGen/AArch64/sve-deinterleave-load.ll (+89)
(added) llvm/test/CodeGen/RISCV/rvv/sve-deinterleave-load.ll (+74)

diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index e0ade02959025f..e233d430e98dd5 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -59,6 +59,8 @@
 #include <string>
 #include <utility>
 #include <vector>
+#include <stack>
+#include <queue>
 
 namespace llvm {
 
@@ -3145,6 +3147,7 @@ class TargetLoweringBase {
   /// \p DI is the deinterleave intrinsic.
   /// \p LI is the accompanying load instruction
   virtual bool lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *DI,
+                                                std::queue<std::pair<unsigned, Value*>>& LeafNodes,
                                                 LoadInst *LI) const {
     return false;
   }
@@ -3156,6 +3159,7 @@ class TargetLoweringBase {
   /// \p II is the interleave intrinsic.
   /// \p SI is the accompanying store instruction
   virtual bool lowerInterleaveIntrinsicToStore(IntrinsicInst *II,
+                                               std::queue<Value*>& LeafNodes,
                                                StoreInst *SI) const {
     return false;
   }
diff --git a/llvm/lib/CodeGen/InterleavedAccessPass.cpp b/llvm/lib/CodeGen/InterleavedAccessPass.cpp
index 438ac1c3cc6e2c..73c3a63b61da3b 100644
--- a/llvm/lib/CodeGen/InterleavedAccessPass.cpp
+++ b/llvm/lib/CodeGen/InterleavedAccessPass.cpp
@@ -71,6 +71,7 @@
 #include "llvm/Transforms/Utils/Local.h"
 #include <cassert>
 #include <utility>
+#include <queue>
 
 using namespace llvm;
 
@@ -510,12 +511,52 @@ bool InterleavedAccessImpl::lowerDeinterleaveIntrinsic(
 
   LLVM_DEBUG(dbgs() << "IA: Found a deinterleave intrinsic: " << *DI << "\n");
 
+  std::stack<IntrinsicInst*> DeinterleaveTreeQueue;
+  std::queue<std::pair<unsigned, Value*>> LeafNodes;
+  std::map<IntrinsicInst*, bool>mp;
+  SmallVector<Instruction *> TempDeadInsts;
+
+  DeinterleaveTreeQueue.push(DI);
+  unsigned DILeafCount = 0;
+  while(!DeinterleaveTreeQueue.empty()) {
+    auto CurrentDI = DeinterleaveTreeQueue.top();
+    DeinterleaveTreeQueue.pop();
+    TempDeadInsts.push_back(CurrentDI);
+    bool RootFound = false;
+    for (auto UserExtract : CurrentDI->users()) { // iterate over extract users of deinterleave
+      Instruction *Extract = dyn_cast<Instruction>(UserExtract);
+      if (!Extract || Extract->getOpcode() != Instruction::ExtractValue)
+        continue;
+      bool IsLeaf = true;
+      for (auto UserDI : UserExtract->users()) { // iterate over deinterleave users of extract
+        IntrinsicInst *Child_DI = dyn_cast<IntrinsicInst>(UserDI);
+        if (!Child_DI || 
+            Child_DI->getIntrinsicID() != Intrinsic::experimental_vector_deinterleave2)
+            continue;
+        IsLeaf = false;
+        if (mp.count(Child_DI) == 0) {
+          DeinterleaveTreeQueue.push(Child_DI);
+        }
+        continue;
+      }
+      if (IsLeaf) {
+        RootFound = true;
+        LeafNodes.push(std::make_pair(DILeafCount, UserExtract));
+        TempDeadInsts.push_back(Extract);
+      }
+      else {
+        TempDeadInsts.push_back(Extract);
+      }
+    }
+    if (RootFound)
+      DILeafCount += CurrentDI->getNumUses();
+  }
   // Try and match this with target specific intrinsics.
-  if (!TLI->lowerDeinterleaveIntrinsicToLoad(DI, LI))
+  if (!TLI->lowerDeinterleaveIntrinsicToLoad(DI, LeafNodes, LI))
     return false;
 
   // We now have a target-specific load, so delete the old one.
-  DeadInsts.push_back(DI);
+  DeadInsts.insert(DeadInsts.end(), TempDeadInsts.rbegin(), TempDeadInsts.rend());
   DeadInsts.push_back(LI);
   return true;
 }
@@ -531,14 +572,33 @@ bool InterleavedAccessImpl::lowerInterleaveIntrinsic(
     return false;
 
   LLVM_DEBUG(dbgs() << "IA: Found an interleave intrinsic: " << *II << "\n");
-
+  std::queue<IntrinsicInst*> IeinterleaveTreeQueue;
+  std::queue<Value*> LeafNodes;
+  SmallVector<Instruction *> TempDeadInsts;
+
+  IeinterleaveTreeQueue.push(II);
+  while(!IeinterleaveTreeQueue.empty()) {
+    auto node = IeinterleaveTreeQueue.front();
+    TempDeadInsts.push_back(node);
+    IeinterleaveTreeQueue.pop();
+    for(unsigned i = 0; i < 2; i++) {
+      auto op = node->getOperand(i);
+      if(auto CurrentII = dyn_cast<IntrinsicInst>(op)) {
+        if (CurrentII->getIntrinsicID() != Intrinsic::experimental_vector_interleave2)
+            continue;
+        IeinterleaveTreeQueue.push(CurrentII);
+        continue;
+      }
+      LeafNodes.push(op);
+    }
+  }
   // Try and match this with target specific intrinsics.
-  if (!TLI->lowerInterleaveIntrinsicToStore(II, SI))
+  if (!TLI->lowerInterleaveIntrinsicToStore(II, LeafNodes, SI))
     return false;
 
   // We now have a target-specific store, so delete the old one.
   DeadInsts.push_back(SI);
-  DeadInsts.push_back(II);
+  DeadInsts.insert(DeadInsts.end(), TempDeadInsts.begin(), TempDeadInsts.end());
   return true;
 }
 
@@ -559,7 +619,7 @@ bool InterleavedAccessImpl::runOnFunction(Function &F) {
       // with a factor of 2.
       if (II->getIntrinsicID() == Intrinsic::experimental_vector_deinterleave2)
         Changed |= lowerDeinterleaveIntrinsic(II, DeadInsts);
-      if (II->getIntrinsicID() == Intrinsic::experimental_vector_interleave2)
+      else if (II->getIntrinsicID() == Intrinsic::experimental_vector_interleave2)
         Changed |= lowerInterleaveIntrinsic(II, DeadInsts);
     }
   }
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7947d73f9a4dd0..ab8c01e2df5a9a 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -16345,15 +16345,15 @@ bool AArch64TargetLowering::lowerInterleavedStore(StoreInst *SI,
 }
 
 bool AArch64TargetLowering::lowerDeinterleaveIntrinsicToLoad(
-    IntrinsicInst *DI, LoadInst *LI) const {
+    IntrinsicInst *DI, std::queue<std::pair<unsigned, llvm::Value*>>& LeafNodes, LoadInst *LI) const {
   // Only deinterleave2 supported at present.
   if (DI->getIntrinsicID() != Intrinsic::experimental_vector_deinterleave2)
     return false;
 
-  // Only a factor of 2 supported at present.
-  const unsigned Factor = 2;
+  const unsigned Factor = std::max(2, (int)LeafNodes.size());
 
-  VectorType *VTy = cast<VectorType>(DI->getType()->getContainedType(0));
+  VectorType *VTy = (LeafNodes.size() > 0) ? cast<VectorType>(LeafNodes.front().second->getType()) :
+                    cast<VectorType>(DI->getType()->getContainedType(0));
   const DataLayout &DL = DI->getModule()->getDataLayout();
   bool UseScalable;
   if (!isLegalInterleavedAccessType(VTy, DL, UseScalable))
@@ -16409,8 +16409,27 @@ bool AArch64TargetLowering::lowerDeinterleaveIntrinsicToLoad(
     Result = Builder.CreateInsertValue(Result, Left, 0);
     Result = Builder.CreateInsertValue(Result, Right, 1);
   } else {
-    if (UseScalable)
+    if (UseScalable) {
       Result = Builder.CreateCall(LdNFunc, {Pred, BaseAddr}, "ldN");
+      if (Factor == 2) {
+        DI->replaceAllUsesWith(Result);
+        return true;
+      }
+      while (!LeafNodes.empty()) {
+        unsigned ExtractIndex = LeafNodes.front().first;
+        llvm::Value* CurrentExtract = LeafNodes.front().second;
+        LeafNodes.pop();
+        ExtractValueInst* ExtractValueInst = dyn_cast<llvm::ExtractValueInst>(CurrentExtract);
+      
+        SmallVector<unsigned, 4> NewIndices;
+        for (auto index : ExtractValueInst->indices())
+          NewIndices.push_back(index + ExtractIndex);
+
+        Value *extrc =Builder.CreateExtractValue(Result, NewIndices);
+        CurrentExtract->replaceAllUsesWith(extrc);
+      }
+      return true;
+    }
     else
       Result = Builder.CreateCall(LdNFunc, BaseAddr, "ldN");
   }
@@ -16420,15 +16439,15 @@ bool AArch64TargetLowering::lowerDeinterleaveIntrinsicToLoad(
 }
 
 bool AArch64TargetLowering::lowerInterleaveIntrinsicToStore(
-    IntrinsicInst *II, StoreInst *SI) const {
+    IntrinsicInst *II, std::queue<Value*>& LeafNodes, StoreInst *SI) const {
   // Only interleave2 supported at present.
   if (II->getIntrinsicID() != Intrinsic::experimental_vector_interleave2)
     return false;
 
-  // Only a factor of 2 supported at present.
-  const unsigned Factor = 2;
+  // leaf nodes are the nodes that will be interleaved
+  const unsigned Factor = LeafNodes.size();
 
-  VectorType *VTy = cast<VectorType>(II->getOperand(0)->getType());
+  VectorType *VTy = cast<VectorType>(LeafNodes.front()->getType());
   const DataLayout &DL = II->getModule()->getDataLayout();
   bool UseScalable;
   if (!isLegalInterleavedAccessType(VTy, DL, UseScalable))
@@ -16473,8 +16492,16 @@ bool AArch64TargetLowering::lowerInterleaveIntrinsicToStore(
       R = Builder.CreateExtractVector(StTy, II->getOperand(1), Idx);
     }
 
-    if (UseScalable)
-      Builder.CreateCall(StNFunc, {L, R, Pred, Address});
+    if (UseScalable) {
+      SmallVector<Value *> Args;
+      while (!LeafNodes.empty()) {
+        Args.push_back(LeafNodes.front());
+        LeafNodes.pop();
+      }
+      Args.push_back(Pred);
+      Args.push_back(Address);
+      Builder.CreateCall(StNFunc, Args);
+    }
     else
       Builder.CreateCall(StNFunc, {L, R, Address});
   }
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index db6e8a00d2fb5e..85497a1f7ae41a 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -683,9 +683,11 @@ class AArch64TargetLowering : public TargetLowering {
                              unsigned Factor) const override;
 
   bool lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *DI,
+                                        std::queue<std::pair<unsigned, Value*>>& LeafNodes,
                                         LoadInst *LI) const override;
 
   bool lowerInterleaveIntrinsicToStore(IntrinsicInst *II,
+                                       std::queue<Value*>& LeafNodes,
                                        StoreInst *SI) const override;
 
   bool isLegalAddImmediate(int64_t) const override;
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index e80931a03f30b6..35150928f0adb0 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -3315,15 +3315,17 @@ InstructionCost AArch64TTIImpl::getInterleavedMemoryOpCost(
   assert(Factor >= 2 && "Invalid interleave factor");
   auto *VecVTy = cast<VectorType>(VecTy);
 
-  if (VecTy->isScalableTy() && (!ST->hasSVE() || Factor != 2))
-    return InstructionCost::getInvalid();
+ unsigned MaxFactor = TLI->getMaxSupportedInterleaveFactor();
+ if (VecTy->isScalableTy() &&
+    (!ST->hasSVE() || Factor > MaxFactor))
+   return InstructionCost::getInvalid();
 
   // Vectorization for masked interleaved accesses is only enabled for scalable
   // VF.
   if (!VecTy->isScalableTy() && (UseMaskForCond || UseMaskForGaps))
     return InstructionCost::getInvalid();
 
-  if (!UseMaskForGaps && Factor <= TLI->getMaxSupportedInterleaveFactor()) {
+  if (!UseMaskForGaps && Factor <= MaxFactor) {
     unsigned MinElts = VecVTy->getElementCount().getKnownMinValue();
     auto *SubVecTy =
         VectorType::get(VecVTy->getElementType(),
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index dc7c6f83b98579..64e0a2bb1f2942 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -21025,6 +21025,7 @@ bool RISCVTargetLowering::lowerInterleavedStore(StoreInst *SI,
 }
 
 bool RISCVTargetLowering::lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *DI,
+                                                           std::queue<std::pair<unsigned, Value*>>& LeafNodes,
                                                            LoadInst *LI) const {
   assert(LI->isSimple());
   IRBuilder<> Builder(LI);
@@ -21033,10 +21034,11 @@ bool RISCVTargetLowering::lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *DI,
   if (DI->getIntrinsicID() != Intrinsic::experimental_vector_deinterleave2)
     return false;
 
-  unsigned Factor = 2;
+  unsigned Factor = std::max(2, (int)LeafNodes.size());
 
   VectorType *VTy = cast<VectorType>(DI->getOperand(0)->getType());
-  VectorType *ResVTy = cast<VectorType>(DI->getType()->getContainedType(0));
+  VectorType *ResVTy = (LeafNodes.size() > 0) ? cast<VectorType>(LeafNodes.front().second->getType()) :
+                        cast<VectorType>(DI->getType()->getContainedType(0));
 
   if (!isLegalInterleavedAccessType(ResVTy, Factor, LI->getAlign(),
                                     LI->getPointerAddressSpace(),
@@ -21064,6 +21066,27 @@ bool RISCVTargetLowering::lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *DI,
                                            {ResVTy, XLenTy});
     VL = Constant::getAllOnesValue(XLenTy);
     Ops.append(Factor, PoisonValue::get(ResVTy));
+    Ops.append({LI->getPointerOperand(), VL});
+    Value *Vlseg = Builder.CreateCall(VlsegNFunc, Ops);
+    //-----------
+    if (Factor == 2) {
+      DI->replaceAllUsesWith(Vlseg);
+      return true;
+    }
+    unsigned ExtractIndex = 0;
+    while (!LeafNodes.empty()) {
+      ExtractIndex = LeafNodes.front().first;
+      auto CurrentExtract = LeafNodes.front().second;
+      LeafNodes.pop();
+      ExtractValueInst* ExtractValueInst = dyn_cast<llvm::ExtractValueInst>(CurrentExtract);
+      SmallVector<unsigned, 4> NewIndices;
+      for (auto index : ExtractValueInst->indices()) {
+        NewIndices.push_back(index + ExtractIndex);
+      }
+      Value *extrc = Builder.CreateExtractValue(Vlseg, NewIndices);
+      CurrentExtract->replaceAllUsesWith(extrc);
+    }
+    return true;
   }
 
   Ops.append({LI->getPointerOperand(), VL});
@@ -21075,6 +21098,7 @@ bool RISCVTargetLowering::lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *DI,
 }
 
 bool RISCVTargetLowering::lowerInterleaveIntrinsicToStore(IntrinsicInst *II,
+                                                          std::queue<Value*>& LeafNodes,
                                                           StoreInst *SI) const {
   assert(SI->isSimple());
   IRBuilder<> Builder(SI);
@@ -21083,10 +21107,10 @@ bool RISCVTargetLowering::lowerInterleaveIntrinsicToStore(IntrinsicInst *II,
   if (II->getIntrinsicID() != Intrinsic::experimental_vector_interleave2)
     return false;
 
-  unsigned Factor = 2;
+  unsigned Factor = LeafNodes.size();
 
   VectorType *VTy = cast<VectorType>(II->getType());
-  VectorType *InVTy = cast<VectorType>(II->getOperand(0)->getType());
+  VectorType *InVTy = cast<VectorType>(LeafNodes.front()->getType());
 
   if (!isLegalInterleavedAccessType(InVTy, Factor, SI->getAlign(),
                                     SI->getPointerAddressSpace(),
@@ -21112,6 +21136,15 @@ bool RISCVTargetLowering::lowerInterleaveIntrinsicToStore(IntrinsicInst *II,
     VssegNFunc = Intrinsic::getDeclaration(SI->getModule(), IntrIds[Factor - 2],
                                            {InVTy, XLenTy});
     VL = Constant::getAllOnesValue(XLenTy);
+    SmallVector<Value *> Args;
+      while (!LeafNodes.empty()) {
+        Args.push_back(LeafNodes.front());
+        LeafNodes.pop();
+      }
+      Args.push_back(SI->getPointerOperand());
+      Args.push_back(VL);
+      Builder.CreateCall(VssegNFunc, Args);
+      return true;
   }
 
   Builder.CreateCall(VssegNFunc, {II->getOperand(0), II->getOperand(1),
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.h b/llvm/lib/Target/RISCV/RISCVISelLowering.h
index b10da3d40befb7..1f104cf3bc15d5 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.h
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.h
@@ -855,10 +855,12 @@ class RISCVTargetLowering : public TargetLowering {
   bool lowerInterleavedStore(StoreInst *SI, ShuffleVectorInst *SVI,
                              unsigned Factor) const override;
 
-  bool lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *II,
+  bool lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *DI,
+                                        std::queue<std::pair<unsigned, Value*>>& LeafNodes,
                                         LoadInst *LI) const override;
 
   bool lowerInterleaveIntrinsicToStore(IntrinsicInst *II,
+                                       std::queue<Value*>& LeafNodes,
                                        StoreInst *SI) const override;
 
   bool supportKCFIBundles() const override { return true; }
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 2057cab46135ff..41f8c5a72ce1e7 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -154,6 +154,7 @@
 #include <string>
 #include <tuple>
 #include <utility>
+#include <queue>
 
 using namespace llvm;
 
@@ -459,10 +460,23 @@ static Value *interleaveVectors(IRBuilderBase &Builder, ArrayRef<Value *> Vals,
   // Scalable vectors cannot use arbitrary shufflevectors (only splats), so
   // must use intrinsics to interleave.
   if (VecTy->isScalableTy()) {
-    VectorType *WideVecTy = VectorType::getDoubleElementsVectorType(VecTy);
-    return Builder.CreateIntrinsic(
-        WideVecTy, Intrinsic::experimental_vector_interleave2, Vals,
-        /*FMFSource=*/nullptr, Name);
+    SmallVector<Value *> Vecs(Vals);
+    unsigned AllNodesNum = (2*Vals.size()) - 1;
+    // last element in the vec should be the final interleaved result,
+    // so, skip processing last element.
+    AllNodesNum --;
+    // interleave each 2 consecutive nodes, and push result to the vec,
+    // so that we can interleave the interleaved results again if we have
+    // more than 2 vectors to interleave.
+    for (unsigned i = 0; i < AllNodesNum; i +=2) {
+      VectorType *VecTy = cast<VectorType>(Vecs[i]->getType());
+      VectorType *WideVecTy = VectorType::getDoubleElementsVectorType(VecTy);
+      auto InterleavedVec = Builder.CreateIntrinsic(
+        WideVecTy, Intrinsic::experimental_vector_interleave2,
+        {Vecs[i], Vecs[i+1]}, /*FMFSource=*/nullptr, Name);
+      Vecs.push_back(InterleavedVec);
+    }
+    return Vecs[Vecs.size()-1];
   }
 
   // Fixed length. Start by concatenating all vectors into a wide vector.
@@ -2519,7 +2533,7 @@ void InnerLoopVectorizer::vectorizeInterleaveGroup(
                              unsigned Part, Value *MaskForGaps) -> Value * {
     if (VF.isScalable()) {
       assert(!MaskForGaps && "Interleaved groups with gaps are not supported.");
-      assert(InterleaveFactor == 2 &&
+      assert(isPowerOf2_32(InterleaveFactor)  &&
              "Unsupported deinterleave factor for scalable vectors");
       auto *BlockInMaskPart = State.get(BlockInMask, Part);
       SmallVector<Value *, 2> Ops = {BlockInMaskPart, BlockInMaskPart};
@@ -2572,23 +2586,40 @@ void InnerLoopVectorizer::vectorizeInterleaveGroup(
     }
 
     if (VecTy->isScalableTy()) {
-      assert(InterleaveFactor == 2 &&
-             "Unsupported deinterleave factor for scalable vectors");
-
+      assert(isPowerOf2_32(InterleaveFactor)  &&
+            "Unsupported deinterleave factor for scalable vectors");
       for (unsigned Part = 0; Part < UF; ++Part) {
         // Scalable vectors cannot use arbitrary shufflevectors (only splats),
         // so must use intrinsics to deinterleave.
-        Value *DI = Builder.CreateIntrinsic(
-            Intrinsic::experimental_vector_deinterleave2, VecTy, NewLoads[Part],
-            /*FMFSource=*/nullptr, "strided.vec");
+        
+        std::queue<Value *>Queue;
+        Queue.push(NewLoads[Part]);
+        // NonLeaf represents how many times we will do deinterleaving,
+        // think of it as a tree, each node will be deinterleaved, untill we reach to
+        // the leaf nodes which will be the final results of deinterleaving.
+        unsigned NonLeaf = InterleaveFactor - 1;
+        for (unsigned i = 0; i < NonLeaf; i ++) {
+          auto Node = Queue.front();
+          Queue.pop();
+          auto DeinterleaveType = Node->getType();
+          Value *DI = Builder.CreateIntrinsic(
+            Intrinsic::experimental_vector_deinterleave2, DeinterleaveType, Node,
+            /*FMFSource=*/nullptr, "root.strided.vec");
+          Value *StridedVec1 = Builder.CreateExtractValue(DI, 0);
+          Value *Strid...
[truncated]

llvmbot · 2024-04-17T04:20:04Z

@llvm/pr-subscribers-llvm-transforms

Author: Hassnaa Hamdi (hassnaaHamdi)

Changes

Given an array of struct like this: struct xyzt { int x; int y; int z; int t; },
The LoopVectorize can't use scalable vectors to vectorize it,
because SV have to use intrinsics to deinterleave,
BUT (de)interleave4 is not available.
This patch uses (de)interleave2 recursively to get same results of using (de)interleave4;
then the vectorizer could use SV.
ex: if we have vector.deinterleave2.nxv16i32(<vscale x 16 x i32> %vec),
it will be deinterleaved into { <vscale x 8 x i32>, <8 x 16 x i32>},
then each extracted vector: <vscale x 8 x i32> will be deinterleaved into { <vscale x 4 x i32>, <vscale x 4 i32> },
so the final result would be: { <vscale x 4 x i32>, <vscale x 4 i32>, <vscale x 4 x i32>, <vscale x 4 i32> },
which is the same result if we could use deinterleave4.
Finally the TLIs that have (de)interleave4 intrinsics can spot that sequence of (de)interleave2 and replace it by (de)interleave4.
this solution is expected to work for any interleaving factor that is pow(2), as long as the TLI has the equivalent intrinsics.

Patch is 32.23 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89018.diff

10 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetLowering.h (+4)
(modified) llvm/lib/CodeGen/InterleavedAccessPass.cpp (+66-6)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+38-11)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.h (+2)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+5-3)
(modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+37-4)
(modified) llvm/lib/Target/RISCV/RISCVISelLowering.h (+3-1)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+48-16)
(added) llvm/test/CodeGen/AArch64/sve-deinterleave-load.ll (+89)
(added) llvm/test/CodeGen/RISCV/rvv/sve-deinterleave-load.ll (+74)

diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index e0ade02959025f..e233d430e98dd5 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -59,6 +59,8 @@
 #include <string>
 #include <utility>
 #include <vector>
+#include <stack>
+#include <queue>
 
 namespace llvm {
 
@@ -3145,6 +3147,7 @@ class TargetLoweringBase {
   /// \p DI is the deinterleave intrinsic.
   /// \p LI is the accompanying load instruction
   virtual bool lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *DI,
+                                                std::queue<std::pair<unsigned, Value*>>& LeafNodes,
                                                 LoadInst *LI) const {
     return false;
   }
@@ -3156,6 +3159,7 @@ class TargetLoweringBase {
   /// \p II is the interleave intrinsic.
   /// \p SI is the accompanying store instruction
   virtual bool lowerInterleaveIntrinsicToStore(IntrinsicInst *II,
+                                               std::queue<Value*>& LeafNodes,
                                                StoreInst *SI) const {
     return false;
   }
diff --git a/llvm/lib/CodeGen/InterleavedAccessPass.cpp b/llvm/lib/CodeGen/InterleavedAccessPass.cpp
index 438ac1c3cc6e2c..73c3a63b61da3b 100644
--- a/llvm/lib/CodeGen/InterleavedAccessPass.cpp
+++ b/llvm/lib/CodeGen/InterleavedAccessPass.cpp
@@ -71,6 +71,7 @@
 #include "llvm/Transforms/Utils/Local.h"
 #include <cassert>
 #include <utility>
+#include <queue>
 
 using namespace llvm;
 
@@ -510,12 +511,52 @@ bool InterleavedAccessImpl::lowerDeinterleaveIntrinsic(
 
   LLVM_DEBUG(dbgs() << "IA: Found a deinterleave intrinsic: " << *DI << "\n");
 
+  std::stack<IntrinsicInst*> DeinterleaveTreeQueue;
+  std::queue<std::pair<unsigned, Value*>> LeafNodes;
+  std::map<IntrinsicInst*, bool>mp;
+  SmallVector<Instruction *> TempDeadInsts;
+
+  DeinterleaveTreeQueue.push(DI);
+  unsigned DILeafCount = 0;
+  while(!DeinterleaveTreeQueue.empty()) {
+    auto CurrentDI = DeinterleaveTreeQueue.top();
+    DeinterleaveTreeQueue.pop();
+    TempDeadInsts.push_back(CurrentDI);
+    bool RootFound = false;
+    for (auto UserExtract : CurrentDI->users()) { // iterate over extract users of deinterleave
+      Instruction *Extract = dyn_cast<Instruction>(UserExtract);
+      if (!Extract || Extract->getOpcode() != Instruction::ExtractValue)
+        continue;
+      bool IsLeaf = true;
+      for (auto UserDI : UserExtract->users()) { // iterate over deinterleave users of extract
+        IntrinsicInst *Child_DI = dyn_cast<IntrinsicInst>(UserDI);
+        if (!Child_DI || 
+            Child_DI->getIntrinsicID() != Intrinsic::experimental_vector_deinterleave2)
+            continue;
+        IsLeaf = false;
+        if (mp.count(Child_DI) == 0) {
+          DeinterleaveTreeQueue.push(Child_DI);
+        }
+        continue;
+      }
+      if (IsLeaf) {
+        RootFound = true;
+        LeafNodes.push(std::make_pair(DILeafCount, UserExtract));
+        TempDeadInsts.push_back(Extract);
+      }
+      else {
+        TempDeadInsts.push_back(Extract);
+      }
+    }
+    if (RootFound)
+      DILeafCount += CurrentDI->getNumUses();
+  }
   // Try and match this with target specific intrinsics.
-  if (!TLI->lowerDeinterleaveIntrinsicToLoad(DI, LI))
+  if (!TLI->lowerDeinterleaveIntrinsicToLoad(DI, LeafNodes, LI))
     return false;
 
   // We now have a target-specific load, so delete the old one.
-  DeadInsts.push_back(DI);
+  DeadInsts.insert(DeadInsts.end(), TempDeadInsts.rbegin(), TempDeadInsts.rend());
   DeadInsts.push_back(LI);
   return true;
 }
@@ -531,14 +572,33 @@ bool InterleavedAccessImpl::lowerInterleaveIntrinsic(
     return false;
 
   LLVM_DEBUG(dbgs() << "IA: Found an interleave intrinsic: " << *II << "\n");
-
+  std::queue<IntrinsicInst*> IeinterleaveTreeQueue;
+  std::queue<Value*> LeafNodes;
+  SmallVector<Instruction *> TempDeadInsts;
+
+  IeinterleaveTreeQueue.push(II);
+  while(!IeinterleaveTreeQueue.empty()) {
+    auto node = IeinterleaveTreeQueue.front();
+    TempDeadInsts.push_back(node);
+    IeinterleaveTreeQueue.pop();
+    for(unsigned i = 0; i < 2; i++) {
+      auto op = node->getOperand(i);
+      if(auto CurrentII = dyn_cast<IntrinsicInst>(op)) {
+        if (CurrentII->getIntrinsicID() != Intrinsic::experimental_vector_interleave2)
+            continue;
+        IeinterleaveTreeQueue.push(CurrentII);
+        continue;
+      }
+      LeafNodes.push(op);
+    }
+  }
   // Try and match this with target specific intrinsics.
-  if (!TLI->lowerInterleaveIntrinsicToStore(II, SI))
+  if (!TLI->lowerInterleaveIntrinsicToStore(II, LeafNodes, SI))
     return false;
 
   // We now have a target-specific store, so delete the old one.
   DeadInsts.push_back(SI);
-  DeadInsts.push_back(II);
+  DeadInsts.insert(DeadInsts.end(), TempDeadInsts.begin(), TempDeadInsts.end());
   return true;
 }
 
@@ -559,7 +619,7 @@ bool InterleavedAccessImpl::runOnFunction(Function &F) {
       // with a factor of 2.
       if (II->getIntrinsicID() == Intrinsic::experimental_vector_deinterleave2)
         Changed |= lowerDeinterleaveIntrinsic(II, DeadInsts);
-      if (II->getIntrinsicID() == Intrinsic::experimental_vector_interleave2)
+      else if (II->getIntrinsicID() == Intrinsic::experimental_vector_interleave2)
         Changed |= lowerInterleaveIntrinsic(II, DeadInsts);
     }
   }
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7947d73f9a4dd0..ab8c01e2df5a9a 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -16345,15 +16345,15 @@ bool AArch64TargetLowering::lowerInterleavedStore(StoreInst *SI,
 }
 
 bool AArch64TargetLowering::lowerDeinterleaveIntrinsicToLoad(
-    IntrinsicInst *DI, LoadInst *LI) const {
+    IntrinsicInst *DI, std::queue<std::pair<unsigned, llvm::Value*>>& LeafNodes, LoadInst *LI) const {
   // Only deinterleave2 supported at present.
   if (DI->getIntrinsicID() != Intrinsic::experimental_vector_deinterleave2)
     return false;
 
-  // Only a factor of 2 supported at present.
-  const unsigned Factor = 2;
+  const unsigned Factor = std::max(2, (int)LeafNodes.size());
 
-  VectorType *VTy = cast<VectorType>(DI->getType()->getContainedType(0));
+  VectorType *VTy = (LeafNodes.size() > 0) ? cast<VectorType>(LeafNodes.front().second->getType()) :
+                    cast<VectorType>(DI->getType()->getContainedType(0));
   const DataLayout &DL = DI->getModule()->getDataLayout();
   bool UseScalable;
   if (!isLegalInterleavedAccessType(VTy, DL, UseScalable))
@@ -16409,8 +16409,27 @@ bool AArch64TargetLowering::lowerDeinterleaveIntrinsicToLoad(
     Result = Builder.CreateInsertValue(Result, Left, 0);
     Result = Builder.CreateInsertValue(Result, Right, 1);
   } else {
-    if (UseScalable)
+    if (UseScalable) {
       Result = Builder.CreateCall(LdNFunc, {Pred, BaseAddr}, "ldN");
+      if (Factor == 2) {
+        DI->replaceAllUsesWith(Result);
+        return true;
+      }
+      while (!LeafNodes.empty()) {
+        unsigned ExtractIndex = LeafNodes.front().first;
+        llvm::Value* CurrentExtract = LeafNodes.front().second;
+        LeafNodes.pop();
+        ExtractValueInst* ExtractValueInst = dyn_cast<llvm::ExtractValueInst>(CurrentExtract);
+      
+        SmallVector<unsigned, 4> NewIndices;
+        for (auto index : ExtractValueInst->indices())
+          NewIndices.push_back(index + ExtractIndex);
+
+        Value *extrc =Builder.CreateExtractValue(Result, NewIndices);
+        CurrentExtract->replaceAllUsesWith(extrc);
+      }
+      return true;
+    }
     else
       Result = Builder.CreateCall(LdNFunc, BaseAddr, "ldN");
   }
@@ -16420,15 +16439,15 @@ bool AArch64TargetLowering::lowerDeinterleaveIntrinsicToLoad(
 }
 
 bool AArch64TargetLowering::lowerInterleaveIntrinsicToStore(
-    IntrinsicInst *II, StoreInst *SI) const {
+    IntrinsicInst *II, std::queue<Value*>& LeafNodes, StoreInst *SI) const {
   // Only interleave2 supported at present.
   if (II->getIntrinsicID() != Intrinsic::experimental_vector_interleave2)
     return false;
 
-  // Only a factor of 2 supported at present.
-  const unsigned Factor = 2;
+  // leaf nodes are the nodes that will be interleaved
+  const unsigned Factor = LeafNodes.size();
 
-  VectorType *VTy = cast<VectorType>(II->getOperand(0)->getType());
+  VectorType *VTy = cast<VectorType>(LeafNodes.front()->getType());
   const DataLayout &DL = II->getModule()->getDataLayout();
   bool UseScalable;
   if (!isLegalInterleavedAccessType(VTy, DL, UseScalable))
@@ -16473,8 +16492,16 @@ bool AArch64TargetLowering::lowerInterleaveIntrinsicToStore(
       R = Builder.CreateExtractVector(StTy, II->getOperand(1), Idx);
     }
 
-    if (UseScalable)
-      Builder.CreateCall(StNFunc, {L, R, Pred, Address});
+    if (UseScalable) {
+      SmallVector<Value *> Args;
+      while (!LeafNodes.empty()) {
+        Args.push_back(LeafNodes.front());
+        LeafNodes.pop();
+      }
+      Args.push_back(Pred);
+      Args.push_back(Address);
+      Builder.CreateCall(StNFunc, Args);
+    }
     else
       Builder.CreateCall(StNFunc, {L, R, Address});
   }
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index db6e8a00d2fb5e..85497a1f7ae41a 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -683,9 +683,11 @@ class AArch64TargetLowering : public TargetLowering {
                              unsigned Factor) const override;
 
   bool lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *DI,
+                                        std::queue<std::pair<unsigned, Value*>>& LeafNodes,
                                         LoadInst *LI) const override;
 
   bool lowerInterleaveIntrinsicToStore(IntrinsicInst *II,
+                                       std::queue<Value*>& LeafNodes,
                                        StoreInst *SI) const override;
 
   bool isLegalAddImmediate(int64_t) const override;
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index e80931a03f30b6..35150928f0adb0 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -3315,15 +3315,17 @@ InstructionCost AArch64TTIImpl::getInterleavedMemoryOpCost(
   assert(Factor >= 2 && "Invalid interleave factor");
   auto *VecVTy = cast<VectorType>(VecTy);
 
-  if (VecTy->isScalableTy() && (!ST->hasSVE() || Factor != 2))
-    return InstructionCost::getInvalid();
+ unsigned MaxFactor = TLI->getMaxSupportedInterleaveFactor();
+ if (VecTy->isScalableTy() &&
+    (!ST->hasSVE() || Factor > MaxFactor))
+   return InstructionCost::getInvalid();
 
   // Vectorization for masked interleaved accesses is only enabled for scalable
   // VF.
   if (!VecTy->isScalableTy() && (UseMaskForCond || UseMaskForGaps))
     return InstructionCost::getInvalid();
 
-  if (!UseMaskForGaps && Factor <= TLI->getMaxSupportedInterleaveFactor()) {
+  if (!UseMaskForGaps && Factor <= MaxFactor) {
     unsigned MinElts = VecVTy->getElementCount().getKnownMinValue();
     auto *SubVecTy =
         VectorType::get(VecVTy->getElementType(),
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index dc7c6f83b98579..64e0a2bb1f2942 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -21025,6 +21025,7 @@ bool RISCVTargetLowering::lowerInterleavedStore(StoreInst *SI,
 }
 
 bool RISCVTargetLowering::lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *DI,
+                                                           std::queue<std::pair<unsigned, Value*>>& LeafNodes,
                                                            LoadInst *LI) const {
   assert(LI->isSimple());
   IRBuilder<> Builder(LI);
@@ -21033,10 +21034,11 @@ bool RISCVTargetLowering::lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *DI,
   if (DI->getIntrinsicID() != Intrinsic::experimental_vector_deinterleave2)
     return false;
 
-  unsigned Factor = 2;
+  unsigned Factor = std::max(2, (int)LeafNodes.size());
 
   VectorType *VTy = cast<VectorType>(DI->getOperand(0)->getType());
-  VectorType *ResVTy = cast<VectorType>(DI->getType()->getContainedType(0));
+  VectorType *ResVTy = (LeafNodes.size() > 0) ? cast<VectorType>(LeafNodes.front().second->getType()) :
+                        cast<VectorType>(DI->getType()->getContainedType(0));
 
   if (!isLegalInterleavedAccessType(ResVTy, Factor, LI->getAlign(),
                                     LI->getPointerAddressSpace(),
@@ -21064,6 +21066,27 @@ bool RISCVTargetLowering::lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *DI,
                                            {ResVTy, XLenTy});
     VL = Constant::getAllOnesValue(XLenTy);
     Ops.append(Factor, PoisonValue::get(ResVTy));
+    Ops.append({LI->getPointerOperand(), VL});
+    Value *Vlseg = Builder.CreateCall(VlsegNFunc, Ops);
+    //-----------
+    if (Factor == 2) {
+      DI->replaceAllUsesWith(Vlseg);
+      return true;
+    }
+    unsigned ExtractIndex = 0;
+    while (!LeafNodes.empty()) {
+      ExtractIndex = LeafNodes.front().first;
+      auto CurrentExtract = LeafNodes.front().second;
+      LeafNodes.pop();
+      ExtractValueInst* ExtractValueInst = dyn_cast<llvm::ExtractValueInst>(CurrentExtract);
+      SmallVector<unsigned, 4> NewIndices;
+      for (auto index : ExtractValueInst->indices()) {
+        NewIndices.push_back(index + ExtractIndex);
+      }
+      Value *extrc = Builder.CreateExtractValue(Vlseg, NewIndices);
+      CurrentExtract->replaceAllUsesWith(extrc);
+    }
+    return true;
   }
 
   Ops.append({LI->getPointerOperand(), VL});
@@ -21075,6 +21098,7 @@ bool RISCVTargetLowering::lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *DI,
 }
 
 bool RISCVTargetLowering::lowerInterleaveIntrinsicToStore(IntrinsicInst *II,
+                                                          std::queue<Value*>& LeafNodes,
                                                           StoreInst *SI) const {
   assert(SI->isSimple());
   IRBuilder<> Builder(SI);
@@ -21083,10 +21107,10 @@ bool RISCVTargetLowering::lowerInterleaveIntrinsicToStore(IntrinsicInst *II,
   if (II->getIntrinsicID() != Intrinsic::experimental_vector_interleave2)
     return false;
 
-  unsigned Factor = 2;
+  unsigned Factor = LeafNodes.size();
 
   VectorType *VTy = cast<VectorType>(II->getType());
-  VectorType *InVTy = cast<VectorType>(II->getOperand(0)->getType());
+  VectorType *InVTy = cast<VectorType>(LeafNodes.front()->getType());
 
   if (!isLegalInterleavedAccessType(InVTy, Factor, SI->getAlign(),
                                     SI->getPointerAddressSpace(),
@@ -21112,6 +21136,15 @@ bool RISCVTargetLowering::lowerInterleaveIntrinsicToStore(IntrinsicInst *II,
     VssegNFunc = Intrinsic::getDeclaration(SI->getModule(), IntrIds[Factor - 2],
                                            {InVTy, XLenTy});
     VL = Constant::getAllOnesValue(XLenTy);
+    SmallVector<Value *> Args;
+      while (!LeafNodes.empty()) {
+        Args.push_back(LeafNodes.front());
+        LeafNodes.pop();
+      }
+      Args.push_back(SI->getPointerOperand());
+      Args.push_back(VL);
+      Builder.CreateCall(VssegNFunc, Args);
+      return true;
   }
 
   Builder.CreateCall(VssegNFunc, {II->getOperand(0), II->getOperand(1),
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.h b/llvm/lib/Target/RISCV/RISCVISelLowering.h
index b10da3d40befb7..1f104cf3bc15d5 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.h
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.h
@@ -855,10 +855,12 @@ class RISCVTargetLowering : public TargetLowering {
   bool lowerInterleavedStore(StoreInst *SI, ShuffleVectorInst *SVI,
                              unsigned Factor) const override;
 
-  bool lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *II,
+  bool lowerDeinterleaveIntrinsicToLoad(IntrinsicInst *DI,
+                                        std::queue<std::pair<unsigned, Value*>>& LeafNodes,
                                         LoadInst *LI) const override;
 
   bool lowerInterleaveIntrinsicToStore(IntrinsicInst *II,
+                                       std::queue<Value*>& LeafNodes,
                                        StoreInst *SI) const override;
 
   bool supportKCFIBundles() const override { return true; }
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 2057cab46135ff..41f8c5a72ce1e7 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -154,6 +154,7 @@
 #include <string>
 #include <tuple>
 #include <utility>
+#include <queue>
 
 using namespace llvm;
 
@@ -459,10 +460,23 @@ static Value *interleaveVectors(IRBuilderBase &Builder, ArrayRef<Value *> Vals,
   // Scalable vectors cannot use arbitrary shufflevectors (only splats), so
   // must use intrinsics to interleave.
   if (VecTy->isScalableTy()) {
-    VectorType *WideVecTy = VectorType::getDoubleElementsVectorType(VecTy);
-    return Builder.CreateIntrinsic(
-        WideVecTy, Intrinsic::experimental_vector_interleave2, Vals,
-        /*FMFSource=*/nullptr, Name);
+    SmallVector<Value *> Vecs(Vals);
+    unsigned AllNodesNum = (2*Vals.size()) - 1;
+    // last element in the vec should be the final interleaved result,
+    // so, skip processing last element.
+    AllNodesNum --;
+    // interleave each 2 consecutive nodes, and push result to the vec,
+    // so that we can interleave the interleaved results again if we have
+    // more than 2 vectors to interleave.
+    for (unsigned i = 0; i < AllNodesNum; i +=2) {
+      VectorType *VecTy = cast<VectorType>(Vecs[i]->getType());
+      VectorType *WideVecTy = VectorType::getDoubleElementsVectorType(VecTy);
+      auto InterleavedVec = Builder.CreateIntrinsic(
+        WideVecTy, Intrinsic::experimental_vector_interleave2,
+        {Vecs[i], Vecs[i+1]}, /*FMFSource=*/nullptr, Name);
+      Vecs.push_back(InterleavedVec);
+    }
+    return Vecs[Vecs.size()-1];
   }
 
   // Fixed length. Start by concatenating all vectors into a wide vector.
@@ -2519,7 +2533,7 @@ void InnerLoopVectorizer::vectorizeInterleaveGroup(
                              unsigned Part, Value *MaskForGaps) -> Value * {
     if (VF.isScalable()) {
       assert(!MaskForGaps && "Interleaved groups with gaps are not supported.");
-      assert(InterleaveFactor == 2 &&
+      assert(isPowerOf2_32(InterleaveFactor)  &&
              "Unsupported deinterleave factor for scalable vectors");
       auto *BlockInMaskPart = State.get(BlockInMask, Part);
       SmallVector<Value *, 2> Ops = {BlockInMaskPart, BlockInMaskPart};
@@ -2572,23 +2586,40 @@ void InnerLoopVectorizer::vectorizeInterleaveGroup(
     }
 
     if (VecTy->isScalableTy()) {
-      assert(InterleaveFactor == 2 &&
-             "Unsupported deinterleave factor for scalable vectors");
-
+      assert(isPowerOf2_32(InterleaveFactor)  &&
+            "Unsupported deinterleave factor for scalable vectors");
       for (unsigned Part = 0; Part < UF; ++Part) {
         // Scalable vectors cannot use arbitrary shufflevectors (only splats),
         // so must use intrinsics to deinterleave.
-        Value *DI = Builder.CreateIntrinsic(
-            Intrinsic::experimental_vector_deinterleave2, VecTy, NewLoads[Part],
-            /*FMFSource=*/nullptr, "strided.vec");
+        
+        std::queue<Value *>Queue;
+        Queue.push(NewLoads[Part]);
+        // NonLeaf represents how many times we will do deinterleaving,
+        // think of it as a tree, each node will be deinterleaved, untill we reach to
+        // the leaf nodes which will be the final results of deinterleaving.
+        unsigned NonLeaf = InterleaveFactor - 1;
+        for (unsigned i = 0; i < NonLeaf; i ++) {
+          auto Node = Queue.front();
+          Queue.pop();
+          auto DeinterleaveType = Node->getType();
+          Value *DI = Builder.CreateIntrinsic(
+            Intrinsic::experimental_vector_deinterleave2, DeinterleaveType, Node,
+            /*FMFSource=*/nullptr, "root.strided.vec");
+          Value *StridedVec1 = Builder.CreateExtractValue(DI, 0);
+          Value *Strid...
[truncated]

github-actions · 2024-04-17T04:23:12Z

✅ With the latest revision this PR passed the C/C++ code formatter.

fhahn · 2024-04-17T10:17:28Z

It loos like at the TargetLowering, LV and InterleavedAccessPass changes could be decoupled?

Mel-Chen · 2024-04-17T10:14:17Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -2519,7 +2533,7 @@ void InnerLoopVectorizer::vectorizeInterleaveGroup(
                             unsigned Part, Value *MaskForGaps) -> Value * {
    if (VF.isScalable()) {
      assert(!MaskForGaps && "Interleaved groups with gaps are not supported.");
-      assert(InterleaveFactor == 2 &&
+      assert(isPowerOf2_32(InterleaveFactor)  &&
             "Unsupported deinterleave factor for scalable vectors");
      auto *BlockInMaskPart = State.get(BlockInMask, Part);
      SmallVector<Value *, 2> Ops = {BlockInMaskPart, BlockInMaskPart};


The mask of masked interleaved accesses also requires an interleave tree to generate the correct mask.

Could you please give an example for a case that uses masked interleaved accesses ?
I have commented the code of creating masked load (the call to CreateGroupMask lambda function), and reran the tests but all tests ran successfully. It seems that for the interleaved accesses all the loads are aligned not masked.

Mel-Chen · 2024-04-17T10:15:40Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+        // think of it as a tree, each node will be deinterleaved, untill we reach to
+        // the leaf nodes which will be the final results of deinterleaving.
+        unsigned NonLeaf = InterleaveFactor - 1;
+        for (unsigned i = 0; i < NonLeaf; i ++) {


i --> I
i ++ --> I++

Mel-Chen · 2024-04-17T10:40:31Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+          auto StridedVec = Queue.front();
+          Queue.pop();


Here is a example:
A vector 0 1 2 3 4 5 6 7
If we do deinterleave 4 on the vector, we should get:
member 0: 0 4
member 1: 1 5
member 2: 2 6
member 3: 3 7
But the Queue in your change may like: 0 4, 2 6, 1 5, 3 7.
Please confirm the Queue is sorted by a correct rank.

Mel-Chen · 2024-04-17T10:41:06Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -2681,6 +2712,7 @@ void InnerLoopVectorizer::vectorizeInterleaveGroup(

    // Interleave all the smaller vectors into one wider vector.
    Value *IVec = interleaveVectors(Builder, StoredVecs, "interleaved.vec");
+    //LLVM_DEBUG(dbgs() << "interleaved vec: "; IVec->dump());


Please remove it.

paulwalker-arm · 2024-04-17T11:12:01Z

It loos like at the TargetLowering, LV and InterleavedAccessPass changes could be decoupled?

I agree. Please can you split this PR into two PRs with the first only concerned with the correct lowering of the new emulated ld4/st4 IR sequence. The second PR then teaches LoopVectorize how to use/generate them.

davemgreen · 2024-04-17T11:25:43Z

Hi - Is there a plan for how to handle ld3? We have seen a lot of issues recently with the canonical shuffle representation for fixed-vector ld2/ld3/ld4, and I was wondering if it made sense to move away from shuffles for fixed-length too.

paulwalker-arm · 2024-04-17T11:30:37Z

Hi - Is there a plan for how to handle ld3? We have seen a lot of issues recently with the canonical shuffle representation for fixed-vector ld2/ld3/ld4, and I was wondering if it made sense to move away from shuffles for fixed-length too.

There is but that will require a new intrinsic. My hope is that rather than having an intrinsic per interleave factor we could model them all using interleave2 and interleave3 (once it's created). This is why we've started with ld4/st4 support to see if there are any pitfalls to this approach.

Personally I'd love us to move to using these intrinsics for all vector types because it will streamline several code paths.

efriedma-quic · 2024-04-17T17:19:24Z

Doing deinterleaving as trees sort of makes sense for high interleaving factors... I've seen loops that benefit from deinterleaving with interleave factors as high as 12. I'm a little concerned the abstraction layers here are going to make cost modeling less accurate, though; ideally, the vectorizer should be able to estimate the cost of an ld4.

topperc · 2024-04-17T18:10:43Z

Hi - Is there a plan for how to handle ld3? We have seen a lot of issues recently with the canonical shuffle representation for fixed-vector ld2/ld3/ld4, and I was wondering if it made sense to move away from shuffles for fixed-length too.

There is but that will require a new intrinsic. My hope is that rather than having an intrinsic per interleave factor we could model them all using interleave2 and interleave3 (once it's created). This is why we've started with ld4/st4 support to see if there are any pitfalls to this approach.

Personally I'd love us to move to using these intrinsics for all vector types because it will streamline several code paths.

RISC-V has interleave loads for up to 8. So I guess we would need interleave5 and interleave7?

paulwalker-arm · 2024-04-18T09:51:47Z

RISC-V has interleave loads for up to 8. So I guess we would need interleave5 and interleave7?

Yes, sorry. I guess I meant "Hopefully we can emulate all required interleave factors by only implement specific intrinsics for factors that are a prime number"? An alternative proposal is to have intrinsics for all but then lower them to sequences of fewer intrinsics within the InterleavedAccess pass or perhaps even SelectionDAGBuilder. I suppose this really depends on how awkward cost modelling the sequences turns out to be.

@efriedma-quic - Is your concern related to vectorisation or the costing of already vectorised code?

efriedma-quic · 2024-04-18T17:01:43Z

Given the way the pass pipeline is structured, cost modeling in the vectorizer itself tends to be more important than modeling in subsequent passes. I guess maybe it's not a big deal what the vectorizer generates if the vectorizer itself has some way to get the correct numbers.

davemgreen · 2024-04-21T19:27:26Z

The loop vectorizer will produce costs via getInterleavedMemoryOpCost so should be fine as far as I understand. If there are no combines later on (either uncosted in instcombine or costed in vector-combine) that work with vector.interleave/vector.deinterleave then they can break the canonical patterns that the backend is expecting to generate ld2/ld4 from. I'm hoping that if we can move to interleave/deinterleave, that should fix some of the problems we have at the moment.

I have recently been adding costs for the existing shuffles we find for fixed length vectors, in an attempt to reduce the number of times we break apart the load+shuffle (or store+shuffle), and have to either attempt to repair it or fall back to worse generation in the backend. I would say that in general costing for single-instructions is fine, two instructions making a pattern (like shuffle(load) or store(shuffle)) are do-able but start to get unreliable, and three-instruction plus becomes difficult to cost well.

paulwalker-arm · 2024-09-13T15:18:37Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+  unsigned MaxFactor = TLI->getMaxSupportedInterleaveFactor();
+  if (VecTy->isScalableTy() &&
+      (!ST->hasSVE() || !isPowerOf2_32(Factor) || Factor > MaxFactor))


Whilst this works I think it's much clearer to simply say !ST->hasSVE() || (Factor != 2 && Factor != 4).

For what it's worth I don't see getMaxSupportedInterleaveFactor() being a good function because it doesn't provide enough context for the question it is asking (i.e. it assumes the vector types does not matter). The only reason we don't run in to trouble is because other than this function all other uses are specific to fixed length vector types.

paulwalker-arm · 2024-09-13T15:59:16Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    for (unsigned I = 0, J = InterleaveFactor / 2, K = 0; K < InterleaveFactor;
+         K++) {
+      if (K % 2 == 0) {
+        InterleavingValues[K] = Vals[I];
+        I++;
+      } else {
+        InterleavingValues[K] = Vals[J];
+        J++;
+      }
+    }


Would the following simplification work?

for (unsigned I = 0; I < InterleaveFactor/2; ++I) { InterleavingValues[2*I] = Value[I]; InterleavingValues[2*I+1] = Value[I + InterleaveFactor/2]; }

Simplification aside, does this two stage algorithm work? Or rather, I'm pretty sure it doesn't work, but I'm unsure if there are intentional restrictions that means it is only supposed to work for specific factors.

I could be wrong but I think the algorithm works for InterleavingValues==2 and InterleavingValues==4 but fails for InterleavingValues==8. This would be kind of ok given the original code only worked for InterleavingValues==2, but the other changes in this PR (and the new code's complexity) imply you expect the algorithm to support all powers-of-two?

It would be good to know your intent here because then I can either suggest simplifying the code or help fix the algorithm if my observation is valid.

Yes, my intent is to make it generic.
I think to make it generic, There will be multiple sorting during the interleave/deinterleave, not only at the end. correct ?

Yes I believe so. It'll be a continuous process of "ordering the operands and then interleaving them" until you have only one vector (or continuous "deinterleave and then order the results" until you have the required N vectors).

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

paulwalker-arm · 2024-09-13T16:11:59Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+#ifndef NDEBUG
+    for (Value *Val : InterleavingValues)
+      assert(Val && "NULL Interleaving Value");
+#endif


Does this assert add any value?

From the code it can be seen that InterleavingValues has InterleaveFactor elements, which itself is the size of Vals, and the loop goes from 0:InterleaveFactor. This means the only way InterleavingValues can have a NULL entry is if it came from Vals, which cannot happen because there's already an assert above where the type of each element of Vals is checked (i.e. all the Value* have been dereference by this point anyway).

Mel-Chen · 2024-09-23T14:28:12Z

Would it be possible to directly add (de)interleave intrinsic 4 to achieve this? Such an implementation should be simpler and more maintainable.
Using prime factor (de)interleave intrinsics to combine non-prime interleave factors would only be effective in saving the number of intrinsics needed when supporting large interleave factors.
For RISCV, the current largest factor is 8. (Not sure if other targets support larger factors.)

paulwalker-arm · 2024-09-24T09:54:10Z

Would it be possible to directly add (de)interleave intrinsic 4 to achieve this?

A significant risk of not being able to identify the larger interleave factors would be the main reason to introduce dedicated intrinsics. However, I'd expect the IR to emulate an 8-way interleave to be pretty fixed so I'd rather wait to see if this is proved incorrect before going down that route.

At the end of the day the code this PR will introduce will be required anyway to lower (de)interleave intrinsics a target does not support, so there shouldn't be much wasted effort.

hassnaaHamdi · 2024-12-10T18:37:55Z

Thanks Paul for reviewing the patch.
I'm going to rebase and land the patch in the next days if there are no further comments.

Mel-Chen

Thanks for your contribution. Here is some questions for this patch:

I believe we requires InterleavedAccessPass(IAP) support before this patch. Does IAP now have the ability to convert the power of 2 factor?
Do you have lit test cases for masked interleaved accesses and reverse interleaved accesses with power of 2 factor?
Last question is for the future work. RISCV supports factor 6. If we have interleave3/deinterleave3 intrinsics, can the approach in this patch support factor 6? Or will it require a lot of modifications to support factor 6?

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

Mel-Chen · 2024-12-11T09:48:42Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    // single final interleaved value.
+    VectorType *InterleaveTy =
+        cast<VectorType>(InterleavingValues[0]->getType());
+    for (unsigned Midpoint = Factor / 2; Midpoint > 0; Midpoint /= 2) {


Add assertion for confirming Factor is power of 2.

Hi @Mel-Chen
Thanks for looking at the patch.
The assert statement is already added before calling the interleaveVectors(..) function.

About your questions above:

yes I have landed a patch for adding support to the InterleaveAccessPass to support reading the (de)interleave tree pattern.

Adding them.

If we have (de)interleave3 intrinsics, then we will have to do same logic for recursive (de)interleave3, and then the extra needed work will be representing the interleave factor by multiple of 2 and 3. so for the case of factor 6, we will do single iteration of (de)interleave2 then single iteration of (de)interleave3. The same logic will be applied for all factors that consist of multiples of 2 and 3 only.

Hi @Mel-Chen
Are you satisfied about the latest changes ?

The assert statement is already added before calling the interleaveVectors(..) function.

I think we still need an assert before the for loop, as this is a standalone function. This will ensure that no caller inadvertently passes an invalid factor in the future.

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

Mel-Chen · 2024-12-17T09:32:09Z

llvm/test/Transforms/InterleavedAccess/AArch64/sve-deinterleave4.ll

@@ -136,3 +136,4 @@ define void @negative_deinterleave4_test(ptr %src) {

  ret void
 }
+


Please remove this blank line.

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

Mel-Chen · 2024-12-17T09:54:52Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    // single final interleaved value.
+    VectorType *InterleaveTy =
+        cast<VectorType>(InterleavingValues[0]->getType());
+    for (unsigned Midpoint = Factor / 2; Midpoint > 0; Midpoint /= 2) {


The assert statement is already added before calling the interleaveVectors(..) function.

I think we still need an assert before the for loop, as this is a standalone function. This will ensure that no caller inadvertently passes an invalid factor in the future.

hassnaaHamdi · 2024-12-20T12:58:42Z

Hi @Mel-Chen
Thanks for your review.
I have resolved your comments, I think the patch is ready now to be landed ?

Mel-Chen

The IR looks correct.
I temporarily accept this solution, although it will generate Σ(2^i), i = 0 to log2(factor)-1 interleave2/deinterleave2.
Please wait for other reviewers @fhahn @ayalz for a few days.

omjavaid · 2024-12-30T21:04:00Z

This PR appears to be breaking LLVM build on AArch64 SVE Linux buildbots
https://lab.llvm.org/buildbot/#/builders/143/builds/4462
https://lab.llvm.org/buildbot/#/builders/17/builds/4902
https://lab.llvm.org/buildbot/#/builders/4/builds/4399
https://lab.llvm.org/buildbot/#/builders/41/builds/4299

This reverts commit ccfe0de. This breaks LLVM build on AArch64 SVE Linux buildbots https://lab.llvm.org/buildbot/#/builders/143/builds/4462 https://lab.llvm.org/buildbot/#/builders/17/builds/4902 https://lab.llvm.org/buildbot/#/builders/4/builds/4399 https://lab.llvm.org/buildbot/#/builders/41/builds/4299

omjavaid · 2024-12-30T22:15:09Z

I have reverted this change temporarily to fix the buildbots. Please review the change. Thanks!

This commit relands the changes from "[LV]: Teach LV to recursively (de)interleave. #89018" Reason for revert: - The patch exposed a bug in the IA pass, the bug is now fixed and landed by commit: #122643

This patch relands the changes from "[LV]: Teach LV to recursively (de)interleave.#122989" Reason for revert: - The patch exposed an assert in the vectorizer related to VF difference between legacy cost model and VPlan-based cost model because of uncalculated cost for VPInstruction which is created by VPlanTransforms as a replacement to 'or disjoint' instruction. VPlanTransforms do that instructions change when there are memory interleaving and predicated blocks, but that change didn't cause problems because at most cases the cost difference between legacy/new models is not noticeable. - Issue is fixed by #125434 Original patch: #89018 Reviewed-by: paulwalker-arm, Mel-Chen

…25094) This patch relands the changes from "[LV]: Teach LV to recursively (de)interleave.#122989" Reason for revert: - The patch exposed an assert in the vectorizer related to VF difference between legacy cost model and VPlan-based cost model because of uncalculated cost for VPInstruction which is created by VPlanTransforms as a replacement to 'or disjoint' instruction. VPlanTransforms do that instructions change when there are memory interleaving and predicated blocks, but that change didn't cause problems because at most cases the cost difference between legacy/new models is not noticeable. - Issue is fixed by #125434 Original patch: llvm/llvm-project#89018 Reviewed-by: paulwalker-arm, Mel-Chen

This patch relands the changes from "[LV]: Teach LV to recursively (de)interleave.llvm#122989" Reason for revert: - The patch exposed an assert in the vectorizer related to VF difference between legacy cost model and VPlan-based cost model because of uncalculated cost for VPInstruction which is created by VPlanTransforms as a replacement to 'or disjoint' instruction. VPlanTransforms do that instructions change when there are memory interleaving and predicated blocks, but that change didn't cause problems because at most cases the cost difference between legacy/new models is not noticeable. - Issue is fixed by llvm#125434 Original patch: llvm#89018 Reviewed-by: paulwalker-arm, Mel-Chen

This adds [de]interleave intrinsics for factors of 4,6,8, so that every interleaved memory operation supported by the in-tree targets can be represented by a single intrinsic. For context, [de]interleaves of fixed-length vectors are represented by a series of shufflevectors. The intrinsics are needed for scalable vectors, and we don't currently scalably vectorize all possible factors of interleave groups supported by RISC-V/AArch64. The underlying reason for this is that higher factors are currently represented by interleaving multiple interleaves themselves, which made sense at the time in the discussion in #89018. But after trying to integrate these for higher factors on RISC-V I think we should revisit this design choice: - Matching these in InterleavedAccessPass is non-trivial: We currently only support factors that are a power of 2, and detecting this requires a good chunk of code - The shufflevector masks used for [de]interleaves of fixed-length vectors are much easier to pattern match as they are strided patterns, but for the intrinsics it's much more complicated to match as the structure is a tree. - Unlike shufflevectors, there's no optimisation that happens on [de]interleave2 intriniscs - For non-power-of-2 factors e.g. 6, there are multiple possible ways a [de]interleave could be represented, see the discussion in #139373 - We already have intrinsics for 2,3,5 and 7, so by avoiding 4,6 and 8 we're not really saving much By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice: SVE only has instructions for factors 2,3,4, whilst RVV only supports up to factor 8. This patch would make it much easier to support scalable interleaved accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as the loop vectorizer and InterleavedAccessPass wouldn't need to construct and match trees of interleaves. For interleave factors above 8, for which there are no hardware memory operations to match in the InterleavedAccessPass, we can still keep the wide load + recursive interleaving in the loop vectorizer.

This adds [de]interleave intrinsics for factors of 4,6,8, so that every interleaved memory operation supported by the in-tree targets can be represented by a single intrinsic. For context, [de]interleaves of fixed-length vectors are represented by a series of shufflevectors. The intrinsics are needed for scalable vectors, and we don't currently scalably vectorize all possible factors of interleave groups supported by RISC-V/AArch64. The underlying reason for this is that higher factors are currently represented by interleaving multiple interleaves themselves, which made sense at the time in the discussion in llvm/llvm-project#89018. But after trying to integrate these for higher factors on RISC-V I think we should revisit this design choice: - Matching these in InterleavedAccessPass is non-trivial: We currently only support factors that are a power of 2, and detecting this requires a good chunk of code - The shufflevector masks used for [de]interleaves of fixed-length vectors are much easier to pattern match as they are strided patterns, but for the intrinsics it's much more complicated to match as the structure is a tree. - Unlike shufflevectors, there's no optimisation that happens on [de]interleave2 intriniscs - For non-power-of-2 factors e.g. 6, there are multiple possible ways a [de]interleave could be represented, see the discussion in #139373 - We already have intrinsics for 2,3,5 and 7, so by avoiding 4,6 and 8 we're not really saving much By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice: SVE only has instructions for factors 2,3,4, whilst RVV only supports up to factor 8. This patch would make it much easier to support scalable interleaved accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as the loop vectorizer and InterleavedAccessPass wouldn't need to construct and match trees of interleaves. For interleave factors above 8, for which there are no hardware memory operations to match in the InterleavedAccessPass, we can still keep the wide load + recursive interleaving in the loop vectorizer.

This adds [de]interleave intrinsics for factors of 4,6,8, so that every interleaved memory operation supported by the in-tree targets can be represented by a single intrinsic. For context, [de]interleaves of fixed-length vectors are represented by a series of shufflevectors. The intrinsics are needed for scalable vectors, and we don't currently scalably vectorize all possible factors of interleave groups supported by RISC-V/AArch64. The underlying reason for this is that higher factors are currently represented by interleaving multiple interleaves themselves, which made sense at the time in the discussion in llvm#89018. But after trying to integrate these for higher factors on RISC-V I think we should revisit this design choice: - Matching these in InterleavedAccessPass is non-trivial: We currently only support factors that are a power of 2, and detecting this requires a good chunk of code - The shufflevector masks used for [de]interleaves of fixed-length vectors are much easier to pattern match as they are strided patterns, but for the intrinsics it's much more complicated to match as the structure is a tree. - Unlike shufflevectors, there's no optimisation that happens on [de]interleave2 intriniscs - For non-power-of-2 factors e.g. 6, there are multiple possible ways a [de]interleave could be represented, see the discussion in llvm#139373 - We already have intrinsics for 2,3,5 and 7, so by avoiding 4,6 and 8 we're not really saving much By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice: SVE only has instructions for factors 2,3,4, whilst RVV only supports up to factor 8. This patch would make it much easier to support scalable interleaved accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as the loop vectorizer and InterleavedAccessPass wouldn't need to construct and match trees of interleaves. For interleave factors above 8, for which there are no hardware memory operations to match in the InterleavedAccessPass, we can still keep the wide load + recursive interleaving in the loop vectorizer.

llvmbot added backend:AArch64 backend:RISC-V vectorizers llvm:transforms labels Apr 17, 2024

Mel-Chen reviewed Apr 17, 2024

View reviewed changes

hassnaaHamdi requested review from momchil-velikov, paulwalker-arm and CarolineConcatto April 17, 2024 11:00

efriedma-quic mentioned this pull request Apr 18, 2024

[WIP][RFC] Implementation for SVE2 long operations #89310

Open

hassnaaHamdi force-pushed the main branch 2 times, most recently from 0931f25 to ef3a8ea Compare August 22, 2024 04:54

hassnaaHamdi changed the title ~~[LV][AArch64]: Utilise SVE ld4/st4 instructions via auto-vectorisation~~ [LV]: Teach LV to recursively (de)interleave. Aug 22, 2024

paulwalker-arm requested review from fhahn and ayalz August 22, 2024 12:45

paulwalker-arm reviewed Sep 13, 2024

View reviewed changes

hassnaaHamdi force-pushed the main branch from b3c5d05 to a1c5378 Compare October 8, 2024 00:16

update by latest changes

010e798

hassnaaHamdi added 3 commits December 11, 2024 08:32

resolve conflicts with upstream

73febb7

resolve comments

7195ce0

rebase

155cc1e

Mel-Chen reviewed Dec 11, 2024

View reviewed changes

[resolve comments]: add tests for masked interleaved accesses

aec4b95

hassnaaHamdi force-pushed the main branch from 22a4172 to aec4b95 Compare December 12, 2024 06:56

format

89b158b

Mel-Chen reviewed Dec 17, 2024

View reviewed changes

Resolve review comments

4766679

hassnaaHamdi requested a review from Mel-Chen December 18, 2024 08:23

Mel-Chen approved these changes Dec 23, 2024

View reviewed changes

hassnaaHamdi merged commit ccfe0de into llvm:main Dec 27, 2024
8 checks passed

hassnaaHamdi mentioned this pull request Jan 14, 2025

Reland: [LV]: Teach LV to recursively (de)interleave. #122989

Merged

hassnaaHamdi mentioned this pull request Feb 2, 2025

Reland "[LV]: Teach LV to recursively (de)interleave." #125094

Merged

This was referenced May 13, 2025

[IA] Add support for [de]interleave{3,5,7} #139373

Merged

[IR] Add llvm.vector.[de]interleave{4,6,8} #139893

Merged

		@@ -136,3 +136,4 @@ define void @negative_deinterleave4_test(ptr %src) {

		ret void
		}

[LV]: Teach LV to recursively (de)interleave. #89018

[LV]: Teach LV to recursively (de)interleave. #89018

Uh oh!

Conversation

hassnaaHamdi commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Apr 17, 2024

Uh oh!

github-actions bot commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fhahn commented Apr 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulwalker-arm commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davemgreen commented Apr 17, 2024

Uh oh!

paulwalker-arm commented Apr 17, 2024

Uh oh!

efriedma-quic commented Apr 17, 2024

Uh oh!

topperc commented Apr 17, 2024

Uh oh!

paulwalker-arm commented Apr 18, 2024

Uh oh!

efriedma-quic commented Apr 18, 2024

Uh oh!

davemgreen commented Apr 21, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mel-Chen commented Sep 23, 2024

Uh oh!

paulwalker-arm commented Sep 24, 2024

Uh oh!

hassnaaHamdi commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mel-Chen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hassnaaHamdi Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

hassnaaHamdi commented Apr 17, 2024 •

edited

Loading

llvmbot commented Apr 17, 2024 •

edited

Loading

github-actions bot commented Apr 17, 2024 •

edited

Loading

paulwalker-arm commented Apr 17, 2024 •

edited

Loading

hassnaaHamdi commented Dec 10, 2024 •

edited

Loading

hassnaaHamdi Dec 12, 2024 •

edited

Loading

Mel-Chen left a comment •

edited

Loading