[AMDGPU] Allow SLP to analyze i8s #91016

jrbyrnes · 2024-05-03T21:19:00Z

After #66838 , for improved codegen, vectors of i8s will be preferable (relative to the scalarized equivalent) as incoming values for PHI nodes to enable the legalization workaround.

After the mentioned PR, we will workaround the normal legalization in SelectionDAG for vector i8 PHIs, so in this work we also bypass the legalization checks when initially deciding whether or not to vectorize the type (GCNTTIImpl::getNumberOfParts). The effect is to have SLP consider i8s for vectorization.

#95328 introduces the cost hook needed to capture the vectorization conditions (where the vectors are used by #66838 )

llvm/test/Transforms/SLPVectorizer/AMDGPU/vectorize-i8.ll

llvmbot · 2024-05-04T13:31:06Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-amdgpu

Author: Jeffrey Byrnes (jrbyrnes)

Changes

After #66838 , vectors of i8s will be preferable (relative to the scalarized equivalent) as incoming values for PHI nodes. Since this work addresses that concern, I have built it on top of that PR. The changes to AMDGPULateCodeGenPrepare can be ignored in this PR, as this PR is only meant for the other changes. For this work, the changes begin at 2049a61

This PR is a draft / WIP, while I work on test coverage and confirm there are no oddities with our current cost model with the inclusion of i8s. However, before taking that plunge, I wanted to see if there were any high level concerns with the approach.

Conceptually, this work vectorizes I8 phis by imposing a cost penalty to the scalarized version. I have also reduced the cost of i8 shuffles, as this facilitates more PHI vectorization and the shuffles should at least be matched to perm.

After the mentioned PR, we will workaround the normal legalization in SelectionDAG for vector i8 PHIs, so in this work we also bypass the legalizeation checks when initially deciding whether or not to vectorize the type (GCNTTIImpl::getNumberOfParts).

Patch is 259.98 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/91016.diff

18 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+13)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+4)
(modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+4)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+6)
(modified) llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp (+331-2)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2-2)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp (+43-7)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h (+3)
(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+1-1)
(modified) llvm/test/Analysis/CostModel/AMDGPU/shufflevector.ll (+108-108)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+427)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-loop-carried.ll (+52)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-break-large-phis.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll (+8-16)
(modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+176-1757)
(added) llvm/test/CodeGen/AMDGPU/vni8-loop-carried.ll (+46)
(added) llvm/test/Transforms/SLPVectorizer/AMDGPU/vectorize-i8.ll (+366)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 58c69ac939763a..1fccc460ed74ad 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -884,6 +884,10 @@ class TargetTransformInfo {
                                            bool Insert, bool Extract,
                                            TTI::TargetCostKind CostKind) const;
 
+  /// Estimate the overhead of scalarizing a PHI instruction.
+  InstructionCost getPHIScalarizationOverhead(Type *ScalarTy,
+                                              VectorType *VTy) const;
+
   /// Estimate the overhead of scalarizing an instructions unique
   /// non-constant operands. The (potentially vector) types to use for each of
   /// argument are passes via Tys.
@@ -1912,6 +1916,9 @@ class TargetTransformInfo::Concept {
   getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
                                    ArrayRef<Type *> Tys,
                                    TargetCostKind CostKind) = 0;
+
+  virtual InstructionCost getPHIScalarizationOverhead(Type *ScalarTy,
+                                                      VectorType *VTy) = 0;
   virtual bool supportsEfficientVectorElementLoadStore() = 0;
   virtual bool supportsTailCalls() = 0;
   virtual bool supportsTailCallFor(const CallBase *CB) = 0;
@@ -2441,6 +2448,12 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
     return Impl.getScalarizationOverhead(Ty, DemandedElts, Insert, Extract,
                                          CostKind);
   }
+
+  InstructionCost getPHIScalarizationOverhead(Type *ScalarTy,
+                                              VectorType *VTy) override {
+    return Impl.getPHIScalarizationOverhead(ScalarTy, VTy);
+  }
+
   InstructionCost
   getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
                                    ArrayRef<Type *> Tys,
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 5b40e49714069f..66f3b01e0a3cda 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -363,6 +363,10 @@ class TargetTransformInfoImplBase {
     return 0;
   }
 
+  InstructionCost getPHIScalarizationOverhead(Type *ScalarTy, VectorType *VTy) {
+    return 0;
+  }
+
   InstructionCost
   getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
                                    ArrayRef<Type *> Tys,
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index 06a19c75cf873a..4b4101a1e26940 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -801,6 +801,10 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
                                              CostKind);
   }
 
+  InstructionCost getPHIScalarizationOverhead(Type *ScalarTy, VectorType *VTy) {
+    return 0;
+  }
+
   /// Estimate the overhead of scalarizing an instructions unique
   /// non-constant operands. The (potentially vector) types to use for each of
   /// argument are passes via Tys.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 33c899fe889990..e351f3895ad507 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -585,6 +585,12 @@ InstructionCost TargetTransformInfo::getScalarizationOverhead(
                                            CostKind);
 }
 
+InstructionCost
+TargetTransformInfo::getPHIScalarizationOverhead(Type *ScalarTy,
+                                                 VectorType *VTy) const {
+  return TTIImpl->getPHIScalarizationOverhead(ScalarTy, VTy);
+}
+
 InstructionCost TargetTransformInfo::getOperandsScalarizationOverhead(
     ArrayRef<const Value *> Args, ArrayRef<Type *> Tys,
     TTI::TargetCostKind CostKind) const {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
index 69fdeaebe0a018..d7d2ebff03b6b7 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
@@ -81,6 +81,88 @@ class AMDGPULateCodeGenPrepare
   bool visitLoadInst(LoadInst &LI);
 };
 
+class ConversionCandidateInfo {
+private:
+  // The instruction which defined the original virtual register used across
+  // blocks
+  Instruction *LiveRegDef;
+  // The original type
+  Type *OriginalType;
+  // The desired type
+  Type *NewType;
+  // The instruction sequence that converts the virtual register, to be used
+  // instead of the original
+  Instruction *Converted = nullptr;
+  // The builder used to build the conversion instruction
+  IRBuilder<> ConvertBuilder;
+
+public:
+  // The instruction which defined the original virtual register used across
+  // blocks
+  Instruction *getLiveRegDef() { return LiveRegDef; }
+  // The original type
+  Type *getOriginalType() { return OriginalType; }
+  // The desired type
+  Type *getNewType() { return NewType; }
+  void setNewType(Type *NewType) { this->NewType = NewType; }
+  // The instruction that conerts the virtual register, to be used instead of
+  // the original
+  Instruction *getConverted() { return Converted; }
+  void setConverted(Instruction *Converted) { this->Converted = Converted; }
+  // The builder used to build the conversion instruction
+  IRBuilder<> &getConvertBuilder() { return ConvertBuilder; }
+  // Do we have a instruction sequence which convert the original virtual
+  // register
+  bool hasConverted() { return Converted != nullptr; }
+
+  ConversionCandidateInfo(Instruction *LiveRegDef, BasicBlock *InsertBlock,
+                          BasicBlock::iterator InsertPt)
+      : LiveRegDef(LiveRegDef), OriginalType(LiveRegDef->getType()),
+        ConvertBuilder(InsertBlock, InsertPt) {}
+  ConversionCandidateInfo(Instruction *LiveRegDef, Type *NewType,
+                          BasicBlock *InsertBlock,
+                          BasicBlock::iterator InsertPt)
+      : LiveRegDef(LiveRegDef), OriginalType(LiveRegDef->getType()),
+        NewType(NewType), ConvertBuilder(InsertBlock, InsertPt) {}
+};
+
+typedef std::pair<Instruction *, BasicBlock *> IncomingPair;
+typedef std::pair<Instruction *, SmallVector<IncomingPair, 4>> PHIUpdateInfo;
+
+class LiveRegOptimizer {
+private:
+  Module *Mod = nullptr;
+  const DataLayout *DL = nullptr;
+  // The scalar type to convert to
+  Type *ConvertToScalar;
+  // Holds the collection of PHIs with their pending new operands
+  SmallVector<PHIUpdateInfo, 4> PHIUpdater;
+
+public:
+  // Should the def of the instruction be converted if it is live across blocks
+  bool shouldReplaceUses(const Instruction &I);
+  // Convert the virtual register to the compatible vector of legal type
+  void convertToOptType(ConversionCandidateInfo &LR);
+  // Convert the virtual register back to the original type, stripping away
+  // the MSBs in cases where there was an imperfect fit (e.g. v2i32 -> v7i8)
+  void convertFromOptType(ConversionCandidateInfo &LR);
+  // Get a vector of desired scalar type that is compatible with the original
+  // vector. In cases where there is no bitsize equivalent using a legal vector
+  // type, we pad the MSBs (e.g. v7i8 -> v2i32)
+  Type *getCompatibleType(Instruction *InstToConvert);
+  // Find and replace uses of the virtual register in different block with a
+  // newly produced virtual register of legal type
+  bool replaceUses(Instruction &I);
+  // Replace the collected PHIs with newly produced incoming values. Replacement
+  // is only done if we have a replacement for each original incoming value.
+  bool replacePHIs();
+
+  LiveRegOptimizer(Module *Mod) : Mod(Mod) {
+    DL = &Mod->getDataLayout();
+    ConvertToScalar = Type::getInt32Ty(Mod->getContext());
+  }
+};
+
 } // end anonymous namespace
 
 bool AMDGPULateCodeGenPrepare::doInitialization(Module &M) {
@@ -102,14 +184,261 @@ bool AMDGPULateCodeGenPrepare::runOnFunction(Function &F) {
   AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
   UA = &getAnalysis<UniformityInfoWrapperPass>().getUniformityInfo();
 
+  // "Optimize" the virtual regs that cross basic block boundaries. When
+  // building the SelectionDAG, vectors of illegal types that cross basic blocks
+  // will be scalarized and widened, with each scalar living in its
+  // own physical register. To work around this, this optimization converts the
+  // vectors to equivalent vectors of legal type (which are converted back
+  // before uses in subsequent blocks), to pack the bits into fewer physical
+  // registers (used in CopyToReg/CopyFromReg pairs).
+  LiveRegOptimizer LRO(Mod);
+
   bool Changed = false;
   for (auto &BB : F)
-    for (Instruction &I : llvm::make_early_inc_range(BB))
+    for (Instruction &I : make_early_inc_range(BB)) {
       Changed |= visit(I);
+      if (!LRO.shouldReplaceUses(I))
+        continue;
+      Changed |= LRO.replaceUses(I);
+    }
 
+  Changed |= LRO.replacePHIs();
   return Changed;
 }
 
+bool LiveRegOptimizer::replaceUses(Instruction &I) {
+  bool MadeChange = false;
+
+  struct ConvertUseInfo {
+    Instruction *Converted;
+    SmallVector<Instruction *, 4> Users;
+  };
+  DenseMap<BasicBlock *, ConvertUseInfo> InsertedConversionMap;
+
+  ConversionCandidateInfo FromCCI(&I, I.getParent(),
+                                  std::next(I.getIterator()));
+  FromCCI.setNewType(getCompatibleType(FromCCI.getLiveRegDef()));
+  for (auto IUser = I.user_begin(); IUser != I.user_end(); IUser++) {
+
+    Instruction *UserInst = cast<Instruction>(*IUser);
+    if (UserInst->getParent() != I.getParent() || isa<PHINode>(UserInst)) {
+      LLVM_DEBUG(dbgs() << *UserInst << "\n\tUses "
+                        << *FromCCI.getOriginalType()
+                        << " from previous block. Needs conversion\n");
+      convertToOptType(FromCCI);
+      if (!FromCCI.hasConverted())
+        continue;
+      // If it is a PHI node, just create and collect the new operand. We can
+      // only replace the PHI node once we have converted all the operands
+      if (auto PHI = dyn_cast<PHINode>(UserInst)) {
+        for (unsigned Idx = 0; Idx < PHI->getNumIncomingValues(); Idx++) {
+          Value *IncVal = PHI->getIncomingValue(Idx);
+          if (&I == dyn_cast<Instruction>(IncVal)) {
+            BasicBlock *IncBlock = PHI->getIncomingBlock(Idx);
+            auto PHIOps =
+                find_if(PHIUpdater, [&UserInst](PHIUpdateInfo &Entry) {
+                  return Entry.first == UserInst;
+                });
+
+            if (PHIOps == PHIUpdater.end())
+              PHIUpdater.push_back(
+                  {UserInst, {{FromCCI.getConverted(), IncBlock}}});
+            else
+              PHIOps->second.push_back({FromCCI.getConverted(), IncBlock});
+
+            break;
+          }
+        }
+        continue;
+      }
+
+      // Do not create multiple conversion sequences if there are multiple
+      // uses in the same block
+      if (InsertedConversionMap.contains(UserInst->getParent())) {
+        InsertedConversionMap[UserInst->getParent()].Users.push_back(UserInst);
+        LLVM_DEBUG(dbgs() << "\tUser already has access to converted def\n");
+        continue;
+      }
+
+      ConversionCandidateInfo ToCCI(FromCCI.getConverted(), I.getType(),
+                                    UserInst->getParent(),
+
+                                    UserInst->getParent()->getFirstNonPHIIt());
+      convertFromOptType(ToCCI);
+      assert(ToCCI.hasConverted());
+      InsertedConversionMap[UserInst->getParent()] = {ToCCI.getConverted(),
+                                                      {UserInst}};
+    }
+  }
+
+  // Replace uses of with in a separate loop that is not dependent upon the
+  // state of the uses
+  for (auto &Entry : InsertedConversionMap) {
+    for (auto &UserInst : Entry.second.Users) {
+      LLVM_DEBUG(dbgs() << *UserInst
+                        << "\n\tNow uses: " << *Entry.second.Converted << '\n');
+      UserInst->replaceUsesOfWith(&I, Entry.second.Converted);
+      MadeChange = true;
+    }
+  }
+  return MadeChange;
+}
+
+bool LiveRegOptimizer::replacePHIs() {
+  bool MadeChange = false;
+  for (auto Ele : PHIUpdater) {
+    auto [ThePHIInst, NewPHINodeOps] = Ele;
+    LLVM_DEBUG(dbgs() << "Attempting to replace: " << *ThePHIInst << '\n');
+    // If we have conveted all the required operands, then do the replacement
+    if (cast<PHINode>(ThePHIInst)->getNumIncomingValues() ==
+        NewPHINodeOps.size()) {
+      IRBuilder<> Builder(Ele.first);
+      auto NPHI = Builder.CreatePHI(NewPHINodeOps[0].first->getType(),
+                                    NewPHINodeOps.size());
+      for (auto IncVals : NewPHINodeOps) {
+        NPHI->addIncoming(IncVals.first, IncVals.second);
+        LLVM_DEBUG(dbgs() << "  Using: " << *IncVals.first
+                          << "  For: " << IncVals.second->getName() << '\n');
+      }
+      LLVM_DEBUG(dbgs() << "Sucessfully replaced with " << *NPHI << '\n');
+      ConversionCandidateInfo ToCCI(
+          NPHI, ThePHIInst->getType(), ThePHIInst->getParent(),
+
+          ThePHIInst->getParent()->getFirstNonPHIIt());
+      convertFromOptType(ToCCI);
+      assert(ToCCI.hasConverted());
+      Ele.first->replaceAllUsesWith(ToCCI.getConverted());
+      // The old PHI is no longer used
+      ThePHIInst->eraseFromParent();
+      MadeChange = true;
+    }
+  }
+  return MadeChange;
+}
+
+Type *LiveRegOptimizer::getCompatibleType(Instruction *InstToConvert) {
+  Type *OriginalType = InstToConvert->getType();
+  assert(OriginalType->getScalarSizeInBits() <=
+         ConvertToScalar->getScalarSizeInBits());
+  VectorType *VTy = dyn_cast<VectorType>(OriginalType);
+  if (!VTy)
+    return ConvertToScalar;
+
+  TypeSize OriginalSize = DL->getTypeSizeInBits(VTy);
+  TypeSize ConvertScalarSize = DL->getTypeSizeInBits(ConvertToScalar);
+  unsigned ConvertEltCount =
+      (OriginalSize + ConvertScalarSize - 1) / ConvertScalarSize;
+
+  if (OriginalSize <= ConvertScalarSize)
+    return IntegerType::get(Mod->getContext(), ConvertScalarSize);
+
+  return VectorType::get(Type::getIntNTy(Mod->getContext(), ConvertScalarSize),
+                         ElementCount::getFixed(ConvertEltCount));
+}
+
+void LiveRegOptimizer::convertToOptType(ConversionCandidateInfo &LR) {
+  if (LR.hasConverted()) {
+    LLVM_DEBUG(dbgs() << "\tAlready has converted def\n");
+    return;
+  }
+
+  VectorType *VTy = cast<VectorType>(LR.getOriginalType());
+  Type *NewTy = LR.getNewType();
+
+  TypeSize OriginalSize = DL->getTypeSizeInBits(VTy);
+  TypeSize NewSize = DL->getTypeSizeInBits(NewTy);
+
+  auto &Builder = LR.getConvertBuilder();
+  Value *V = cast<Value>(LR.getLiveRegDef());
+  // If there is a bitsize match, we can fit the old vector into a new vector of
+  // desired type
+  if (OriginalSize == NewSize) {
+    LR.setConverted(cast<Instruction>(Builder.CreateBitCast(V, NewTy)));
+    LLVM_DEBUG(dbgs() << "\tConverted def to " << *LR.getConverted()->getType()
+                      << '\n');
+    return;
+  }
+
+  // If there is a bitsize mismatch, we must use a wider vector
+  assert(NewSize > OriginalSize);
+  ElementCount ExpandedVecElementCount =
+      ElementCount::getFixed(NewSize / VTy->getScalarSizeInBits());
+
+  SmallVector<int, 8> ShuffleMask;
+  for (unsigned I = 0; I < VTy->getElementCount().getFixedValue(); I++)
+    ShuffleMask.push_back(I);
+
+  for (uint64_t I = VTy->getElementCount().getFixedValue();
+       I < ExpandedVecElementCount.getFixedValue(); I++)
+    ShuffleMask.push_back(VTy->getElementCount().getFixedValue());
+
+  Instruction *ExpandedVec =
+      cast<Instruction>(Builder.CreateShuffleVector(V, ShuffleMask));
+  LR.setConverted(cast<Instruction>(Builder.CreateBitCast(ExpandedVec, NewTy)));
+  LLVM_DEBUG(dbgs() << "\tConverted def to " << *LR.getConverted()->getType()
+                    << '\n');
+  return;
+}
+
+void LiveRegOptimizer::convertFromOptType(ConversionCandidateInfo &LRC) {
+  Type *OTy = LRC.getOriginalType();
+  VectorType *NewVTy = cast<VectorType>(LRC.getNewType());
+
+  TypeSize OriginalSize = DL->getTypeSizeInBits(OTy);
+  TypeSize NewSize = DL->getTypeSizeInBits(NewVTy);
+
+  auto &Builder = LRC.getConvertBuilder();
+  Value *V = cast<Value>(LRC.getLiveRegDef());
+  // If there is a bitsize match, we simply convert back to the original type
+  if (OriginalSize == NewSize) {
+    LRC.setConverted(cast<Instruction>(Builder.CreateBitCast(V, NewVTy)));
+    LLVM_DEBUG(dbgs() << "\tProduced for user: " << *LRC.getConverted()
+                      << '\n');
+    return;
+  }
+
+  if (!OTy->isVectorTy()) {
+    Instruction *Trunc = cast<Instruction>(Builder.CreateTrunc(
+        LRC.getLiveRegDef(), IntegerType::get(Mod->getContext(), NewSize)));
+    Instruction *Original =
+        cast<Instruction>(Builder.CreateBitCast(Trunc, NewVTy));
+    LRC.setConverted(cast<Instruction>(Original));
+    LLVM_DEBUG(dbgs() << "\tProduced for user: " << *LRC.getConverted()
+                      << '\n');
+    return;
+  }
+
+  // If there is a bitsize mismatch, we have used a wider vector and must strip
+  // the MSBs to convert back to the original type
+  assert(OriginalSize > NewSize);
+  ElementCount ExpandedVecElementCount =
+      ElementCount::getFixed(OriginalSize / NewVTy->getScalarSizeInBits());
+  VectorType *ExpandedVT = VectorType::get(
+      Type::getIntNTy(Mod->getContext(), NewVTy->getScalarSizeInBits()),
+      ExpandedVecElementCount);
+  Instruction *Converted =
+      cast<Instruction>(Builder.CreateBitCast(LRC.getLiveRegDef(), ExpandedVT));
+
+  unsigned NarrowElementCount = NewVTy->getElementCount().getFixedValue();
+  SmallVector<int, 8> ShuffleMask(NarrowElementCount);
+  std::iota(ShuffleMask.begin(), ShuffleMask.end(), 0);
+
+  Instruction *NarrowVec =
+      cast<Instruction>(Builder.CreateShuffleVector(Converted, ShuffleMask));
+  LRC.setConverted(cast<Instruction>(NarrowVec));
+  LLVM_DEBUG(dbgs() << "\tProduced for user: " << *LRC.getConverted() << '\n');
+  return;
+}
+
+bool LiveRegOptimizer::shouldReplaceUses(const Instruction &I) {
+  // Vectors of illegal types are copied across blocks in an efficient manner.
+  // They are scalarized and widened to legal scalars. In such cases, we can do
+  // better by using legal vector types
+  Type *IType = I.getType();
+  return IType->isVectorTy() && IType->getScalarSizeInBits() < 16 &&
+         !I.getType()->getScalarType()->isPointerTy();
+}
+
 bool AMDGPULateCodeGenPrepare::canWidenScalarExtLoad(LoadInst &LI) const {
   unsigned AS = LI.getPointerAddressSpace();
   // Skip non-constant address space.
@@ -119,7 +448,7 @@ bool AMDGPULateCodeGenPrepare::canWidenScalarExtLoad(LoadInst &LI) const {
   // Skip non-simple loads.
   if (!LI.isSimple())
     return false;
-  auto *Ty = LI.getType();
+  Type *Ty = LI.getType();
   // Skip aggregate types.
   if (Ty->isAggregateType())
     return false;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 305a6c8c3b9262..c15481336075e6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -1150,10 +1150,10 @@ bool GCNPassConfig::addPreISel() {
   AMDGPUPassConfig::addPreISel();
 
   if (TM->getOptLevel() > CodeGenOptLevel::None)
-    addPass(createAMDGPULateCodeGenPreparePass());
+    addPass(createSinkingPass());
 
   if (TM->getOptLevel() > CodeGenOptLevel::None)
-    addPass(createSinkingPass());
+    addPass(createAMDGPULateCodeGenPreparePass());
 
   // Merge divergent exit nodes. StructurizeCFG won't recognize the multi-exit
   // regions formed by them.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index 84320d296a037b..2fe2d506fa4308 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -306,6 +306,18 @@ bool GCNTTIImpl::hasBranchDivergence(const Function *F) const {
   return !F || !ST->isSingleLaneExecution(*F);
 }
 
+unsigned GCNTTIImpl::getNumberOfParts(Type *Tp) const {
+  if (auto VTy = dyn_cast<FixedVectorType>(...
[truncated]

arsenm · 2024-05-08T15:10:40Z

Intrinsically it makes no sense for us to prefer i8 anything. It's a 32-bit architecture. Everything is a 32-bit register. Vectors of i32 should be the canonically preferred type for anything wide. Why would we want i8?

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

jrbyrnes · 2024-05-08T22:22:46Z

Vectors of i32 should be the canonically preferred type for anything wide. Why would we want i8?

I think we should prefer vectors of i8 over scalar i8s for PHIs because vectors of i8s for PHIs will be coerced to vectors of i32 to workaround the legalization issues. So, in that case, vectorization is profitable

arsenm · 2024-05-13T14:56:07Z

Vectors of i32 should be the canonically preferred type for anything wide. Why would we want i8?

I think we should prefer vectors of i8 over scalar i8s for PHIs because vectors of i8s for PHIs will be coerced to vectors of i32 to workaround the legalization issues. So, in that case, vectorization is profitable

Description should be clarified, because "for the vectorizers" and "for codegen" are different questions

arsenm

I'm not sure why we need a dedicated phi function. It should essentially be the type legalization cost for any value. Phis aren't a real operation, they are just split to legal registers

Change-Id: Ia995bc646e5f050083bd6277eeabe0b5ab410f47

jrbyrnes · 2024-06-13T19:10:34Z

Rebase + Removed getPHIScalarizationOverhead and vectorizing i8 shuffles in favor of #91016

Also, removed the dependency on #66838

This PR now simply tells SLP to analyze i8s by changing the vectorization factor. The cost model seems okay as is, with perhaps some overestimation of vector cost for some ops. The vectorization in the lit tests is due to reduced cost of vectorizing vs extracting/inserting for identity vectors.

Change-Id: I8a9505b7b611d12349509be7271f071965c382e2

Change-Id: If5ac53d5235ee8c65c53454b209c9f155c17edc4

jrbyrnes requested a review from arsenm May 3, 2024 21:19

jrbyrnes commented May 3, 2024

View reviewed changes

llvm/test/Transforms/SLPVectorizer/AMDGPU/vectorize-i8.ll Outdated Show resolved Hide resolved

jrbyrnes added the backend:AMDGPU label May 3, 2024

arsenm reviewed May 8, 2024

View reviewed changes

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h Outdated Show resolved Hide resolved

arsenm reviewed May 8, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp Outdated Show resolved Hide resolved

arsenm reviewed May 13, 2024

View reviewed changes

[AMDGPU] Allow SLP to analyze i8s

69ecffc

Change-Id: Ia995bc646e5f050083bd6277eeabe0b5ab410f47

jrbyrnes force-pushed the VectorizeI8PHIs branch from 1262897 to 69ecffc Compare June 13, 2024 19:10

jrbyrnes changed the title ~~[AMDGPU] Prefer vector i8s in PHI Nodes~~ [AMDGPU] Allow SLP to analyze i8s Jun 13, 2024

Fix auto

eb6c7cd

Change-Id: I8a9505b7b611d12349509be7271f071965c382e2

jrbyrnes added a commit to jrbyrnes/llvm-project that referenced this pull request Jun 13, 2024

Review comments + Rebase on top of llvm#91016

0210138

Change-Id: If5ac53d5235ee8c65c53454b209c9f155c17edc4

jrbyrnes mentioned this pull request Jun 13, 2024

[SLP] Allow targets to add cost for nonstandard conditions #95328

Open

jrbyrnes marked this pull request as ready for review June 13, 2024 19:45

llvmbot added the llvm:transforms label Jun 13, 2024

arsenm mentioned this pull request Aug 22, 2024

[AMDGPU] Vectorize i8 Shuffles #95840

Closed

jrbyrnes mentioned this pull request Aug 23, 2024

[AMDGPU] Vectorize i8 Shuffles #105850

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Allow SLP to analyze i8s #91016

[AMDGPU] Allow SLP to analyze i8s #91016

Uh oh!

jrbyrnes commented May 3, 2024 •

edited

Loading

Uh oh!

Uh oh!

llvmbot commented May 4, 2024 •

edited

Loading

Uh oh!

arsenm commented May 8, 2024

Uh oh!

Uh oh!

Uh oh!

jrbyrnes commented May 8, 2024

Uh oh!

arsenm commented May 13, 2024

Uh oh!

arsenm left a comment

Uh oh!

jrbyrnes commented Jun 13, 2024 •

edited

Loading

Uh oh!

Uh oh!

[AMDGPU] Allow SLP to analyze i8s #91016

Are you sure you want to change the base?

[AMDGPU] Allow SLP to analyze i8s #91016

Uh oh!

Conversation

jrbyrnes commented May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

llvmbot commented May 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm commented May 8, 2024

Uh oh!

Uh oh!

Uh oh!

jrbyrnes commented May 8, 2024

Uh oh!

arsenm commented May 13, 2024

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

jrbyrnes commented Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jrbyrnes commented May 3, 2024 •

edited

Loading

llvmbot commented May 4, 2024 •

edited

Loading

jrbyrnes commented Jun 13, 2024 •

edited

Loading