[AMDGPU] Add IR LiveReg type-based optimization #66838

jrbyrnes · 2023-09-19T23:13:59Z

As suggested in #66134, this adds the IR level logic to coerce the type of illegal vectors which have live ranges that span across basic blocks.

The issue is that local ISel will emit CopyToReg / CopyFromReg pairs for live ranges spanning basic blocks. For illegal vector types, the DAGBuilder will legalize by scalarizing the vector, then widening each scalar, and passing each scalar via a separate physical register. See https://godbolt.org/z/Y7MhcjGE8 for a demo of the issue.

This feature identifies cases like these, and inserts bitcasts between the def of the illegal vector and the uses in different blocks. This results in avoiding the scalarization process and an ability to pack the bits into fewer registers -- for example, we now use 2 VGPR for a v8i8 instead of 8.

llvmbot · 2023-09-19T23:15:12Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Changes

NOTE: This commit is part of a stack which spans across phabricator. The PR is meant only for the top of the stack ([AMDGPU] Add IR LiveReg type-based optimization).

As suggested in #66134, this adds the IR level logic to coerce the type of illegal vectors which have live ranges that span across basic blocks.

The issue is that local ISel will emit CopyToReg / CopyFromReg pairs for live ranges spanning basic blocks. For illegal vector types, the DAGBuilder will legalize by scalarizing the vector, then widening each scalar, and passing each scalar via a separate physical register. See https://godbolt.org/z/Y7MhcjGE8 for a demo of the issue.

This feature identifies cases like these, and inserts bitcasts between the def of the illegal vector and the uses in different blocks. This results in avoiding the scalarization process and an ability to pack the bits into fewer registers -- for example, we now use 2 VGPR for a v8i8 instead of 8.

Patch is 637.69 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/66838.diff

18 Files Affected:

(modified) llvm/include/llvm/CodeGen/ByteProvider.h (+16)
(modified) llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp (+343)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+502-48)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-break-large-phis.ll (+79-46)
(modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+120-122)
(modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+10-8)
(modified) llvm/test/CodeGen/AMDGPU/idot2.ll (+11-12)
(modified) llvm/test/CodeGen/AMDGPU/idot4s.ll (+2391-123)
(modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (+3703-227)
(modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll (+6-9)
(modified) llvm/test/CodeGen/AMDGPU/load-hi16.ll (+30-30)
(modified) llvm/test/CodeGen/AMDGPU/load-lo16.ll (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/load-local.128.ll (+25-24)
(modified) llvm/test/CodeGen/AMDGPU/load-local.96.ll (+19-18)
(modified) llvm/test/CodeGen/AMDGPU/lshr.v2i16.ll (+8-10)
(modified) llvm/test/CodeGen/AMDGPU/permute_i8.ll (+254)
(modified) llvm/test/CodeGen/AMDGPU/shl.v2i16.ll (+7-11)
(modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+985-1707)

diff --git a/llvm/include/llvm/CodeGen/ByteProvider.h b/llvm/include/llvm/CodeGen/ByteProvider.h
index 3187b4e68c56f3a..99ae8607c0b2071 100644
--- a/llvm/include/llvm/CodeGen/ByteProvider.h
+++ b/llvm/include/llvm/CodeGen/ByteProvider.h
@@ -32,6 +32,11 @@ template <typename ISelOp> class ByteProvider {
   ByteProvider(std::optional<ISelOp> Src, int64_t DestOffset, int64_t SrcOffset)
       : Src(Src), DestOffset(DestOffset), SrcOffset(SrcOffset) {}
 
+  ByteProvider(std::optional<ISelOp> Src, int64_t DestOffset, int64_t SrcOffset,
+               std::optional<bool> IsSigned)
+      : Src(Src), DestOffset(DestOffset), SrcOffset(SrcOffset),
+        IsSigned(IsSigned) {}
+
   // TODO -- use constraint in c++20
   // Does this type correspond with an operation in selection DAG
   template <typename T> class is_op {
@@ -61,6 +66,9 @@ template <typename ISelOp> class ByteProvider {
   // DestOffset
   int64_t SrcOffset = 0;
 
+  // Whether or not Src be treated as signed
+  std::optional<bool> IsSigned;
+
   ByteProvider() = default;
 
   static ByteProvider getSrc(std::optional<ISelOp> Val, int64_t ByteOffset,
@@ -70,6 +78,14 @@ template <typename ISelOp> class ByteProvider {
     return ByteProvider(Val, ByteOffset, VectorOffset);
   }
 
+  static ByteProvider getSrc(std::optional<ISelOp> Val, int64_t ByteOffset,
+                             int64_t VectorOffset,
+                             std::optional<bool> IsSigned) {
+    static_assert(is_op<ISelOp>().value,
+                  "ByteProviders must contain an operation in selection DAG.");
+    return ByteProvider(Val, ByteOffset, VectorOffset, IsSigned);
+  }
+
   static ByteProvider getConstantZero() {
     return ByteProvider<ISelOp>(std::nullopt, 0, 0);
   }
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
index 4cce34bdeabcf44..b50379e98d0f6b5 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
@@ -106,6 +106,7 @@ class AMDGPUCodeGenPrepareImpl
   Module *Mod = nullptr;
   const DataLayout *DL = nullptr;
   bool HasUnsafeFPMath = false;
+  bool UsesGlobalISel = false;
   bool HasFP32DenormalFlush = false;
   bool FlowChanged = false;
   mutable Function *SqrtF32 = nullptr;
@@ -341,6 +342,85 @@ class AMDGPUCodeGenPrepare : public FunctionPass {
   StringRef getPassName() const override { return "AMDGPU IR optimizations"; }
 };
 
+class LiveRegConversion {
+private:
+  // The instruction which defined the original virtual register used across
+  // blocks
+  Instruction *LiveRegDef;
+  // The original type
+  Type *OriginalType;
+  // The desired type
+  Type *NewType;
+  // The instruction sequence that converts the virtual register, to be used
+  // instead of the original
+  std::optional<Instruction *> Converted;
+  // The builder used to build the conversion instruction
+  IRBuilder<> ConvertBuilder;
+
+public:
+  // The instruction which defined the original virtual register used across
+  // blocks
+  Instruction *getLiveRegDef() { return LiveRegDef; }
+  // The original type
+  Type *getOriginalType() { return OriginalType; }
+  // The desired type
+  Type *getNewType() { return NewType; }
+  void setNewType(Type *NewType) { this->NewType = NewType; }
+  // The instruction that conerts the virtual register, to be used instead of
+  // the original
+  std::optional<Instruction *> &getConverted() { return Converted; }
+  void setConverted(Instruction *Converted) { this->Converted = Converted; }
+  // The builder used to build the conversion instruction
+  IRBuilder<> &getConverBuilder() { return ConvertBuilder; }
+  // Do we have a instruction sequence which convert the original virtual
+  // register
+  bool hasConverted() { return Converted.has_value(); }
+
+  LiveRegConversion(Instruction *LiveRegDef, BasicBlock *InsertBlock,
+                    BasicBlock::iterator InsertPt)
+      : LiveRegDef(LiveRegDef), OriginalType(LiveRegDef->getType()),
+        ConvertBuilder(InsertBlock, InsertPt) {}
+  LiveRegConversion(Instruction *LiveRegDef, Type *NewType,
+                    BasicBlock *InsertBlock, BasicBlock::iterator InsertPt)
+      : LiveRegDef(LiveRegDef), OriginalType(LiveRegDef->getType()),
+        NewType(NewType), ConvertBuilder(InsertBlock, InsertPt) {}
+};
+
+class LiveRegOptimizer {
+private:
+  Module *Mod = nullptr;
+  // The scalar type to convert to
+  Type *ConvertToScalar;
+  // Holds the collection of PHIs with their pending new operands
+  SmallVector<std::pair<Instruction *,
+                        SmallVector<std::pair<Instruction *, BasicBlock *>, 4>>,
+              4>
+      PHIUpdater;
+
+public:
+  // Should the def of the instruction be converted if it is live across blocks
+  bool shouldReplaceUses(const Instruction &I);
+  // Convert the virtual register to the compatible vector of legal type
+  void convertToOptType(LiveRegConversion &LR);
+  // Convert the virtual register back to the original type, stripping away
+  // the MSBs in cases where there was an imperfect fit (e.g. v2i32 -> v7i8)
+  void convertFromOptType(LiveRegConversion &LR);
+  // Get a vector of desired scalar type that is compatible with the original
+  // vector. In cases where there is no bitsize equivalent using a legal vector
+  // type, we pad the MSBs (e.g. v7i8 -> v2i32)
+  Type *getCompatibleType(Instruction *InstToConvert);
+  // Find and replace uses of the virtual register in different block with a
+  // newly produced virtual register of legal type
+  bool replaceUses(Instruction &I);
+  // Replace the collected PHIs with newly produced incoming values. Replacement
+  // is only done if we have a replacement for each original incoming value.
+  bool replacePHIs();
+
+  LiveRegOptimizer(Module *Mod) : Mod(Mod) {
+    ConvertToScalar = Type::getInt32Ty(Mod->getContext());
+  }
+};
+
 } // end anonymous namespace
 
 bool AMDGPUCodeGenPrepareImpl::run(Function &F) {
@@ -358,6 +438,7 @@ bool AMDGPUCodeGenPrepareImpl::run(Function &F) {
       Next = std::next(I);
 
       MadeChange |= visit(*I);
+      I->getType();
 
       if (Next != E) { // Control flow changed
         BasicBlock *NextInstBB = Next->getParent();
@@ -369,9 +450,269 @@ bool AMDGPUCodeGenPrepareImpl::run(Function &F) {
       }
     }
   }
+
+  // GlobalISel should directly use the values, and do not need to emit
+  // CopyTo/CopyFrom Regs across blocks
+  if (UsesGlobalISel)
+    return MadeChange;
+
+  // "Optimize" the virtual regs that cross basic block boundaries. In such
+  // cases, vectors of illegal types will be scalarized and widened, with each
+  // scalar living in its own physical register. The optimization converts the
+  // vectors to equivalent vectors of legal type (which are convereted back
+  // before uses in subsequenmt blocks), to pack the bits into fewer physical
+  // registers (used in CopyToReg/CopyFromReg pairs).
+  LiveRegOptimizer LRO(Mod);
+  for (auto &BB : F) {
+    for (auto &I : BB) {
+      if (!LRO.shouldReplaceUses(I))
+        continue;
+      MadeChange |= LRO.replaceUses(I);
+    }
+  }
+
+  MadeChange |= LRO.replacePHIs();
+  return MadeChange;
+}
+
+bool LiveRegOptimizer::replaceUses(Instruction &I) {
+  bool MadeChange = false;
+
+  struct ConvertUseInfo {
+    Instruction *Converted;
+    SmallVector<Instruction *, 4> Users;
+  };
+  DenseMap<BasicBlock *, ConvertUseInfo> UseConvertTracker;
+
+  LiveRegConversion FromLRC(
+      &I, I.getParent(),
+      static_cast<BasicBlock::iterator>(std::next(I.getIterator())));
+  FromLRC.setNewType(getCompatibleType(FromLRC.getLiveRegDef()));
+  for (auto IUser = I.user_begin(); IUser != I.user_end(); IUser++) {
+
+    if (auto UserInst = dyn_cast<Instruction>(*IUser)) {
+      if (UserInst->getParent() != I.getParent()) {
+        LLVM_DEBUG(dbgs() << *UserInst << "\n\tUses "
+                          << *FromLRC.getOriginalType()
+                          << " from previous block. Needs conversion\n");
+        convertToOptType(FromLRC);
+        if (!FromLRC.hasConverted())
+          continue;
+        // If it is a PHI node, just create and collect the new operand. We can
+        // only replace the PHI node once we have converted all the operands
+        if (auto PhiInst = dyn_cast<PHINode>(UserInst)) {
+          for (unsigned Idx = 0; Idx < PhiInst->getNumIncomingValues(); Idx++) {
+            auto IncVal = PhiInst->getIncomingValue(Idx);
+            if (&I == dyn_cast<Instruction>(IncVal)) {
+              auto IncBlock = PhiInst->getIncomingBlock(Idx);
+              auto PHIOps = find_if(
+                  PHIUpdater,
+                  [&UserInst](
+                      std::pair<Instruction *,
+                                SmallVector<
+                                    std::pair<Instruction *, BasicBlock *>, 4>>
+                          &Entry) { return Entry.first == UserInst; });
+
+              if (PHIOps == PHIUpdater.end())
+                PHIUpdater.push_back(
+                    {UserInst, {{*FromLRC.getConverted(), IncBlock}}});
+              else
+                PHIOps->second.push_back({*FromLRC.getConverted(), IncBlock});
+
+              break;
+            }
+          }
+          continue;
+        }
+
+        // Do not create multiple conversion sequences if there are multiple
+        // uses in the same block
+        if (UseConvertTracker.contains(UserInst->getParent())) {
+          UseConvertTracker[UserInst->getParent()].Users.push_back(UserInst);
+          LLVM_DEBUG(dbgs() << "\tUser already has access to converted def\n");
+          continue;
+        }
+
+        LiveRegConversion ToLRC(*FromLRC.getConverted(), I.getType(),
+                                UserInst->getParent(),
+                                static_cast<BasicBlock::iterator>(
+                                    UserInst->getParent()->getFirstNonPHIIt()));
+        convertFromOptType(ToLRC);
+        assert(ToLRC.hasConverted());
+        UseConvertTracker[UserInst->getParent()] = {*ToLRC.getConverted(),
+                                                    {UserInst}};
+      }
+    }
+  }
+
+  // Replace uses of with in a separate loop that is not dependent upon the
+  // state of the uses
+  for (auto &Entry : UseConvertTracker) {
+    for (auto &UserInst : Entry.second.Users) {
+      LLVM_DEBUG(dbgs() << *UserInst
+                        << "\n\tNow uses: " << *Entry.second.Converted << "\n");
+      UserInst->replaceUsesOfWith(&I, Entry.second.Converted);
+      MadeChange = true;
+    }
+  }
+  return MadeChange;
+}
+
+bool LiveRegOptimizer::replacePHIs() {
+  bool MadeChange = false;
+  for (auto Ele : PHIUpdater) {
+    auto ThePHINode = dyn_cast<PHINode>(Ele.first);
+    assert(ThePHINode);
+    auto NewPHINodeOps = Ele.second;
+    LLVM_DEBUG(dbgs() << "Attempting to replace: " << *ThePHINode << "\n");
+    // If we have conveted all the required operands, then do the replacement
+    if (ThePHINode->getNumIncomingValues() == NewPHINodeOps.size()) {
+      IRBuilder<> Builder(Ele.first);
+      auto NPHI = Builder.CreatePHI(NewPHINodeOps[0].first->getType(),
+                                    NewPHINodeOps.size());
+      for (auto IncVals : NewPHINodeOps) {
+        NPHI->addIncoming(IncVals.first, IncVals.second);
+        LLVM_DEBUG(dbgs() << "  Using: " << *IncVals.first
+                          << "  For: " << IncVals.second->getName() << "\n");
+      }
+      LLVM_DEBUG(dbgs() << "Sucessfully replaced with " << *NPHI << "\n");
+      LiveRegConversion ToLRC(NPHI, ThePHINode->getType(),
+                              ThePHINode->getParent(),
+                              static_cast<BasicBlock::iterator>(
+                                  ThePHINode->getParent()->getFirstNonPHIIt()));
+      convertFromOptType(ToLRC);
+      assert(ToLRC.hasConverted());
+      Ele.first->replaceAllUsesWith(*ToLRC.getConverted());
+      // The old PHI is no longer used
+      ThePHINode->eraseFromParent();
+      MadeChange = true;
+    }
+  }
   return MadeChange;
 }
 
+Type *LiveRegOptimizer::getCompatibleType(Instruction *InstToConvert) {
+  auto OriginalType = InstToConvert->getType();
+  assert(OriginalType->getScalarSizeInBits() <=
+         ConvertToScalar->getScalarSizeInBits());
+  auto VTy = dyn_cast<VectorType>(OriginalType);
+  if (!VTy)
+    return ConvertToScalar;
+
+  auto OriginalSize =
+      VTy->getScalarSizeInBits() * VTy->getElementCount().getFixedValue();
+  auto ConvertScalarSize = ConvertToScalar->getScalarSizeInBits();
+  auto ConvertEltCount =
+      (OriginalSize + ConvertScalarSize - 1) / ConvertScalarSize;
+
+  return VectorType::get(Type::getIntNTy(Mod->getContext(), ConvertScalarSize),
+                         llvm::ElementCount::getFixed(ConvertEltCount));
+}
+
+void LiveRegOptimizer::convertToOptType(LiveRegConversion &LR) {
+  if (LR.hasConverted()) {
+    LLVM_DEBUG(dbgs() << "\tAlready has converted def\n");
+    return;
+  }
+
+  auto VTy = dyn_cast<VectorType>(LR.getOriginalType());
+  assert(VTy);
+  auto NewVTy = dyn_cast<VectorType>(LR.getNewType());
+  assert(NewVTy);
+
+  auto V = static_cast<Value *>(LR.getLiveRegDef());
+  auto OriginalSize =
+      VTy->getScalarSizeInBits() * VTy->getElementCount().getFixedValue();
+  auto NewSize =
+      NewVTy->getScalarSizeInBits() * NewVTy->getElementCount().getFixedValue();
+
+  auto &Builder = LR.getConverBuilder();
+
+  // If there is a bitsize match, we can fit the old vector into a new vector of
+  // desired type
+  if (OriginalSize == NewSize) {
+    LR.setConverted(dyn_cast<Instruction>(Builder.CreateBitCast(V, NewVTy)));
+    LLVM_DEBUG(dbgs() << "\tConverted def to "
+                      << *(*LR.getConverted())->getType() << "\n");
+    return;
+  }
+
+  // If there is a bitsize mismatch, we must use a wider vector
+  assert(NewSize > OriginalSize);
+  auto ExpandedVecElementCount =
+      llvm::ElementCount::getFixed(NewSize / VTy->getScalarSizeInBits());
+
+  SmallVector<int, 8> ShuffleMask;
+  for (unsigned I = 0; I < VTy->getElementCount().getFixedValue(); I++)
+    ShuffleMask.push_back(I);
+
+  for (uint64_t I = VTy->getElementCount().getFixedValue();
+       I < ExpandedVecElementCount.getFixedValue(); I++)
+    ShuffleMask.push_back(VTy->getElementCount().getFixedValue());
+
+  auto ExpandedVec =
+      dyn_cast<Instruction>(Builder.CreateShuffleVector(V, ShuffleMask));
+  LR.setConverted(
+      dyn_cast<Instruction>(Builder.CreateBitCast(ExpandedVec, NewVTy)));
+  LLVM_DEBUG(dbgs() << "\tConverted def to " << *(*LR.getConverted())->getType()
+                    << "\n");
+  return;
+}
+
+void LiveRegOptimizer::convertFromOptType(LiveRegConversion &LRC) {
+  auto VTy = dyn_cast<VectorType>(LRC.getOriginalType());
+  assert(VTy);
+  auto NewVTy = dyn_cast<VectorType>(LRC.getNewType());
+  assert(NewVTy);
+
+  auto V = static_cast<Value *>(LRC.getLiveRegDef());
+  auto OriginalSize =
+      VTy->getScalarSizeInBits() * VTy->getElementCount().getFixedValue();
+  auto NewSize =
+      NewVTy->getScalarSizeInBits() * NewVTy->getElementCount().getFixedValue();
+
+  auto &Builder = LRC.getConverBuilder();
+
+  // If there is a bitsize match, we simply convert back to the original type
+  if (OriginalSize == NewSize) {
+    LRC.setConverted(dyn_cast<Instruction>(Builder.CreateBitCast(V, NewVTy)));
+    LLVM_DEBUG(dbgs() << "\tProduced for user: " << **LRC.getConverted()
+                      << "\n");
+    return;
+  }
+
+  // If there is a bitsize mismatch, we have used a wider vector and must strip
+  // the MSBs to convert back to the original type
+  assert(OriginalSize > NewSize);
+  auto ExpandedVecElementCount = llvm::ElementCount::getFixed(
+      OriginalSize / NewVTy->getScalarSizeInBits());
+  auto ExpandedVT = VectorType::get(
+      Type::getIntNTy(Mod->getContext(), NewVTy->getScalarSizeInBits()),
+      ExpandedVecElementCount);
+  auto Converted = dyn_cast<Instruction>(
+      Builder.CreateBitCast(LRC.getLiveRegDef(), ExpandedVT));
+
+  auto NarrowElementCount = NewVTy->getElementCount().getFixedValue();
+  SmallVector<int, 8> ShuffleMask;
+  for (uint64_t I = 0; I < NarrowElementCount; I++)
+    ShuffleMask.push_back(I);
+
+  auto NarrowVec = dyn_cast<Instruction>(
+      Builder.CreateShuffleVector(Converted, ShuffleMask));
+  LRC.setConverted(dyn_cast<Instruction>(NarrowVec));
+  LLVM_DEBUG(dbgs() << "\tProduced for user: " << **LRC.getConverted() << "\n");
+  return;
+}
+
+bool LiveRegOptimizer::shouldReplaceUses(const Instruction &I) {
+  // Vectors of illegal types are copied across blocks in an efficient manner.
+  // They are scalarized and widened to legal scalars. In such cases, we can do
+  // better by using legal vector types
+  auto IType = I.getType();
+  return IType->isVectorTy() && IType->getScalarSizeInBits() < 16 &&
+         !I.getType()->getScalarType()->isPointerTy();
+}
+
 unsigned AMDGPUCodeGenPrepareImpl::getBaseElementBitWidth(const Type *T) const {
   assert(needsPromotionToI32(T) && "T does not need promotion to i32");
 
@@ -2230,6 +2571,7 @@ bool AMDGPUCodeGenPrepare::runOnFunction(Function &F) {
   Impl.ST = &TM.getSubtarget<GCNSubtarget>(F);
   Impl.AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
   Impl.UA = &getAnalysis<UniformityInfoWrapperPass>().getUniformityInfo();
+  Impl.UsesGlobalISel = TM.Options.EnableGlobalISel;
   auto *DTWP = getAnalysisIfAvailable<DominatorTreeWrapperPass>();
   Impl.DT = DTWP ? &DTWP->getDomTree() : nullptr;
   Impl.HasUnsafeFPMath = hasUnsafeFPMath(F);
@@ -2250,6 +2592,7 @@ PreservedAnalyses AMDGPUCodeGenPreparePass::run(Function &F,
   Impl.UA = &FAM.getResult<UniformityInfoAnalysis>(F);
   Impl.DT = FAM.getCachedResult<DominatorTreeAnalysis>(F);
   Impl.HasUnsafeFPMath = hasUnsafeFPMath(F);
+  Impl.UsesGlobalISel = TM.Options.EnableGlobalISel;
   SIModeRegisterDefaults Mode(F);
   Impl.HasFP32DenormalFlush =
       Mode.FP32Denormals == DenormalMode::getPreserveSign();
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 1c85ec3f9f5212f..403d61f1b836fab 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -10652,23 +10652,27 @@ SDValue SITargetLowering::performAndCombine(SDNode *N,
 // performed.
 static const std::optional<ByteProvider<SDValue>>
 calculateSrcByte(const SDValue Op, uint64_t DestByte, uint64_t SrcIndex = 0,
+                 std::optional<bool> IsSigned = std::nullopt,
                  unsigned Depth = 0) {
   // We may need to recursively traverse a series of SRLs
   if (Depth >= 6)
     return std::nullopt;
 
-  auto ValueSize = Op.getValueSizeInBits();
-  if (ValueSize != 8 && ValueSize != 16 && ValueSize != 32)
+  if (Op.getValueSizeInBits() < 8)
     return std::nullopt;
 
   switch (Op->getOpcode()) {
   case ISD::TRUNCATE: {
-    return calculateSrcByte(Op->getOperand(0), DestByte, SrcIndex, Depth + 1);
+    return calculateSrcByte(Op->getOperand(0), DestByte, SrcIndex, IsSigned,
+                            Depth + 1);
   }
 
   case ISD::SIGN_EXTEND:
   case ISD::ZERO_EXTEND:
   case ISD::SIGN_EXTEND_INREG: {
+    IsSigned = IsSigned.value_or(false) ||
+               Op->getOpcode() == ISD::SIGN_EXTEND ||
+               Op->getOpcode() == ISD::SIGN_EXTEND_INREG;
     SDValue NarrowOp = Op->getOperand(0);
     auto NarrowVT = NarrowOp.getValueType();
     if (Op->getOpcode() == ISD::SIGN_EXTEND_INREG) {
@@ -10681,7 +10685,8 @@ calculateSrcByte(const SDValue Op, uint64_t DestByte, uint64_t SrcIndex = 0,
 
     if (SrcIndex >= NarrowByteWidth)
       return std::nullopt;
-    return calculateSrcByte(Op->getOperand(0), DestByte, SrcIndex, Depth + 1);
+    return calculateSrcByte(Op->getOperand(0), DestByte, SrcIndex, IsSigned,
+                            Depth + 1);
   }
 
   case ISD::SRA:
@@ -10697,12 +10702,38 @@ calculateSrcByte(const SDValue Op, uint64_t DestByte, uint64_t SrcIndex = 0,
 
     SrcIndex += BitShift / 8;
 
-    return calculateSrcByte(Op->getOperand(0), DestByte, SrcIndex, Depth + 1);
+    return calculateSrcByte(Op->getOperand(0), DestByte, SrcIndex, IsSigned,
+                            Depth + 1);
   }
 
-  default: {
+  case ISD::EXTRACT_VECTOR_ELT: {
+    auto IdxOp = dyn_cast<ConstantSDNode>(Op->getOperand(1));
+    if (!IdxOp)
+      return std::nullopt;
+    auto VecIdx = IdxOp->getZExtValue();
+    auto ScalarSize = Op.getScalarValueS...
[truncated]

arsenm

Needs a rebase to show the already pushed dependencies

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

arsenm

Needs rebase, most of this stuff was already merged?

jrbyrnes · 2024-02-07T23:20:08Z

Just rebase to reflect the current outstanding work -- still need to address other comments.

jrbyrnes · 2024-04-18T00:49:07Z

Decoupling from "[AMDGPU]: Accept constant zero bytes in v_perm OrCombine" as that is taking longer than expected, and this has priority.

As a result, in the exotic cases (e.g. v3i8), we may produce suboptimal codegen, but, for the normal case, codegen is much improved.

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

arsenm

Should include a test where a predecessor block has a repeated successor (i.e. it has multiple incoming switch cases)

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll

arsenm · 2024-05-01T05:35:45Z

How does this compare to the usage of

llvm-project/llvm/include/llvm/CodeGen/TargetLowering.h

Line 2869 in 3684a38

virtual bool shouldConvertPhiType(Type *From, Type *To) const {

in generic CodeGenPrepare?

jrbyrnes · 2024-05-20T18:22:09Z

How does this compare to the usage of

llvm-project/llvm/include/llvm/CodeGen/TargetLowering.h

Line 2869 in 3684a38

virtual bool shouldConvertPhiType(Type *From, Type *To) const {

in generic CodeGenPrepare?

Interesting -- I did not see that, thanks.

That feature does have the same structure as the feature in this PR, but it seems for a different purpose: to help fold away bitcasts. It will insert bitcasts if it finds that the bitcasts defining incoming values and bitcast uses of the PHI are all the same type. The target hook is invoked in special cases: when there are no bitcast def/uses, or when the bitcasts are def by a load (for bitcast def) / used by a store (for bitcast use of PHI).

It is mainly due to this special condition entry that the hook will not work for the feature in this PR. Not only that special condition, but the generic codegenprepare optimizePhiType is disabled for vector types. Of course we could remove the conditions to the target hook, but this feels like misusing the hook since the features are conceptually different and we will have a phase ordering problem (code sinking will fold the newly inserted casts).

arsenm · 2024-05-20T19:22:16Z

It is mainly due to this special condition entry that the hook will not work for the feature in this PR. Not only that special condition, but the generic codegenprepare optimizePhiType is disabled for vector types. Of course we could remove the conditions to the target hook, but this feels like misusing the hook since the features are conceptually different and we will have a phase ordering problem (code sinking will fold the newly inserted casts).

The implementation is still a lot simpler, if the heuristic and placement is different. Does copying the same technique, and using the existing ValueToValueMap work?

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

Pierre-vh · 2024-05-21T07:22:56Z

We should check if PHI splitting is still needed in CGP with this

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

jrbyrnes · 2024-06-11T21:45:33Z

In the latest, I've refactored the main algorithm s.t. the design/style is more consistent with

llvm-project/llvm/lib/CodeGen/CodeGenPrepare.cpp

Line 6380 in 3684a38

bool CodeGenPrepare::optimizePhiType(

That said, the linked algorithm is only used to coerce types for certain Phi nodes, whereas this feature coerces types for all problematic values (Phi Node or not) so the algorithm needed adjustments. Moreover, this feature needs to calculate the type to convert to, so some of the code from the older version survived (e.g. converTo/FromOptType). Nevertheless, the new implementation seems more standard and simple, so I thought it should supersede the old version (rather than creating a competing PR).

The force-push was to fix the lit changes

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

jrbyrnes · 2024-06-12T21:42:56Z

llvm/test/CodeGen/AMDGPU/extract-subvector-16bit.ll

-; SI-NEXT:    v_lshlrev_b32_e32 v1, 16, v4
-; SI-NEXT:    v_or_b32_e32 v2, v2, v0
-; SI-NEXT:    v_or_b32_e32 v3, v3, v1
+; SI-NEXT:    v_lshlrev_b32_e32 v1, 16, v3


i16 is not legal for the generic subtarget, so the LiveReg optimization gets triggered. However, when legalizing, the v8i16 loads are scalarized into 8 i32 (ext from i16) loads which negates the benefit.

jrbyrnes · 2024-06-20T04:46:14Z

Introduce BBUseValMap to correctly track per BB the conversions back to the original type for users + handle zeroinitializer incoming values & return on unhandled incoming values.

Latest version passes PSDB

jrbyrnes · 2024-06-21T20:06:16Z

ping

arsenm

Can you add some IR->IR tests for the basic cases

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

jrbyrnes · 2024-06-26T20:38:35Z

Any other concerns here?

jrbyrnes · 2024-06-28T21:27:55Z

Thanks for the review @arsenm

Change-Id: Ia0d11b79b8302e79247fe193ccabc0dad2d359a0

jrbyrnes · 2024-06-28T22:06:10Z

ded9564

Part of #66838. https://lab.llvm.org/buildbot/#/builders/52/builds/404 https://lab.llvm.org/buildbot/#/builders/55/builds/358 https://lab.llvm.org/buildbot/#/builders/164/builds/518 This reverts commit ded9564.

Change-Id: Id3cf508092d5c6321cc980ae168e79d525f558a3

jrbyrnes · 2024-07-01T22:15:00Z

I've reopened for the reland, the latest doesn't have the sanitizer issues flagged in #97138.

Since visitLoadInst deletes dead loads, optimizeLiveType may try to analyze freed memory. Simply deferring the deletion until post-loop doesn't work as the loads marked for removal may be part of a subsequently coerced chain. In order to integrate the loops, these interactions will need to be handled, which is left as a TODO.

arsenm · 2024-07-02T14:31:38Z

I've reopened for the reland, the latest doesn't have the sanitizer issues flagged in #97138.

Since visitLoadInst deletes dead loads, optimizeLiveType may try to analyze freed memory. Simply deferring the deletion until post-loop doesn't work as the loads marked for removal may be part of a subsequently coerced chain. In order to integrate the loops, these interactions will need to be handled, which is left as a TODO.

This pass, like most combiners, should work bottom up not top down. Would that avoid the issue?

Change-Id: I7a3968d4812c2af790648222f9fe78b5af356937

jrbyrnes · 2024-07-02T22:06:28Z

This pass, like most combiners, should work bottom up not top down. Would that avoid the issue?

Yes we can achieve the loop integration with reverse iteration and integrated deletion -- this is implemented in the latest version.

During this implemetaion, I found what I thought to be a regression, but this was actually flagging a bug in the way deleting PHIs was handled in the previous version. In the previous version, when we are unable to convert all incoming values for a coerced PHI, we would flag the PHI for deletion -- to be deleted post loop. However, we did not also update the ValMap, so users would still update their operands to use the coerced PHI, which would then be deleted. The latest contains the fix for this. This will only occur when we having a weird incoming value for a PHI (e.g. vector literal); I assume it is for this reason that PSDB didn't notice it.

Change-Id: If1bcd6721fea0fa0114dbf019647f109015007a0

jrbyrnes · 2024-07-03T01:34:04Z

Passes psdb

arsenm

Don't know why this is in AMDGPULateCodeGenPrepare instead of regular AMDGPUCodeGenPrepare. IIRC it was an extremely niche issue that caused us to have this late one in the first place which I wanted to fix. We can always move it later though

jrbyrnes · 2024-07-03T16:29:04Z

Reland: 5da7179

Part of llvm#66838. https://lab.llvm.org/buildbot/#/builders/52/builds/404 https://lab.llvm.org/buildbot/#/builders/55/builds/358 https://lab.llvm.org/buildbot/#/builders/164/builds/518 This reverts commit ded9564.

llvmbot added the backend:AMDGPU label Sep 19, 2023

arsenm requested changes Nov 8, 2023

View reviewed changes

jrbyrnes mentioned this pull request Jan 12, 2024

[SelectionDAG] NFC: Add target hooks to enable vector coercion in CopyToReg / CopyFromReg #66134

Closed

arsenm requested changes Feb 6, 2024

View reviewed changes

jrbyrnes force-pushed the CompleteLiveRegRebase1 branch from 4ee0c89 to 1c502bc Compare February 7, 2024 23:19

jrbyrnes force-pushed the CompleteLiveRegRebase1 branch from 1c502bc to 71526ba Compare February 7, 2024 23:51

jrbyrnes force-pushed the CompleteLiveRegRebase1 branch from 71526ba to 5efb955 Compare April 18, 2024 00:48

jayfoad reviewed Apr 18, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp Outdated Show resolved Hide resolved

llvmbot added the llvm:globalisel label Apr 19, 2024

arsenm reviewed Apr 25, 2024

View reviewed changes

jrbyrnes commented Apr 30, 2024

View reviewed changes

llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll Show resolved Hide resolved

jrbyrnes mentioned this pull request May 3, 2024

[AMDGPU] Allow SLP to analyze i8s #91016

Open

arsenm reviewed May 20, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp Outdated Show resolved Hide resolved

arsenm reviewed Jun 6, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp Outdated Show resolved Hide resolved

jrbyrnes force-pushed the CompleteLiveRegRebase1 branch from 9e78de0 to cbc7922 Compare June 11, 2024 21:41

arsenm reviewed Jun 12, 2024

View reviewed changes

jrbyrnes commented Jun 12, 2024

View reviewed changes

jrbyrnes mentioned this pull request Jun 17, 2024

[AMDGPU] Vectorize i8 Shuffles #95840

Closed

arsenm reviewed Jun 21, 2024

View reviewed changes

arsenm approved these changes Jun 28, 2024

View reviewed changes

[AMDGPU] Add IR LiveReg type-based optimization

5b9dff4

jrbyrnes referenced this pull request Jun 28, 2024

[AMDGPU] Add IR LiveReg type-based optimization

ded9564

Change-Id: Ia0d11b79b8302e79247fe193ccabc0dad2d359a0

jrbyrnes closed this Jun 28, 2024

vitalybuka mentioned this pull request Jun 29, 2024

Revert "[AMDGPU] Add IR LiveReg type-based optimization" #97138

Merged

Separate the loop to resolve use after free

d90e99a

Change-Id: Id3cf508092d5c6321cc980ae168e79d525f558a3

jrbyrnes reopened this Jul 1, 2024

jrbyrnes force-pushed the CompleteLiveRegRebase1 branch from e55ce0f to d90e99a Compare July 1, 2024 22:08

Integrate the loops

8874874

Change-Id: I7a3968d4812c2af790648222f9fe78b5af356937

Don't bypass when ST.hasScalarSubwordLoads

54a0923

Change-Id: If1bcd6721fea0fa0114dbf019647f109015007a0

arsenm approved these changes Jul 3, 2024

View reviewed changes

jrbyrnes referenced this pull request Jul 3, 2024

[AMDGPU] Reland: Add IR LiveReg type-based optimization

5da7179

jrbyrnes closed this Jul 3, 2024

jrbyrnes mentioned this pull request Aug 23, 2024

[AMDGPU] Vectorize i8 Shuffles #105850

Open

[AMDGPU] Add IR LiveReg type-based optimization #66838

[AMDGPU] Add IR LiveReg type-based optimization #66838

Uh oh!

Conversation

jrbyrnes commented Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

jrbyrnes commented Feb 7, 2024

Uh oh!

jrbyrnes commented Apr 18, 2024

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arsenm commented May 1, 2024

Uh oh!

jrbyrnes commented May 20, 2024

Uh oh!

arsenm commented May 20, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pierre-vh commented May 21, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbyrnes commented Jun 11, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbyrnes Jun 12, 2024

Choose a reason for hiding this comment

Uh oh!

jrbyrnes commented Jun 20, 2024

Uh oh!

jrbyrnes commented Jun 21, 2024

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbyrnes commented Sep 19, 2023 •

edited

Loading

llvmbot commented Sep 19, 2023 •

edited

Loading

jrbyrnes commented Jul 1, 2024 •

edited

Loading