-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AMDGPU] Allow SLP to analyze i8s #91016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-amdgpu Author: Jeffrey Byrnes (jrbyrnes) ChangesAfter #66838 , vectors of i8s will be preferable (relative to the scalarized equivalent) as incoming values for PHI nodes. Since this work addresses that concern, I have built it on top of that PR. The changes to AMDGPULateCodeGenPrepare can be ignored in this PR, as this PR is only meant for the other changes. For this work, the changes begin at 2049a61 This PR is a draft / WIP, while I work on test coverage and confirm there are no oddities with our current cost model with the inclusion of i8s. However, before taking that plunge, I wanted to see if there were any high level concerns with the approach. Conceptually, this work vectorizes I8 phis by imposing a cost penalty to the scalarized version. I have also reduced the cost of i8 shuffles, as this facilitates more PHI vectorization and the shuffles should at least be matched to perm. After the mentioned PR, we will workaround the normal legalization in SelectionDAG for vector i8 PHIs, so in this work we also bypass the legalizeation checks when initially deciding whether or not to vectorize the type (GCNTTIImpl::getNumberOfParts). Patch is 259.98 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/91016.diff 18 Files Affected:
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 58c69ac939763a..1fccc460ed74ad 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -884,6 +884,10 @@ class TargetTransformInfo {
bool Insert, bool Extract,
TTI::TargetCostKind CostKind) const;
+ /// Estimate the overhead of scalarizing a PHI instruction.
+ InstructionCost getPHIScalarizationOverhead(Type *ScalarTy,
+ VectorType *VTy) const;
+
/// Estimate the overhead of scalarizing an instructions unique
/// non-constant operands. The (potentially vector) types to use for each of
/// argument are passes via Tys.
@@ -1912,6 +1916,9 @@ class TargetTransformInfo::Concept {
getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
ArrayRef<Type *> Tys,
TargetCostKind CostKind) = 0;
+
+ virtual InstructionCost getPHIScalarizationOverhead(Type *ScalarTy,
+ VectorType *VTy) = 0;
virtual bool supportsEfficientVectorElementLoadStore() = 0;
virtual bool supportsTailCalls() = 0;
virtual bool supportsTailCallFor(const CallBase *CB) = 0;
@@ -2441,6 +2448,12 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
return Impl.getScalarizationOverhead(Ty, DemandedElts, Insert, Extract,
CostKind);
}
+
+ InstructionCost getPHIScalarizationOverhead(Type *ScalarTy,
+ VectorType *VTy) override {
+ return Impl.getPHIScalarizationOverhead(ScalarTy, VTy);
+ }
+
InstructionCost
getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
ArrayRef<Type *> Tys,
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 5b40e49714069f..66f3b01e0a3cda 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -363,6 +363,10 @@ class TargetTransformInfoImplBase {
return 0;
}
+ InstructionCost getPHIScalarizationOverhead(Type *ScalarTy, VectorType *VTy) {
+ return 0;
+ }
+
InstructionCost
getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
ArrayRef<Type *> Tys,
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index 06a19c75cf873a..4b4101a1e26940 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -801,6 +801,10 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
CostKind);
}
+ InstructionCost getPHIScalarizationOverhead(Type *ScalarTy, VectorType *VTy) {
+ return 0;
+ }
+
/// Estimate the overhead of scalarizing an instructions unique
/// non-constant operands. The (potentially vector) types to use for each of
/// argument are passes via Tys.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 33c899fe889990..e351f3895ad507 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -585,6 +585,12 @@ InstructionCost TargetTransformInfo::getScalarizationOverhead(
CostKind);
}
+InstructionCost
+TargetTransformInfo::getPHIScalarizationOverhead(Type *ScalarTy,
+ VectorType *VTy) const {
+ return TTIImpl->getPHIScalarizationOverhead(ScalarTy, VTy);
+}
+
InstructionCost TargetTransformInfo::getOperandsScalarizationOverhead(
ArrayRef<const Value *> Args, ArrayRef<Type *> Tys,
TTI::TargetCostKind CostKind) const {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
index 69fdeaebe0a018..d7d2ebff03b6b7 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
@@ -81,6 +81,88 @@ class AMDGPULateCodeGenPrepare
bool visitLoadInst(LoadInst &LI);
};
+class ConversionCandidateInfo {
+private:
+ // The instruction which defined the original virtual register used across
+ // blocks
+ Instruction *LiveRegDef;
+ // The original type
+ Type *OriginalType;
+ // The desired type
+ Type *NewType;
+ // The instruction sequence that converts the virtual register, to be used
+ // instead of the original
+ Instruction *Converted = nullptr;
+ // The builder used to build the conversion instruction
+ IRBuilder<> ConvertBuilder;
+
+public:
+ // The instruction which defined the original virtual register used across
+ // blocks
+ Instruction *getLiveRegDef() { return LiveRegDef; }
+ // The original type
+ Type *getOriginalType() { return OriginalType; }
+ // The desired type
+ Type *getNewType() { return NewType; }
+ void setNewType(Type *NewType) { this->NewType = NewType; }
+ // The instruction that conerts the virtual register, to be used instead of
+ // the original
+ Instruction *getConverted() { return Converted; }
+ void setConverted(Instruction *Converted) { this->Converted = Converted; }
+ // The builder used to build the conversion instruction
+ IRBuilder<> &getConvertBuilder() { return ConvertBuilder; }
+ // Do we have a instruction sequence which convert the original virtual
+ // register
+ bool hasConverted() { return Converted != nullptr; }
+
+ ConversionCandidateInfo(Instruction *LiveRegDef, BasicBlock *InsertBlock,
+ BasicBlock::iterator InsertPt)
+ : LiveRegDef(LiveRegDef), OriginalType(LiveRegDef->getType()),
+ ConvertBuilder(InsertBlock, InsertPt) {}
+ ConversionCandidateInfo(Instruction *LiveRegDef, Type *NewType,
+ BasicBlock *InsertBlock,
+ BasicBlock::iterator InsertPt)
+ : LiveRegDef(LiveRegDef), OriginalType(LiveRegDef->getType()),
+ NewType(NewType), ConvertBuilder(InsertBlock, InsertPt) {}
+};
+
+typedef std::pair<Instruction *, BasicBlock *> IncomingPair;
+typedef std::pair<Instruction *, SmallVector<IncomingPair, 4>> PHIUpdateInfo;
+
+class LiveRegOptimizer {
+private:
+ Module *Mod = nullptr;
+ const DataLayout *DL = nullptr;
+ // The scalar type to convert to
+ Type *ConvertToScalar;
+ // Holds the collection of PHIs with their pending new operands
+ SmallVector<PHIUpdateInfo, 4> PHIUpdater;
+
+public:
+ // Should the def of the instruction be converted if it is live across blocks
+ bool shouldReplaceUses(const Instruction &I);
+ // Convert the virtual register to the compatible vector of legal type
+ void convertToOptType(ConversionCandidateInfo &LR);
+ // Convert the virtual register back to the original type, stripping away
+ // the MSBs in cases where there was an imperfect fit (e.g. v2i32 -> v7i8)
+ void convertFromOptType(ConversionCandidateInfo &LR);
+ // Get a vector of desired scalar type that is compatible with the original
+ // vector. In cases where there is no bitsize equivalent using a legal vector
+ // type, we pad the MSBs (e.g. v7i8 -> v2i32)
+ Type *getCompatibleType(Instruction *InstToConvert);
+ // Find and replace uses of the virtual register in different block with a
+ // newly produced virtual register of legal type
+ bool replaceUses(Instruction &I);
+ // Replace the collected PHIs with newly produced incoming values. Replacement
+ // is only done if we have a replacement for each original incoming value.
+ bool replacePHIs();
+
+ LiveRegOptimizer(Module *Mod) : Mod(Mod) {
+ DL = &Mod->getDataLayout();
+ ConvertToScalar = Type::getInt32Ty(Mod->getContext());
+ }
+};
+
} // end anonymous namespace
bool AMDGPULateCodeGenPrepare::doInitialization(Module &M) {
@@ -102,14 +184,261 @@ bool AMDGPULateCodeGenPrepare::runOnFunction(Function &F) {
AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
UA = &getAnalysis<UniformityInfoWrapperPass>().getUniformityInfo();
+ // "Optimize" the virtual regs that cross basic block boundaries. When
+ // building the SelectionDAG, vectors of illegal types that cross basic blocks
+ // will be scalarized and widened, with each scalar living in its
+ // own physical register. To work around this, this optimization converts the
+ // vectors to equivalent vectors of legal type (which are converted back
+ // before uses in subsequent blocks), to pack the bits into fewer physical
+ // registers (used in CopyToReg/CopyFromReg pairs).
+ LiveRegOptimizer LRO(Mod);
+
bool Changed = false;
for (auto &BB : F)
- for (Instruction &I : llvm::make_early_inc_range(BB))
+ for (Instruction &I : make_early_inc_range(BB)) {
Changed |= visit(I);
+ if (!LRO.shouldReplaceUses(I))
+ continue;
+ Changed |= LRO.replaceUses(I);
+ }
+ Changed |= LRO.replacePHIs();
return Changed;
}
+bool LiveRegOptimizer::replaceUses(Instruction &I) {
+ bool MadeChange = false;
+
+ struct ConvertUseInfo {
+ Instruction *Converted;
+ SmallVector<Instruction *, 4> Users;
+ };
+ DenseMap<BasicBlock *, ConvertUseInfo> InsertedConversionMap;
+
+ ConversionCandidateInfo FromCCI(&I, I.getParent(),
+ std::next(I.getIterator()));
+ FromCCI.setNewType(getCompatibleType(FromCCI.getLiveRegDef()));
+ for (auto IUser = I.user_begin(); IUser != I.user_end(); IUser++) {
+
+ Instruction *UserInst = cast<Instruction>(*IUser);
+ if (UserInst->getParent() != I.getParent() || isa<PHINode>(UserInst)) {
+ LLVM_DEBUG(dbgs() << *UserInst << "\n\tUses "
+ << *FromCCI.getOriginalType()
+ << " from previous block. Needs conversion\n");
+ convertToOptType(FromCCI);
+ if (!FromCCI.hasConverted())
+ continue;
+ // If it is a PHI node, just create and collect the new operand. We can
+ // only replace the PHI node once we have converted all the operands
+ if (auto PHI = dyn_cast<PHINode>(UserInst)) {
+ for (unsigned Idx = 0; Idx < PHI->getNumIncomingValues(); Idx++) {
+ Value *IncVal = PHI->getIncomingValue(Idx);
+ if (&I == dyn_cast<Instruction>(IncVal)) {
+ BasicBlock *IncBlock = PHI->getIncomingBlock(Idx);
+ auto PHIOps =
+ find_if(PHIUpdater, [&UserInst](PHIUpdateInfo &Entry) {
+ return Entry.first == UserInst;
+ });
+
+ if (PHIOps == PHIUpdater.end())
+ PHIUpdater.push_back(
+ {UserInst, {{FromCCI.getConverted(), IncBlock}}});
+ else
+ PHIOps->second.push_back({FromCCI.getConverted(), IncBlock});
+
+ break;
+ }
+ }
+ continue;
+ }
+
+ // Do not create multiple conversion sequences if there are multiple
+ // uses in the same block
+ if (InsertedConversionMap.contains(UserInst->getParent())) {
+ InsertedConversionMap[UserInst->getParent()].Users.push_back(UserInst);
+ LLVM_DEBUG(dbgs() << "\tUser already has access to converted def\n");
+ continue;
+ }
+
+ ConversionCandidateInfo ToCCI(FromCCI.getConverted(), I.getType(),
+ UserInst->getParent(),
+
+ UserInst->getParent()->getFirstNonPHIIt());
+ convertFromOptType(ToCCI);
+ assert(ToCCI.hasConverted());
+ InsertedConversionMap[UserInst->getParent()] = {ToCCI.getConverted(),
+ {UserInst}};
+ }
+ }
+
+ // Replace uses of with in a separate loop that is not dependent upon the
+ // state of the uses
+ for (auto &Entry : InsertedConversionMap) {
+ for (auto &UserInst : Entry.second.Users) {
+ LLVM_DEBUG(dbgs() << *UserInst
+ << "\n\tNow uses: " << *Entry.second.Converted << '\n');
+ UserInst->replaceUsesOfWith(&I, Entry.second.Converted);
+ MadeChange = true;
+ }
+ }
+ return MadeChange;
+}
+
+bool LiveRegOptimizer::replacePHIs() {
+ bool MadeChange = false;
+ for (auto Ele : PHIUpdater) {
+ auto [ThePHIInst, NewPHINodeOps] = Ele;
+ LLVM_DEBUG(dbgs() << "Attempting to replace: " << *ThePHIInst << '\n');
+ // If we have conveted all the required operands, then do the replacement
+ if (cast<PHINode>(ThePHIInst)->getNumIncomingValues() ==
+ NewPHINodeOps.size()) {
+ IRBuilder<> Builder(Ele.first);
+ auto NPHI = Builder.CreatePHI(NewPHINodeOps[0].first->getType(),
+ NewPHINodeOps.size());
+ for (auto IncVals : NewPHINodeOps) {
+ NPHI->addIncoming(IncVals.first, IncVals.second);
+ LLVM_DEBUG(dbgs() << " Using: " << *IncVals.first
+ << " For: " << IncVals.second->getName() << '\n');
+ }
+ LLVM_DEBUG(dbgs() << "Sucessfully replaced with " << *NPHI << '\n');
+ ConversionCandidateInfo ToCCI(
+ NPHI, ThePHIInst->getType(), ThePHIInst->getParent(),
+
+ ThePHIInst->getParent()->getFirstNonPHIIt());
+ convertFromOptType(ToCCI);
+ assert(ToCCI.hasConverted());
+ Ele.first->replaceAllUsesWith(ToCCI.getConverted());
+ // The old PHI is no longer used
+ ThePHIInst->eraseFromParent();
+ MadeChange = true;
+ }
+ }
+ return MadeChange;
+}
+
+Type *LiveRegOptimizer::getCompatibleType(Instruction *InstToConvert) {
+ Type *OriginalType = InstToConvert->getType();
+ assert(OriginalType->getScalarSizeInBits() <=
+ ConvertToScalar->getScalarSizeInBits());
+ VectorType *VTy = dyn_cast<VectorType>(OriginalType);
+ if (!VTy)
+ return ConvertToScalar;
+
+ TypeSize OriginalSize = DL->getTypeSizeInBits(VTy);
+ TypeSize ConvertScalarSize = DL->getTypeSizeInBits(ConvertToScalar);
+ unsigned ConvertEltCount =
+ (OriginalSize + ConvertScalarSize - 1) / ConvertScalarSize;
+
+ if (OriginalSize <= ConvertScalarSize)
+ return IntegerType::get(Mod->getContext(), ConvertScalarSize);
+
+ return VectorType::get(Type::getIntNTy(Mod->getContext(), ConvertScalarSize),
+ ElementCount::getFixed(ConvertEltCount));
+}
+
+void LiveRegOptimizer::convertToOptType(ConversionCandidateInfo &LR) {
+ if (LR.hasConverted()) {
+ LLVM_DEBUG(dbgs() << "\tAlready has converted def\n");
+ return;
+ }
+
+ VectorType *VTy = cast<VectorType>(LR.getOriginalType());
+ Type *NewTy = LR.getNewType();
+
+ TypeSize OriginalSize = DL->getTypeSizeInBits(VTy);
+ TypeSize NewSize = DL->getTypeSizeInBits(NewTy);
+
+ auto &Builder = LR.getConvertBuilder();
+ Value *V = cast<Value>(LR.getLiveRegDef());
+ // If there is a bitsize match, we can fit the old vector into a new vector of
+ // desired type
+ if (OriginalSize == NewSize) {
+ LR.setConverted(cast<Instruction>(Builder.CreateBitCast(V, NewTy)));
+ LLVM_DEBUG(dbgs() << "\tConverted def to " << *LR.getConverted()->getType()
+ << '\n');
+ return;
+ }
+
+ // If there is a bitsize mismatch, we must use a wider vector
+ assert(NewSize > OriginalSize);
+ ElementCount ExpandedVecElementCount =
+ ElementCount::getFixed(NewSize / VTy->getScalarSizeInBits());
+
+ SmallVector<int, 8> ShuffleMask;
+ for (unsigned I = 0; I < VTy->getElementCount().getFixedValue(); I++)
+ ShuffleMask.push_back(I);
+
+ for (uint64_t I = VTy->getElementCount().getFixedValue();
+ I < ExpandedVecElementCount.getFixedValue(); I++)
+ ShuffleMask.push_back(VTy->getElementCount().getFixedValue());
+
+ Instruction *ExpandedVec =
+ cast<Instruction>(Builder.CreateShuffleVector(V, ShuffleMask));
+ LR.setConverted(cast<Instruction>(Builder.CreateBitCast(ExpandedVec, NewTy)));
+ LLVM_DEBUG(dbgs() << "\tConverted def to " << *LR.getConverted()->getType()
+ << '\n');
+ return;
+}
+
+void LiveRegOptimizer::convertFromOptType(ConversionCandidateInfo &LRC) {
+ Type *OTy = LRC.getOriginalType();
+ VectorType *NewVTy = cast<VectorType>(LRC.getNewType());
+
+ TypeSize OriginalSize = DL->getTypeSizeInBits(OTy);
+ TypeSize NewSize = DL->getTypeSizeInBits(NewVTy);
+
+ auto &Builder = LRC.getConvertBuilder();
+ Value *V = cast<Value>(LRC.getLiveRegDef());
+ // If there is a bitsize match, we simply convert back to the original type
+ if (OriginalSize == NewSize) {
+ LRC.setConverted(cast<Instruction>(Builder.CreateBitCast(V, NewVTy)));
+ LLVM_DEBUG(dbgs() << "\tProduced for user: " << *LRC.getConverted()
+ << '\n');
+ return;
+ }
+
+ if (!OTy->isVectorTy()) {
+ Instruction *Trunc = cast<Instruction>(Builder.CreateTrunc(
+ LRC.getLiveRegDef(), IntegerType::get(Mod->getContext(), NewSize)));
+ Instruction *Original =
+ cast<Instruction>(Builder.CreateBitCast(Trunc, NewVTy));
+ LRC.setConverted(cast<Instruction>(Original));
+ LLVM_DEBUG(dbgs() << "\tProduced for user: " << *LRC.getConverted()
+ << '\n');
+ return;
+ }
+
+ // If there is a bitsize mismatch, we have used a wider vector and must strip
+ // the MSBs to convert back to the original type
+ assert(OriginalSize > NewSize);
+ ElementCount ExpandedVecElementCount =
+ ElementCount::getFixed(OriginalSize / NewVTy->getScalarSizeInBits());
+ VectorType *ExpandedVT = VectorType::get(
+ Type::getIntNTy(Mod->getContext(), NewVTy->getScalarSizeInBits()),
+ ExpandedVecElementCount);
+ Instruction *Converted =
+ cast<Instruction>(Builder.CreateBitCast(LRC.getLiveRegDef(), ExpandedVT));
+
+ unsigned NarrowElementCount = NewVTy->getElementCount().getFixedValue();
+ SmallVector<int, 8> ShuffleMask(NarrowElementCount);
+ std::iota(ShuffleMask.begin(), ShuffleMask.end(), 0);
+
+ Instruction *NarrowVec =
+ cast<Instruction>(Builder.CreateShuffleVector(Converted, ShuffleMask));
+ LRC.setConverted(cast<Instruction>(NarrowVec));
+ LLVM_DEBUG(dbgs() << "\tProduced for user: " << *LRC.getConverted() << '\n');
+ return;
+}
+
+bool LiveRegOptimizer::shouldReplaceUses(const Instruction &I) {
+ // Vectors of illegal types are copied across blocks in an efficient manner.
+ // They are scalarized and widened to legal scalars. In such cases, we can do
+ // better by using legal vector types
+ Type *IType = I.getType();
+ return IType->isVectorTy() && IType->getScalarSizeInBits() < 16 &&
+ !I.getType()->getScalarType()->isPointerTy();
+}
+
bool AMDGPULateCodeGenPrepare::canWidenScalarExtLoad(LoadInst &LI) const {
unsigned AS = LI.getPointerAddressSpace();
// Skip non-constant address space.
@@ -119,7 +448,7 @@ bool AMDGPULateCodeGenPrepare::canWidenScalarExtLoad(LoadInst &LI) const {
// Skip non-simple loads.
if (!LI.isSimple())
return false;
- auto *Ty = LI.getType();
+ Type *Ty = LI.getType();
// Skip aggregate types.
if (Ty->isAggregateType())
return false;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 305a6c8c3b9262..c15481336075e6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -1150,10 +1150,10 @@ bool GCNPassConfig::addPreISel() {
AMDGPUPassConfig::addPreISel();
if (TM->getOptLevel() > CodeGenOptLevel::None)
- addPass(createAMDGPULateCodeGenPreparePass());
+ addPass(createSinkingPass());
if (TM->getOptLevel() > CodeGenOptLevel::None)
- addPass(createSinkingPass());
+ addPass(createAMDGPULateCodeGenPreparePass());
// Merge divergent exit nodes. StructurizeCFG won't recognize the multi-exit
// regions formed by them.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index 84320d296a037b..2fe2d506fa4308 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -306,6 +306,18 @@ bool GCNTTIImpl::hasBranchDivergence(const Function *F) const {
return !F || !ST->isSingleLaneExecution(*F);
}
+unsigned GCNTTIImpl::getNumberOfParts(Type *Tp) const {
+ if (auto VTy = dyn_cast<FixedVectorType>(...
[truncated]
|
Intrinsically it makes no sense for us to prefer i8 anything. It's a 32-bit architecture. Everything is a 32-bit register. Vectors of i32 should be the canonically preferred type for anything wide. Why would we want i8? |
I think we should prefer vectors of i8 over scalar i8s for PHIs because vectors of i8s for PHIs will be coerced to vectors of i32 to workaround the legalization issues. So, in that case, vectorization is profitable |
Description should be clarified, because "for the vectorizers" and "for codegen" are different questions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we need a dedicated phi function. It should essentially be the type legalization cost for any value. Phis aren't a real operation, they are just split to legal registers
Change-Id: Ia995bc646e5f050083bd6277eeabe0b5ab410f47
1262897
to
69ecffc
Compare
Rebase + Removed getPHIScalarizationOverhead and vectorizing i8 shuffles in favor of #91016 Also, removed the dependency on #66838 This PR now simply tells SLP to analyze i8s by changing the vectorization factor. The cost model seems okay as is, with perhaps some overestimation of vector cost for some ops. The vectorization in the lit tests is due to reduced cost of vectorizing vs extracting/inserting for identity vectors. |
Change-Id: If5ac53d5235ee8c65c53454b209c9f155c17edc4
After #66838 , for improved codegen, vectors of i8s will be preferable (relative to the scalarized equivalent) as incoming values for PHI nodes to enable the legalization workaround.
After the mentioned PR, we will workaround the normal legalization in SelectionDAG for vector i8 PHIs, so in this work we also bypass the legalization checks when initially deciding whether or not to vectorize the type (GCNTTIImpl::getNumberOfParts). The effect is to have SLP consider i8s for vectorization.
#95328 introduces the cost hook needed to capture the vectorization conditions (where the vectors are used by #66838 )