-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[LV] Don't predicate divs with invariant divisor when folding tail #98904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
b6ab904
1782e6c
cb18fd6
d834270
f00467f
f0b1ea4
92e0a77
e7e7564
1574791
2c8de7c
8a71ea7
d1bdb1e
edee437
f9581e0
49c6677
bfe0e34
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -1402,7 +1402,8 @@ class LoopVectorizationCostModel { | |||||
/// Returns true if \p I is an instruction that needs to be predicated | ||||||
/// at runtime. The result is independent of the predication mechanism. | ||||||
/// Superset of instructions that return true for isScalarWithPredication. | ||||||
bool isPredicatedInst(Instruction *I) const; | ||||||
bool isPredicatedInst(Instruction *I, ElementCount VF, | ||||||
bool IsKnownUniform = false) const; | ||||||
|
||||||
/// Return the costs for our two available strategies for lowering a | ||||||
/// div/rem operation which requires speculating at least one lane. | ||||||
|
@@ -3637,7 +3638,7 @@ void LoopVectorizationCostModel::collectLoopScalars(ElementCount VF) { | |||||
|
||||||
bool LoopVectorizationCostModel::isScalarWithPredication( | ||||||
Instruction *I, ElementCount VF) const { | ||||||
if (!isPredicatedInst(I)) | ||||||
if (!isPredicatedInst(I, VF)) | ||||||
return false; | ||||||
|
||||||
// Do we have a non-scalar lowering for this predicated | ||||||
|
@@ -3676,7 +3677,9 @@ bool LoopVectorizationCostModel::isScalarWithPredication( | |||||
} | ||||||
} | ||||||
|
||||||
bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const { | ||||||
bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I, | ||||||
ElementCount VF, | ||||||
bool IsKnownUniform) const { | ||||||
if (!blockNeedsPredicationForAnyReason(I->getParent())) | ||||||
return false; | ||||||
|
||||||
|
@@ -3710,6 +3713,15 @@ bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const { | |||||
case Instruction::SDiv: | ||||||
case Instruction::SRem: | ||||||
case Instruction::URem: | ||||||
// When folding the tail, at least one of the lanes must execute | ||||||
// unconditionally. If the divisor is loop-invariant no predication is | ||||||
// needed, as predication would not prevent the divide-by-0 on the executed | ||||||
// lane. | ||||||
if (foldTailByMasking() && !Legal->blockNeedsPredication(I->getParent()) && | ||||||
TheLoop->isLoopInvariant(I->getOperand(1)) && | ||||||
(IsKnownUniform || isUniformAfterVectorization(I, VF))) | ||||||
return false; | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is similar to unpredicating loads and stores above, and should have consistent explanations and implementations. Checking if !Legal->blockNeedsPredication(I->getParent()) implies foldTailByMasking() given than we passed blockNeedsPredicationForAnyReason(I->getParent()) to get here. Why are IsKnownUniform and isUniformAfterVectorization needed - would Legal->isInvariant(I->getOperand(1)) suffice, as in the load-from-uniform-address case above? A side-effecting instruction under mask could be unmasked provided (a) its side-effect operand(s) is uniform (divisor for div/rem, address for load, address and value for store), and (b) its mask is known to have its first lane set (conceivably extensible to having any lane set?). This is the case for instructions subject only to fold-tail mask. Would be good to have an explicit API, say, Would it suffice to have
as in the load-from-uniform-address case above? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks I restructured the code and generalized it so the unconditionally-executed but tail-folded cases are all handled at the ned. Removed the IsKnownUniform requirement. |
||||||
|
||||||
// TODO: We can use the loop-preheader as context point here and get | ||||||
// context sensitive reasoning | ||||||
return !isSafeToSpeculativelyExecute(I); | ||||||
|
@@ -3907,7 +3919,7 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) { | |||||
SetVector<Instruction *> Worklist; | ||||||
|
||||||
// Add uniform instructions demanding lane 0 to the worklist. Instructions | ||||||
// that are scalar with predication must not be considered uniform after | ||||||
// that are require predication must not be considered uniform after | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Should have been fixed by earlier PR) |
||||||
// vectorization, because that would create an erroneous replicating region | ||||||
// where only a single instance out of VF should be formed. | ||||||
// TODO: optimize such seldom cases if found important, see PR40816. | ||||||
|
@@ -3917,9 +3929,10 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) { | |||||
<< *I << "\n"); | ||||||
return; | ||||||
} | ||||||
if (isScalarWithPredication(I, VF)) { | ||||||
LLVM_DEBUG(dbgs() << "LV: Found not uniform being ScalarWithPredication: " | ||||||
<< *I << "\n"); | ||||||
if (isPredicatedInst(I, VF, true)) { | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Code is gone in the latest version, thanks! |
||||||
LLVM_DEBUG( | ||||||
dbgs() << "LV: Found not uniform due to requiring predication: " << *I | ||||||
<< "\n"); | ||||||
return; | ||||||
} | ||||||
LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *I << "\n"); | ||||||
|
@@ -5633,7 +5646,7 @@ bool LoopVectorizationCostModel::useEmulatedMaskMemRefHack(Instruction *I, | |||||
// from moving "masked load/store" check from legality to cost model. | ||||||
// Masked Load/Gather emulation was previously never allowed. | ||||||
// Limited number of Masked Store/Scatter emulation was allowed. | ||||||
assert((isPredicatedInst(I)) && | ||||||
assert((isPredicatedInst(I, VF)) && | ||||||
"Expecting a scalar emulated instruction"); | ||||||
return isa<LoadInst>(I) || | ||||||
(isa<StoreInst>(I) && | ||||||
|
@@ -5912,7 +5925,7 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I, | |||||
// If we have a predicated load/store, it will need extra i1 extracts and | ||||||
// conditional branches, but may not be executed for each vector lane. Scale | ||||||
// the cost by the probability of executing the predicated block. | ||||||
if (isPredicatedInst(I)) { | ||||||
if (isPredicatedInst(I, VF)) { | ||||||
Cost /= getReciprocalPredBlockProb(); | ||||||
|
||||||
// Add the cost of an i1 extract and a branch | ||||||
|
@@ -6772,7 +6785,7 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, | |||||
case Instruction::SDiv: | ||||||
case Instruction::URem: | ||||||
case Instruction::SRem: | ||||||
if (VF.isVector() && isPredicatedInst(I)) { | ||||||
if (VF.isVector() && isPredicatedInst(I, VF)) { | ||||||
const auto [ScalarCost, SafeDivisorCost] = getDivRemSpeculationCost(I, VF); | ||||||
return isDivRemScalarWithPredication(ScalarCost, SafeDivisorCost) ? | ||||||
ScalarCost : SafeDivisorCost; | ||||||
|
@@ -8444,7 +8457,7 @@ bool VPRecipeBuilder::shouldWiden(Instruction *I, VFRange &Range) const { | |||||
|
||||||
VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I, | ||||||
ArrayRef<VPValue *> Operands, | ||||||
VPBasicBlock *VPBB) { | ||||||
VPBasicBlock *VPBB, VFRange &Range) { | ||||||
switch (I->getOpcode()) { | ||||||
default: | ||||||
return nullptr; | ||||||
|
@@ -8454,7 +8467,10 @@ VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I, | |||||
case Instruction::URem: { | ||||||
// If not provably safe, use a select to form a safe divisor before widening the | ||||||
// div/rem operation itself. Otherwise fall through to general handling below. | ||||||
if (CM.isPredicatedInst(I)) { | ||||||
bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange( | ||||||
[&](ElementCount VF) -> bool { return CM.isPredicatedInst(I, VF); }, | ||||||
Range); | ||||||
if (IsPredicated) { | ||||||
SmallVector<VPValue *> Ops(Operands.begin(), Operands.end()); | ||||||
VPValue *Mask = getBlockInMask(I->getParent()); | ||||||
VPValue *One = | ||||||
|
@@ -8504,8 +8520,8 @@ VPReplicateRecipe *VPRecipeBuilder::handleReplication(Instruction *I, | |||||
[&](ElementCount VF) { return CM.isUniformAfterVectorization(I, VF); }, | ||||||
Range); | ||||||
|
||||||
bool IsPredicated = CM.isPredicatedInst(I); | ||||||
|
||||||
bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange( | ||||||
[&](ElementCount VF) { return CM.isPredicatedInst(I, VF); }, Range); | ||||||
// Even if the instruction is not marked as uniform, there are certain | ||||||
// intrinsic calls that can be effectively treated as such, so we check for | ||||||
// them here. Conservatively, we only do this for scalable vectors, since | ||||||
|
@@ -8625,7 +8641,7 @@ VPRecipeBuilder::tryToCreateWidenRecipe(Instruction *Instr, | |||||
*CI); | ||||||
} | ||||||
|
||||||
return tryToWiden(Instr, Operands, VPBB); | ||||||
return tryToWiden(Instr, Operands, VPBB, Range); | ||||||
} | ||||||
|
||||||
void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF, | ||||||
|
@@ -9482,6 +9498,8 @@ void VPInterleaveRecipe::execute(VPTransformState &State) { | |||||
void VPReplicateRecipe::execute(VPTransformState &State) { | ||||||
Instruction *UI = getUnderlyingInstr(); | ||||||
if (State.Instance) { // Generate a single instance. | ||||||
assert((State.VF.isScalar() || !isUniform()) && | ||||||
"uniform recipe shouldn't be predicated"); | ||||||
assert(!State.VF.isScalable() && "Can't scalarize a scalable vector"); | ||||||
State.ILV->scalarizeInstruction(UI, this, *State.Instance, State); | ||||||
// Insert scalar instance packing it into a vector. | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the case of
I
being isSafeToSpeculativelyExecute() be handled here at the outset along with the case ofI
's block not needing predication for any reason?Legal's isMaskRequired() deals only with load, stores, and calls (in particular "assume" intrinsics - which get dropped rather than predicated), and is tail-folding aware. Should originally unconditional loads and stores (of invariant values) from invariant addresses be excluded from MaskedOps? I.e., should Legal's blockCanBePredicated() take care of unmasking such loads and stores under tail-folding, alongside its unmasking of loads from safe pointers? Furthermore, is the distinction between CM.isPredicatedInst(I) and Legal.isMaskRequired(I) clear and desired, or could the collection of all MaskedOps/PredicatedInsts be done once? Masks were and still are introduced only where necessary, as a functional requirement rather than a performance preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reorganized to combine all checks at the top, thanks!.
As for unifying the logic, might be good to do it separately, but definitely seems worthwhile given that the decisions are independent of cost