Skip to content

[LV] Don't predicate divs with invariant divisor when folding tail #98904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jul 25, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 33 additions & 15 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1402,7 +1402,8 @@ class LoopVectorizationCostModel {
/// Returns true if \p I is an instruction that needs to be predicated
/// at runtime. The result is independent of the predication mechanism.
/// Superset of instructions that return true for isScalarWithPredication.
bool isPredicatedInst(Instruction *I) const;
bool isPredicatedInst(Instruction *I, ElementCount VF,
bool IsKnownUniform = false) const;

/// Return the costs for our two available strategies for lowering a
/// div/rem operation which requires speculating at least one lane.
Expand Down Expand Up @@ -3637,7 +3638,7 @@ void LoopVectorizationCostModel::collectLoopScalars(ElementCount VF) {

bool LoopVectorizationCostModel::isScalarWithPredication(
Instruction *I, ElementCount VF) const {
if (!isPredicatedInst(I))
if (!isPredicatedInst(I, VF))
return false;

// Do we have a non-scalar lowering for this predicated
Expand Down Expand Up @@ -3676,7 +3677,9 @@ bool LoopVectorizationCostModel::isScalarWithPredication(
}
}

bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {
bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I,
ElementCount VF,
bool IsKnownUniform) const {
if (!blockNeedsPredicationForAnyReason(I->getParent()))
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return false;
// If predication is not needed, avoid it.
if (!blockNeedsPredicationForAnyReason(I->getParent())
|| isSafeToSpeculativelyExecute(I))
|| !Legal->isMaskRequired(I)
|| isa<BranchInst, PHINode>(I))
return false;

should the case of I being isSafeToSpeculativelyExecute() be handled here at the outset along with the case of I's block not needing predication for any reason?

Legal's isMaskRequired() deals only with load, stores, and calls (in particular "assume" intrinsics - which get dropped rather than predicated), and is tail-folding aware. Should originally unconditional loads and stores (of invariant values) from invariant addresses be excluded from MaskedOps? I.e., should Legal's blockCanBePredicated() take care of unmasking such loads and stores under tail-folding, alongside its unmasking of loads from safe pointers? Furthermore, is the distinction between CM.isPredicatedInst(I) and Legal.isMaskRequired(I) clear and desired, or could the collection of all MaskedOps/PredicatedInsts be done once? Masks were and still are introduced only where necessary, as a functional requirement rather than a performance preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorganized to combine all checks at the top, thanks!.

As for unifying the logic, might be good to do it separately, but definitely seems worthwhile given that the decisions are independent of cost


Expand Down Expand Up @@ -3710,6 +3713,15 @@ bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {
case Instruction::SDiv:
case Instruction::SRem:
case Instruction::URem:
// When folding the tail, at least one of the lanes must execute
// unconditionally. If the divisor is loop-invariant no predication is
// needed, as predication would not prevent the divide-by-0 on the executed
// lane.
if (foldTailByMasking() && !Legal->blockNeedsPredication(I->getParent()) &&
TheLoop->isLoopInvariant(I->getOperand(1)) &&
(IsKnownUniform || isUniformAfterVectorization(I, VF)))
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to unpredicating loads and stores above, and should have consistent explanations and implementations. Checking if !Legal->blockNeedsPredication(I->getParent()) implies foldTailByMasking() given than we passed blockNeedsPredicationForAnyReason(I->getParent()) to get here. Why are IsKnownUniform and isUniformAfterVectorization needed - would Legal->isInvariant(I->getOperand(1)) suffice, as in the load-from-uniform-address case above?

A side-effecting instruction under mask could be unmasked provided (a) its side-effect operand(s) is uniform (divisor for div/rem, address for load, address and value for store), and (b) its mask is known to have its first lane set (conceivably extensible to having any lane set?). This is the case for instructions subject only to fold-tail mask. Would be good to have an explicit API, say, MaskKnownToExcludeFirstLane().

Would it suffice to have

    if (Legal->isInvariant(I->getOperand(1)) &&
        !Legal->blockNeedsPredication(I->getParent()))
      return false;

as in the load-from-uniform-address case above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I restructured the code and generalized it so the unconditionally-executed but tail-folded cases are all handled at the ned.

Removed the IsKnownUniform requirement.


// TODO: We can use the loop-preheader as context point here and get
// context sensitive reasoning
return !isSafeToSpeculativelyExecute(I);
Expand Down Expand Up @@ -3907,7 +3919,7 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) {
SetVector<Instruction *> Worklist;

// Add uniform instructions demanding lane 0 to the worklist. Instructions
// that are scalar with predication must not be considered uniform after
// that are require predication must not be considered uniform after
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// that are require predication must not be considered uniform after
// that require predication must not be considered uniform after

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Should have been fixed by earlier PR)

// vectorization, because that would create an erroneous replicating region
// where only a single instance out of VF should be formed.
// TODO: optimize such seldom cases if found important, see PR40816.
Expand All @@ -3917,9 +3929,10 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) {
<< *I << "\n");
return;
}
if (isScalarWithPredication(I, VF)) {
LLVM_DEBUG(dbgs() << "LV: Found not uniform being ScalarWithPredication: "
<< *I << "\n");
if (isPredicatedInst(I, VF, true)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (isPredicatedInst(I, VF, true)) {
if (isPredicatedInst(I, VF, /*IsKnownUniform=*/true)) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is gone in the latest version, thanks!

LLVM_DEBUG(
dbgs() << "LV: Found not uniform due to requiring predication: " << *I
<< "\n");
return;
}
LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *I << "\n");
Expand Down Expand Up @@ -5633,7 +5646,7 @@ bool LoopVectorizationCostModel::useEmulatedMaskMemRefHack(Instruction *I,
// from moving "masked load/store" check from legality to cost model.
// Masked Load/Gather emulation was previously never allowed.
// Limited number of Masked Store/Scatter emulation was allowed.
assert((isPredicatedInst(I)) &&
assert((isPredicatedInst(I, VF)) &&
"Expecting a scalar emulated instruction");
return isa<LoadInst>(I) ||
(isa<StoreInst>(I) &&
Expand Down Expand Up @@ -5912,7 +5925,7 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
// If we have a predicated load/store, it will need extra i1 extracts and
// conditional branches, but may not be executed for each vector lane. Scale
// the cost by the probability of executing the predicated block.
if (isPredicatedInst(I)) {
if (isPredicatedInst(I, VF)) {
Cost /= getReciprocalPredBlockProb();

// Add the cost of an i1 extract and a branch
Expand Down Expand Up @@ -6772,7 +6785,7 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I,
case Instruction::SDiv:
case Instruction::URem:
case Instruction::SRem:
if (VF.isVector() && isPredicatedInst(I)) {
if (VF.isVector() && isPredicatedInst(I, VF)) {
const auto [ScalarCost, SafeDivisorCost] = getDivRemSpeculationCost(I, VF);
return isDivRemScalarWithPredication(ScalarCost, SafeDivisorCost) ?
ScalarCost : SafeDivisorCost;
Expand Down Expand Up @@ -8444,7 +8457,7 @@ bool VPRecipeBuilder::shouldWiden(Instruction *I, VFRange &Range) const {

VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I,
ArrayRef<VPValue *> Operands,
VPBasicBlock *VPBB) {
VPBasicBlock *VPBB, VFRange &Range) {
switch (I->getOpcode()) {
default:
return nullptr;
Expand All @@ -8454,7 +8467,10 @@ VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I,
case Instruction::URem: {
// If not provably safe, use a select to form a safe divisor before widening the
// div/rem operation itself. Otherwise fall through to general handling below.
if (CM.isPredicatedInst(I)) {
bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange(
[&](ElementCount VF) -> bool { return CM.isPredicatedInst(I, VF); },
Range);
if (IsPredicated) {
SmallVector<VPValue *> Ops(Operands.begin(), Operands.end());
VPValue *Mask = getBlockInMask(I->getParent());
VPValue *One =
Expand Down Expand Up @@ -8504,8 +8520,8 @@ VPReplicateRecipe *VPRecipeBuilder::handleReplication(Instruction *I,
[&](ElementCount VF) { return CM.isUniformAfterVectorization(I, VF); },
Range);

bool IsPredicated = CM.isPredicatedInst(I);

bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange(
[&](ElementCount VF) { return CM.isPredicatedInst(I, VF); }, Range);
// Even if the instruction is not marked as uniform, there are certain
// intrinsic calls that can be effectively treated as such, so we check for
// them here. Conservatively, we only do this for scalable vectors, since
Expand Down Expand Up @@ -8625,7 +8641,7 @@ VPRecipeBuilder::tryToCreateWidenRecipe(Instruction *Instr,
*CI);
}

return tryToWiden(Instr, Operands, VPBB);
return tryToWiden(Instr, Operands, VPBB, Range);
}

void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
Expand Down Expand Up @@ -9482,6 +9498,8 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {
void VPReplicateRecipe::execute(VPTransformState &State) {
Instruction *UI = getUnderlyingInstr();
if (State.Instance) { // Generate a single instance.
assert((State.VF.isScalar() || !isUniform()) &&
"uniform recipe shouldn't be predicated");
assert(!State.VF.isScalable() && "Can't scalarize a scalable vector");
State.ILV->scalarizeInstruction(UI, this, *State.Instance, State);
// Insert scalar instance packing it into a vector.
Expand Down
2 changes: 1 addition & 1 deletion llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ class VPRecipeBuilder {
/// if it can. The function should only be called if the cost-model indicates
/// that widening should be performed.
VPWidenRecipe *tryToWiden(Instruction *I, ArrayRef<VPValue *> Operands,
VPBasicBlock *VPBB);
VPBasicBlock *VPBB, VFRange &Range);

public:
VPRecipeBuilder(VPlan &Plan, Loop *OrigLoop, const TargetLibraryInfo *TLI,
Expand Down
Loading
Loading