Skip to content

[VPlan] Unroll VPReplicateRecipe by VF. #142433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 26, 2025
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7328,6 +7328,7 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
// cost model is complete for better cost estimates.
VPlanTransforms::runPass(VPlanTransforms::unrollByUF, BestVPlan, BestUF,
OrigLoop->getHeader()->getContext());
VPlanTransforms::runPass(VPlanTransforms::replicateByVF, BestVPlan, BestVF);
VPlanTransforms::runPass(VPlanTransforms::materializeBroadcasts, BestVPlan);
if (hasBranchWeightMD(*OrigLoop->getLoopLatch()->getTerminator()))
VPlanTransforms::runPass(VPlanTransforms::addBranchWeightToMiddleTerminator,
Expand Down
7 changes: 7 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlan.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,13 @@ Value *VPTransformState::get(const VPValue *Def, const VPLane &Lane) {
return Data.VPV2Scalars[Def][0];
}

// Look through BuildVector to avoid redundant extracts.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done now rather than later to reduce test diff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep without the fold we would have additional insert/extracts for uses inside replicate regions, with corresponding test changes.

// TODO: Remove once replicate regions are unrolled explicitly.
if (Lane.getKind() == VPLane::Kind::First && match(Def, m_BuildVector())) {
auto *BuildVector = cast<VPInstruction>(Def);
return get(BuildVector->getOperand(Lane.getKnownLane()), true);
}

assert(hasVectorValue(Def));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent: missing error message.

auto *VecPart = Data.VPV2Vector[Def];
if (!VecPart->getType()->isVectorTy()) {
Expand Down
7 changes: 7 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlan.h
Original file line number Diff line number Diff line change
Expand Up @@ -936,6 +936,13 @@ class VPInstruction : public VPRecipeWithIRFlags,
BranchOnCount,
BranchOnCond,
Broadcast,
/// Given operands of (the same) struct type, creates a struct of fixed-
/// width vectors each containing a struct field of all operands. The
/// number of operands matches the element count of every vector.
BuildStructVector,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lex order has BuildVector after BuildStructVector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reordered, thanks

/// Creates a fixed-width vector containing all operands. The number of
/// operands matches the vector element count.
BuildVector,
ComputeAnyOfResult,
ComputeFindLastIVResult,
ComputeReductionResult,
Expand Down
2 changes: 2 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,8 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
case VPInstruction::CalculateTripCountMinusVF:
case VPInstruction::CanonicalIVIncrementForPart:
case VPInstruction::AnyOf:
case VPInstruction::BuildStructVector:
case VPInstruction::BuildVector:
return SetResultTyFromOp();
case VPInstruction::FirstActiveLane:
return Type::getIntNTy(Ctx, 64);
Expand Down
14 changes: 14 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,9 @@ struct Recipe_match {
if ((!matchRecipeAndOpcode<RecipeTys>(R) && ...))
return false;

auto *VPI = dyn_cast<VPInstruction>(R);
if (VPI && VPI->getOpcode() == VPInstruction::BuildVector)
return true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto *VPI = dyn_cast<VPInstruction>(R);
if (VPI && VPI->getOpcode() == VPInstruction::BuildVector)
return true;
// Finally match operands, except for BuildVector which is matched w/o checking its operands.
auto *VPI = dyn_cast<VPInstruction>(R);
if (VPI && VPI->getOpcode() == VPInstruction::BuildVector)
return true;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to check instead if Ops_t is empty and if so assert that R is BuildVector and early return, or set NumOperands to zero instead of R->getNumOperands() and continue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved up to handle the case first, thanks

assert(R->getNumOperands() == std::tuple_size<Ops_t>::value &&
"recipe with matched opcode does not have the expected number of "
"operands");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent nit: worth noting below that "Commutative" checks operands in reverse order, which works best for binary operations.

Expand Down Expand Up @@ -263,6 +266,10 @@ struct Recipe_match {
}
};

template <unsigned Opcode, typename... RecipeTys>
using ZeroOpRecipe_match =
Recipe_match<std::tuple<>, Opcode, false, RecipeTys...>;

template <typename Op0_t, unsigned Opcode, typename... RecipeTys>
using UnaryRecipe_match =
Recipe_match<std::tuple<Op0_t>, Opcode, false, RecipeTys...>;
Expand All @@ -271,6 +278,9 @@ template <typename Op0_t, unsigned Opcode>
using UnaryVPInstruction_match =
UnaryRecipe_match<Op0_t, Opcode, VPInstruction>;

template <unsigned Opcode>
using ZeroOpVPInstruction_match = ZeroOpRecipe_match<Opcode, VPInstruction>;

template <typename Op0_t, unsigned Opcode>
using AllUnaryRecipe_match =
UnaryRecipe_match<Op0_t, Opcode, VPWidenRecipe, VPReplicateRecipe,
Expand Down Expand Up @@ -302,6 +312,10 @@ using AllBinaryRecipe_match =
BinaryRecipe_match<Op0_t, Op1_t, Opcode, Commutative, VPWidenRecipe,
VPReplicateRecipe, VPWidenCastRecipe, VPInstruction>;

inline ZeroOpVPInstruction_match<VPInstruction::BuildVector> m_BuildVector() {
return ZeroOpVPInstruction_match<VPInstruction::BuildVector>();
}

Comment on lines +321 to +324
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inline ZeroOpVPInstruction_match<VPInstruction::BuildVector> m_BuildVector() {
return ZeroOpVPInstruction_match<VPInstruction::BuildVector>();
}
/// BuildVector is matches only its opcode, w/o matching its operands.
inline ZeroOpVPInstruction_match<VPInstruction::BuildVector> m_BuildVector() {
return ZeroOpVPInstruction_match<VPInstruction::BuildVector>();
}

plus some explanation why - number of operands varies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

template <unsigned Opcode, typename Op0_t>
inline UnaryVPInstruction_match<Op0_t, Opcode>
m_VPInstruction(const Op0_t &Op0) {
Expand Down
100 changes: 60 additions & 40 deletions llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -495,9 +495,9 @@ Value *VPInstruction::generate(VPTransformState &State) {
}
case Instruction::ExtractElement: {
assert(State.VF.isVector() && "Only extract elements from vectors");
Value *Vec = State.get(getOperand(0));
Value *Idx = State.get(getOperand(1), /*IsScalar=*/true);
return Builder.CreateExtractElement(Vec, Idx, Name);
unsigned IdxToExtract =
cast<ConstantInt>(getOperand(1)->getLiveInIRValue())->getZExtValue();
return State.get(getOperand(0), VPLane(IdxToExtract));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that the second operand must be constant should be documented and verified (if modeled as a general operand rather than a "flag").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to update the code here; there is code that may construct ExtractElement with non-constant values (early-exit with forced interleaving, where it is the first active lane)

}
case Instruction::Freeze: {
Value *Op = State.get(getOperand(0), vputils::onlyFirstLaneUsed(this));
Expand Down Expand Up @@ -608,6 +608,32 @@ Value *VPInstruction::generate(VPTransformState &State) {
return Builder.CreateVectorSplat(
State.VF, State.get(getOperand(0), /*IsScalar*/ true), "broadcast");
}
case VPInstruction::BuildStructVector: {
// For struct types, we need to build a new 'wide' struct type, where each
// element is widened, i.e. we crate a struct of vectors .
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// element is widened, i.e. we crate a struct of vectors .
// element is widened, i.e., we create a struct of vectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks

auto *StructTy =
cast<StructType>(State.TypeAnalysis.inferScalarType(getOperand(0)));
Value *Res = PoisonValue::get(toVectorizedTy(StructTy, State.VF));
for (const auto &[Idx, Op] : enumerate(operands())) {
for (unsigned I = 0; I != StructTy->getNumElements(); I++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to interchange the loops, inserting all elements into each vector consecutively and then insert each vector into Res once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this matches the construction order as packScalarIntoVectorizedValue, would probably be good to only adjust once its last user has been removed.

Currently that's stil lin State.set, but will double-check if that code is actually used for struct types (only should be relevant for recipes in replicate regions)

Value *ScalarValue = Builder.CreateExtractValue(State.get(Op, true), I);
Value *VectorValue = Builder.CreateExtractValue(Res, I);
VectorValue =
Builder.CreateInsertElement(VectorValue, ScalarValue, Idx);
Res = Builder.CreateInsertValue(Res, VectorValue, I);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to clarify that Idx represents an enumeration of vector elements while I represents an enumeration of struct fields. Admittedly StructTy->getNumElements() may also cause confusion. How about LaneIndex and FieldIndex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

return Res;
}
case VPInstruction::BuildVector: {
auto *ScalarTy = State.TypeAnalysis.inferScalarType(getOperand(0));
auto NumOfElements = ElementCount::getFixed(getNumOperands());
Value *Res = PoisonValue::get(toVectorizedTy(ScalarTy, NumOfElements));
for (const auto &[Idx, Op] : enumerate(operands()))
Res = State.Builder.CreateInsertElement(Res, State.get(Op, true),
State.Builder.getInt32(Idx));
return Res;
}
case VPInstruction::ReductionStartVector: {
if (State.VF.isScalar())
return State.get(getOperand(0), true);
Expand Down Expand Up @@ -894,10 +920,11 @@ void VPInstruction::execute(VPTransformState &State) {
if (!hasResult())
return;
assert(GeneratedValue && "generate must produce a value");
assert(
(GeneratedValue->getType()->isVectorTy() == !GeneratesPerFirstLaneOnly ||
State.VF.isScalar()) &&
"scalar value but not only first lane defined");
assert((((GeneratedValue->getType()->isVectorTy() ||
GeneratedValue->getType()->isStructTy()) ==
!GeneratesPerFirstLaneOnly) ||
State.VF.isScalar()) &&
"scalar value but not only first lane defined");
State.set(this, GeneratedValue,
/*IsScalar*/ GeneratesPerFirstLaneOnly);
}
Expand All @@ -911,6 +938,8 @@ bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
case Instruction::ICmp:
case Instruction::Select:
case VPInstruction::AnyOf:
case VPInstruction::BuildStructVector:
case VPInstruction::BuildVector:
case VPInstruction::CalculateTripCountMinusVF:
case VPInstruction::CanonicalIVIncrementForPart:
case VPInstruction::ExtractLastElement:
Expand Down Expand Up @@ -1033,6 +1062,12 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
case VPInstruction::Broadcast:
O << "broadcast";
break;
case VPInstruction::BuildStructVector:
O << "buildstructvector";
break;
case VPInstruction::BuildVector:
O << "buildvector";
break;
case VPInstruction::ExtractLastElement:
O << "extract-last-element";
break;
Expand Down Expand Up @@ -2627,45 +2662,30 @@ static void scalarizeInstruction(const Instruction *Instr,

void VPReplicateRecipe::execute(VPTransformState &State) {
Instruction *UI = getUnderlyingInstr();
if (State.Lane) { // Generate a single instance.
assert((State.VF.isScalar() || !isSingleScalar()) &&
"uniform recipe shouldn't be predicated");
assert(!State.VF.isScalable() && "Can't scalarize a scalable vector");
scalarizeInstruction(UI, this, *State.Lane, State);
// Insert scalar instance packing it into a vector.
if (State.VF.isVector() && shouldPack()) {
Value *WideValue;
// If we're constructing lane 0, initialize to start from poison.
if (State.Lane->isFirstLane()) {
assert(!State.VF.isScalable() && "VF is assumed to be non scalable.");
WideValue = PoisonValue::get(VectorType::get(UI->getType(), State.VF));
} else {
WideValue = State.get(this);
}
State.set(this, State.packScalarIntoVectorizedValue(this, WideValue,
*State.Lane));
}
return;
}

if (IsSingleScalar) {
// Uniform within VL means we need to generate lane 0.
if (!State.Lane) {
assert(IsSingleScalar &&
"VPReplicateRecipes outside replicate regions must be unrolled");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"VPReplicateRecipes outside replicate regions must be unrolled");
"VPReplicateRecipes outside replicate regions must have already been unrolled");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated thanks

scalarizeInstruction(UI, this, VPLane(0), State);
return;
}
Comment on lines 2735 to 2737
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle simpler single-scalar case first and early-return, assert one of above two cases applies:

  if (!State.Lane) {
    assert(IsSingleScalar && "...");
    scalarizeInstruction(UI, this, VPLane(0), State);
    return;
  }

  // Generate a single instance.
  ...

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks


// A store of a loop varying value to a uniform address only needs the last
// copy of the store.
if (isa<StoreInst>(UI) && vputils::isSingleScalar(getOperand(1))) {
auto Lane = VPLane::getLastLaneForVF(State.VF);
scalarizeInstruction(UI, this, VPLane(Lane), State);
return;
assert((State.VF.isScalar() || !isSingleScalar()) &&
"uniform recipe shouldn't be predicated");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"uniform recipe shouldn't be predicated");
"uniform recipe shouldn't be predicated");
assert(!State.VF.isScalable() && "Can't scalarize a scalable vector");

retain the assert here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

scalarizeInstruction(UI, this, *State.Lane, State);
// Insert scalar instance packing it into a vector.
if (State.VF.isVector() && shouldPack()) {
Value *WideValue;
// If we're constructing lane 0, initialize to start from poison.
if (State.Lane->isFirstLane()) {
assert(!State.VF.isScalable() && "VF is assumed to be non scalable.");
WideValue = PoisonValue::get(VectorType::get(UI->getType(), State.VF));
} else {
WideValue = State.get(this);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (State.Lane->isFirstLane()) {
assert(!State.VF.isScalable() && "VF is assumed to be non scalable.");
WideValue = PoisonValue::get(VectorType::get(UI->getType(), State.VF));
} else {
WideValue = State.get(this);
}
if (State.Lane->isFirstLane())
WideValue = PoisonValue::get(VectorType::get(UI->getType(), State.VF));
else
WideValue = State.get(this);

drop it from here?
Can also simplify using a ternary Value *WideValue = State.Lane->isFirstLane() ? ... : ... ;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

State.set(this, State.packScalarIntoVectorizedValue(this, WideValue,
*State.Lane));
}

// Generate scalar instances for all VF lanes.
const unsigned EndLane = State.VF.getFixedValue();
for (unsigned Lane = 0; Lane < EndLane; ++Lane)
scalarizeInstruction(UI, this, VPLane(Lane), State);
}

bool VPReplicateRecipe::shouldPack() const {
Expand Down
18 changes: 18 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1140,6 +1140,24 @@ static void simplifyRecipe(VPRecipeBase &R, VPTypeAnalysis &TypeInfo) {
return;
}

// Look through ExtractLastElement (BuildVector ....).
if (match(&R, m_VPInstruction<VPInstruction::ExtractLastElement>(
m_BuildVector()))) {
auto *BuildVector = cast<VPInstruction>(R.getOperand(0));
Def->replaceAllUsesWith(
BuildVector->getOperand(BuildVector->getNumOperands() - 1));
return;
}

// Look through ExtractPenultimateElement (BuildVector ....).
if (match(&R, m_VPInstruction<VPInstruction::ExtractPenultimateElement>(
m_BuildVector()))) {
auto *BuildVector = cast<VPInstruction>(R.getOperand(0));
Def->replaceAllUsesWith(
BuildVector->getOperand(BuildVector->getNumOperands() - 2));
return;
}

// Some simplifications can only be applied after unrolling. Perform them
// below.
if (!Plan->isUnrolled())
Expand Down
6 changes: 6 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlanTransforms.h
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,12 @@ struct VPlanTransforms {
/// Explicitly unroll \p Plan by \p UF.
static void unrollByUF(VPlan &Plan, unsigned UF, LLVMContext &Ctx);

/// Replace each VPReplicateRecipe outside on any replicate region in \p Plan
/// with \p VF single-scalar recipes.
/// TODO: Also replicate VPReplicateRecipes inside replicate regions, thereby
/// dissolving the latter.
static void replicateByVF(VPlan &Plan, ElementCount VF);

/// Optimize \p Plan based on \p BestVF and \p BestUF. This may restrict the
/// resulting plan to \p BestVF and \p BestUF.
static void optimizeForVFAndUF(VPlan &Plan, ElementCount BestVF,
Expand Down
84 changes: 84 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
#include "VPlan.h"
#include "VPlanAnalysis.h"
#include "VPlanCFG.h"
#include "VPlanHelpers.h"
#include "VPlanPatternMatch.h"
#include "VPlanTransforms.h"
#include "VPlanUtils.h"
Expand Down Expand Up @@ -450,3 +451,86 @@ void VPlanTransforms::unrollByUF(VPlan &Plan, unsigned UF, LLVMContext &Ctx) {

VPlanTransforms::removeDeadRecipes(Plan);
}

/// Create a single-scalar clone of \p RepR for lane \p Lane.
static VPReplicateRecipe *cloneForLane(VPlan &Plan, VPBuilder &Builder,
Type *IdxTy, VPReplicateRecipe *RepR,
VPLane Lane) {
// Collect the operands at Lane, creating extracts as needed.
SmallVector<VPValue *> NewOps;
for (VPValue *Op : RepR->operands()) {
if (vputils::isSingleScalar(Op)) {
NewOps.push_back(Op);
continue;
}
if (Lane.getKind() == VPLane::Kind::ScalableLast) {
NewOps.push_back(
Builder.createNaryOp(VPInstruction::ExtractLastElement, {Op}));
continue;
}
// Look through buildvector to avoid unnecessary extracts.
if (match(Op, m_BuildVector())) {
NewOps.push_back(
cast<VPInstruction>(Op)->getOperand(Lane.getKnownLane()));
continue;
}
VPValue *Idx =
Plan.getOrAddLiveIn(ConstantInt::get(IdxTy, Lane.getKnownLane()));
VPValue *Ext = Builder.createNaryOp(Instruction::ExtractElement, {Op, Idx});
NewOps.push_back(Ext);
}

auto *New =
new VPReplicateRecipe(RepR->getUnderlyingInstr(), NewOps,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a VPInstruction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually yes, but more work is needed separately to use VPInstruction for single-scalar VPReplicateRecipes: #141429, #140623

/*IsSingleScalar=*/true, /*Mask=*/nullptr, *RepR);
New->insertBefore(RepR);
return New;
}

void VPlanTransforms::replicateByVF(VPlan &Plan, ElementCount VF) {
Type *IdxTy = IntegerType::get(
Plan.getScalarHeader()->getIRBasicBlock()->getContext(), 32);
for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(
vp_depth_first_shallow(Plan.getVectorLoopRegion()->getEntry()))) {
for (VPRecipeBase &R : make_early_inc_range(*VPBB)) {
auto *RepR = dyn_cast<VPReplicateRecipe>(&R);
if (!RepR || RepR->isSingleScalar())
continue;

VPBuilder Builder(RepR);
SmallVector<VPValue *> LaneDefs;
// Stores to invariant addresses need to store the last lane only.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, should this be considered RepR->isSingleScalar() too and/or converted to storing last lane only independent of replicatingByVF(), possibly using ExtractLast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, should be able to move this to convertToSingleScalar.

if (isa<StoreInst>(RepR->getUnderlyingInstr()) &&
vputils::isSingleScalar(RepR->getOperand(1))) {
cloneForLane(Plan, Builder, IdxTy, RepR, VPLane::getLastLaneForVF(VF));
RepR->eraseFromParent();
continue;
}

/// Create single-scalar version of RepR for all lanes.
for (unsigned I = 0; I != VF.getKnownMinValue(); ++I)
LaneDefs.push_back(cloneForLane(Plan, Builder, IdxTy, RepR, VPLane(I)));

if (RepR->getNumUsers() == 0) {
RepR->eraseFromParent();
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps better to deal with both useless cases first:

Suggested change
if (isa<StoreInst>(RepR->getUnderlyingInstr()) &&
vputils::isSingleScalar(RepR->getOperand(1))) {
cloneForLane(Plan, Builder, IdxTy, RepR, VPLane::getLastLaneForVF(VF));
RepR->eraseFromParent();
continue;
}
/// Create single-scalar version of RepR for all lanes.
for (unsigned I = 0; I != VF.getKnownMinValue(); ++I)
LaneDefs.push_back(cloneForLane(Plan, Builder, IdxTy, RepR, VPLane(I)));
if (RepR->getNumUsers() == 0) {
RepR->eraseFromParent();
continue;
}
if (RepR->getNumUsers() == 0) {
if (isa<StoreInst>(RepR->getUnderlyingInstr()) &&
vputils::isSingleScalar(RepR->getOperand(1))) {
// Stores to invariant addresses need to store the last lane only.
cloneForLane(Plan, Builder, IdxTy, RepR, VPLane::getLastLaneForVF(VF));
} else {
// Create single-scalar version of RepR for all lanes.
for (unsigned I = 0; I != VF.getKnownMinValue(); ++I)
cloneForLane(Plan, Builder, IdxTy, RepR, VPLane(I));
}
RepR->eraseFromParent();
continue;
}
/// Create and record single-scalar version of RepR for all lanes.
SmallVector<VPValue *> LaneDefs;
for (unsigned I = 0; I != VF.getKnownMinValue(); ++I)
LaneDefs.push_back(cloneForLane(Plan, Builder, IdxTy, RepR, VPLane(I)));
``

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks


/// Users that only demand the first lane can use the definition for lane
/// 0.
RepR->replaceUsesWithIf(LaneDefs[0], [RepR](VPUser &U, unsigned) {
return U.onlyFirstLaneUsed(RepR);
});

// If needed, create a Build(Struct)Vector recipe to insert the scalar
// lane values into a vector.
Comment on lines +526 to +527
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a pair of replicating recipes one feeding the other is replaced by VF recipes feeding a buildVector which VF other recipes extract from, where the extracts are optimized away by cloneForLane(); and the buildVector possibly by dce?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Type *ResTy = RepR->getUnderlyingInstr()->getType();
VPValue *VecRes = Builder.createNaryOp(
ResTy->isStructTy() ? VPInstruction::BuildStructVector
: VPInstruction::BuildVector,
LaneDefs);
RepR->replaceAllUsesWith(VecRes);
RepR->eraseFromParent();
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -393,12 +393,6 @@ define void @test_for_tried_to_force_scalar(ptr noalias %A, ptr noalias %B, ptr
; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <12 x float> [[WIDE_VEC]], <12 x float> poison, <4 x i32> <i32 0, i32 3, i32 6, i32 9>
; CHECK-NEXT: [[TMP30:%.*]] = extractelement <4 x float> [[STRIDED_VEC]], i32 3
; CHECK-NEXT: store float [[TMP30]], ptr [[C:%.*]], align 4
; CHECK-NEXT: [[TMP31:%.*]] = extractelement <4 x ptr> [[TMP29]], i32 0
; CHECK-NEXT: [[TMP38:%.*]] = load float, ptr [[TMP31]], align 4
; CHECK-NEXT: [[TMP33:%.*]] = extractelement <4 x ptr> [[TMP29]], i32 1
; CHECK-NEXT: [[TMP32:%.*]] = load float, ptr [[TMP33]], align 4
; CHECK-NEXT: [[TMP35:%.*]] = extractelement <4 x ptr> [[TMP29]], i32 2
; CHECK-NEXT: [[TMP34:%.*]] = load float, ptr [[TMP35]], align 4
; CHECK-NEXT: [[TMP37:%.*]] = extractelement <4 x ptr> [[TMP29]], i32 3
; CHECK-NEXT: [[TMP36:%.*]] = load float, ptr [[TMP37]], align 4
; CHECK-NEXT: store float [[TMP36]], ptr [[B:%.*]], align 4
Expand Down
Loading
Loading