Skip to content

[LV] Vectorize Epilogues for loops with small VF but high IC #108190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
Original file line number Diff line number Diff line change
Expand Up @@ -525,6 +525,12 @@ class LoopVectorizationPlanner {
bool isMoreProfitable(const VectorizationFactor &A,
const VectorizationFactor &B) const;

/// Returns true if the per-lane cost of VectorizationFactor A is lower than
/// that of B in the context of vectorizing a loop with known \p MaxTripCount.
bool isMoreProfitable(const VectorizationFactor &A,
const VectorizationFactor &B,
const unsigned MaxTripCount) const;

/// Determines if we have the infrastructure to vectorize the loop and its
/// epilogue, assuming the main loop is vectorized by \p VF.
bool isCandidateForEpilogueVectorization(const ElementCount VF) const;
Expand Down
48 changes: 35 additions & 13 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1516,7 +1516,10 @@ class LoopVectorizationCostModel {
/// Returns true if epilogue vectorization is considered profitable, and
/// false otherwise.
/// \p VF is the vectorization factor chosen for the original loop.
bool isEpilogueVectorizationProfitable(const ElementCount VF) const;
/// \p Multiplier is an aditional scaling factor applied to VF before
/// comparing to EpilogueVectorizationMinVF.
bool isEpilogueVectorizationProfitable(const ElementCount VF,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiplier is the UF/IC, right? Might be clearer to call it that to avoid confusion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only in the non-SVE case, for SVE it's getVScaleForTuning(OrigLoop, TTI).value_or(1). Which this wouldn't be changing just pulling it out as an argument and passing IC for non-SVE.

const unsigned Multiplier) const;

/// Returns the execution time cost of an instruction for a given vector
/// width. Vector width of one means scalar.
Expand Down Expand Up @@ -4289,12 +4292,11 @@ getVScaleForTuning(const Loop *L, const TargetTransformInfo &TTI) {
}

bool LoopVectorizationPlanner::isMoreProfitable(
const VectorizationFactor &A, const VectorizationFactor &B) const {
const VectorizationFactor &A, const VectorizationFactor &B,
const unsigned MaxTripCount) const {
InstructionCost CostA = A.Cost;
InstructionCost CostB = B.Cost;

unsigned MaxTripCount = PSE.getSmallConstantMaxTripCount();

// Improve estimate for the vector width if it is scalable.
unsigned EstimatedWidthA = A.Width.getKnownMinValue();
unsigned EstimatedWidthB = B.Width.getKnownMinValue();
Expand Down Expand Up @@ -4343,6 +4345,12 @@ bool LoopVectorizationPlanner::isMoreProfitable(
return CmpFn(RTCostA, RTCostB);
}

bool LoopVectorizationPlanner::isMoreProfitable(
const VectorizationFactor &A, const VectorizationFactor &B) const {
const unsigned MaxTripCount = PSE.getSmallConstantMaxTripCount();
return LoopVectorizationPlanner::isMoreProfitable(A, B, MaxTripCount);
}

void LoopVectorizationPlanner::emitInvalidCostRemarks(
OptimizationRemarkEmitter *ORE) {
using RecipeVFPair = std::pair<VPRecipeBase *, ElementCount>;
Expand Down Expand Up @@ -4661,7 +4669,7 @@ bool LoopVectorizationPlanner::isCandidateForEpilogueVectorization(
}

bool LoopVectorizationCostModel::isEpilogueVectorizationProfitable(
const ElementCount VF) const {
const ElementCount VF, const unsigned Multiplier) const {
// FIXME: We need a much better cost-model to take different parameters such
// as register pressure, code size increase and cost of extra branches into
// account. For now we apply a very crude heuristic and only consider loops
Expand All @@ -4676,9 +4684,6 @@ bool LoopVectorizationCostModel::isEpilogueVectorizationProfitable(
if (TTI.getMaxInterleaveFactor(VF) <= 1)
return false;

unsigned Multiplier = 1;
if (VF.isScalable())
Multiplier = getVScaleForTuning(TheLoop, TTI).value_or(1);
if ((Multiplier * VF.getKnownMinValue()) >= EpilogueVectorizationMinVF)
return true;
return false;
Expand Down Expand Up @@ -4724,7 +4729,11 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
return Result;
}

if (!CM.isEpilogueVectorizationProfitable(MainLoopVF)) {
unsigned Multiplier = IC;
if (MainLoopVF.isScalable())
Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this could be Multiplier *= for SVE, so that both the IC and the VScaleForTuning are accounted for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but I didn't really look at SVE and since the remaining part of this change depends on using RemainingIterations for checking profitability, which has an (existing) // TODO: extend to support scalable VFs. I'd rather keep changing behavior for SVE as a follow-up.

Copy link
Contributor

@david-arm david-arm Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change looks incorrect. Previously in isEpilogueVectorizationProfitable we did:

  unsigned Multiplier = 1;
  if (VF.isScalable())
    Multiplier = getVScaleForTuning(TheLoop, TTI).value_or(1);
  if ((Multiplier * VF.getKnownMinValue()) >= EpilogueVectorizationMinVF)
    return true;

i.e. for fixed-width VFs Multiplier = 1, whereas after this change Multiplier = IC. This is either biasing against or in favour of fixed-width VFs, which doesn't seem right. I think in order to match the previous behaviour the code should be:

  unsigned Multiplier = 1;
  if (MainLoopVF.isScalable())
    Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fhahn can you verify this as well? I think it should be one of the following:

  unsigned Multiplier = IC;
  if (MainLoopVF.isScalable())
    Multiplier *= getVScaleForTuning(OrigLoop, TTI).value_or(1);

or

  unsigned Multiplier = 1;
  if (MainLoopVF.isScalable())
    Multiplier = getVScaleForTuning(OrigLoop, TTI).value_or(1);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say it is incorrect, it just keeps the original behavior for scalable vectors as I think @juliannagele doesn't have access to HW with scalable vectors, which would be needed to evaluate the impact of changing this for scalable vectors.

At this point already picked the VF for the main loop, so the only change is that we consider epilogue vectorization for more cases with fixed vectors.

To avoid regression with fixed vectors, this patch relies on code that checks the number of remaining iterations, which currently doesn't support scalable vectors (look for // TODO: extend to support scalable VFs), which probably should be fixed first. Again, this should probably be done by someone with access to HW supporting scalable vectors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can volunteer for that, but I am really keen that this lands first. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this has now landed, and #100385 is hopefully unblocked. Dave put up #116607 for changing this part too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, didn't realise that, thanks!


if (!CM.isEpilogueVectorizationProfitable(MainLoopVF, Multiplier)) {
LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization is not profitable for "
"this loop\n");
return Result;
Expand All @@ -4743,16 +4752,20 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
ScalarEvolution &SE = *PSE.getSE();
Type *TCType = Legal->getWidestInductionType();
const SCEV *RemainingIterations = nullptr;
unsigned MaxTripCount = 0;
for (auto &NextVF : ProfitableVFs) {
// Skip candidate VFs without a corresponding VPlan.
if (!hasPlanWithVF(NextVF.Width))
continue;

// Skip candidate VFs with widths >= the estimate runtime VF (scalable
// vectors) or the VF of the main loop (fixed vectors).
// Skip candidate VFs with widths >= the (estimated) runtime VF (scalable
// vectors) or > the VF of the main loop (fixed vectors).
if ((!NextVF.Width.isScalable() && MainLoopVF.isScalable() &&
ElementCount::isKnownGE(NextVF.Width, EstimatedRuntimeVF)) ||
ElementCount::isKnownGE(NextVF.Width, MainLoopVF))
(NextVF.Width.isScalable() &&
ElementCount::isKnownGE(NextVF.Width, MainLoopVF)) ||
(!NextVF.Width.isScalable() && !MainLoopVF.isScalable() &&
ElementCount::isKnownGT(NextVF.Width, MainLoopVF)))
continue;

// If NextVF is greater than the number of remaining iterations, the
Expand All @@ -4766,6 +4779,14 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
"Trip count SCEV must be computable");
RemainingIterations = SE.getURemExpr(
TC, SE.getConstant(TCType, MainLoopVF.getKnownMinValue() * IC));
MaxTripCount = MainLoopVF.getKnownMinValue() * IC - 1;
if (SE.isKnownPredicate(CmpInst::ICMP_ULT, RemainingIterations,
SE.getConstant(TCType, MaxTripCount))) {
MaxTripCount =
SE.getUnsignedRangeMax(RemainingIterations).getZExtValue();
}
LLVM_DEBUG(dbgs() << "LEV: Maximum Trip Count for Epilogue: "
<< MaxTripCount << "\n");
}
if (SE.isKnownPredicate(
CmpInst::ICMP_UGT,
Expand All @@ -4774,7 +4795,8 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
continue;
}

if (Result.Width.isScalar() || isMoreProfitable(NextVF, Result))
if (Result.Width.isScalar() ||
isMoreProfitable(NextVF, Result, MaxTripCount))
Result = NextVF;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ define void @test_pr25490(i32 %n, ptr noalias nocapture %a, ptr noalias nocaptur
; CHECK-NEXT: br i1 [[CMP_28]], label [[FOR_COND_CLEANUP:%.*]], label [[ITER_CHECK:%.*]]
; CHECK: iter.check:
; CHECK-NEXT: [[TMP0:%.*]] = zext i32 [[N]] to i64
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 8
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
; CHECK: vector.main.loop.iter.check:
; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i32 [[N]], 16
Expand Down Expand Up @@ -50,33 +50,33 @@ define void @test_pr25490(i32 %n, ptr noalias nocapture %a, ptr noalias nocaptur
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_VEC]], [[TMP0]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
; CHECK: vec.epilog.iter.check:
; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = and i64 [[TMP0]], 8
; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK_NOT_NOT:%.*]] = icmp eq i64 [[N_VEC_REMAINING]], 0
; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK_NOT_NOT]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = and i64 [[TMP0]], 12
; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp eq i64 [[N_VEC_REMAINING]], 0
; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
; CHECK: vec.epilog.ph:
; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
; CHECK-NEXT: [[N_VEC5:%.*]] = and i64 [[TMP0]], 4294967288
; CHECK-NEXT: [[N_VEC5:%.*]] = and i64 [[TMP0]], 4294967292
; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
; CHECK: vec.epilog.vector.body:
; CHECK-NEXT: [[INDEX6:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT10:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX6]]
; CHECK-NEXT: [[WIDE_LOAD7:%.*]] = load <8 x i8>, ptr [[TMP14]], align 1
; CHECK-NEXT: [[WIDE_LOAD7:%.*]] = load <4 x i8>, ptr [[TMP14]], align 1
; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX6]]
; CHECK-NEXT: [[WIDE_LOAD8:%.*]] = load <8 x i8>, ptr [[TMP15]], align 1
; CHECK-NEXT: [[TMP16:%.*]] = zext <8 x i8> [[WIDE_LOAD8]] to <8 x i16>
; CHECK-NEXT: [[TMP17:%.*]] = zext <8 x i8> [[WIDE_LOAD7]] to <8 x i16>
; CHECK-NEXT: [[TMP18:%.*]] = mul nuw <8 x i16> [[TMP16]], [[TMP17]]
; CHECK-NEXT: [[TMP19:%.*]] = lshr <8 x i16> [[TMP18]], splat (i16 8)
; CHECK-NEXT: [[TMP20:%.*]] = trunc nuw <8 x i16> [[TMP19]] to <8 x i8>
; CHECK-NEXT: store <8 x i8> [[TMP20]], ptr [[TMP15]], align 1
; CHECK-NEXT: [[WIDE_LOAD8:%.*]] = load <4 x i8>, ptr [[TMP15]], align 1
; CHECK-NEXT: [[TMP16:%.*]] = zext <4 x i8> [[WIDE_LOAD8]] to <4 x i16>
; CHECK-NEXT: [[TMP17:%.*]] = zext <4 x i8> [[WIDE_LOAD7]] to <4 x i16>
; CHECK-NEXT: [[TMP18:%.*]] = mul nuw <4 x i16> [[TMP16]], [[TMP17]]
; CHECK-NEXT: [[TMP19:%.*]] = lshr <4 x i16> [[TMP18]], splat (i16 8)
; CHECK-NEXT: [[TMP20:%.*]] = trunc nuw <4 x i16> [[TMP19]] to <4 x i8>
; CHECK-NEXT: store <4 x i8> [[TMP20]], ptr [[TMP15]], align 1
; CHECK-NEXT: [[TMP21:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX6]]
; CHECK-NEXT: [[WIDE_LOAD9:%.*]] = load <8 x i8>, ptr [[TMP21]], align 1
; CHECK-NEXT: [[TMP22:%.*]] = zext <8 x i8> [[WIDE_LOAD9]] to <8 x i16>
; CHECK-NEXT: [[TMP23:%.*]] = mul nuw <8 x i16> [[TMP22]], [[TMP17]]
; CHECK-NEXT: [[TMP24:%.*]] = lshr <8 x i16> [[TMP23]], splat (i16 8)
; CHECK-NEXT: [[TMP25:%.*]] = trunc nuw <8 x i16> [[TMP24]] to <8 x i8>
; CHECK-NEXT: store <8 x i8> [[TMP25]], ptr [[TMP21]], align 1
; CHECK-NEXT: [[INDEX_NEXT10]] = add nuw i64 [[INDEX6]], 8
; CHECK-NEXT: [[WIDE_LOAD9:%.*]] = load <4 x i8>, ptr [[TMP21]], align 1
; CHECK-NEXT: [[TMP22:%.*]] = zext <4 x i8> [[WIDE_LOAD9]] to <4 x i16>
; CHECK-NEXT: [[TMP23:%.*]] = mul nuw <4 x i16> [[TMP22]], [[TMP17]]
; CHECK-NEXT: [[TMP24:%.*]] = lshr <4 x i16> [[TMP23]], splat (i16 8)
; CHECK-NEXT: [[TMP25:%.*]] = trunc nuw <4 x i16> [[TMP24]] to <4 x i8>
; CHECK-NEXT: store <4 x i8> [[TMP25]], ptr [[TMP21]], align 1
; CHECK-NEXT: [[INDEX_NEXT10]] = add nuw i64 [[INDEX6]], 4
; CHECK-NEXT: [[TMP26:%.*]] = icmp eq i64 [[INDEX_NEXT10]], [[N_VEC5]]
; CHECK-NEXT: br i1 [[TMP26]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
; CHECK: vec.epilog.middle.block:
Expand Down
Loading
Loading