Skip to content

[LAA] Be more careful when evaluating AddRecs at symbolic max BTC. #128061

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 23, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions llvm/include/llvm/Analysis/LoopAccessAnalysis.h
Original file line number Diff line number Diff line change
Expand Up @@ -901,7 +901,10 @@ LLVM_ABI bool sortPtrAccesses(ArrayRef<Value *> VL, Type *ElemTy,
LLVM_ABI bool isConsecutiveAccess(Value *A, Value *B, const DataLayout &DL,
ScalarEvolution &SE, bool CheckType = true);

/// Calculate Start and End points of memory access.
/// Calculate Start and End points of memory access using exact backedge taken
/// count \p BTC if computable or maximum backedge taken count \p MaxBTC
/// otherwise.
///
/// Let's assume A is the first access and B is a memory access on N-th loop
/// iteration. Then B is calculated as:
/// B = A + Step*N .
Expand All @@ -915,8 +918,8 @@ LLVM_ABI bool isConsecutiveAccess(Value *A, Value *B, const DataLayout &DL,
/// There is no conflict when the intervals are disjoint:
/// NoConflict = (P2.Start >= P1.End) || (P1.Start >= P2.End)
LLVM_ABI std::pair<const SCEV *, const SCEV *> getStartAndEndForAccess(
const Loop *Lp, const SCEV *PtrExpr, Type *AccessTy, const SCEV *MaxBECount,
ScalarEvolution *SE,
const Loop *Lp, const SCEV *PtrExpr, Type *AccessTy, const SCEV *BTC,
const SCEV *MaxBTC, ScalarEvolution *SE,
DenseMap<std::pair<const SCEV *, Type *>,
std::pair<const SCEV *, const SCEV *>> *PointerBounds);

Expand Down
7 changes: 6 additions & 1 deletion llvm/lib/Analysis/Loads.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -319,17 +319,22 @@ bool llvm::isDereferenceableAndAlignedInLoop(
const SCEV *MaxBECount =
Predicates ? SE.getPredicatedConstantMaxBackedgeTakenCount(L, *Predicates)
: SE.getConstantMaxBackedgeTakenCount(L);
const SCEV *BECount = Predicates
? SE.getPredicatedBackedgeTakenCount(L, *Predicates)
: SE.getBackedgeTakenCount(L);
if (isa<SCEVCouldNotCompute>(MaxBECount))
return false;

const auto &[AccessStart, AccessEnd] = getStartAndEndForAccess(
L, PtrScev, LI->getType(), MaxBECount, &SE, nullptr);
L, PtrScev, LI->getType(), BECount, MaxBECount, &SE, nullptr);
if (isa<SCEVCouldNotCompute>(AccessStart) ||
isa<SCEVCouldNotCompute>(AccessEnd))
return false;

// Try to get the access size.
const SCEV *PtrDiff = SE.getMinusSCEV(AccessEnd, AccessStart);
if (isa<SCEVCouldNotCompute>(PtrDiff))
return false;
APInt MaxPtrDiff = SE.getUnsignedRangeMax(PtrDiff);

Value *Base = nullptr;
Expand Down
137 changes: 123 additions & 14 deletions llvm/lib/Analysis/LoopAccessAnalysis.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -188,9 +188,90 @@ RuntimeCheckingPtrGroup::RuntimeCheckingPtrGroup(
Members.push_back(Index);
}

/// Returns \p A + \p B, if it is guaranteed not to unsigned wrap. Otherwise
/// return nullptr. \p A and \p B must have the same type.
static const SCEV *addSCEVOverflow(const SCEV *A, const SCEV *B,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think addSCEVNoOverflow / mulSCEVNoOverflow would be a better name for these.

ScalarEvolution &SE) {
if (!SE.willNotOverflow(Instruction::Add, false, A, B))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add /*IsSigned=*/

return nullptr;
return SE.getAddExpr(A, B);
}

/// Returns \p A * \p B, if it is guaranteed not to unsigned wrap. Otherwise
/// return nullptr. \p A and \p B must have the same type.
static const SCEV *mulSCEVOverflow(const SCEV *A, const SCEV *B,
ScalarEvolution &SE) {
if (!SE.willNotOverflow(Instruction::Mul, false, A, B))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth adding something in the comments for mulSCEVOverflow and addSCEVOverflow that A and B need to have the same type in order to do the overflow check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment, thanks. They must have the same type for A * B to be valid.

return nullptr;
return SE.getMulExpr(A, B);
}

/// Return true, if evaluating \p AR at \p MaxBTC cannot wrap, because \p AR at
/// \p MaxBTC is guaranteed inbounds of the accessed object.
static bool evaluatePtrAddRecAtMaxBTCWillNotWrap(const SCEVAddRecExpr *AR,
const SCEV *MaxBTC,
const SCEV *EltSize,
ScalarEvolution &SE,
const DataLayout &DL) {
auto *PointerBase = SE.getPointerBase(AR->getStart());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes AR is for pointers, but the function name indicates the add rec could be anything. Is it worth making that clear in the function name? I'm not sure what would happen if we call SE.getPointerBase for something that isn't a pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep it must be a pointer AddRec, and that's guaranteed from the current caller. Updated to evaluatePtrAddRecAtMaxBTCWillNotWrap

auto *StartPtr = dyn_cast<SCEVUnknown>(PointerBase);
if (!StartPtr)
return false;
bool CheckForNonNull, CheckForFreed;
uint64_t DerefBytes = StartPtr->getValue()->getPointerDereferenceableBytes(
DL, CheckForNonNull, CheckForFreed);

if (CheckForNonNull || CheckForFreed)
return false;

const SCEV *Step = AR->getStepRecurrence(SE);
bool IsKnownNonNegative = SE.isKnownNonNegative(Step);
if (!IsKnownNonNegative && !SE.isKnownNegative(Step))
return false;

Type *WiderTy = SE.getWiderType(MaxBTC->getType(), Step->getType());
Step = SE.getNoopOrSignExtend(Step, WiderTy);
MaxBTC = SE.getNoopOrZeroExtend(MaxBTC, WiderTy);

// For the computations below, make sure they don't unsigned wrap.
if (!SE.isKnownPredicate(CmpInst::ICMP_UGE, AR->getStart(), StartPtr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a hard theoretical limitation that prevents AR being something like (%p - 16)? Or does it just make the later calculations simpler? If it's the latter (sounds like it), then can you add a TODO that we can improve this in future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently makes sure that AR->getStart() - StartPtr cannot wrap. We cannot check if the result ofgetMinusSCEV has NUW, as it will form a SCEVAddExpr with a negative operand, which will generally not be NUW.

return false;
const SCEV *StartOffset = SE.getNoopOrZeroExtend(
SE.getMinusSCEV(AR->getStart(), StartPtr), WiderTy);

const SCEV *OffsetAtLastIter =
mulSCEVOverflow(MaxBTC, SE.getAbsExpr(Step, false), SE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/*IsNSW=*/

if (!OffsetAtLastIter)
return false;

const SCEV *OffsetEndBytes = addSCEVOverflow(
OffsetAtLastIter, SE.getNoopOrZeroExtend(EltSize, WiderTy), SE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this isn't really the last accessed byte, which would really be EltSize - 1. Might be clearer to simply call it EndOffset?

I realise it makes the subsequent calculations simpler, but it does artificially increase the chance of overflow because in reality the pointer for the last accessed byte is (AR->getStart() + (MaxBTC * Step) + EltSize - 1). I'm fine with that, but maybe worth a comment?

Copy link
Contributor Author

@fhahn fhahn Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the names, thanks.

I realise it makes the subsequent calculations simpler, but it does artificially increase the chance of overflow because in reality the pointer for the last accessed byte is (AR->getStart() + (MaxBTC * Step) + EltSize - 1). I'm fine with that, but maybe worth a comment?

We are comparing the result <= DerefBytes (i.e. the end of the accessed range is at most DerefBytes), so I think it shouldn't pessimize the results. If we subtract 1, we would check < DerefBytes (i.e. the last accessed byte must be inside the dereferenceable bytes)

if (!OffsetEndBytes)
return false;

if (IsKnownNonNegative) {
// For positive steps, check if
// (AR->getStart() - StartPtr) + (MaxBTC * Step) + EltSize <= DerefBytes,
// while making sure none of the computations unsigned wrap themselves.
const SCEV *EndBytes = addSCEVOverflow(StartOffset, OffsetEndBytes, SE);
if (!EndBytes)
return false;
return SE.isKnownPredicate(CmpInst::ICMP_ULE, EndBytes,
SE.getConstant(WiderTy, DerefBytes));
}

// For negative steps check if
// * StartOffset >= (MaxBTC * Step + EltSize)
// * StartOffset <= DerefBytes.
assert(SE.isKnownNegative(Step) && "must be known negative");
return SE.isKnownPredicate(CmpInst::ICMP_SGE, StartOffset, OffsetEndBytes) &&
SE.isKnownPredicate(CmpInst::ICMP_ULE, StartOffset,
SE.getConstant(WiderTy, DerefBytes));
}

std::pair<const SCEV *, const SCEV *> llvm::getStartAndEndForAccess(
const Loop *Lp, const SCEV *PtrExpr, Type *AccessTy, const SCEV *MaxBECount,
ScalarEvolution *SE,
const Loop *Lp, const SCEV *PtrExpr, Type *AccessTy, const SCEV *BTC,
const SCEV *MaxBTC, ScalarEvolution *SE,
DenseMap<std::pair<const SCEV *, Type *>,
std::pair<const SCEV *, const SCEV *>> *PointerBounds) {
std::pair<const SCEV *, const SCEV *> *PtrBoundsPair;
Expand All @@ -206,11 +287,37 @@ std::pair<const SCEV *, const SCEV *> llvm::getStartAndEndForAccess(
const SCEV *ScStart;
const SCEV *ScEnd;

auto &DL = Lp->getHeader()->getDataLayout();
Type *IdxTy = DL.getIndexType(PtrExpr->getType());
const SCEV *EltSizeSCEV = SE->getStoreSizeOfExpr(IdxTy, AccessTy);
if (SE->isLoopInvariant(PtrExpr, Lp)) {
ScStart = ScEnd = PtrExpr;
} else if (auto *AR = dyn_cast<SCEVAddRecExpr>(PtrExpr)) {
ScStart = AR->getStart();
ScEnd = AR->evaluateAtIteration(MaxBECount, *SE);
if (!isa<SCEVCouldNotCompute>(BTC))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we are passing the exact BTC, and handling the case where it is a could-not-compute. Why not just pass the symbolic max as before, and have the logic below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can compute the back edge taken count, we are guaranteed to execute exactly that amount of iterations.

the symbolic max back edge taken count is an upper bound and the loop may exit at any earlier iteration (eg because it has an uncountable exit).

As per the comment, computable BTC means we should be able to rely on the fact that the pointers cannot wrap in any iteration. If we instead only have symbolic mac BTC, we may only execute a smaller number of iterations than the max, and then only those iterations are guaranteed to not wrap in general, so evaluating at the symbolic max may wrap.

One case to consider is when the symbolic max BTC is a SCEVUnknown, we will form a SCEvMultiply expression for which we cannot determine if it wraps or not (vs the case when the symbolic BTC is a constant)

Copy link
Contributor

@artagnon artagnon Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. My confusion is the following: if we have a computable BTC, isn't Exact = SymbolicMax? If we don't have a computable BTC, Exact = SCEVCouldNotCompute and SymbolicMax could be a SCEVConstant, general SCEV expression, SCEVUnknown, or SCEVCouldNotCompute, in the worst case. If my reasoning is correct, there is no additional information in Exact over the SymbolicMax, and we shouldn't have to pass Exact. In the test cases you have added, isn't SymbolicMax a SCEVConstant = INT_MAX? What does evaluating an AddRec at the INT_MAX iteration wrap to? Not -(EltSize + 1), or evaluating the AddRec at INT_MIN? Perhaps worth adding some SCEV tests for this evaluation, as a separate patch that we can verify?

When SymbolicMax is a SCEVUnknown, it means that the iteration is bounded by some function argument IR value, right? In this case, Exact will also be the same SCEVUnknown, and if we pass INT_MAX when calling the function, the evaluation will wrap, and this is UB anyway?

What happens when SymbolicMax is a SCEVCouldNotCompute? I think this will result in a crash with the current code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, just thinking out loud here: for simplicity, let AR = {0, +, 1} and let SymbolicMax BTC = INT_MAX. Then, we compute AddExpr(0, MulExpr(1, INT_MAX)). I don't think this overflows. Now, let AR = {0, + 2}. Then, we compute AddExpr(0, MulExpr(2, BinomialCoefficient(INT_MAX, 2)) where the binomial coefficient evaluates to INT_MAX * (INT_MAX - 1) / 2. Naively doing this would overflow even for BTC equal to sqrt(INT_MAX), but it looks like BinomialCoefficient is written carefully, although the final result is truncated (?). In conclusion, it looks like the problem is that evaluateAtIteration does not wrap, but rather truncates the result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this! One clarification is that we evaluate at BTC = UNSIGNED_MAX. So {0, +, 1} won't wrap, but adding 1 will (getStartAndEndForAccess will compute the first address after the last access).

When we have strides larger than 1, the last accessed address will be something like %start + stride * UNSIGNED_MAX, which should wrap to something like %start - %stride. I am not entirely sure if there may be other wrapping issues with how evaluateAtIteration internally computes the result, but the original end point computed for runtime_checks_with_symbolic_max_btc_neg_1 should illustrates that: start == end due to adding %stride to the result of evaluateAtIteration.

// Evaluating AR at an exact BTC is safe: LAA separately checks that
// accesses cannot wrap in the loop. If evaluating AR at BTC wraps, then
// the loop either triggers UB when executing a memory access with a
// poison pointer or the wrapping/poisoned pointer is not used.
ScEnd = AR->evaluateAtIteration(BTC, *SE);
else {
// Evaluating AR at MaxBTC may wrap and create an expression that is less
// than the start of the AddRec due to wrapping (for example consider
// MaxBTC = -2). If that's the case, set ScEnd to -(EltSize + 1). ScEnd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I still don't understand why exact BTC = -2 in the if case above isn't a problem too, although I realise you've tried to explain this to me once already and I do believe that it's an issue. :)

It still sounds like there is a deficiency elsewhere in LAA that we're trying to workaround here, but I can live with that for now!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is computable BTC, we must take the backedge exactly BTC times if the pointer expression would wrap then the original loop is guarnateed to have UB.

If we only have a symbolic max BTC, we take the backedge up to BTC times, but the original loop could exit before we wrap and trigger UB. If we would generate start and end pointers based on the symbolic max before the loop, those may wrap and cause incorrect runtime check results.

With the computable BTC, the start/end pointers can only wrap if the loop has UB.

// will get incremented by EltSize before returning, so this effectively
// sets ScEnd to the maximum unsigned value for the type. Note that LAA
// separately checks that accesses cannot not wrap, so unsigned max
// represents an upper bound.
if (evaluatePtrAddRecAtMaxBTCWillNotWrap(AR, MaxBTC, EltSizeSCEV, *SE,
DL)) {
ScEnd = AR->evaluateAtIteration(MaxBTC, *SE);
} else {
ScEnd = SE->getAddExpr(
SE->getNegativeSCEV(EltSizeSCEV),
SE->getSCEV(ConstantExpr::getIntToPtr(
ConstantInt::get(EltSizeSCEV->getType(), -1), AR->getType())));
}
}
const SCEV *Step = AR->getStepRecurrence(*SE);

// For expressions with negative step, the upper bound is ScStart and the
Expand All @@ -232,9 +339,6 @@ std::pair<const SCEV *, const SCEV *> llvm::getStartAndEndForAccess(
assert(SE->isLoopInvariant(ScEnd, Lp) && "ScEnd needs to be invariant");

// Add the size of the pointed element to ScEnd.
auto &DL = Lp->getHeader()->getDataLayout();
Type *IdxTy = DL.getIndexType(PtrExpr->getType());
const SCEV *EltSizeSCEV = SE->getStoreSizeOfExpr(IdxTy, AccessTy);
ScEnd = SE->getAddExpr(ScEnd, EltSizeSCEV);

std::pair<const SCEV *, const SCEV *> Res = {ScStart, ScEnd};
Expand All @@ -250,9 +354,11 @@ void RuntimePointerChecking::insert(Loop *Lp, Value *Ptr, const SCEV *PtrExpr,
unsigned DepSetId, unsigned ASId,
PredicatedScalarEvolution &PSE,
bool NeedsFreeze) {
const SCEV *MaxBECount = PSE.getSymbolicMaxBackedgeTakenCount();
const auto &[ScStart, ScEnd] = getStartAndEndForAccess(
Lp, PtrExpr, AccessTy, MaxBECount, PSE.getSE(), &DC.getPointerBounds());
const SCEV *SymbolicMaxBTC = PSE.getSymbolicMaxBackedgeTakenCount();
const SCEV *BTC = PSE.getBackedgeTakenCount();
const auto &[ScStart, ScEnd] =
getStartAndEndForAccess(Lp, PtrExpr, AccessTy, BTC, SymbolicMaxBTC,
PSE.getSE(), &DC.getPointerBounds());
Comment on lines +357 to +361
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we changing this because the exact BTC gives better results in some cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's changed to differentiate the cases where we can and cannot compute the BTC exactly (there may not be a computable BTC for loops with early exits)

assert(!isa<SCEVCouldNotCompute>(ScStart) &&
!isa<SCEVCouldNotCompute>(ScEnd) &&
"must be able to compute both start and end expressions");
Expand Down Expand Up @@ -1907,11 +2013,14 @@ MemoryDepChecker::getDependenceDistanceStrideAndSize(
// required for correctness.
if (SE.isLoopInvariant(Src, InnermostLoop) ||
SE.isLoopInvariant(Sink, InnermostLoop)) {
const SCEV *MaxBECount = PSE.getSymbolicMaxBackedgeTakenCount();
const auto &[SrcStart_, SrcEnd_] = getStartAndEndForAccess(
InnermostLoop, Src, ATy, MaxBECount, PSE.getSE(), &PointerBounds);
const auto &[SinkStart_, SinkEnd_] = getStartAndEndForAccess(
InnermostLoop, Sink, BTy, MaxBECount, PSE.getSE(), &PointerBounds);
const SCEV *BTC = PSE.getBackedgeTakenCount();
const SCEV *SymbolicMaxBTC = PSE.getSymbolicMaxBackedgeTakenCount();
const auto &[SrcStart_, SrcEnd_] =
getStartAndEndForAccess(InnermostLoop, Src, ATy, BTC, SymbolicMaxBTC,
PSE.getSE(), &PointerBounds);
const auto &[SinkStart_, SinkEnd_] =
getStartAndEndForAccess(InnermostLoop, Sink, BTy, BTC, SymbolicMaxBTC,
PSE.getSE(), &PointerBounds);
if (!isa<SCEVCouldNotCompute>(SrcStart_) &&
!isa<SCEVCouldNotCompute>(SrcEnd_) &&
!isa<SCEVCouldNotCompute>(SinkStart_) &&
Expand Down
25 changes: 13 additions & 12 deletions llvm/test/Analysis/LoopAccessAnalysis/early-exit-runtime-checks.ll
Original file line number Diff line number Diff line change
Expand Up @@ -72,10 +72,10 @@ define void @all_exits_dominate_latch_countable_exits_at_most_501_iterations_kno
; CHECK-NEXT: %gep.A = getelementptr inbounds i32, ptr %A, i64 %iv
; CHECK-NEXT: Grouped accesses:
; CHECK-NEXT: Group GRP0:
; CHECK-NEXT: (Low: %B High: (2004 + %B))
; CHECK-NEXT: (Low: %B High: inttoptr (i64 -1 to ptr))
; CHECK-NEXT: Member: {%B,+,4}<nuw><%loop.header>
; CHECK-NEXT: Group GRP1:
; CHECK-NEXT: (Low: %A High: (2004 + %A))
; CHECK-NEXT: (Low: %A High: inttoptr (i64 -1 to ptr))
; CHECK-NEXT: Member: {%A,+,4}<nuw><%loop.header>
; CHECK-EMPTY:
; CHECK-NEXT: Non vectorizable stores to invariant address were not found in loop.
Expand Down Expand Up @@ -131,10 +131,10 @@ define void @all_exits_dominate_latch_countable_exits_at_most_500_iterations_not
; CHECK-NEXT: %gep.A = getelementptr inbounds i32, ptr %A, i64 %iv
; CHECK-NEXT: Grouped accesses:
; CHECK-NEXT: Group GRP0:
; CHECK-NEXT: (Low: %B High: (2000 + %B))
; CHECK-NEXT: (Low: %B High: inttoptr (i64 -1 to ptr))
; CHECK-NEXT: Member: {%B,+,4}<nuw><%loop.header>
; CHECK-NEXT: Group GRP1:
; CHECK-NEXT: (Low: %A High: (2000 + %A))
; CHECK-NEXT: (Low: %A High: inttoptr (i64 -1 to ptr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something here, but this seems too conservative right? The loop absolutely cannot execute more than 500 times even if we take an early exit. Even before the High of 2000 + %A was a conservative worst case based on not taking an early exit. Is the idea to fix existing bugs for now in this patch and possibly improve upon this later? Or have I just missed some fundamental reasoning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep we have a bound on the number of iterations and we are guaranteed to not exceed it.

One problematic input could be where only the first 10 elements of A and B are deferenceable, and the early exit leaves the loop at iteration 9. If we expand the bounds with B + 2000, this may wrap in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the fact that the GEPs in the loop explicitly say there is no wrapping (due to inbounds), it sounds like this patch is saying that for early exit loops 2000+%A could wrap (hence the need to clamp the value), but for normal loops 2000+%A cannot wrap (hence we don't need to clamp). I'm just struggling to understand the reasoning behind this. Having said that, I'm happy to accept this limitation for now if it makes life easier and helps to solve the immediate issue.

I think it's worth adding a TODO that we can refine this in future. I think anyone looking at this test might be confused about the discrepancy between normal and early exit loops that have the same mathematical maximum upper bound for memory accesses. Also, after this patch if we have an early exit loop that requires runtime memory checks then there is a very high chance of failing those checks due to the increase in upper bound to UINT64_MAX so it's questionable whether there is even any point vectorising loops with runtime memory checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the fact that the GEPs in the loop explicitly say there is no wrapping (due to inbounds), it sounds like this patch is saying that for early exit loops 2000+%A could wrap (hence the need to clamp the value), but for normal loops 2000+%A cannot wrap (hence we don't need to clamp). I'm just struggling to understand the reasoning behind this. Having said that, I'm happy to accept this limitation for now if it makes life easier and helps to solve the immediate issue.

Technically the GEPs say that the result is poison if it is not inbounds. Dereferencing the poison GEP would trigger UB.

Say 1000+%A wraps. When there's no early exit, 2000+%A would still wrap, but guaranteed to trigger UB in the original loop, because the loop must access memory range %A..2000+%A.

If there is an early exit, the original loop may always exit before accessing 1000+%A, so would not trigger UB, but we would expand 2000+%A, which would have wrapped and the runtime check would incorrect.

Hope that makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand. It sounds like you're saying that without the early exit we don't really care if the runtime checks are nonsense or not because the entire loop is UB anyway? Whereas for early exit loops the loop may or may not be UB and so we do care about getting the right runtime checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

; CHECK-NEXT: Member: {%A,+,4}<nuw><%loop.header>
; CHECK-EMPTY:
; CHECK-NEXT: Non vectorizable stores to invariant address were not found in loop.
Expand Down Expand Up @@ -247,10 +247,10 @@ define i32 @all_exits_dominate_latch_countable_exits_at_most_1001_iterations_kno
; CHECK-NEXT: %gep.A = getelementptr inbounds i32, ptr %A, i64 %iv
; CHECK-NEXT: Grouped accesses:
; CHECK-NEXT: Group GRP0:
; CHECK-NEXT: (Low: %B High: (4004 + %B))
; CHECK-NEXT: (Low: %B High: inttoptr (i64 -1 to ptr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a TODO because we shouldn't be doing this for dereferenceable pointers, right? Surely if it's guaranteed to be dereferenceable that implies it should not wrap? For example, I'm struggling to see how a C++ object passed by reference to a function could be allocated across a wrapped address space and be legal. I would expect any attempt to actually use the object triggers undefined behaviour in the C++ specification. I realise there's more to life than C++ - this is just one example of course.

Also, if I've understood correctly it will significantly impact @huntergr-arm's work to enable vectorisation of early exit loops with loads and stores from dereferenceable memory.

Again, I'm happy to accept the patch as is, just saying that we should improve this in future if we ever want to make progress with early exit vectorisaton.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That test was accessing one past the dereferenceable range (see the original bound of 4004 + %B), so I think using -1 is the best we can do here. The name of the test was confusing, as it actually executes 1001 iterations.

There's now a variant that actually executes at most 1000 iterations (for which we don't pessimize the bounds) and this test has been renamed to 1001 iterations.

; CHECK-NEXT: Member: {%B,+,4}<nuw><%loop.header>
; CHECK-NEXT: Group GRP1:
; CHECK-NEXT: (Low: %A High: (4004 + %A))
; CHECK-NEXT: (Low: %A High: inttoptr (i64 -1 to ptr))
; CHECK-NEXT: Member: {%A,+,4}<nuw><%loop.header>
; CHECK-EMPTY:
; CHECK-NEXT: Non vectorizable stores to invariant address were not found in loop.
Expand Down Expand Up @@ -305,10 +305,10 @@ define i32 @all_exits_dominate_latch_countable_exits_at_most_1000_iterations_not
; CHECK-NEXT: %gep.A = getelementptr inbounds i32, ptr %A, i64 %iv
; CHECK-NEXT: Grouped accesses:
; CHECK-NEXT: Group GRP0:
; CHECK-NEXT: (Low: %B High: (4000 + %B))
; CHECK-NEXT: (Low: %B High: inttoptr (i64 -1 to ptr))
; CHECK-NEXT: Member: {%B,+,4}<nuw><%loop.header>
; CHECK-NEXT: Group GRP1:
; CHECK-NEXT: (Low: %A High: (4000 + %A))
; CHECK-NEXT: (Low: %A High: inttoptr (i64 -1 to ptr))
; CHECK-NEXT: Member: {%A,+,4}<nuw><%loop.header>
; CHECK-EMPTY:
; CHECK-NEXT: Non vectorizable stores to invariant address were not found in loop.
Expand Down Expand Up @@ -350,6 +350,7 @@ e.2:
ret i32 2
}


define i32 @not_all_exits_dominate_latch(ptr %A, ptr %B) {
; CHECK-LABEL: 'not_all_exits_dominate_latch'
; CHECK-NEXT: loop.header:
Expand Down Expand Up @@ -407,10 +408,10 @@ define i32 @b3_does_not_dominate_latch_known_deref(ptr dereferenceable(4000) %A,
; CHECK-NEXT: %gep.A = getelementptr inbounds i32, ptr %A, i64 %iv
; CHECK-NEXT: Grouped accesses:
; CHECK-NEXT: Group GRP0:
; CHECK-NEXT: (Low: %B High: (4004 + %B))
; CHECK-NEXT: (Low: %B High: inttoptr (i64 -1 to ptr))
; CHECK-NEXT: Member: {%B,+,4}<nuw><%loop.header>
; CHECK-NEXT: Group GRP1:
; CHECK-NEXT: (Low: %A High: (4004 + %A))
; CHECK-NEXT: (Low: %A High: inttoptr (i64 -1 to ptr))
; CHECK-NEXT: Member: {%A,+,4}<nuw><%loop.header>
; CHECK-EMPTY:
; CHECK-NEXT: Non vectorizable stores to invariant address were not found in loop.
Expand Down Expand Up @@ -462,10 +463,10 @@ define i32 @b3_does_not_dominate_latch_not_known_deref(ptr %A, ptr %B) {
; CHECK-NEXT: %gep.A = getelementptr inbounds i32, ptr %A, i64 %iv
; CHECK-NEXT: Grouped accesses:
; CHECK-NEXT: Group GRP0:
; CHECK-NEXT: (Low: %B High: (4004 + %B))
; CHECK-NEXT: (Low: %B High: inttoptr (i64 -1 to ptr))
; CHECK-NEXT: Member: {%B,+,4}<nuw><%loop.header>
; CHECK-NEXT: Group GRP1:
; CHECK-NEXT: (Low: %A High: (4004 + %A))
; CHECK-NEXT: (Low: %A High: inttoptr (i64 -1 to ptr))
; CHECK-NEXT: Member: {%A,+,4}<nuw><%loop.header>
; CHECK-EMPTY:
; CHECK-NEXT: Non vectorizable stores to invariant address were not found in loop.
Expand Down
Loading
Loading