Skip to content

[LAA] Support different strides & non constant dep distances using SCEV. #88039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 25, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 80 additions & 42 deletions llvm/lib/Analysis/LoopAccessAnalysis.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1920,20 +1920,21 @@ isLoopVariantIndirectAddress(ArrayRef<const Value *> UnderlyingObjects,
namespace {
struct DepDistanceStrideAndSizeInfo {
const SCEV *Dist;
uint64_t Stride;
uint64_t StrideA;
uint64_t StrideB;
uint64_t TypeByteSize;
bool AIsWrite;
bool BIsWrite;

DepDistanceStrideAndSizeInfo(const SCEV *Dist, uint64_t Stride,
uint64_t TypeByteSize, bool AIsWrite,
bool BIsWrite)
: Dist(Dist), Stride(Stride), TypeByteSize(TypeByteSize),
AIsWrite(AIsWrite), BIsWrite(BIsWrite) {}
DepDistanceStrideAndSizeInfo(const SCEV *Dist, uint64_t StrideA,
uint64_t StrideB, uint64_t TypeByteSize,
bool AIsWrite, bool BIsWrite)
: Dist(Dist), StrideA(StrideA), StrideB(StrideB),
TypeByteSize(TypeByteSize), AIsWrite(AIsWrite), BIsWrite(BIsWrite) {}
};
} // namespace

// Get the dependence distance, stride, type size and whether it is a write for
// Get the dependence distance, strides, type size and whether it is a write for
// the dependence between A and B. Returns a DepType, if we can prove there's
// no dependence or the analysis fails. Outlined to lambda to limit he scope
// of various temporary variables, like A/BPtr, StrideA/BPtr and others.
Expand Down Expand Up @@ -1998,7 +1999,7 @@ getDependenceDistanceStrideAndSize(
// Need accesses with constant stride. We don't want to vectorize
// "A[B[i]] += ..." and similar code or pointer arithmetic that could wrap
// in the address space.
if (!StrideAPtr || !StrideBPtr || StrideAPtr != StrideBPtr) {
if (!StrideAPtr || !StrideBPtr) {
LLVM_DEBUG(dbgs() << "Pointer access with non-constant stride\n");
return MemoryDepChecker::Dependence::Unknown;
}
Expand All @@ -2008,9 +2009,9 @@ getDependenceDistanceStrideAndSize(
DL.getTypeStoreSizeInBits(ATy) == DL.getTypeStoreSizeInBits(BTy);
if (!HasSameSize)
TypeByteSize = 0;
uint64_t Stride = std::abs(StrideAPtr);
return DepDistanceStrideAndSizeInfo(Dist, Stride, TypeByteSize, AIsWrite,
BIsWrite);
return DepDistanceStrideAndSizeInfo(Dist, std::abs(StrideAPtr),
std::abs(StrideBPtr), TypeByteSize,
AIsWrite, BIsWrite);
}

MemoryDepChecker::Dependence::DepType MemoryDepChecker::isDependent(
Expand All @@ -2028,68 +2029,105 @@ MemoryDepChecker::Dependence::DepType MemoryDepChecker::isDependent(
if (std::holds_alternative<Dependence::DepType>(Res))
return std::get<Dependence::DepType>(Res);

const auto &[Dist, Stride, TypeByteSize, AIsWrite, BIsWrite] =
const auto &[Dist, StrideA, StrideB, TypeByteSize, AIsWrite, BIsWrite] =
std::get<DepDistanceStrideAndSizeInfo>(Res);
bool HasSameSize = TypeByteSize > 0;

uint64_t CommonStride = StrideA == StrideB ? StrideA : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't 0 a valid stride (i.e. indexed by a constant)?

Suggested change
uint64_t CommonStride = StrideA == StrideB ? StrideA : 0;
std::optional<uint64_t> CommonStride = StrideA == StrideB ? StrideA : None;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, both strides need to be != 0 I think (getDependenceDistanceStrideAndSize returns Unknown dependence if any stride is == 0).

But using std::optional makes this future proof (I am planning to also remove that restriction as follow up), updated, thanks!

if (isa<SCEVCouldNotCompute>(Dist)) {
FoundNonConstantDistanceDependence |= CommonStride;
LLVM_DEBUG(dbgs() << "LAA: Dependence because of uncomputable distance.\n");
return Dependence::Unknown;
}

ScalarEvolution &SE = *PSE.getSE();
auto &DL = InnermostLoop->getHeader()->getModule()->getDataLayout();
if (!isa<SCEVCouldNotCompute>(Dist) && HasSameSize &&
if (HasSameSize && CommonStride &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isSafeDependenceDistance could be easily modified as well to support different sizes, but it's unrelated to this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I plan to address this separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up now available: #90036

isSafeDependenceDistance(DL, SE, *(PSE.getBackedgeTakenCount()), *Dist,
Stride, TypeByteSize))
CommonStride, TypeByteSize))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isSafeDependenceDistance doesn't depend on there being a common stride. It could either be the larger one, or a case-distinction of either Dist positive (Src < Sink) or positive (Sink < Src).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the comment in the function implies that it assumes equal strides which is why I left it as-is for now, to limit the scope of the initial patch.

It tries to prove |Dist| > BackedgeTakenCount * Stride, so using the maximum should conservatively be correct.

A similar reasoning can be applied at the end of the function too, which also is currently guarded by having a common stride. I am also planning on addressing that separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a TODO?

return Dependence::NoDep;

const SCEVConstant *C = dyn_cast<SCEVConstant>(Dist);
if (!C) {
LLVM_DEBUG(dbgs() << "LAA: Dependence because of non-constant distance\n");
FoundNonConstantDistanceDependence = true;
return Dependence::Unknown;
}

const APInt &Val = C->getAPInt();
int64_t Distance = Val.getSExtValue();

// Attempt to prove strided accesses independent.
if (std::abs(Distance) > 0 && Stride > 1 && HasSameSize &&
areStridedAccessesIndependent(std::abs(Distance), Stride, TypeByteSize)) {
LLVM_DEBUG(dbgs() << "LAA: Strided accesses are independent\n");
return Dependence::NoDep;
if (C) {
const APInt &Val = C->getAPInt();
int64_t Distance = Val.getSExtValue();

if (std::abs(Distance) > 0 && CommonStride > 1 && HasSameSize &&
areStridedAccessesIndependent(std::abs(Distance), CommonStride,
TypeByteSize)) {
LLVM_DEBUG(dbgs() << "LAA: Strided accesses are independent\n");
return Dependence::NoDep;
}

// Write to the same location with the same size.
if (C->isZero()) {
if (HasSameSize)
return Dependence::Forward;
LLVM_DEBUG(
dbgs()
<< "LAA: Zero dependence difference but different type sizes\n");
return Dependence::Unknown;
}
}

// Negative distances are not plausible dependencies.
if (Val.isNegative()) {
if (SE.isKnownNonPositive(Dist)) {
if (!SE.isKnownNegative(Dist) && !HasSameSize) {
LLVM_DEBUG(dbgs() << "LAA: possibly zero dependence difference but "
"different type sizes\n");
return Dependence::Unknown;
}
Copy link
Member

@Meinersbur Meinersbur Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impression is that most of these return Unknown; should fall through instead to give other checks a chance to match. If there is an invariant after a certain point (e.g. "type size is equal") it should be rejected there. It avoids having to maintain several conditions e.g. updating FoundNonConstantDistanceDependence which seems to be missing here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've merged the code from above that handles C == 0 with the code here. The main reason we need to update FoundNonConstantDistanceDependence in multiple places now is that there now are multiple exit paths that return Unknown with non-constant dependences.

The latest version of the patch limits setting FoundNonConstantDistanceDependence to 3 times, with the one before couldPreventStoreLoadForward potentially being removed as a follow-up by making couldPreventStoreLoadForward work with non-constant distances.


bool IsTrueDataDependence = (AIsWrite && !BIsWrite);
// There is no need to update MaxSafeVectorWidthInBits after call to
// couldPreventStoreLoadForward, even if it changed MinDepDistBytes,
// since a forward dependency will allow vectorization using any width.
if (IsTrueDataDependence && EnableForwardingConflictDetection &&
(!HasSameSize || couldPreventStoreLoadForward(Val.abs().getZExtValue(),
TypeByteSize))) {
LLVM_DEBUG(dbgs() << "LAA: Forward but may prevent st->ld forwarding\n");
return Dependence::ForwardButPreventsForwarding;
if (IsTrueDataDependence && EnableForwardingConflictDetection) {
if (!C) {
FoundNonConstantDistanceDependence |= CommonStride;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no common stride (CommonStride == 0) then any of the two could still be non-constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the reason for only setting it for CommonStride == 0 is to only set it in cases where we would set it in the original code as well; if there was no common stride, we would have returned Unknown much earlier, without setting FoundNonConstantDistanceDependence.

I added a TODO to relax this requirement, but would like to keep it in the patch to limit the impact.

return Dependence::Unknown;
}
if (!HasSameSize ||
couldPreventStoreLoadForward(C->getAPInt().abs().getZExtValue(),
TypeByteSize)) {
LLVM_DEBUG(
dbgs() << "LAA: Forward but may prevent st->ld forwarding\n");
return Dependence::ForwardButPreventsForwarding;
}
}

LLVM_DEBUG(dbgs() << "LAA: Dependence is negative\n");
return Dependence::Forward;
}

// Write to the same location with the same size.
if (Val == 0) {
if (HasSameSize)
return Dependence::Forward;
LLVM_DEBUG(
dbgs() << "LAA: Zero dependence difference but different type sizes\n");
if (!SE.isKnownPositive(Dist)) {
FoundNonConstantDistanceDependence |= !C && CommonStride;
return Dependence::Unknown;
}

assert(Val.isStrictlyPositive() && "Expect a positive value");

if (!HasSameSize) {
LLVM_DEBUG(dbgs() << "LAA: ReadWrite-Write positive dependency with "
"different type sizes\n");
FoundNonConstantDistanceDependence |= !C && CommonStride;
return Dependence::Unknown;
}

if (!C) {
LLVM_DEBUG(dbgs() << "LAA: Dependence because of non-constant distance\n");
FoundNonConstantDistanceDependence |= CommonStride;
return Dependence::Unknown;
}

// The logic below currently only supports StrideA == StrideB, i.e. there's a
// common stride.
if (!CommonStride)
return Dependence::Unknown;
Comment on lines +2141 to +2142
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many checks above also test (CommonStride > 1). Move them below this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could only find the one guarding areStridedAccessesIndependent and I don't think it can be moved here, as it tries to identify independent strided accesses early on to return NoDep. Kept the order roughly the same for now.


const APInt &Val = C->getAPInt();
int64_t Distance = Val.getSExtValue();

// Bail out early if passed-in parameters make vectorization not feasible.
unsigned ForcedFactor = (VectorizerParams::VectorizationFactor ?
VectorizerParams::VectorizationFactor : 1);
Expand Down Expand Up @@ -2125,7 +2163,7 @@ MemoryDepChecker::Dependence::DepType MemoryDepChecker::isDependent(
// the minimum distance needed is 28, which is greater than distance. It is
// not safe to do vectorization.
uint64_t MinDistanceNeeded =
TypeByteSize * Stride * (MinNumIter - 1) + TypeByteSize;
TypeByteSize * CommonStride * (MinNumIter - 1) + TypeByteSize;
if (MinDistanceNeeded > static_cast<uint64_t>(Distance)) {
LLVM_DEBUG(dbgs() << "LAA: Failure because of positive distance "
<< Distance << '\n');
Expand Down Expand Up @@ -2174,7 +2212,7 @@ MemoryDepChecker::Dependence::DepType MemoryDepChecker::isDependent(

// An update to MinDepDistBytes requires an update to MaxSafeVectorWidthInBits
// since there is a backwards dependency.
uint64_t MaxVF = MinDepDistBytes / (TypeByteSize * Stride);
uint64_t MaxVF = MinDepDistBytes / (TypeByteSize * CommonStride);
LLVM_DEBUG(dbgs() << "LAA: Positive distance " << Val.getSExtValue()
<< " with max VF = " << MaxVF << '\n');
uint64_t MaxVFInBits = MaxVF * TypeByteSize * 8;
Expand Down
4 changes: 3 additions & 1 deletion llvm/lib/Transforms/Scalar/LoopLoadElimination.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -126,8 +126,10 @@ struct StoreToLoadForwardingCandidate {

// We don't need to check non-wrapping here because forward/backward
// dependence wouldn't be valid if these weren't monotonic accesses.
auto *Dist = cast<SCEVConstant>(
auto *Dist = dyn_cast<SCEVConstant>(
PSE.getSE()->getMinusSCEV(StorePtrSCEV, LoadPtrSCEV));
if (!Dist)
return false;
Comment on lines +129 to +132
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related? LAA doesn't seem to be involved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the pass uses LoopAccessInfo in findStoreToLoadDependences to collect dependences. Before this patch, all identified dependences would have a constant distance which is why this change is needed

const APInt &Val = Dist->getAPInt();
return Val == TypeByteSize * StrideLoad;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,9 @@ declare void @llvm.assume(i1)
define void @different_non_constant_strides_known_forward(ptr %A) {
; CHECK-LABEL: 'different_non_constant_strides_known_forward'
; CHECK-NEXT: loop:
; CHECK-NEXT: Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
; CHECK-NEXT: Unknown data dependence.
; CHECK-NEXT: Memory dependences are safe
; CHECK-NEXT: Dependences:
; CHECK-NEXT: Unknown:
; CHECK-NEXT: Forward:
; CHECK-NEXT: %l = load i32, ptr %gep.mul.2, align 4 ->
; CHECK-NEXT: store i32 %add, ptr %gep, align 4
; CHECK-EMPTY:
Expand Down Expand Up @@ -45,10 +44,9 @@ exit:
define void @different_non_constant_strides_known_forward_min_distance_3(ptr %A) {
; CHECK-LABEL: 'different_non_constant_strides_known_forward_min_distance_3'
; CHECK-NEXT: loop:
; CHECK-NEXT: Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
; CHECK-NEXT: Unknown data dependence.
; CHECK-NEXT: Memory dependences are safe
; CHECK-NEXT: Dependences:
; CHECK-NEXT: Unknown:
; CHECK-NEXT: Forward:
; CHECK-NEXT: %l = load i32, ptr %gep.mul.2, align 4 ->
; CHECK-NEXT: store i32 %add, ptr %gep, align 4
; CHECK-EMPTY:
Expand Down
102 changes: 92 additions & 10 deletions llvm/test/Transforms/LoopVectorize/global_alias.ll
Original file line number Diff line number Diff line change
Expand Up @@ -491,22 +491,101 @@ for.end: ; preds = %for.body
ret i32 %1
}

; /// Different objects, swapped induction, alias at the end
; int noAlias15 (int a) {
; int i;
; for (i=0; i<SIZE; i++)
; Foo.A[i] = Foo.B[SIZE-i-1] + a;
; return Foo.A[a];
; }
; CHECK-LABEL: define i32 @noAlias15(
; CHECK: vector.memcheck:
; CHECK-NEXT: br i1 false, label %scalar.ph, label %vector.ph
; CHECK: add nsw <4 x i32>
; CHECK: ret

define i32 @noAlias15(i32 %a) nounwind {
entry:
br label %for.body

for.body: ; preds = %entry, %for.body
%i.05 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
%sub1 = sub nuw nsw i32 99, %i.05
%arrayidx = getelementptr inbounds %struct.anon, ptr @Foo, i32 0, i32 2, i32 %sub1
%0 = load i32, ptr %arrayidx, align 4
%add = add nsw i32 %0, %a
%arrayidx2 = getelementptr inbounds [100 x i32], ptr @Foo, i32 0, i32 %i.05
store i32 %add, ptr %arrayidx2, align 4
%inc = add nuw nsw i32 %i.05, 1
%exitcond.not = icmp eq i32 %inc, 100
br i1 %exitcond.not, label %for.end, label %for.body

for.end: ; preds = %for.body
%arrayidx3 = getelementptr inbounds [100 x i32], ptr @Foo, i32 0, i32 %a
%1 = load i32, ptr %arrayidx3, align 4
ret i32 %1
}

; /// Different objects, swapped induction, alias at the beginning
; int noAlias16 (int a) {
; int i;
; for (i=0; i<SIZE; i++)
; Foo.A[SIZE-i-1] = Foo.B[i] + a;
; return Foo.A[a];
; }
; CHECK-LABEL: define i32 @noAlias16(
; CHECK: entry:
; CHECK-NEXT: br i1 false, label %scalar.ph, label %vector.ph

; CHECK: add nsw <4 x i32>
; CHECK: ret

define i32 @noAlias16(i32 %a) nounwind {
entry:
br label %for.body

for.body: ; preds = %entry, %for.body
%i.05 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
%arrayidx = getelementptr inbounds %struct.anon, ptr @Foo, i32 0, i32 2, i32 %i.05
%0 = load i32, ptr %arrayidx, align 4
%add = add nsw i32 %0, %a
%sub1 = sub nuw nsw i32 99, %i.05
%arrayidx2 = getelementptr inbounds [100 x i32], ptr @Foo, i32 0, i32 %sub1
store i32 %add, ptr %arrayidx2, align 4
%inc = add nuw nsw i32 %i.05, 1
%exitcond.not = icmp eq i32 %inc, 100
br i1 %exitcond.not, label %for.end, label %for.body

for.end: ; preds = %for.body
%arrayidx3 = getelementptr inbounds [100 x i32], ptr @Foo, i32 0, i32 %a
%1 = load i32, ptr %arrayidx3, align 4
ret i32 %1
}


;; === Now, the tests that we could vectorize with induction changes or run-time checks ===


; /// Different objects, swapped induction, alias at the end
; int mayAlias01 (int a) {
; int mayAlias01 (int a, int N) {
; int i;
; for (i=0; i<SIZE; i++)
; for (i=0; i<N; i++)
; Foo.A[i] = Foo.B[SIZE-i-1] + a;
; return Foo.A[a];
; }
; CHECK-LABEL: define i32 @mayAlias01(
; CHECK-NOT: add nsw <4 x i32>
; CHECK: vector.memcheck:
; CHECK-NEXT: [[MUL:%.+]] = shl i32 %N, 2
; CHECK-NEXT: [[SCEVGEP0:%.+]] = getelementptr i8, ptr @Foo, i32 [[MUL]]
; CHECK-NEXT: [[SUB:%.+]] = sub i32 804, [[MUL]]
; CHECK-NEXT: [[SCEVGEP1:%.+]] = getelementptr i8, ptr @Foo, i32 [[SUB]]
; CHECK-NEXT: [[BOUND:%.+]] = icmp ult ptr [[SCEVGEP1]], [[SCEVGEP0]]
; CHECK-NEXT: br i1 [[BOUND]], label %scalar.ph, label %vector.ph

; CHECK: add nsw <4 x i32>
; CHECK: ret

define i32 @mayAlias01(i32 %a) nounwind {
define i32 @mayAlias01(i32 %a, i32 %N) nounwind {
entry:
br label %for.body

Expand All @@ -519,7 +598,7 @@ for.body: ; preds = %entry, %for.body
%arrayidx2 = getelementptr inbounds [100 x i32], ptr @Foo, i32 0, i32 %i.05
store i32 %add, ptr %arrayidx2, align 4
%inc = add nuw nsw i32 %i.05, 1
%exitcond.not = icmp eq i32 %inc, 100
%exitcond.not = icmp eq i32 %inc, %N
br i1 %exitcond.not, label %for.end, label %for.body

for.end: ; preds = %for.body
Expand All @@ -529,17 +608,20 @@ for.end: ; preds = %for.body
}

; /// Different objects, swapped induction, alias at the beginning
; int mayAlias02 (int a) {
; int mayAlias02 (int a, int N) {
; int i;
; for (i=0; i<SIZE; i++)
; for (i=0; i<N; i++)
; Foo.A[SIZE-i-1] = Foo.B[i] + a;
; return Foo.A[a];
; }
; CHECK-LABEL: define i32 @mayAlias02(
; CHECK-NOT: add nsw <4 x i32>
; CHECK: vector.memcheck:
; CHECK-NEXT: br i1 false, label %scalar.ph, label %vector.ph

; CHECK: add nsw <4 x i32>
; CHECK: ret

define i32 @mayAlias02(i32 %a) nounwind {
define i32 @mayAlias02(i32 %a, i32 %N) nounwind {
entry:
br label %for.body

Expand All @@ -552,7 +634,7 @@ for.body: ; preds = %entry, %for.body
%arrayidx2 = getelementptr inbounds [100 x i32], ptr @Foo, i32 0, i32 %sub1
store i32 %add, ptr %arrayidx2, align 4
%inc = add nuw nsw i32 %i.05, 1
%exitcond.not = icmp eq i32 %inc, 100
%exitcond.not = icmp eq i32 %inc, %N
br i1 %exitcond.not, label %for.end, label %for.body

for.end: ; preds = %for.body
Expand Down
Loading