Skip to content

[LAA] Scale strides using type-size (NFC) #124529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 20, 2025
Merged

Conversation

artagnon
Copy link
Contributor

@artagnon artagnon commented Jan 27, 2025

Change getDependenceDistanceStrideAndSize to scale strides by TypeByteSize, scaling the returned CommonStride and MaxStride. Even though there is a seemingly-functional change of setting CommonStride when scaled strides are equal, it ends up being a non-functional change due to aggressive HasSameSize checking.

@llvmbot llvmbot added the llvm:analysis Includes value tracking, cost tables and constant folding label Jan 27, 2025
@llvmbot
Copy link
Member

llvmbot commented Jan 27, 2025

@llvm/pr-subscribers-llvm-analysis

Author: Ramkumar Ramachandra (artagnon)

Changes

Change getDependenceDistanceStrideAndSize to scale strides by TypeByteSize, scaling the returned CommonStride and MaxStride. We now correctly detect that there is indeed a common stride, when we failed to previously due to differing type sizes.


Full diff: https://github.com/llvm/llvm-project/pull/124529.diff

4 Files Affected:

  • (modified) llvm/include/llvm/Analysis/LoopAccessAnalysis.h (+1-1)
  • (modified) llvm/lib/Analysis/LoopAccessAnalysis.cpp (+35-37)
  • (modified) llvm/test/Analysis/LoopAccessAnalysis/forward-loop-carried.ll (-4)
  • (modified) llvm/test/Analysis/LoopAccessAnalysis/stride-access-dependence.ll (+5-33)
diff --git a/llvm/include/llvm/Analysis/LoopAccessAnalysis.h b/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
index 31374a128856c7..3059cfcb9bbccd 100644
--- a/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
+++ b/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
@@ -370,7 +370,7 @@ class MemoryDepChecker {
     /// Strides could either be scaled (in bytes, taking the size of the
     /// underlying type into account), or unscaled (in indexing units; unscaled
     /// stride = scaled stride / size of underlying type). Here, strides are
-    /// unscaled.
+    /// scaled.
     uint64_t MaxStride;
     std::optional<uint64_t> CommonStride;
 
diff --git a/llvm/lib/Analysis/LoopAccessAnalysis.cpp b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
index 2a68979add666d..571e63a3ccb46c 100644
--- a/llvm/lib/Analysis/LoopAccessAnalysis.cpp
+++ b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
@@ -1799,8 +1799,7 @@ void MemoryDepChecker::mergeInStatus(VectorizationSafetyStatus S) {
 ///     }
 static bool isSafeDependenceDistance(const DataLayout &DL, ScalarEvolution &SE,
                                      const SCEV &MaxBTC, const SCEV &Dist,
-                                     uint64_t MaxStride,
-                                     uint64_t TypeByteSize) {
+                                     uint64_t MaxStride) {
 
   // If we can prove that
   //      (**) |Dist| > MaxBTC * Step
@@ -1819,8 +1818,7 @@ static bool isSafeDependenceDistance(const DataLayout &DL, ScalarEvolution &SE,
   // will be executed only if LoopCount >= VF, proving distance >= LoopCount
   // also guarantees that distance >= VF.
   //
-  const uint64_t ByteStride = MaxStride * TypeByteSize;
-  const SCEV *Step = SE.getConstant(MaxBTC.getType(), ByteStride);
+  const SCEV *Step = SE.getConstant(MaxBTC.getType(), MaxStride);
   const SCEV *Product = SE.getMulExpr(&MaxBTC, Step);
 
   const SCEV *CastedDist = &Dist;
@@ -1854,24 +1852,16 @@ static bool isSafeDependenceDistance(const DataLayout &DL, ScalarEvolution &SE,
 /// bytes.
 ///
 /// \returns true if they are independent.
-static bool areStridedAccessesIndependent(uint64_t Distance, uint64_t Stride,
-                                          uint64_t TypeByteSize) {
+static bool areStridedAccessesIndependent(uint64_t Distance, uint64_t Stride) {
   assert(Stride > 1 && "The stride must be greater than 1");
-  assert(TypeByteSize > 0 && "The type size in byte must be non-zero");
   assert(Distance > 0 && "The distance must be non-zero");
 
-  // Skip if the distance is not multiple of type byte size.
-  if (Distance % TypeByteSize)
-    return false;
-
-  uint64_t ScaledDist = Distance / TypeByteSize;
-
-  // No dependence if the scaled distance is not multiple of the stride.
+  // No dependence if the distance is not multiple of the stride.
   // E.g.
   //      for (i = 0; i < 1024 ; i += 4)
   //        A[i+2] = A[i] + 1;
   //
-  // Two accesses in memory (scaled distance is 2, stride is 4):
+  // Two accesses in memory (distance is 2, stride is 4):
   //     | A[0] |      |      |      | A[4] |      |      |      |
   //     |      |      | A[2] |      |      |      | A[6] |      |
   //
@@ -1879,10 +1869,10 @@ static bool areStridedAccessesIndependent(uint64_t Distance, uint64_t Stride,
   //      for (i = 0; i < 1024 ; i += 3)
   //        A[i+4] = A[i] + 1;
   //
-  // Two accesses in memory (scaled distance is 4, stride is 3):
+  // Two accesses in memory (distance is 4, stride is 3):
   //     | A[0] |      |      | A[3] |      |      | A[6] |      |      |
   //     |      |      |      |      | A[4] |      |      | A[7] |      |
-  return ScaledDist % Stride;
+  return Distance % Stride;
 }
 
 std::variant<MemoryDepChecker::Dependence::DepType,
@@ -1990,25 +1980,34 @@ MemoryDepChecker::getDependenceDistanceStrideAndSize(
     return MemoryDepChecker::Dependence::Unknown;
   }
 
-  uint64_t TypeByteSize = DL.getTypeAllocSize(ATy);
-  bool HasSameSize =
-      DL.getTypeStoreSizeInBits(ATy) == DL.getTypeStoreSizeInBits(BTy);
-  if (!HasSameSize)
-    TypeByteSize = 0;
+  TypeSize AStoreSz = DL.getTypeStoreSize(ATy),
+           BStoreSz = DL.getTypeStoreSize(BTy);
+
+  // Fail early if either store size is scalable.
+  if (AStoreSz.isScalable() || BStoreSz.isScalable())
+    return MemoryDepChecker::Dependence::Unknown;
+
+  // The TypeByteSize is used to scale Distance and VF. In these contexts, the
+  // only size that matters is the size of the Sink. If store sizes are not the
+  // same, set TypeByteSize to zero, so we can check it in the caller.
+  uint64_t ASz = alignTo(AStoreSz, DL.getABITypeAlign(ATy)),
+           TypeByteSize = AStoreSz == BStoreSz
+                              ? alignTo(BStoreSz, DL.getABITypeAlign(BTy))
+                              : 0;
 
-  StrideAPtrInt = std::abs(StrideAPtrInt);
-  StrideBPtrInt = std::abs(StrideBPtrInt);
+  uint64_t StrideAScaled = std::abs(StrideAPtrInt) * ASz;
+  uint64_t StrideBScaled = std::abs(StrideBPtrInt) * TypeByteSize;
 
-  uint64_t MaxStride = std::max(StrideAPtrInt, StrideBPtrInt);
+  uint64_t MaxStride = std::max(StrideAScaled, StrideBScaled);
 
   std::optional<uint64_t> CommonStride;
-  if (StrideAPtrInt == StrideBPtrInt)
-    CommonStride = StrideAPtrInt;
+  if (StrideAScaled == StrideBScaled)
+    CommonStride = StrideAScaled;
 
   // TODO: Historically, we don't retry with runtime checks unless the
   // (unscaled) strides are the same. Fix this once the condition for runtime
   // checks in isDependent is fixed.
-  bool ShouldRetryWithRuntimeCheck = CommonStride.has_value();
+  bool ShouldRetryWithRuntimeCheck = StrideAPtrInt == StrideBPtrInt;
 
   return DepDistanceStrideAndSizeInfo(Dist, MaxStride, CommonStride,
                                       ShouldRetryWithRuntimeCheck, TypeByteSize,
@@ -2048,9 +2047,9 @@ MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
   // upper bound of the number of iterations), the accesses are independet, i.e.
   // they are far enough appart that accesses won't access the same location
   // across all loop ierations.
-  if (HasSameSize && isSafeDependenceDistance(
-                         DL, SE, *(PSE.getSymbolicMaxBackedgeTakenCount()),
-                         *Dist, MaxStride, TypeByteSize))
+  if (HasSameSize &&
+      isSafeDependenceDistance(
+          DL, SE, *(PSE.getSymbolicMaxBackedgeTakenCount()), *Dist, MaxStride))
     return Dependence::NoDep;
 
   const SCEVConstant *ConstDist = dyn_cast<SCEVConstant>(Dist);
@@ -2062,7 +2061,7 @@ MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
     // If the distance between accesses and their strides are known constants,
     // check whether the accesses interlace each other.
     if (Distance > 0 && CommonStride && CommonStride > 1 && HasSameSize &&
-        areStridedAccessesIndependent(Distance, *CommonStride, TypeByteSize)) {
+        areStridedAccessesIndependent(Distance, *CommonStride)) {
       LLVM_DEBUG(dbgs() << "LAA: Strided accesses are independent\n");
       return Dependence::NoDep;
     }
@@ -2154,8 +2153,8 @@ MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
 
   // It's not vectorizable if the distance is smaller than the minimum distance
   // needed for a vectroized/unrolled version. Vectorizing one iteration in
-  // front needs TypeByteSize * Stride. Vectorizing the last iteration needs
-  // TypeByteSize (No need to plus the last gap distance).
+  // front needs CommonStride. Vectorizing the last iteration needs TypeByteSize
+  // (No need to plus the last gap distance).
   //
   // E.g. Assume one char is 1 byte in memory and one int is 4 bytes.
   //      foo(int *A) {
@@ -2182,8 +2181,7 @@ MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
   // We know that Dist is positive, but it may not be constant. Use the signed
   // minimum for computations below, as this ensures we compute the closest
   // possible dependence distance.
-  uint64_t MinDistanceNeeded =
-      TypeByteSize * *CommonStride * (MinNumIter - 1) + TypeByteSize;
+  uint64_t MinDistanceNeeded = *CommonStride * (MinNumIter - 1) + TypeByteSize;
   if (MinDistanceNeeded > static_cast<uint64_t>(MinDistance)) {
     if (!ConstDist) {
       // For non-constant distances, we checked the lower bound of the
@@ -2239,7 +2237,7 @@ MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
 
   // An update to MinDepDistBytes requires an update to MaxSafeVectorWidthInBits
   // since there is a backwards dependency.
-  uint64_t MaxVF = MinDepDistBytes / (TypeByteSize * *CommonStride);
+  uint64_t MaxVF = MinDepDistBytes / *CommonStride;
   LLVM_DEBUG(dbgs() << "LAA: Positive min distance " << MinDistance
                     << " with max VF = " << MaxVF << '\n');
 
diff --git a/llvm/test/Analysis/LoopAccessAnalysis/forward-loop-carried.ll b/llvm/test/Analysis/LoopAccessAnalysis/forward-loop-carried.ll
index adfd19923e921c..455b23c4a8637c 100644
--- a/llvm/test/Analysis/LoopAccessAnalysis/forward-loop-carried.ll
+++ b/llvm/test/Analysis/LoopAccessAnalysis/forward-loop-carried.ll
@@ -62,10 +62,6 @@ define void @forward_different_access_sizes(ptr readnone %end, ptr %start) {
 ; CHECK-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
 ; CHECK-NEXT:  Forward loop carried data dependence that prevents store-to-load forwarding.
 ; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        Forward:
-; CHECK-NEXT:            %l = load i24, ptr %gep.1, align 1 ->
-; CHECK-NEXT:            store i24 %l, ptr %ptr.iv, align 1
-; CHECK-EMPTY:
 ; CHECK-NEXT:        ForwardButPreventsForwarding:
 ; CHECK-NEXT:            store i32 0, ptr %gep.2, align 4 ->
 ; CHECK-NEXT:            %l = load i24, ptr %gep.1, align 1
diff --git a/llvm/test/Analysis/LoopAccessAnalysis/stride-access-dependence.ll b/llvm/test/Analysis/LoopAccessAnalysis/stride-access-dependence.ll
index ef19e173b65994..cac29a0a2cb10f 100644
--- a/llvm/test/Analysis/LoopAccessAnalysis/stride-access-dependence.ll
+++ b/llvm/test/Analysis/LoopAccessAnalysis/stride-access-dependence.ll
@@ -416,13 +416,8 @@ for.body:                                         ; preds = %entry, %for.body
 define void @vectorizable_unscaled_Read_Write(ptr nocapture %A) {
 ; CHECK-LABEL: 'vectorizable_unscaled_Read_Write'
 ; CHECK-NEXT:    for.body:
-; CHECK-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
-; CHECK-NEXT:  Backward loop carried data dependence that prevents store-to-load forwarding.
+; CHECK-NEXT:      Memory dependences are safe
 ; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        BackwardVectorizableButPreventsForwarding:
-; CHECK-NEXT:            %0 = load i32, ptr %arrayidx, align 4 ->
-; CHECK-NEXT:            store i32 %add, ptr %arrayidx2, align 4
-; CHECK-EMPTY:
 ; CHECK-NEXT:      Run-time memory checks:
 ; CHECK-NEXT:      Grouped accesses:
 ; CHECK-EMPTY:
@@ -464,12 +459,8 @@ for.body:                                         ; preds = %entry, %for.body
 define i32 @vectorizable_unscaled_Write_Read(ptr nocapture %A) {
 ; CHECK-LABEL: 'vectorizable_unscaled_Write_Read'
 ; CHECK-NEXT:    for.body:
-; CHECK-NEXT:      Memory dependences are safe with a maximum safe vector width of 64 bits
+; CHECK-NEXT:      Memory dependences are safe
 ; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        BackwardVectorizable:
-; CHECK-NEXT:            store i32 %0, ptr %arrayidx, align 4 ->
-; CHECK-NEXT:            %1 = load i32, ptr %arrayidx2, align 4
-; CHECK-EMPTY:
 ; CHECK-NEXT:      Run-time memory checks:
 ; CHECK-NEXT:      Grouped accesses:
 ; CHECK-EMPTY:
@@ -508,13 +499,8 @@ for.body:                                         ; preds = %entry, %for.body
 define void @unsafe_unscaled_Read_Write(ptr nocapture %A) {
 ; CHECK-LABEL: 'unsafe_unscaled_Read_Write'
 ; CHECK-NEXT:    for.body:
-; CHECK-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
-; CHECK-NEXT:  Backward loop carried data dependence.
+; CHECK-NEXT:      Memory dependences are safe
 ; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        Backward:
-; CHECK-NEXT:            %0 = load i32, ptr %arrayidx, align 4 ->
-; CHECK-NEXT:            store i32 %add, ptr %arrayidx2, align 4
-; CHECK-EMPTY:
 ; CHECK-NEXT:      Run-time memory checks:
 ; CHECK-NEXT:      Grouped accesses:
 ; CHECK-EMPTY:
@@ -551,13 +537,8 @@ for.body:                                         ; preds = %entry, %for.body
 define void @unsafe_unscaled_Read_Write2(ptr nocapture %A) {
 ; CHECK-LABEL: 'unsafe_unscaled_Read_Write2'
 ; CHECK-NEXT:    for.body:
-; CHECK-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
-; CHECK-NEXT:  Backward loop carried data dependence.
+; CHECK-NEXT:      Memory dependences are safe
 ; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        Backward:
-; CHECK-NEXT:            %0 = load i32, ptr %arrayidx, align 4 ->
-; CHECK-NEXT:            store i32 %add, ptr %arrayidx2, align 4
-; CHECK-EMPTY:
 ; CHECK-NEXT:      Run-time memory checks:
 ; CHECK-NEXT:      Grouped accesses:
 ; CHECK-EMPTY:
@@ -602,17 +583,8 @@ for.body:                                         ; preds = %entry, %for.body
 define void @interleaved_stores(ptr nocapture %A) {
 ; CHECK-LABEL: 'interleaved_stores'
 ; CHECK-NEXT:    for.body:
-; CHECK-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
-; CHECK-NEXT:  Backward loop carried data dependence.
+; CHECK-NEXT:      Memory dependences are safe
 ; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        Backward:
-; CHECK-NEXT:            store i32 %2, ptr %arrayidx5, align 4 ->
-; CHECK-NEXT:            store i32 %2, ptr %arrayidx9, align 4
-; CHECK-EMPTY:
-; CHECK-NEXT:        Backward:
-; CHECK-NEXT:            store i32 %0, ptr %arrayidx2, align 4 ->
-; CHECK-NEXT:            store i32 %2, ptr %arrayidx5, align 4
-; CHECK-EMPTY:
 ; CHECK-NEXT:      Run-time memory checks:
 ; CHECK-NEXT:      Grouped accesses:
 ; CHECK-EMPTY:

@artagnon artagnon changed the title LAA: scale strides using type-size LAA: scale strides using type-size (NFC) Jan 27, 2025
Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a step back, are there any planned follow-ups that make it easier to work with scaled strides?

@artagnon
Copy link
Contributor Author

artagnon commented Jan 31, 2025

Taking a step back, are there any planned follow-ups that make it easier to work with scaled strides?

Yes, I plan to iterate on the aggressive HasSameSize check, weakening it. A version of that patch is #122318, but I fear that it might be incorrect without scaling strides.

@artagnon
Copy link
Contributor Author

artagnon commented Feb 6, 2025

Gentle ping.

@artagnon artagnon requested a review from Meinersbur February 14, 2025 14:27
@artagnon
Copy link
Contributor Author

Gentle ping.

Change getDependenceDistanceStrideAndSize to scale strides by
TypeByteSize, scaling the returned CommonStride and MaxStride. We now
correctly detect that there is indeed a common stride, when we failed to
previously due to differing type sizes.
@artagnon artagnon changed the title LAA: scale strides using type-size (NFC) [LAA] Scale strides using type-size (NFC) Feb 20, 2025
Copy link
Member

@Meinersbur Meinersbur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@artagnon artagnon merged commit 6eba277 into llvm:main Feb 20, 2025
8 checks passed
@artagnon artagnon deleted the laa-scaled-stride branch February 20, 2025 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:analysis Includes value tracking, cost tables and constant folding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants