Skip to content

LAA: generalize strides over unequal type sizes #108088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

artagnon
Copy link
Contributor

@artagnon artagnon commented Sep 10, 2024

getDependenceDistanceStrideAndSize currently returns a non-zero TypeByteSize only if the type-sizes of the source and sink are equal. The non-zero TypeByteSize is then used by isDependent to scale the strides, the distance between the accesses, and the VF. This restriction is very artificial, as the stride sizes can be scaled by the respective type-sizes in advance, freeing isDependent of this responsibility, and removing the ugly special-case of zero-TypeByteSize.

@artagnon artagnon requested review from nikic and fhahn September 10, 2024 21:02
@llvmbot llvmbot added llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Sep 10, 2024
@llvmbot
Copy link
Member

llvmbot commented Sep 10, 2024

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-transforms

Author: Ramkumar Ramachandra (artagnon)

Changes

getDepdenceDistanceStrideAndSize currently returns a non-zero TypeByteSize only if the type-sizes of the source and sink are equal. The non-zero TypeByteSize is then used by isDependent to scale the strides, the distance between the accesses, and the VF. This restriction is very artificial, as the stride sizes can be scaled by the respective type-sizes in advance, freeing isDependent of this responsibility, and removing the ugly special-case of zero-TypeByteSize. The patch also has the side-effect of fixing the long-standing TODO of requesting runtime-checks when the strides are unequal.

The test impact of this patch is that several false depdendencies are eliminated, and several unknown depdendencies now come with runtime-checks instead.


Patch is 41.03 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108088.diff

8 Files Affected:

  • (modified) llvm/lib/Analysis/LoopAccessAnalysis.cpp (+63-80)
  • (modified) llvm/test/Analysis/LoopAccessAnalysis/depend_diff_types.ll (+1-9)
  • (modified) llvm/test/Analysis/LoopAccessAnalysis/different-strides-safe-dep-due-to-backedge-taken-count.ll (+14-8)
  • (modified) llvm/test/Analysis/LoopAccessAnalysis/forward-loop-carried.ll (-4)
  • (modified) llvm/test/Analysis/LoopAccessAnalysis/non-constant-strides-backward.ll (+72-36)
  • (modified) llvm/test/Analysis/LoopAccessAnalysis/positive-dependence-distance-different-access-sizes.ll (+16-29)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/interleave-allocsize-not-equal-typesize.ll (+37-12)
  • (modified) llvm/test/Transforms/LoopVectorize/memory-dep-remarks.ll (+8-22)
diff --git a/llvm/lib/Analysis/LoopAccessAnalysis.cpp b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
index 980f142f113265..46fbf6c0e0e71f 100644
--- a/llvm/lib/Analysis/LoopAccessAnalysis.cpp
+++ b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
@@ -1805,8 +1805,7 @@ void MemoryDepChecker::mergeInStatus(VectorizationSafetyStatus S) {
 ///     }
 static bool isSafeDependenceDistance(const DataLayout &DL, ScalarEvolution &SE,
                                      const SCEV &MaxBTC, const SCEV &Dist,
-                                     uint64_t MaxStride,
-                                     uint64_t TypeByteSize) {
+                                     uint64_t MaxStride) {
 
   // If we can prove that
   //      (**) |Dist| > MaxBTC * Step
@@ -1825,8 +1824,7 @@ static bool isSafeDependenceDistance(const DataLayout &DL, ScalarEvolution &SE,
   // will be executed only if LoopCount >= VF, proving distance >= LoopCount
   // also guarantees that distance >= VF.
   //
-  const uint64_t ByteStride = MaxStride * TypeByteSize;
-  const SCEV *Step = SE.getConstant(MaxBTC.getType(), ByteStride);
+  const SCEV *Step = SE.getConstant(MaxBTC.getType(), MaxStride);
   const SCEV *Product = SE.getMulExpr(&MaxBTC, Step);
 
   const SCEV *CastedDist = &Dist;
@@ -1870,9 +1868,7 @@ static bool areStridedAccessesIndependent(uint64_t Distance, uint64_t Stride,
   if (Distance % TypeByteSize)
     return false;
 
-  uint64_t ScaledDist = Distance / TypeByteSize;
-
-  // No dependence if the scaled distance is not multiple of the stride.
+  // No dependence if the distance is not multiple of the stride.
   // E.g.
   //      for (i = 0; i < 1024 ; i += 4)
   //        A[i+2] = A[i] + 1;
@@ -1888,7 +1884,7 @@ static bool areStridedAccessesIndependent(uint64_t Distance, uint64_t Stride,
   // Two accesses in memory (scaled distance is 4, stride is 3):
   //     | A[0] |      |      | A[3] |      |      | A[6] |      |      |
   //     |      |      |      |      | A[4] |      |      | A[7] |      |
-  return ScaledDist % Stride;
+  return Distance % Stride;
 }
 
 std::variant<MemoryDepChecker::Dependence::DepType,
@@ -1927,6 +1923,7 @@ MemoryDepChecker::getDependenceDistanceStrideAndSize(
   if (StrideAPtr && *StrideAPtr < 0) {
     std::swap(Src, Sink);
     std::swap(AInst, BInst);
+    std::swap(ATy, BTy);
     std::swap(StrideAPtr, StrideBPtr);
   }
 
@@ -1970,31 +1967,51 @@ MemoryDepChecker::getDependenceDistanceStrideAndSize(
     return MemoryDepChecker::Dependence::IndirectUnsafe;
   }
 
-  int64_t StrideAPtrInt = *StrideAPtr;
-  int64_t StrideBPtrInt = *StrideBPtr;
-  LLVM_DEBUG(dbgs() << "LAA:  Src induction step: " << StrideAPtrInt
-                    << " Sink induction step: " << StrideBPtrInt << "\n");
+  LLVM_DEBUG(dbgs() << "LAA:  Src induction step: " << *StrideAPtr
+                    << " Sink induction step: " << *StrideBPtr << "\n");
+
+  // Note that store size is different from alloc size, which is depedent on
+  // store size. We use the former for checking illegal cases, and the latter
+  // for scaling strides.
+  TypeSize AStoreSz = DL.getTypeStoreSizeInBits(ATy),
+           BStoreSz = DL.getTypeStoreSizeInBits(BTy);
+
+  // When the distance is zero, we're reading/writing the same memory location:
+  // check that the store sizes are equal. Otherwise, fail with an unknown
+  // dependence for which we should not generate runtime checks.
+  if (Dist->isZero() && AStoreSz != BStoreSz)
+    return MemoryDepChecker::Dependence::Unknown;
+
+  // We can't get get a uint64_t for the AllocSize if either of the store sizes
+  // are scalable.
+  if (AStoreSz.isScalable() || BStoreSz.isScalable())
+    return MemoryDepChecker::Dependence::Unknown;
+
+  // The TypeByteSize is used to scale Distance and VF. In these contexts, the
+  // only size that matters is the size of the Sink.
+  uint64_t ASz = DL.getTypeAllocSize(ATy),
+           TypeByteSize = DL.getTypeAllocSize(BTy);
+
+  // We scale the strides by the alloc-type-sizes, so we can check that the
+  // common distance is equal when ASz != BSz.
+  int64_t StrideAScaled = *StrideAPtr * ASz;
+  int64_t StrideBScaled = *StrideBPtr * TypeByteSize;
+
   // At least Src or Sink are loop invariant and the other is strided or
   // invariant. We can generate a runtime check to disambiguate the accesses.
-  if (StrideAPtrInt == 0 || StrideBPtrInt == 0)
+  if (!StrideAScaled || !StrideBScaled)
     return MemoryDepChecker::Dependence::Unknown;
 
   // Both Src and Sink have a constant stride, check if they are in the same
   // direction.
-  if ((StrideAPtrInt > 0 && StrideBPtrInt < 0) ||
-      (StrideAPtrInt < 0 && StrideBPtrInt > 0)) {
+  if (StrideAScaled > 0 != StrideBScaled > 0) {
     LLVM_DEBUG(
         dbgs() << "Pointer access with strides in different directions\n");
     return MemoryDepChecker::Dependence::Unknown;
   }
 
-  uint64_t TypeByteSize = DL.getTypeAllocSize(ATy);
-  bool HasSameSize =
-      DL.getTypeStoreSizeInBits(ATy) == DL.getTypeStoreSizeInBits(BTy);
-  if (!HasSameSize)
-    TypeByteSize = 0;
-  return DepDistanceStrideAndSizeInfo(Dist, std::abs(StrideAPtrInt),
-                                      std::abs(StrideBPtrInt), TypeByteSize,
+  return DepDistanceStrideAndSizeInfo(Dist, std::abs(StrideAScaled),
+                                      std::abs(StrideBScaled), TypeByteSize,
                                       AIsWrite, BIsWrite);
 }
 
@@ -2012,15 +2029,12 @@ MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
 
   auto &[Dist, StrideA, StrideB, TypeByteSize, AIsWrite, BIsWrite] =
       std::get<DepDistanceStrideAndSizeInfo>(Res);
-  bool HasSameSize = TypeByteSize > 0;
 
   std::optional<uint64_t> CommonStride =
       StrideA == StrideB ? std::make_optional(StrideA) : std::nullopt;
   if (isa<SCEVCouldNotCompute>(Dist)) {
-    // TODO: Relax requirement that there is a common stride to retry with
-    // non-constant distance dependencies.
-    FoundNonConstantDistanceDependence |= CommonStride.has_value();
     LLVM_DEBUG(dbgs() << "LAA: Dependence because of uncomputable distance.\n");
+    FoundNonConstantDistanceDependence = true;
     return Dependence::Unknown;
   }
 
@@ -2033,24 +2047,20 @@ MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
   // upper bound of the number of iterations), the accesses are independet, i.e.
   // they are far enough appart that accesses won't access the same location
   // across all loop ierations.
-  if (HasSameSize && isSafeDependenceDistance(
-                         DL, SE, *(PSE.getSymbolicMaxBackedgeTakenCount()),
-                         *Dist, MaxStride, TypeByteSize))
+  if (isSafeDependenceDistance(
+          DL, SE, *(PSE.getSymbolicMaxBackedgeTakenCount()), *Dist, MaxStride))
     return Dependence::NoDep;
 
-  const SCEVConstant *C = dyn_cast<SCEVConstant>(Dist);
+  const SCEVConstant *ConstDist = dyn_cast<SCEVConstant>(Dist);
 
   // Attempt to prove strided accesses independent.
-  if (C) {
-    const APInt &Val = C->getAPInt();
-    int64_t Distance = Val.getSExtValue();
+  if (ConstDist) {
+    int64_t Distance = std::abs(ConstDist->getAPInt().getSExtValue());
 
     // If the distance between accesses and their strides are known constants,
     // check whether the accesses interlace each other.
-    if (std::abs(Distance) > 0 && CommonStride && *CommonStride > 1 &&
-        HasSameSize &&
-        areStridedAccessesIndependent(std::abs(Distance), *CommonStride,
-                                      TypeByteSize)) {
+    if (Distance > 0 && CommonStride && CommonStride > 1 &&
+        areStridedAccessesIndependent(Distance, *CommonStride, TypeByteSize)) {
       LLVM_DEBUG(dbgs() << "LAA: Strided accesses are independent\n");
       return Dependence::NoDep;
     }
@@ -2063,15 +2073,9 @@ MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
 
   // Negative distances are not plausible dependencies.
   if (SE.isKnownNonPositive(Dist)) {
-    if (SE.isKnownNonNegative(Dist)) {
-      if (HasSameSize) {
-        // Write to the same location with the same size.
-        return Dependence::Forward;
-      }
-      LLVM_DEBUG(dbgs() << "LAA: possibly zero dependence difference but "
-                           "different type sizes\n");
-      return Dependence::Unknown;
-    }
+    if (SE.isKnownNonNegative(Dist))
+      // Write to the same location.
+      return Dependence::Forward;
 
     bool IsTrueDataDependence = (AIsWrite && !BIsWrite);
     // Check if the first access writes to a location that is read in a later
@@ -2083,17 +2087,12 @@ MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
     // forward dependency will allow vectorization using any width.
 
     if (IsTrueDataDependence && EnableForwardingConflictDetection) {
-      if (!C) {
-        // TODO: FoundNonConstantDistanceDependence is used as a necessary
-        // condition to consider retrying with runtime checks. Historically, we
-        // did not set it when strides were different but there is no inherent
-        // reason to.
-        FoundNonConstantDistanceDependence |= CommonStride.has_value();
+      if (!ConstDist) {
+        FoundNonConstantDistanceDependence = true;
         return Dependence::Unknown;
       }
-      if (!HasSameSize ||
-          couldPreventStoreLoadForward(C->getAPInt().abs().getZExtValue(),
-                                       TypeByteSize)) {
+      if (couldPreventStoreLoadForward(
+              ConstDist->getAPInt().abs().getZExtValue(), TypeByteSize)) {
         LLVM_DEBUG(
             dbgs() << "LAA: Forward but may prevent st->ld forwarding\n");
         return Dependence::ForwardButPreventsForwarding;
@@ -2107,27 +2106,12 @@ MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
   int64_t MinDistance = SE.getSignedRangeMin(Dist).getSExtValue();
   // Below we only handle strictly positive distances.
   if (MinDistance <= 0) {
-    FoundNonConstantDistanceDependence |= CommonStride.has_value();
+    FoundNonConstantDistanceDependence = true;
     return Dependence::Unknown;
   }
 
-  if (!isa<SCEVConstant>(Dist)) {
-    // Previously this case would be treated as Unknown, possibly setting
-    // FoundNonConstantDistanceDependence to force re-trying with runtime
-    // checks. Until the TODO below is addressed, set it here to preserve
-    // original behavior w.r.t. re-trying with runtime checks.
-    // TODO: FoundNonConstantDistanceDependence is used as a necessary
-    // condition to consider retrying with runtime checks. Historically, we
-    // did not set it when strides were different but there is no inherent
-    // reason to.
-    FoundNonConstantDistanceDependence |= CommonStride.has_value();
-  }
-
-  if (!HasSameSize) {
-    LLVM_DEBUG(dbgs() << "LAA: ReadWrite-Write positive dependency with "
-                         "different type sizes\n");
-    return Dependence::Unknown;
-  }
+  if (!ConstDist)
+    FoundNonConstantDistanceDependence = true;
 
   if (!CommonStride)
     return Dependence::Unknown;
@@ -2142,8 +2126,8 @@ MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
 
   // It's not vectorizable if the distance is smaller than the minimum distance
   // needed for a vectroized/unrolled version. Vectorizing one iteration in
-  // front needs TypeByteSize * Stride. Vectorizing the last iteration needs
-  // TypeByteSize (No need to plus the last gap distance).
+  // front needs CommonStride. Vectorizing the last iteration needs TypeByteSize
+  // (No need to plus the last gap distance).
   //
   // E.g. Assume one char is 1 byte in memory and one int is 4 bytes.
   //      foo(int *A) {
@@ -2170,10 +2154,9 @@ MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
   // We know that Dist is positive, but it may not be constant. Use the signed
   // minimum for computations below, as this ensures we compute the closest
   // possible dependence distance.
-  uint64_t MinDistanceNeeded =
-      TypeByteSize * *CommonStride * (MinNumIter - 1) + TypeByteSize;
+  uint64_t MinDistanceNeeded = *CommonStride * (MinNumIter - 1) + TypeByteSize;
   if (MinDistanceNeeded > static_cast<uint64_t>(MinDistance)) {
-    if (!isa<SCEVConstant>(Dist)) {
+    if (!ConstDist) {
       // For non-constant distances, we checked the lower bound of the
       // dependence distance and the distance may be larger at runtime (and safe
       // for vectorization). Classify it as Unknown, so we re-try with runtime
@@ -2228,12 +2211,12 @@ MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
 
   // An update to MinDepDistBytes requires an update to MaxSafeVectorWidthInBits
   // since there is a backwards dependency.
-  uint64_t MaxVF = MinDepDistBytes / (TypeByteSize * *CommonStride);
+  uint64_t MaxVF = MinDepDistBytes / *CommonStride;
   LLVM_DEBUG(dbgs() << "LAA: Positive min distance " << MinDistance
                     << " with max VF = " << MaxVF << '\n');
 
   uint64_t MaxVFInBits = MaxVF * TypeByteSize * 8;
-  if (!isa<SCEVConstant>(Dist) && MaxVFInBits < MaxTargetVectorWidthInBits) {
+  if (!ConstDist && MaxVFInBits < MaxTargetVectorWidthInBits) {
     // For non-constant distances, we checked the lower bound of the dependence
     // distance and the distance may be larger at runtime (and safe for
     // vectorization). Classify it as Unknown, so we re-try with runtime checks.
diff --git a/llvm/test/Analysis/LoopAccessAnalysis/depend_diff_types.ll b/llvm/test/Analysis/LoopAccessAnalysis/depend_diff_types.ll
index 81d8b01fe7fb72..809b15b2004952 100644
--- a/llvm/test/Analysis/LoopAccessAnalysis/depend_diff_types.ll
+++ b/llvm/test/Analysis/LoopAccessAnalysis/depend_diff_types.ll
@@ -130,16 +130,8 @@ define void @neg_dist_dep_type_size_equivalence(ptr nocapture %vec, i64 %n) {
 ; CHECK-LABEL: 'neg_dist_dep_type_size_equivalence'
 ; CHECK-NEXT:    loop:
 ; CHECK-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
-; CHECK-NEXT:  Unknown data dependence.
+; CHECK-NEXT:  Backward loop carried data dependence that prevents store-to-load forwarding.
 ; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        Unknown:
-; CHECK-NEXT:            %ld.f64 = load double, ptr %gep.iv, align 8 ->
-; CHECK-NEXT:            store i32 %ld.i64.i32, ptr %gep.iv.n.i64, align 8
-; CHECK-EMPTY:
-; CHECK-NEXT:        Unknown:
-; CHECK-NEXT:            %ld.i64 = load i64, ptr %gep.iv, align 8 ->
-; CHECK-NEXT:            store i32 %ld.i64.i32, ptr %gep.iv.n.i64, align 8
-; CHECK-EMPTY:
 ; CHECK-NEXT:        BackwardVectorizableButPreventsForwarding:
 ; CHECK-NEXT:            %ld.f64 = load double, ptr %gep.iv, align 8 ->
 ; CHECK-NEXT:            store double %val, ptr %gep.iv.101.i64, align 8
diff --git a/llvm/test/Analysis/LoopAccessAnalysis/different-strides-safe-dep-due-to-backedge-taken-count.ll b/llvm/test/Analysis/LoopAccessAnalysis/different-strides-safe-dep-due-to-backedge-taken-count.ll
index 8c7df4bdf5a5a8..2fd4d5ae06094e 100644
--- a/llvm/test/Analysis/LoopAccessAnalysis/different-strides-safe-dep-due-to-backedge-taken-count.ll
+++ b/llvm/test/Analysis/LoopAccessAnalysis/different-strides-safe-dep-due-to-backedge-taken-count.ll
@@ -106,18 +106,24 @@ exit:
   ret void
 }
 
-define void @unknown_dep_not_known_safe_due_to_backedge_taken_count(ptr %A) {
-; CHECK-LABEL: 'unknown_dep_not_known_safe_due_to_backedge_taken_count'
+define void @unknown_dep_safe_with_rtchecks(ptr %A) {
+; CHECK-LABEL: 'unknown_dep_safe_with_rtchecks'
 ; CHECK-NEXT:    loop:
-; CHECK-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
-; CHECK-NEXT:  Unknown data dependence.
+; CHECK-NEXT:      Memory dependences are safe with run-time checks
 ; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        Unknown:
-; CHECK-NEXT:            %l = load i32, ptr %gep, align 4 ->
-; CHECK-NEXT:            store i32 %add, ptr %gep.mul.2, align 4
-; CHECK-EMPTY:
 ; CHECK-NEXT:      Run-time memory checks:
+; CHECK-NEXT:      Check 0:
+; CHECK-NEXT:        Comparing group ([[GRP1:0x[0-9a-f]+]]):
+; CHECK-NEXT:          %gep.mul.2 = getelementptr inbounds i32, ptr %A.510, i64 %iv.mul.2
+; CHECK-NEXT:        Against group ([[GRP2:0x[0-9a-f]+]]):
+; CHECK-NEXT:          %gep = getelementptr inbounds i32, ptr %A, i64 %iv
 ; CHECK-NEXT:      Grouped accesses:
+; CHECK-NEXT:        Group [[GRP1]]:
+; CHECK-NEXT:          (Low: (2040 + %A)<nuw> High: (4084 + %A))
+; CHECK-NEXT:            Member: {(2040 + %A)<nuw>,+,8}<nuw><%loop>
+; CHECK-NEXT:        Group [[GRP2]]:
+; CHECK-NEXT:          (Low: %A High: (1024 + %A))
+; CHECK-NEXT:            Member: {%A,+,4}<nuw><%loop>
 ; CHECK-EMPTY:
 ; CHECK-NEXT:      Non vectorizable stores to invariant address were not found in loop.
 ; CHECK-NEXT:      SCEV assumptions:
diff --git a/llvm/test/Analysis/LoopAccessAnalysis/forward-loop-carried.ll b/llvm/test/Analysis/LoopAccessAnalysis/forward-loop-carried.ll
index adfd19923e921c..7837c20f003e24 100644
--- a/llvm/test/Analysis/LoopAccessAnalysis/forward-loop-carried.ll
+++ b/llvm/test/Analysis/LoopAccessAnalysis/forward-loop-carried.ll
@@ -70,10 +70,6 @@ define void @forward_different_access_sizes(ptr readnone %end, ptr %start) {
 ; CHECK-NEXT:            store i32 0, ptr %gep.2, align 4 ->
 ; CHECK-NEXT:            %l = load i24, ptr %gep.1, align 1
 ; CHECK-EMPTY:
-; CHECK-NEXT:        Forward:
-; CHECK-NEXT:            store i32 0, ptr %gep.2, align 4 ->
-; CHECK-NEXT:            store i24 %l, ptr %ptr.iv, align 1
-; CHECK-EMPTY:
 ; CHECK-NEXT:      Run-time memory checks:
 ; CHECK-NEXT:      Grouped accesses:
 ; CHECK-EMPTY:
diff --git a/llvm/test/Analysis/LoopAccessAnalysis/non-constant-strides-backward.ll b/llvm/test/Analysis/LoopAccessAnalysis/non-constant-strides-backward.ll
index 416742a94e0d36..d0e717c22cece5 100644
--- a/llvm/test/Analysis/LoopAccessAnalysis/non-constant-strides-backward.ll
+++ b/llvm/test/Analysis/LoopAccessAnalysis/non-constant-strides-backward.ll
@@ -8,15 +8,21 @@ declare void @llvm.assume(i1)
 define void @different_non_constant_strides_known_backward(ptr %A) {
 ; CHECK-LABEL: 'different_non_constant_strides_known_backward'
 ; CHECK-NEXT:    loop:
-; CHECK-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
-; CHECK-NEXT:  Unknown data dependence.
+; CHECK-NEXT:      Memory dependences are safe with run-time checks
 ; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        Unknown:
-; CHECK-NEXT:            %l = load i32, ptr %gep, align 4 ->
-; CHECK-NEXT:            store i32 %add, ptr %gep.mul.2, align 4
-; CHECK-EMPTY:
 ; CHECK-NEXT:      Run-time memory checks:
+; CHECK-NEXT:      Check 0:
+; CHECK-NEXT:        Comparing group ([[GRP1:0x[0-9a-f]+]]):
+; CHECK-NEXT:          %gep.mul.2 = getelementptr inbounds i32, ptr %A, i64 %iv.mul.2
+; CHECK-NEXT:        Against group ([[GRP2:0x[0-9a-f]+]]):
+; CHECK-NEXT:          %gep = getelementptr inbounds i32, ptr %A, i64 %iv
 ; CHECK-NEXT:      Grouped accesses:
+; CHECK-NEXT:        Group [[GRP1]]:
+; CHECK-NEXT:          (Low: %A High: (2044 + %A))
+; CHECK-NEXT:            Member: {%A,+,8}<nuw><%loop>
+; CHECK-NEXT:        Group [[GRP2]]:
+; CHECK-NEXT:          (Low: %A High: (1024 + %A))
+; CHECK-NEXT:            Member: {%A,+,4}<nuw><%loop>
 ; CHECK-EMPTY:
 ; CHECK-NEXT:      Non vectorizable stores to invariant address were not found in loop.
 ; CHECK-NEXT:      SCEV assumptions:
@@ -45,15 +51,21 @@ exit:
 define void @different_non_constant_strides_known_backward_distance_larger_than_trip_count(ptr %A) {
 ; CHECK-LABEL: 'different_non_constant_strides_known_backward_distance_larger_than_trip_count'
 ; CHECK-NEXT:    loop:
-; CHECK-NEXT:      Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
-; CHECK-NEXT:  Unknown data dependence.
+; CHECK-NEXT:      Memory dependences are safe with run-time checks
 ; CHECK-NEXT:      Dependences:
-; CHECK-NEXT:        Unknown:
-; CHECK-NEXT:            %l = load i32, ptr %gep, align 4...
[truncated]

@artagnon artagnon requested a review from arsenm September 11, 2024 09:43
@artagnon
Copy link
Contributor Author

Gentle ping.

@artagnon
Copy link
Contributor Author

artagnon commented Oct 2, 2024

Gentle ping.

@arsenm arsenm requested review from preames and huntergr-arm October 2, 2024 10:40
Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are 2 distinct changes in the patch

  • improved reasoning with non-matching strides (change in llvm/test/Analysis/LoopAccessAnalysis/depend_diff_types.ll?)
  • enabling runtime checks for cases where there's no common stride.

Could this be split up? The first should hopefully a strict improvement, while the later is problematic because in many cases the runtime checks will always be false.

There are a few patches that should help avoid generating RT checks in most cases we can prove that they will always be false, including #91962 and #91196

@artagnon
Copy link
Contributor Author

artagnon commented Oct 3, 2024

It looks like there are 2 distinct changes in the patch

  • improved reasoning with non-matching strides (change in llvm/test/Analysis/LoopAccessAnalysis/depend_diff_types.ll?)
  • enabling runtime checks for cases where there's no common stride.

Could this be split up? The first should hopefully a strict improvement, while the later is problematic because in many cases the runtime checks will always be false.

I initially just had the first change, but since it scales the strides, checking for a common stride results in regressions. That was the reason I decided to enable runtime checks when there is no common stride. Will re-investigate whether it can be split up.

@artagnon
Copy link
Contributor Author

artagnon commented Oct 3, 2024

It looks like there are 2 distinct changes in the patch

  • improved reasoning with non-matching strides (change in llvm/test/Analysis/LoopAccessAnalysis/depend_diff_types.ll?)
  • enabling runtime checks for cases where there's no common stride.

Could this be split up? The first should hopefully a strict improvement, while the later is problematic because in many cases the runtime checks will always be false.

I initially just had the first change, but since it scales the strides, checking for a common stride results in regressions. That was the reason I decided to enable runtime checks when there is no common stride. Will re-investigate whether it can be split up.

Thanks, just reverted second change and had a look at the test changes once again: the regressions seem to be non-serious.

EDIT: I think this patch needs more work.

@artagnon artagnon marked this pull request as draft October 8, 2024 18:16
@artagnon artagnon marked this pull request as ready for review October 14, 2024 14:16
@artagnon
Copy link
Contributor Author

I just finished re-working the patch. The test impact is now quite minimal, and clearly positive. Kindly have a look.

@artagnon
Copy link
Contributor Author

Gentle ping.

@artagnon artagnon requested a review from david-arm October 22, 2024 20:39
@artagnon
Copy link
Contributor Author

Gentle ping.

2 similar comments
@artagnon
Copy link
Contributor Author

Gentle ping.

@artagnon
Copy link
Contributor Author

Gentle ping.

getDepdenceDistanceStrideAndSize currently returns a non-zero
TypeByteSize only if the type-sizes of the source and sink are equal.
The non-zero TypeByteSize is then used by isDependent to scale the
strides, the distance between the accesses, and the VF. This restriction
is very artificial, as the stride sizes can be scaled by the respective
type-sizes in advance, freeing isDependent of this responsibility, and
removing the ugly special-case of zero-TypeByteSize. The patch also has
the side-effect of fixing the long-standing TODO of requesting
runtime-checks when the strides are unequal.

The test impact of this patch is that several false depdendencies are
eliminated, and several unknown depdendencies now come with
runtime-checks instead.
@artagnon
Copy link
Contributor Author

artagnon commented Dec 9, 2024

Gentle ping.

1 similar comment
@artagnon
Copy link
Contributor Author

Gentle ping.

Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I've done a thorough review yet, but I've left some comments so far. Hope they are useful!

@@ -1799,8 +1799,7 @@ void MemoryDepChecker::mergeInStatus(VectorizationSafetyStatus S) {
/// }
static bool isSafeDependenceDistance(const DataLayout &DL, ScalarEvolution &SE,
const SCEV &MaxBTC, const SCEV &Dist,
uint64_t MaxStride,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment above needs updating now that you've removed TypeByteSize and MaxStride is now in bytes.

@@ -1882,7 +1878,7 @@ static bool areStridedAccessesIndependent(uint64_t Distance, uint64_t Stride,
// Two accesses in memory (scaled distance is 4, stride is 3):
// | A[0] | | | A[3] | | | A[6] | | |
// | | | | | A[4] | | | A[7] | |
return ScaledDist % Stride;
return Distance % Stride;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you've changed the definition of Stride to be in bytes, not multiples of the type size. Can you add something to the comment to reflect this?

@@ -1921,6 +1917,7 @@ MemoryDepChecker::getDependenceDistanceStrideAndSize(
if (StrideAPtr && *StrideAPtr < 0) {
std::swap(Src, Sink);
std::swap(AInst, BInst);
std::swap(ATy, BTy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a bug fix and unrelated to this patch - can you commit this in a separate PR please?

LLVM_DEBUG(dbgs() << "LAA: Src induction step: " << *StrideAPtr
<< " Sink induction step: " << *StrideBPtr << "\n");

// Note that store size is different from alloc size, which is dependent on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand the comment store size is different from alloc size, which is dependent on store size. Do you mean that alloc size could be greater than the store size?

<< " Sink induction step: " << *StrideBPtr << "\n");

// Note that store size is different from alloc size, which is dependent on
// store size. We use the former for checking illegal cases, and the latter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By former I assume you mean store size? I think it's clearer to write out exactly what you're referring to, i.e. We use the store size for checking illegal cases, and the alloc size for scaling strides.`

bool ShouldRetryWithRuntimeCheck =
std::abs(*StrideAPtr) == std::abs(*StrideBPtr);

return DepDistanceStrideAndSizeInfo(Dist, MaxStride, CommonStride,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you could reduce the complexity of this patch with an initial NFC refactoring PR that changes the constructor of DepDistanceStrideAndSizeInfo to take the MaxStride, CommonStride and ShouldRetryWithRuntimeCheck parameters. Essentially, it's just pushing some of the work into constructing the DepDistanceStrideAndSizeInfo object rather than in MemoryDepChecker::isDependent, which seems a sensible thing to do given we could call isDependent many times on the same object.

@@ -2168,8 +2185,7 @@ MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
// We know that Dist is positive, but it may not be constant. Use the signed
// minimum for computations below, as this ensures we compute the closest
// possible dependence distance.
uint64_t MinDistanceNeeded =
TypeByteSize * *CommonStride * (MinNumIter - 1) + TypeByteSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing CommonStride to be scaled seems like an improvement that could be split off as NFC, reducing the diff?

FoundNonConstantDistanceDependence |= CommonStride.has_value();
// TODO: Relax requirement that there is a common unscaled stride to retry
// with non-constant distance dependencies.
FoundNonConstantDistanceDependence |= ShouldRetryWithRuntimeCheck;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we apply the renaming as NFC patch first, so this reduces the diff here?

@nikic nikic removed their request for review January 14, 2025 19:41
@artagnon artagnon closed this Mar 8, 2025
@artagnon artagnon deleted the laa-stride-typesz branch March 8, 2025 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants