[LoopVectorizer][AArch64] Move getMinTripCountTailFoldingThreshold later. #132170

davemgreen · 2025-03-20T09:12:11Z

This moves the checks of MinTripCountTailFoldingThreshold later, during the
calculation of whether to tail fold. This allows it to check beforehand whether
tail predication is required, either for scalable or fixed-width vectors.

This option is only specified for AArch64, where it returns the minimum of 5.
This patch aims to allow the vectorization of TC=4 loops, preventing them from
performing slower when SVE is present.

llvmbot · 2025-03-20T09:12:46Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: David Green (davemgreen)

Changes

This moves the checks of MinTripCountTailFoldingThreshold later, during the
calculation of whether to tail fold. This allows it to check beforehand whether
tail predication is required, either for scalable or fixed-width vectors.

This option is only specified for AArch64, where it returns the minimum of 5.
This patch aims to allow the vectorization of TC=4 loops, preventing them from
performing slower when SVE is present.

Patch is 29.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/132170.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+39-26)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll (+244-32)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 584da9b7baef5..8789273cd40f7 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4087,11 +4087,8 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
       MaxPowerOf2RuntimeVF = std::nullopt; // Stick with tail-folding for now.
   }
 
-  if (MaxPowerOf2RuntimeVF && *MaxPowerOf2RuntimeVF > 0) {
-    assert((UserVF.isNonZero() || isPowerOf2_32(*MaxPowerOf2RuntimeVF)) &&
-           "MaxFixedVF must be a power of 2");
-    unsigned MaxVFtimesIC =
-        UserIC ? *MaxPowerOf2RuntimeVF * UserIC : *MaxPowerOf2RuntimeVF;
+  auto IsKnownModTripCountZero = [this, &UserIC](unsigned MaxVF) {
+    unsigned MaxVFtimesIC = UserIC ? MaxVF * UserIC : MaxVF;
     ScalarEvolution *SE = PSE.getSE();
     // Currently only loops with countable exits are vectorized, but calling
     // getSymbolicMaxBackedgeTakenCount allows enablement work for loops with
@@ -4105,13 +4102,40 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
     const SCEV *Rem = SE->getURemExpr(
         SE->applyLoopGuards(ExitCount, TheLoop),
         SE->getConstant(BackedgeTakenCount->getType(), MaxVFtimesIC));
-    if (Rem->isZero()) {
+    return Rem->isZero();
+  };
+
+  if (MaxPowerOf2RuntimeVF && *MaxPowerOf2RuntimeVF > 0) {
+    assert((UserVF.isNonZero() || isPowerOf2_32(*MaxPowerOf2RuntimeVF)) &&
+           "MaxFixedVF must be a power of 2");
+    if (IsKnownModTripCountZero(*MaxPowerOf2RuntimeVF)) {
       // Accept MaxFixedVF if we do not have a tail.
       LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
       return MaxFactors;
     }
   }
 
+  if (MaxTC && MaxTC <= TTI.getMinTripCountTailFoldingThreshold()) {
+    if (MaxPowerOf2RuntimeVF && *MaxPowerOf2RuntimeVF > 0) {
+      // If we have a low-trip-count, and the fixed-width VF is known to divide
+      // the trip count but the scalable factor does not, use the fixed-width
+      // factor in preference to allow the generation of a non-predicated loop.
+      if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedLowTripLoop &&
+          IsKnownModTripCountZero(MaxFactors.FixedVF.getFixedValue())) {
+        LLVM_DEBUG(dbgs() << "LV: Picking a fixed-width so that no tail will "
+                             "remain for any chosen VF.\n");
+        MaxFactors.ScalableVF = ElementCount::getScalable(0);
+        return MaxFactors;
+      }
+    }
+
+    reportVectorizationFailure(
+        "The trip count is below the minial threshold value.",
+        "loop trip count is too low, avoiding vectorization", "LowTripCount",
+        ORE, TheLoop);
+    return FixedScalableVFPair::getNone();
+  }
+
   // If we don't know the precise trip count, or if the trip count that we
   // found modulo the vectorization factor is not zero, try to fold the tail
   // by masking.
@@ -10606,26 +10630,15 @@ bool LoopVectorizePass::processLoop(Loop *L) {
     if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
       LLVM_DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
     else {
-      if (*ExpectedTC > TTI->getMinTripCountTailFoldingThreshold()) {
-        LLVM_DEBUG(dbgs() << "\n");
-        // Predicate tail-folded loops are efficient even when the loop
-        // iteration count is low. However, setting the epilogue policy to
-        // `CM_ScalarEpilogueNotAllowedLowTripLoop` prevents vectorizing loops
-        // with runtime checks. It's more effective to let
-        // `isOutsideLoopWorkProfitable` determine if vectorization is
-        // beneficial for the loop.
-        if (SEL != CM_ScalarEpilogueNotNeededUsePredicate)
-          SEL = CM_ScalarEpilogueNotAllowedLowTripLoop;
-      } else {
-        LLVM_DEBUG(dbgs() << " But the target considers the trip count too "
-                             "small to consider vectorizing.\n");
-        reportVectorizationFailure(
-            "The trip count is below the minial threshold value.",
-            "loop trip count is too low, avoiding vectorization",
-            "LowTripCount", ORE, L);
-        Hints.emitRemarkWithHints();
-        return false;
-      }
+      LLVM_DEBUG(dbgs() << "\n");
+      // Predicate tail-folded loops are efficient even when the loop
+      // iteration count is low. However, setting the epilogue policy to
+      // `CM_ScalarEpilogueNotAllowedLowTripLoop` prevents vectorizing loops
+      // with runtime checks. It's more effective to let
+      // `isOutsideLoopWorkProfitable` determine if vectorization is
+      // beneficial for the loop.
+      if (SEL != CM_ScalarEpilogueNotNeededUsePredicate)
+        SEL = CM_ScalarEpilogueNotAllowedLowTripLoop;
     }
   }
 
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll b/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
index 9b7e41aa98db6..2ffac8b12b887 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
 ; REQUIRES: asserts
-; RUN: opt -S < %s -p loop-vectorize -debug-only=loop-vectorize -mattr=+sve 2>%t | FileCheck %s --check-prefixes=CHECK,CHECK-VS1
+; RUN: opt -S < %s -p "loop-vectorize,simplifycfg" -debug-only=loop-vectorize -mattr=+sve 2>%t | FileCheck %s --check-prefixes=CHECK,CHECK-VS1
 ; RUN: cat %t | FileCheck %s --check-prefixes=DEBUG,DEBUG-VS1
-; RUN: opt -S < %s -p loop-vectorize -debug-only=loop-vectorize -mcpu=neoverse-v1 -sve-tail-folding=disabled 2>%t | FileCheck %s --check-prefixes=CHECK,CHECK-VS2
+; RUN: opt -S < %s -p "loop-vectorize,simplifycfg" -debug-only=loop-vectorize -mcpu=neoverse-v1 -sve-tail-folding=disabled 2>%t | FileCheck %s --check-prefixes=CHECK,CHECK-VS2
 ; RUN: cat %t | FileCheck %s --check-prefixes=DEBUG,DEBUG-VS2
 
 target triple = "aarch64-unknown-linux-gnu"
@@ -18,7 +18,7 @@ target triple = "aarch64-unknown-linux-gnu"
 
 ; DEBUG-LABEL: LV: Checking a loop in 'trip_count_too_small'
 ; DEBUG: LV: Found a loop with a very small trip count. This loop is worth vectorizing only if no scalar iteration overheads are incurred.
-; DEBUG: LV: Not vectorizing: The trip count is below the minial threshold value..
+; DEBUG: LV: Not vectorizing: Runtime SCEV check is required with -Os/-Oz.
 
 ; DEBUG-LABEL: LV: Checking a loop in 'too_many_runtime_checks'
 ; DEBUG: LV: Found trip count: 0
@@ -91,7 +91,7 @@ define void @low_vf_ic_is_better(ptr nocapture noundef %p, i32 %tc, i16 noundef
 ; CHECK-VS1-NEXT:    br i1 [[TMP25]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; CHECK-VS1:       [[MIDDLE_BLOCK]]:
 ; CHECK-VS1-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC]]
-; CHECK-VS1-NEXT:    br i1 [[CMP_N]], label %[[WHILE_END_LOOPEXIT:.*]], label %[[VEC_EPILOG_ITER_CHECK:.*]]
+; CHECK-VS1-NEXT:    br i1 [[CMP_N]], label %[[WHILE_END]], label %[[VEC_EPILOG_ITER_CHECK:.*]]
 ; CHECK-VS1:       [[VEC_EPILOG_ITER_CHECK]]:
 ; CHECK-VS1-NEXT:    [[IND_END4:%.*]] = add i64 [[TMP0]], [[N_VEC]]
 ; CHECK-VS1-NEXT:    [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP3]], [[N_VEC]]
@@ -125,7 +125,7 @@ define void @low_vf_ic_is_better(ptr nocapture noundef %p, i32 %tc, i16 noundef
 ; CHECK-VS1-NEXT:    br i1 [[TMP36]], label %[[VEC_EPILOG_MIDDLE_BLOCK:.*]], label %[[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
 ; CHECK-VS1:       [[VEC_EPILOG_MIDDLE_BLOCK]]:
 ; CHECK-VS1-NEXT:    [[CMP_N10:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC3]]
-; CHECK-VS1-NEXT:    br i1 [[CMP_N10]], label %[[WHILE_END_LOOPEXIT]], label %[[VEC_EPILOG_SCALAR_PH]]
+; CHECK-VS1-NEXT:    br i1 [[CMP_N10]], label %[[WHILE_END]], label %[[VEC_EPILOG_SCALAR_PH]]
 ; CHECK-VS1:       [[VEC_EPILOG_SCALAR_PH]]:
 ; CHECK-VS1-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[TMP39]], %[[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END4]], %[[VEC_EPILOG_ITER_CHECK]] ], [ [[TMP0]], %[[VECTOR_SCEVCHECK]] ], [ [[TMP0]], %[[ITER_CHECK]] ]
 ; CHECK-VS1-NEXT:    br label %[[WHILE_BODY:.*]]
@@ -138,9 +138,7 @@ define void @low_vf_ic_is_better(ptr nocapture noundef %p, i32 %tc, i16 noundef
 ; CHECK-VS1-NEXT:    store i8 [[ADD]], ptr [[ARRAYIDX]], align 1
 ; CHECK-VS1-NEXT:    [[TMP38:%.*]] = and i64 [[IV_NEXT]], 4294967295
 ; CHECK-VS1-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[TMP38]], 19
-; CHECK-VS1-NEXT:    br i1 [[EXITCOND_NOT]], label %[[WHILE_END_LOOPEXIT]], label %[[WHILE_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
-; CHECK-VS1:       [[WHILE_END_LOOPEXIT]]:
-; CHECK-VS1-NEXT:    br label %[[WHILE_END]]
+; CHECK-VS1-NEXT:    br i1 [[EXITCOND_NOT]], label %[[WHILE_END]], label %[[WHILE_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
 ; CHECK-VS1:       [[WHILE_END]]:
 ; CHECK-VS1-NEXT:    ret void
 ;
@@ -199,7 +197,7 @@ define void @low_vf_ic_is_better(ptr nocapture noundef %p, i32 %tc, i16 noundef
 ; CHECK-VS2-NEXT:    br i1 [[TMP25]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; CHECK-VS2:       [[MIDDLE_BLOCK]]:
 ; CHECK-VS2-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC]]
-; CHECK-VS2-NEXT:    br i1 [[CMP_N]], label %[[WHILE_END_LOOPEXIT:.*]], label %[[VEC_EPILOG_ITER_CHECK:.*]]
+; CHECK-VS2-NEXT:    br i1 [[CMP_N]], label %[[WHILE_END]], label %[[VEC_EPILOG_ITER_CHECK:.*]]
 ; CHECK-VS2:       [[VEC_EPILOG_ITER_CHECK]]:
 ; CHECK-VS2-NEXT:    [[IND_END4:%.*]] = add i64 [[TMP0]], [[N_VEC]]
 ; CHECK-VS2-NEXT:    [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP3]], [[N_VEC]]
@@ -233,7 +231,7 @@ define void @low_vf_ic_is_better(ptr nocapture noundef %p, i32 %tc, i16 noundef
 ; CHECK-VS2-NEXT:    br i1 [[TMP36]], label %[[VEC_EPILOG_MIDDLE_BLOCK:.*]], label %[[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
 ; CHECK-VS2:       [[VEC_EPILOG_MIDDLE_BLOCK]]:
 ; CHECK-VS2-NEXT:    [[CMP_N10:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC3]]
-; CHECK-VS2-NEXT:    br i1 [[CMP_N10]], label %[[WHILE_END_LOOPEXIT]], label %[[VEC_EPILOG_SCALAR_PH]]
+; CHECK-VS2-NEXT:    br i1 [[CMP_N10]], label %[[WHILE_END]], label %[[VEC_EPILOG_SCALAR_PH]]
 ; CHECK-VS2:       [[VEC_EPILOG_SCALAR_PH]]:
 ; CHECK-VS2-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[TMP39]], %[[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END4]], %[[VEC_EPILOG_ITER_CHECK]] ], [ [[TMP0]], %[[VECTOR_SCEVCHECK]] ], [ [[TMP0]], %[[ITER_CHECK]] ]
 ; CHECK-VS2-NEXT:    br label %[[WHILE_BODY:.*]]
@@ -246,9 +244,7 @@ define void @low_vf_ic_is_better(ptr nocapture noundef %p, i32 %tc, i16 noundef
 ; CHECK-VS2-NEXT:    store i8 [[ADD]], ptr [[ARRAYIDX]], align 1
 ; CHECK-VS2-NEXT:    [[TMP38:%.*]] = and i64 [[IV_NEXT]], 4294967295
 ; CHECK-VS2-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[TMP38]], 19
-; CHECK-VS2-NEXT:    br i1 [[EXITCOND_NOT]], label %[[WHILE_END_LOOPEXIT]], label %[[WHILE_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
-; CHECK-VS2:       [[WHILE_END_LOOPEXIT]]:
-; CHECK-VS2-NEXT:    br label %[[WHILE_END]]
+; CHECK-VS2-NEXT:    br i1 [[EXITCOND_NOT]], label %[[WHILE_END]], label %[[WHILE_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
 ; CHECK-VS2:       [[WHILE_END]]:
 ; CHECK-VS2-NEXT:    ret void
 ;
@@ -297,9 +293,7 @@ define void @trip_count_too_small(ptr nocapture noundef %p, i32 noundef %tc, i16
 ; CHECK-NEXT:    store i8 [[ADD]], ptr [[ARRAYIDX]], align 1
 ; CHECK-NEXT:    [[TMP44:%.*]] = and i64 [[INDVARS_IV_NEXT]], 4294967295
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[TMP44]], 3
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[WHILE_END_LOOPEXIT:.*]], label %[[WHILE_BODY]]
-; CHECK:       [[WHILE_END_LOOPEXIT]]:
-; CHECK-NEXT:    br label %[[WHILE_END]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[WHILE_END]], label %[[WHILE_BODY]]
 ; CHECK:       [[WHILE_END]]:
 ; CHECK-NEXT:    ret void
 ;
@@ -357,9 +351,7 @@ define void @too_many_runtime_checks(ptr nocapture noundef %p, ptr nocapture nou
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
 ; CHECK-NEXT:    [[TMP64:%.*]] = and i64 [[INDVARS_IV_NEXT]], 4294967295
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[TMP64]], 16
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[WHILE_END_LOOPEXIT:.*]], label %[[WHILE_BODY]]
-; CHECK:       [[WHILE_END_LOOPEXIT]]:
-; CHECK-NEXT:    br label %[[WHILE_END]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[WHILE_END]], label %[[WHILE_BODY]]
 ; CHECK:       [[WHILE_END]]:
 ; CHECK-NEXT:    ret void
 ;
@@ -410,8 +402,6 @@ define void @overflow_indvar_known_false(ptr nocapture noundef %p, i32 noundef %
 ; CHECK-NEXT:    [[TMP19:%.*]] = add i32 [[TC]], 1
 ; CHECK-NEXT:    [[TMP20:%.*]] = zext i32 [[TMP19]] to i64
 ; CHECK-NEXT:    [[TMP1:%.*]] = sub i64 1028, [[TMP20]]
-; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_SCEVCHECK:.*]]
-; CHECK:       [[VECTOR_SCEVCHECK]]:
 ; CHECK-NEXT:    [[TMP21:%.*]] = add i32 [[TC]], 1
 ; CHECK-NEXT:    [[TMP22:%.*]] = zext i32 [[TMP21]] to i64
 ; CHECK-NEXT:    [[TMP23:%.*]] = sub i64 1027, [[TMP22]]
@@ -420,7 +410,7 @@ define void @overflow_indvar_known_false(ptr nocapture noundef %p, i32 noundef %
 ; CHECK-NEXT:    [[TMP26:%.*]] = icmp ult i32 [[TMP25]], [[TMP21]]
 ; CHECK-NEXT:    [[TMP27:%.*]] = icmp ugt i64 [[TMP23]], 4294967295
 ; CHECK-NEXT:    [[TMP28:%.*]] = or i1 [[TMP26]], [[TMP27]]
-; CHECK-NEXT:    br i1 [[TMP28]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; CHECK-NEXT:    br i1 [[TMP28]], label %[[WHILE_BODY:.*]], label %[[VECTOR_PH:.*]]
 ; CHECK:       [[VECTOR_PH]]:
 ; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 16
@@ -449,14 +439,9 @@ define void @overflow_indvar_known_false(ptr nocapture noundef %p, i32 noundef %
 ; CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX_NEXT]], i64 [[TMP1]])
 ; CHECK-NEXT:    [[TMP16:%.*]] = xor <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], splat (i1 true)
 ; CHECK-NEXT:    [[TMP17:%.*]] = extractelement <vscale x 16 x i1> [[TMP16]], i32 0
-; CHECK-NEXT:    br i1 [[TMP17]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
-; CHECK:       [[MIDDLE_BLOCK]]:
-; CHECK-NEXT:    br i1 true, label %[[WHILE_END_LOOPEXIT:.*]], label %[[SCALAR_PH]]
-; CHECK:       [[SCALAR_PH]]:
-; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[IND_END]], %[[MIDDLE_BLOCK]] ], [ [[TMP0]], %[[WHILE_PREHEADER]] ], [ [[TMP0]], %[[VECTOR_SCEVCHECK]] ]
-; CHECK-NEXT:    br label %[[WHILE_BODY:.*]]
+; CHECK-NEXT:    br i1 [[TMP17]], label %[[WHILE_END]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
 ; CHECK:       [[WHILE_BODY]]:
-; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[WHILE_BODY]] ]
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], %[[WHILE_BODY]] ], [ [[TMP0]], %[[WHILE_PREHEADER]] ]
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i8, ptr [[V]], i64 [[INDVARS_IV]]
 ; CHECK-NEXT:    [[TMP18:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
@@ -464,9 +449,7 @@ define void @overflow_indvar_known_false(ptr nocapture noundef %p, i32 noundef %
 ; CHECK-NEXT:    store i8 [[ADD]], ptr [[ARRAYIDX]], align 1
 ; CHECK-NEXT:    [[TMP29:%.*]] = and i64 [[INDVARS_IV_NEXT]], 4294967295
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[TMP29]], 1027
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[WHILE_END_LOOPEXIT]], label %[[WHILE_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
-; CHECK:       [[WHILE_END_LOOPEXIT]]:
-; CHECK-NEXT:    br label %[[WHILE_END]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[WHILE_END]], label %[[WHILE_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
 ; CHECK:       [[WHILE_END]]:
 ; CHECK-NEXT:    ret void
 ;
@@ -495,6 +478,235 @@ while.end:
   ret void
 }
 
+; This has a trip-count of 4, and should vectorize with vf==4.
+define i32 @tc4(ptr noundef readonly captures(none) %tmp) vscale_range(1,16) {
+; CHECK-LABEL: define i32 @tc4(
+; CHECK-SAME: ptr noundef readonly captures(none) [[TMP:%.*]]) #[[ATTR1]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP]], i64 16
+; CHECK-NEXT:    [[ARRAYIDX11:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP]], i64 32
+; CHECK-NEXT:    [[ARRAYIDX14:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP]], i64 48
+; CHECK-NEXT:    [[ARRAYIDX30:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP]], i64 64
+; CHECK-NEXT:    [[ARRAYIDX33:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP]], i64 80
+; CHECK-NEXT:    [[ARRAYIDX46:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP]], i64 96
+; CHECK-NEXT:    [[ARRAYIDX49:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP]], i64 112
+; CHECK-NEXT:    [[TMP0:%.*]] = add i64 0, 0
+; CHECK-NEXT:    [[TMP1:%.*]] = getelementptr inbounds nuw [4 x i32], ptr [[TMP]], i64 0, i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds nuw i32, ptr [[TMP1]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP2]], align 4
+; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds nuw [4 x i32], ptr [[ARRAYIDX2]], i64 0, i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds nuw i32, ptr [[TMP3]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD1:%.*]] = load <4 x i32>, ptr [[TMP4]], align 4
+; CHECK-NEXT:    [[TMP5:%.*]] = add <4 x i32> [[WIDE_LOAD1]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP6:%.*]] = sub <4 x i32> [[WIDE_LOAD]], [[WIDE_LOAD1]]
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds nuw [4 x i32], ptr [[ARRAYIDX11]], i64 0, i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds nuw i32, ptr [[TMP7]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD2:%.*]] = load <4 x i32>, ptr [[TMP8]], align 4
+; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr inbounds nuw [4 x i32], ptr [[ARRAYIDX14]], i64 0, i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP10:%.*]] = getelementptr inbounds nuw i32, ptr [[TMP9]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD3:%.*]] = load <4 x i32>, ptr [[TMP10]], align 4
+; CHECK-NEXT:    [[TMP11:%.*]] = add <4 x i32> [[WIDE_LOAD3]], [[WIDE_LOAD2]]
+; CHECK-NEXT:    [[TMP12:%.*]] = sub <4 x i32> [[WIDE_LOAD2]], [[WIDE_LOAD3]]
+; CHECK-NEXT:    [[TMP13:%.*]] = add <4 x i32> [[TMP11]], [[TMP5]]
+; CHECK-NEXT:    [[TMP14:%.*]] = sub <4 x i32> [[TMP5]], [[TMP11]]
+; CHECK-NEXT:    [[TMP15:%.*]] = add <4 x i32> [[TMP12]], [[TMP6]]
+; CHECK-NEXT:    [[TMP16:%.*]] = sub <4 x i32> [[TMP6]], [[TMP12]]
+; CHECK-NEXT:    [[TMP17:%.*]] = getelementptr inbounds nuw [4 x i32], ptr [[ARRAYIDX30]], i64 0, i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP18:%.*]] = getelementptr inbounds nuw i32, ptr [[TMP17]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD4:%.*]] = load <4 x i32>, ptr [[TMP18]], align 4
+; CHECK-NEXT:    [[TMP19:%.*]] = getelementptr inbounds nuw [4 x i32], ptr [[ARRAYIDX33]], i64 0, i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP20:%.*]] = getelementptr inbounds nuw i32, ptr [[TMP19]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD5:%.*]] = load <4 x i32>, ptr [[TMP20]], align 4
+; CHECK-NEXT:    [[TMP21:%.*]] = add <4 x i32> [[WIDE_LOAD5]], [[WIDE_LOAD4]]
+; CHECK-NEXT:    [[TMP22:%.*]] = sub <4 x i32> [[WIDE_LOAD4]], [[WIDE_LOAD5]]
+; CHECK-NEXT:    [[TMP23:%.*]] = getelementptr inbounds nuw [4 x i32], ptr [[ARRAYIDX46]], i64 0, i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP24:%.*]] = getelementptr inbounds nuw i32, ptr [[TMP23]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD6:%.*]] = load <4 x i32>, ptr [[TMP24]], align 4
+; CHECK-NEXT:    [[TMP25:%.*]] = getelementptr inbounds nuw [4 x i32], ptr [[ARRAYIDX49]], i64 0, i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP26:%.*]] = getelementptr inbounds nuw i32, ptr [[TMP25]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD7:%.*]] = load <4 x i32>, ptr [[TMP26]], align 4
+; CHECK-NEXT:    [[TMP27:%.*]] = add <4 x i32> [[WIDE_LOAD7]], [[WIDE_LOAD6]]
+; CHECK-NEXT:    [[TMP28:%.*]] = sub <4 x i32> [[WIDE_LOAD6]], [[WIDE_LOAD7]]
+; CHECK-NEXT:    [[TMP29:%.*]] = add <4 x i32> [[TMP27]], [[TMP21]]
+; CHECK-NEXT:    [[TMP30:%.*]] = sub <4 x i32> [[TMP21]], [[TMP27]]
+; CHECK-NEXT:    [[TMP31:%.*]] = add <4 x i32> [[TMP28]], [[TMP22]]
+; CHECK-NEXT:    [[TMP32:%.*]] = sub <4 x i32> [[TMP22]], [[TMP28]]
+; CHECK-NEXT:    [[TMP33:%.*]] = add <4 x i32> [[TMP29]], [[TMP13]]
+; CHECK-NEXT:    [[TMP34:%.*]] = lshr <4 x i32> [[TMP33]], splat (i32 15)
+; CHECK-NE...
[truncated]

david-arm

Looks like a sensible improvement! I had a few comments ...

david-arm · 2025-03-20T14:20:07Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+        return MaxFactors;
+      }
+    }
+


Can you re-add the debug output that we had before, i.e. something like:

LLVM_DEBUG(dbgs() << "The target considers the trip count too " "small to consider vectorizing with tail-folding.\n");

reportVectorizationFailure will print the message it is reporting to dbgs() too. It didn't seem necessary to print the same info twice.

It will print

LV: Not vectorizing: The trip count is below the minial threshold value.

OK yeah fair enough, but I think we're still missing the original Hints.emitRemarkWithHints(); call? I think LoopVectorizationCostModel does have access to a Hints pointer.

It looks like it gets emitted in one of the parent functions. None of the other returns from this function use
emitRemarkWithHints, which was the reason I didn't more it over.

reportVectorizationFailure will emit the -Rpass=analysis output. The -Rpass=missed remark will be emitted by the VF now being scalar

remark: the cost-model indicates that vectorization is not beneficial [-Rpass-missed=loop-vectorize]

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

david-arm · 2025-03-20T14:36:09Z

llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll

@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
 ; REQUIRES: asserts
-; RUN: opt -S < %s -p loop-vectorize -debug-only=loop-vectorize -mattr=+sve 2>%t | FileCheck %s --check-prefixes=CHECK,CHECK-VS1
+; RUN: opt -S < %s -p "loop-vectorize,simplifycfg" -debug-only=loop-vectorize -mattr=+sve 2>%t | FileCheck %s --check-prefixes=CHECK,CHECK-VS1


Do you actually need the simplifycfg pass here? If the trip count is 4 and the VF is 4, I'd hope that during vplan construction we optimise the branch in the latch to always exit the loop. Perhaps that might be sufficient to show it working as expected?

Same question here - is the simplifycfg really needed to show the desired effect? I tried removing "simplifycfg" and the output is:

vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %2, %vector.body ] %0 = getelementptr inbounds nuw [4 x i32], ptr %tmp, i64 0, i64 %index %1 = getelementptr inbounds nuw i32, ptr %0, i32 0 %wide.load = load <4 x i32>, ptr %1, align 4 %2 = add <4 x i32> %vec.phi, %wide.load %index.next = add nuw i64 %index, 4 br i1 true, label %middle.block, label %vector.body, !llvm.loop !7 middle.block: ; preds = %vector.body %3 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %2) br i1 true, label %for.cond.cleanup, label %scalar.ph scalar.ph: ; preds = %entry, %middle.block %bc.resume.val = phi i64 [ 4, %middle.block ], [ 0, %entry ] %bc.merge.rdx = phi i32 [ %3, %middle.block ], [ 0, %entry ] br label %for.body for.cond.cleanup: ; preds = %middle.block, %for.body %add.lcssa = phi i32 [ %add, %for.body ], [ %3, %middle.block ] ret i32 %add.lcssa

which is sufficient to show it working as expected, i.e. the vector.body latch unconditionally jumps to middle.block, which unconditionally jumps to for.cond.cleanup. Generally for loop-vectorise tests it's preferred to only test the loop-vectorize pass because simplifycfg may hide latent issues.

It was added to clean up the test output and make it clearer what was changing / important. It shows how much better the code can be if it gets unrolled. Considering this is a heuristic change and not a correctness change that sounded like the better trade-off. It is similar in concept to filter-out-after from #129739 but it doesn't remove any of the output code, just shows how it simplifies after constant branches are folded away.

My understanding of patches over the last little while, these might be optimizations that vplan learns to make itself eventually, and the simplify-cfg will not be useful. It will itself remove the unnecessary branches and the tests will be simpler again. I don't have a strong opinion in the meantime so am happy to remove it.

Thanks, I know that @fhahn has a PR in progress for removing loop regions. In my opinion, one of the benefits of testing purely the loop vectoriser output here is that when such vplan changes happen it should be more obvious what effect that has on these tests. But perhaps this is just personal preference!

llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

…ation. NFC This also adds simplifycfg to make it clear what it the result of vectorizing the loops in the test.

…ter. This moves the checks of MinTripCountTailFoldingThreshold later, during the calculation of whether to tail fold. This allows it to check beforehand whether tail predication is required, either for scalable or fixed-width vectors. This option is only specified for AArch64, where it returns the minimum of 5. This patch aims to allow the vectorization of TC=4 loops, preventing them from performing slower when SVE is present.

fhahn · 2025-03-26T11:27:05Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-           "MaxFixedVF must be a power of 2");
-    unsigned MaxVFtimesIC =
-        UserIC ? *MaxPowerOf2RuntimeVF * UserIC : *MaxPowerOf2RuntimeVF;
+  auto ScalarEpilogueNeeded = [this, &UserIC](unsigned MaxVF) {


The function returns true if no scalar epilogue is needed, so the name should be adjusted?

Suggested change

auto ScalarEpilogueNeeded = [this, &UserIC](unsigned MaxVF) {

auto NoScalarEpilogueNeeded = [this, &UserIC](unsigned MaxVF) {

david-arm

LGTM!

fhahn

LG, just left a few suggestions for the tests for the future

llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll

Post commit cleanup from #132170

davemgreen requested review from fhahn, sdesmalen-arm, david-arm and ElvisWang123 March 20, 2025 09:12

llvmbot added vectorizers llvm:transforms labels Mar 20, 2025

david-arm reviewed Mar 20, 2025

View reviewed changes

fhahn reviewed Mar 20, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

davemgreen force-pushed the gh-lv-fixmintripcount branch from 178df78 to c3f5000 Compare March 24, 2025 08:06

davemgreen added 4 commits March 26, 2025 11:11

[AArch64][LoopVectorizer] Add a test for VF=4 low-trip-count vectoriz…

6fb8134

…ation. NFC This also adds simplifycfg to make it clear what it the result of vectorizing the loops in the test.

Address comments

9ad5ec3

Remove simplify-cfg

b06ca2e

davemgreen force-pushed the gh-lv-fixmintripcount branch from c3f5000 to b06ca2e Compare March 26, 2025 11:18

fhahn reviewed Mar 26, 2025

View reviewed changes

ScalarEpilogueNeeded -> NoScalarEpilogueNeeded

76bf30d

david-arm approved these changes Mar 26, 2025

View reviewed changes

davemgreen merged commit de1c2f2 into llvm:main Mar 26, 2025
11 checks passed

davemgreen deleted the gh-lv-fixmintripcount branch March 26, 2025 19:35

fhahn reviewed Mar 26, 2025

View reviewed changes

davemgreen added a commit that referenced this pull request Mar 28, 2025

[LV][AArch64] Test cleanup of low_trip_count_predicates.ll. NFC

70f083f

Post commit cleanup from #132170

	auto ScalarEpilogueNeeded = [this, &UserIC](unsigned MaxVF) {
	auto NoScalarEpilogueNeeded = [this, &UserIC](unsigned MaxVF) {

[LoopVectorizer][AArch64] Move getMinTripCountTailFoldingThreshold later. #132170

[LoopVectorizer][AArch64] Move getMinTripCountTailFoldingThreshold later. #132170

Uh oh!

Conversation

davemgreen commented Mar 20, 2025

Uh oh!

llvmbot commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Mar 20, 2025 •

edited

Loading