-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[ARM] Disable UpperBound loop unrolling for MVE tail predicated loops. #69709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For MVE tail predicated loops, better code can be generated by keeping the loop whole than to unroll to an upper bound, which requires the expansion of active lane masks that can be difficult to generate good code for. This patch disables UpperBound unrolling when we find a active_lane_mask in the loop.
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-arm Author: David Green (davemgreen) ChangesFor MVE tail predicated loops, better code can be generated by keeping the loop whole than to unroll to an upper bound, which requires the expansion of active lane masks that can be difficult to generate good code for. This patch disables UpperBound unrolling when we find a active_lane_mask in the loop. Full diff: https://github.com/llvm/llvm-project/pull/69709.diff 2 Files Affected:
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
index e0d112c4a7eddb5..1dee7a3ccb6d8d9 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
@@ -2430,9 +2430,15 @@ ARMTTIImpl::getPreferredTailFoldingStyle(bool IVUpdateMayOverflow) const {
void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP,
OptimizationRemarkEmitter *ORE) {
- // Enable Upper bound unrolling universally, not dependant upon the conditions
- // below.
- UP.UpperBound = true;
+ // Enable Upper bound unrolling universally, providing that we do not see an
+ // active lane mask, which will be better kept as a loop to become tail
+ // predicated than to be conditionally unrolled.
+ UP.UpperBound =
+ !ST->hasMVEIntegerOps() || !any_of(*L->getHeader(), [](Instruction &I) {
+ return isa<IntrinsicInst>(I) &&
+ cast<IntrinsicInst>(I).getIntrinsicID() ==
+ Intrinsic::get_active_lane_mask;
+ });
// Only currently enable these preferences for M-Class cores.
if (!ST->isMClass())
diff --git a/llvm/test/Transforms/LoopUnroll/ARM/mve-upperbound.ll b/llvm/test/Transforms/LoopUnroll/ARM/mve-upperbound.ll
new file mode 100644
index 000000000000000..2bb6f05b91b1ab2
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/ARM/mve-upperbound.ll
@@ -0,0 +1,79 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -passes=loop-unroll -S -mtriple thumbv8.1m.main-none-eabi -mattr=+mve %s | FileCheck %s
+
+; The vector loop here is better kept as a loop than conditionally unrolled,
+; letting it transform into a tail predicted loop.
+
+define void @unroll_upper(ptr noundef %pSrc, ptr nocapture noundef writeonly %pDst, i32 noundef %blockSize) {
+; CHECK-LABEL: @unroll_upper(
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[CMP_NOT23:%.*]] = icmp ult i32 [[BLOCKSIZE:%.*]], 16
+; CHECK-NEXT: [[AND:%.*]] = and i32 [[BLOCKSIZE]], 15
+; CHECK-NEXT: [[CMP6_NOT28:%.*]] = icmp eq i32 [[AND]], 0
+; CHECK-NEXT: br i1 [[CMP6_NOT28]], label [[WHILE_END12:%.*]], label [[VECTOR_MEMCHECK:%.*]]
+; CHECK: vector.memcheck:
+; CHECK-NEXT: [[SCEVGEP:%.*]] = getelementptr i8, ptr [[PDST:%.*]], i32 [[AND]]
+; CHECK-NEXT: [[TMP0:%.*]] = shl nuw nsw i32 [[AND]], 1
+; CHECK-NEXT: [[SCEVGEP32:%.*]] = getelementptr i8, ptr [[PSRC:%.*]], i32 [[TMP0]]
+; CHECK-NEXT: [[BOUND0:%.*]] = icmp ult ptr [[PDST]], [[SCEVGEP32]]
+; CHECK-NEXT: [[BOUND1:%.*]] = icmp ult ptr [[PSRC]], [[SCEVGEP]]
+; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
+; CHECK-NEXT: [[N_RND_UP:%.*]] = add nuw nsw i32 [[AND]], 7
+; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], 24
+; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
+; CHECK: vector.body:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_MEMCHECK]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[PDST]], i32 [[INDEX]]
+; CHECK-NEXT: [[TMP1:%.*]] = shl i32 [[INDEX]], 1
+; CHECK-NEXT: [[NEXT_GEP37:%.*]] = getelementptr i8, ptr [[PSRC]], i32 [[TMP1]]
+; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[AND]])
+; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <8 x i16> @llvm.masked.load.v8i16.p0(ptr [[NEXT_GEP37]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> poison)
+; CHECK-NEXT: [[TMP2:%.*]] = lshr <8 x i16> [[WIDE_MASKED_LOAD]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
+; CHECK-NEXT: [[TMP3:%.*]] = trunc <8 x i16> [[TMP2]] to <8 x i8>
+; CHECK-NEXT: call void @llvm.masked.store.v8i8.p0(<8 x i8> [[TMP3]], ptr [[NEXT_GEP]], i32 1, <8 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8
+; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP4]], label [[WHILE_END12_LOOPEXIT:%.*]], label [[VECTOR_BODY]]
+; CHECK: while.end12.loopexit:
+; CHECK-NEXT: br label [[WHILE_END12]]
+; CHECK: while.end12:
+; CHECK-NEXT: ret void
+;
+entry:
+ %cmp.not23 = icmp ult i32 %blockSize, 16
+ %and = and i32 %blockSize, 15
+ %cmp6.not28 = icmp eq i32 %and, 0
+ br i1 %cmp6.not28, label %while.end12, label %vector.memcheck
+
+vector.memcheck: ; preds = %entry
+ %scevgep = getelementptr i8, ptr %pDst, i32 %and
+ %0 = shl nuw nsw i32 %and, 1
+ %scevgep32 = getelementptr i8, ptr %pSrc, i32 %0
+ %bound0 = icmp ult ptr %pDst, %scevgep32
+ %bound1 = icmp ult ptr %pSrc, %scevgep
+ %found.conflict = and i1 %bound0, %bound1
+ %n.rnd.up = add nuw nsw i32 %and, 7
+ %n.vec = and i32 %n.rnd.up, 24
+ br label %vector.body
+
+vector.body: ; preds = %vector.body, %vector.memcheck
+ %index = phi i32 [ 0, %vector.memcheck ], [ %index.next, %vector.body ]
+ %next.gep = getelementptr i8, ptr %pDst, i32 %index
+ %1 = shl i32 %index, 1
+ %next.gep37 = getelementptr i8, ptr %pSrc, i32 %1
+ %active.lane.mask = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 %index, i32 %and)
+ %wide.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0(ptr %next.gep37, i32 2, <8 x i1> %active.lane.mask, <8 x i16> poison)
+ %2 = lshr <8 x i16> %wide.masked.load, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
+ %3 = trunc <8 x i16> %2 to <8 x i8>
+ call void @llvm.masked.store.v8i8.p0(<8 x i8> %3, ptr %next.gep, i32 1, <8 x i1> %active.lane.mask)
+ %index.next = add i32 %index, 8
+ %4 = icmp eq i32 %index.next, %n.vec
+ br i1 %4, label %while.end12, label %vector.body
+
+while.end12: ; preds = %vector.body, %entry
+ ret void
+}
+
+declare <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32, i32)
+declare <8 x i16> @llvm.masked.load.v8i16.p0(ptr nocapture, i32 immarg, <8 x i1>, <8 x i16>)
+declare void @llvm.masked.store.v8i8.p0(<8 x i8>, ptr nocapture, i32 immarg, <8 x i1>)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one. I wonder if a test to make sure your current test actually gets tail-predicated would be good, or if that's outside the scope of this patch.
Local branch amd-gfx d648e11 Revert "[AMDGPU] Try to fix the block prologs broken by RA inserted instructions (llvm#69924)" Remote branch main 75b3c3d [ARM] Disable UpperBound loop unrolling for MVE tail predicated loops. (llvm#69709) Change-Id: I5ed179024ddce969c97745bd3947ac42772629c0
For MVE tail predicated loops, better code can be generated by keeping the loop whole than to unroll to an upper bound, which requires the expansion of active lane masks that can be difficult to generate good code for. This patch disables UpperBound unrolling when we find a active_lane_mask in the loop.