Skip to content

[ARM] Disable UpperBound loop unrolling for MVE tail predicated loops. #69709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 31, 2023

Conversation

davemgreen
Copy link
Collaborator

For MVE tail predicated loops, better code can be generated by keeping the loop whole than to unroll to an upper bound, which requires the expansion of active lane masks that can be difficult to generate good code for. This patch disables UpperBound unrolling when we find a active_lane_mask in the loop.

For MVE tail predicated loops, better code can be generated by keeping the loop
whole than to unroll to an upper bound, which requires the expansion of active
lane masks that can be difficult to generate good code for. This patch disables
UpperBound unrolling when we find a active_lane_mask in the loop.
@llvmbot
Copy link
Member

llvmbot commented Oct 20, 2023

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-arm

Author: David Green (davemgreen)

Changes

For MVE tail predicated loops, better code can be generated by keeping the loop whole than to unroll to an upper bound, which requires the expansion of active lane masks that can be difficult to generate good code for. This patch disables UpperBound unrolling when we find a active_lane_mask in the loop.


Full diff: https://github.com/llvm/llvm-project/pull/69709.diff

2 Files Affected:

  • (modified) llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp (+9-3)
  • (added) llvm/test/Transforms/LoopUnroll/ARM/mve-upperbound.ll (+79)
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
index e0d112c4a7eddb5..1dee7a3ccb6d8d9 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
@@ -2430,9 +2430,15 @@ ARMTTIImpl::getPreferredTailFoldingStyle(bool IVUpdateMayOverflow) const {
 void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
                                          TTI::UnrollingPreferences &UP,
                                          OptimizationRemarkEmitter *ORE) {
-  // Enable Upper bound unrolling universally, not dependant upon the conditions
-  // below.
-  UP.UpperBound = true;
+  // Enable Upper bound unrolling universally, providing that we do not see an
+  // active lane mask, which will be better kept as a loop to become tail
+  // predicated than to be conditionally unrolled.
+  UP.UpperBound =
+      !ST->hasMVEIntegerOps() || !any_of(*L->getHeader(), [](Instruction &I) {
+        return isa<IntrinsicInst>(I) &&
+               cast<IntrinsicInst>(I).getIntrinsicID() ==
+                   Intrinsic::get_active_lane_mask;
+      });
 
   // Only currently enable these preferences for M-Class cores.
   if (!ST->isMClass())
diff --git a/llvm/test/Transforms/LoopUnroll/ARM/mve-upperbound.ll b/llvm/test/Transforms/LoopUnroll/ARM/mve-upperbound.ll
new file mode 100644
index 000000000000000..2bb6f05b91b1ab2
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/ARM/mve-upperbound.ll
@@ -0,0 +1,79 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -passes=loop-unroll -S -mtriple thumbv8.1m.main-none-eabi -mattr=+mve %s | FileCheck %s
+
+; The vector loop here is better kept as a loop than conditionally unrolled,
+; letting it transform into a tail predicted loop.
+
+define void @unroll_upper(ptr noundef %pSrc, ptr nocapture noundef writeonly %pDst, i32 noundef %blockSize) {
+; CHECK-LABEL: @unroll_upper(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP_NOT23:%.*]] = icmp ult i32 [[BLOCKSIZE:%.*]], 16
+; CHECK-NEXT:    [[AND:%.*]] = and i32 [[BLOCKSIZE]], 15
+; CHECK-NEXT:    [[CMP6_NOT28:%.*]] = icmp eq i32 [[AND]], 0
+; CHECK-NEXT:    br i1 [[CMP6_NOT28]], label [[WHILE_END12:%.*]], label [[VECTOR_MEMCHECK:%.*]]
+; CHECK:       vector.memcheck:
+; CHECK-NEXT:    [[SCEVGEP:%.*]] = getelementptr i8, ptr [[PDST:%.*]], i32 [[AND]]
+; CHECK-NEXT:    [[TMP0:%.*]] = shl nuw nsw i32 [[AND]], 1
+; CHECK-NEXT:    [[SCEVGEP32:%.*]] = getelementptr i8, ptr [[PSRC:%.*]], i32 [[TMP0]]
+; CHECK-NEXT:    [[BOUND0:%.*]] = icmp ult ptr [[PDST]], [[SCEVGEP32]]
+; CHECK-NEXT:    [[BOUND1:%.*]] = icmp ult ptr [[PSRC]], [[SCEVGEP]]
+; CHECK-NEXT:    [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
+; CHECK-NEXT:    [[N_RND_UP:%.*]] = add nuw nsw i32 [[AND]], 7
+; CHECK-NEXT:    [[N_VEC:%.*]] = and i32 [[N_RND_UP]], 24
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_MEMCHECK]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[PDST]], i32 [[INDEX]]
+; CHECK-NEXT:    [[TMP1:%.*]] = shl i32 [[INDEX]], 1
+; CHECK-NEXT:    [[NEXT_GEP37:%.*]] = getelementptr i8, ptr [[PSRC]], i32 [[TMP1]]
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[AND]])
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = call <8 x i16> @llvm.masked.load.v8i16.p0(ptr [[NEXT_GEP37]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> poison)
+; CHECK-NEXT:    [[TMP2:%.*]] = lshr <8 x i16> [[WIDE_MASKED_LOAD]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
+; CHECK-NEXT:    [[TMP3:%.*]] = trunc <8 x i16> [[TMP2]] to <8 x i8>
+; CHECK-NEXT:    call void @llvm.masked.store.v8i8.p0(<8 x i8> [[TMP3]], ptr [[NEXT_GEP]], i32 1, <8 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT:    [[INDEX_NEXT]] = add i32 [[INDEX]], 8
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP4]], label [[WHILE_END12_LOOPEXIT:%.*]], label [[VECTOR_BODY]]
+; CHECK:       while.end12.loopexit:
+; CHECK-NEXT:    br label [[WHILE_END12]]
+; CHECK:       while.end12:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %cmp.not23 = icmp ult i32 %blockSize, 16
+  %and = and i32 %blockSize, 15
+  %cmp6.not28 = icmp eq i32 %and, 0
+  br i1 %cmp6.not28, label %while.end12, label %vector.memcheck
+
+vector.memcheck:                                  ; preds = %entry
+  %scevgep = getelementptr i8, ptr %pDst, i32 %and
+  %0 = shl nuw nsw i32 %and, 1
+  %scevgep32 = getelementptr i8, ptr %pSrc, i32 %0
+  %bound0 = icmp ult ptr %pDst, %scevgep32
+  %bound1 = icmp ult ptr %pSrc, %scevgep
+  %found.conflict = and i1 %bound0, %bound1
+  %n.rnd.up = add nuw nsw i32 %and, 7
+  %n.vec = and i32 %n.rnd.up, 24
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %vector.memcheck
+  %index = phi i32 [ 0, %vector.memcheck ], [ %index.next, %vector.body ]
+  %next.gep = getelementptr i8, ptr %pDst, i32 %index
+  %1 = shl i32 %index, 1
+  %next.gep37 = getelementptr i8, ptr %pSrc, i32 %1
+  %active.lane.mask = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 %index, i32 %and)
+  %wide.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0(ptr %next.gep37, i32 2, <8 x i1> %active.lane.mask, <8 x i16> poison)
+  %2 = lshr <8 x i16> %wide.masked.load, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
+  %3 = trunc <8 x i16> %2 to <8 x i8>
+  call void @llvm.masked.store.v8i8.p0(<8 x i8> %3, ptr %next.gep, i32 1, <8 x i1> %active.lane.mask)
+  %index.next = add i32 %index, 8
+  %4 = icmp eq i32 %index.next, %n.vec
+  br i1 %4, label %while.end12, label %vector.body
+
+while.end12:                                      ; preds = %vector.body, %entry
+  ret void
+}
+
+declare <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32, i32)
+declare <8 x i16> @llvm.masked.load.v8i16.p0(ptr nocapture, i32 immarg, <8 x i1>, <8 x i16>)
+declare void @llvm.masked.store.v8i8.p0(<8 x i8>, ptr nocapture, i32 immarg, <8 x i1>)

Copy link
Collaborator

@SamTebbs33 SamTebbs33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one. I wonder if a test to make sure your current test actually gets tail-predicated would be good, or if that's outside the scope of this patch.

@davemgreen davemgreen merged commit 75b3c3d into llvm:main Oct 31, 2023
@davemgreen davemgreen deleted the gh-mve-upperbound branch October 31, 2023 09:51
Guzhu-AMD pushed a commit to GPUOpen-Drivers/llvm-project that referenced this pull request Nov 2, 2023
Local branch amd-gfx d648e11 Revert "[AMDGPU] Try to fix the block prologs broken by RA inserted instructions (llvm#69924)"
Remote branch main 75b3c3d [ARM] Disable UpperBound loop unrolling for MVE tail predicated loops. (llvm#69709)

Change-Id: I5ed179024ddce969c97745bd3947ac42772629c0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants