Skip to content

AMDGPU: Reduce shl64 to shl32 if shift range is [63-32] #125574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 13, 2025

Conversation

LU-JOHN
Copy link
Contributor

@LU-JOHN LU-JOHN commented Feb 3, 2025

Reduce:

DST = shl i64 X, Y

where Y is in the range [63-32] to:

DST = [0, shl i32 X, (Y & 32)]

Alive2 analysis:

https://alive2.llvm.org/ce/z/w_u5je

@LU-JOHN LU-JOHN requested a review from nikic as a code owner February 3, 2025 20:34
@LU-JOHN
Copy link
Contributor Author

LU-JOHN commented Feb 3, 2025

This PR addresses: #63848

Copy link

github-actions bot commented Feb 3, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added backend:AMDGPU llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms labels Feb 3, 2025
@llvmbot
Copy link
Member

llvmbot commented Feb 3, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-amdgpu

Author: None (LU-JOHN)

Changes

Reduce:

DST = shl i64 X, Y

where Y is in the range [63-32] to:

DST = [shl i32 X, (Y - 32), 0]


Full diff: https://github.com/llvm/llvm-project/pull/125574.diff

6 Files Affected:

  • (modified) llvm/include/llvm/Transforms/InstCombine/InstCombiner.h (+2)
  • (modified) llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp (+30)
  • (modified) llvm/lib/Transforms/InstCombine/InstructionCombining.cpp (+9)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pown.ll (+9-8)
  • (added) llvm/test/Transforms/InstCombine/shl64-reduce.ll (+48)
diff --git a/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h b/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
index fa6b60cba15aaf..dfd275b020ed75 100644
--- a/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
+++ b/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
@@ -521,6 +521,8 @@ class LLVM_LIBRARY_VISIBILITY InstCombiner {
                              bool AllowMultipleUsers = false) = 0;
 
   bool isValidAddrSpaceCast(unsigned FromAS, unsigned ToAS) const;
+
+  bool shouldReduceShl64ToShl32();
 };
 
 } // namespace llvm
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp b/llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
index 7ef95800975dba..3ced23671f11a8 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
@@ -1032,6 +1032,32 @@ static bool setShiftFlags(BinaryOperator &I, const SimplifyQuery &Q) {
   return Changed;
 }
 
+static Instruction *transformClampedShift64(BinaryOperator &I,
+                                            const SimplifyQuery &Q,
+                                            InstCombiner::BuilderTy &Builder) {
+  Value *Op0 = I.getOperand(0), *Op1 = I.getOperand(1);
+  Type *I32Type = Type::getInt32Ty(I.getContext());
+  Type *I64Type = Type::getInt64Ty(I.getContext());
+
+  if (I.getType() == I64Type) {
+    KnownBits KnownAmt = computeKnownBits(Op1, /* Depth */ 0, Q);
+    if (KnownAmt.getMinValue().uge(32)) {
+      Value *TruncVal = Builder.CreateTrunc(Op0, I32Type);
+      Value *TruncShiftAmt = Builder.CreateTrunc(Op1, I32Type);
+      Value *AdjustedShiftAmt =
+          Builder.CreateSub(TruncShiftAmt, ConstantInt::get(I32Type, 32));
+      Value *Shl32 = Builder.CreateShl(TruncVal, AdjustedShiftAmt);
+      Value *VResult =
+          Builder.CreateVectorSplat(2, ConstantInt::get(I32Type, 0));
+
+      VResult = Builder.CreateInsertElement(VResult, Shl32,
+                                            ConstantInt::get(I32Type, 1));
+      return CastInst::Create(Instruction::BitCast, VResult, I64Type);
+    }
+  }
+  return nullptr;
+}
+
 Instruction *InstCombinerImpl::visitShl(BinaryOperator &I) {
   const SimplifyQuery Q = SQ.getWithInstruction(&I);
 
@@ -1266,6 +1292,10 @@ Instruction *InstCombinerImpl::visitShl(BinaryOperator &I) {
     }
   }
 
+  if (this->shouldReduceShl64ToShl32())
+    if (Instruction *V = transformClampedShift64(I, Q, Builder))
+      return V;
+
   return nullptr;
 }
 
diff --git a/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp b/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
index 5621511570b581..d356741fcdf21e 100644
--- a/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
@@ -194,6 +194,15 @@ bool InstCombiner::isValidAddrSpaceCast(unsigned FromAS, unsigned ToAS) const {
   return TTIForTargetIntrinsicsOnly.isValidAddrSpaceCast(FromAS, ToAS);
 }
 
+bool InstCombiner::shouldReduceShl64ToShl32() {
+  InstructionCost costShl32 = TTIForTargetIntrinsicsOnly.getArithmeticInstrCost(
+      Instruction::Shl, Builder.getInt32Ty(), TTI::TCK_Latency);
+  InstructionCost costShl64 = TTIForTargetIntrinsicsOnly.getArithmeticInstrCost(
+      Instruction::Shl, Builder.getInt64Ty(), TTI::TCK_Latency);
+
+  return costShl32 < costShl64;
+}
+
 Value *InstCombinerImpl::EmitGEPOffset(GEPOperator *GEP, bool RewriteGEP) {
   if (!RewriteGEP)
     return llvm::emitGEPOffset(&Builder, DL, GEP);
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
index ab2363860af9de..84ac4af6584677 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
@@ -174,7 +174,7 @@ define double @test_pow_fast_f64__integral_y(double %x, i32 %y.i) {
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; CHECK-NEXT:    v_lshlrev_b32_e32 v2, 31, v41
-; CHECK-NEXT:    v_and_b32_e32 v2, v2, v42
+; CHECK-NEXT:    v_and_b32_e32 v2, v42, v2
 ; CHECK-NEXT:    buffer_load_dword v42, off, s[0:3], s33 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload
@@ -458,7 +458,7 @@ define double @test_pown_fast_f64(double %x, i32 %y) {
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; CHECK-NEXT:    v_lshlrev_b32_e32 v2, 31, v41
-; CHECK-NEXT:    v_and_b32_e32 v2, v2, v42
+; CHECK-NEXT:    v_and_b32_e32 v2, v42, v2
 ; CHECK-NEXT:    buffer_load_dword v42, off, s[0:3], s33 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pown.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pown.ll
index f9c359bc114ed3..5155e42fef3cbf 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pown.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pown.ll
@@ -720,14 +720,15 @@ define double @test_pown_afn_nnan_ninf_f64(double %x, i32 %y) {
 ; CHECK-NEXT:    [[POWNI2F:%.*]] = sitofp i32 [[Y]] to double
 ; CHECK-NEXT:    [[__YLOGX:%.*]] = fmul nnan ninf afn double [[__LOG2]], [[POWNI2F]]
 ; CHECK-NEXT:    [[__EXP2:%.*]] = call nnan ninf afn double @_Z4exp2d(double [[__YLOGX]])
-; CHECK-NEXT:    [[__YTOU:%.*]] = zext i32 [[Y]] to i64
-; CHECK-NEXT:    [[__YEVEN:%.*]] = shl i64 [[__YTOU]], 63
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast double [[X]] to i64
-; CHECK-NEXT:    [[__POW_SIGN:%.*]] = and i64 [[__YEVEN]], [[TMP0]]
-; CHECK-NEXT:    [[TMP1:%.*]] = bitcast double [[__EXP2]] to i64
-; CHECK-NEXT:    [[TMP2:%.*]] = or i64 [[__POW_SIGN]], [[TMP1]]
-; CHECK-NEXT:    [[TMP3:%.*]] = bitcast i64 [[TMP2]] to double
-; CHECK-NEXT:    ret double [[TMP3]]
+; CHECK-NEXT:    [[TMP0:%.*]] = shl i32 [[Y]], 31
+; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x i32> <i32 0, i32 poison>, i32 [[TMP0]], i64 1
+; CHECK-NEXT:    [[__YEVEN:%.*]] = bitcast <2 x i32> [[TMP1]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast double [[X]] to i64
+; CHECK-NEXT:    [[__POW_SIGN:%.*]] = and i64 [[TMP2]], [[__YEVEN]]
+; CHECK-NEXT:    [[TMP3:%.*]] = bitcast double [[__EXP2]] to i64
+; CHECK-NEXT:    [[TMP4:%.*]] = or i64 [[__POW_SIGN]], [[TMP3]]
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i64 [[TMP4]] to double
+; CHECK-NEXT:    ret double [[TMP5]]
 ;
 entry:
   %call = tail call nnan ninf afn double @_Z4powndi(double %x, i32 %y)
diff --git a/llvm/test/Transforms/InstCombine/shl64-reduce.ll b/llvm/test/Transforms/InstCombine/shl64-reduce.ll
new file mode 100644
index 00000000000000..00f6c82fae9ad0
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/shl64-reduce.ll
@@ -0,0 +1,48 @@
+;; Test reduction of:
+;;
+;;   DST = shl i64 X, Y
+;;
+;; where Y is in the range [63-32] to:
+;;
+;;   DST = [shl i32 X, (Y - 32), 0]
+
+; RUN: opt < %s -passes=instcombine -S | FileCheck %s
+
+
+target triple = "amdgcn-amd-amdhsa"
+
+; Test reduction where range information comes from meta-data
+define i64 @func_range(i64 noundef %arg0, ptr %arg1.ptr) {
+  %shift.amt = load i64, ptr %arg1.ptr, !range !0
+  %shl = shl i64 %arg0, %shift.amt
+  ret i64 %shl
+
+; CHECK:  define i64 @func_range(i64 noundef %arg0, ptr %arg1.ptr) {
+; CHECK:  %shift.amt = load i64, ptr %arg1.ptr, align 8, !range !0
+; CHECK:  %1 = trunc i64 %arg0 to i32
+; CHECK:  %2 = trunc nuw nsw i64 %shift.amt to i32
+; CHECK:  %3 = add nsw i32 %2, -32
+; CHECK:  %4 = shl i32 %1, %3
+; CHECK:  %5 = insertelement <2 x i32> <i32 0, i32 poison>, i32 %4, i64 1
+; CHECK:  %shl = bitcast <2 x i32> %5 to i64
+; CHECK:  ret i64 %shl
+
+}
+!0 = !{i64 32, i64 64}
+
+; FIXME: This case should be reduced too, but computeKnownBits() cannot
+;        determine the range.  Match current results for now.
+define i64 @func_max(i64 noundef %arg0, i64 noundef %arg1) {
+  %max = call i64 @llvm.umax.i64(i64 %arg1, i64 32)
+  %min = call i64 @llvm.umin.i64(i64 %max,  i64 63)  
+  %shl = shl i64 %arg0, %min
+  ret i64 %shl
+
+; CHECK:  define i64 @func_max(i64 noundef %arg0, i64 noundef %arg1) {
+; CHECK:    %max = call i64 @llvm.umax.i64(i64 %arg1, i64 32)
+; CHECK:    %min = call i64 @llvm.umin.i64(i64 %max,  i64 63)
+; CHECK:    %shl = shl i64 %arg0, %min
+; CHECK:    ret i64 %shl
+}
+  
+

nikic
nikic previously requested changes Feb 3, 2025
Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LU-JOHN
Copy link
Contributor Author

LU-JOHN commented Feb 3, 2025

Please read https://llvm.org/docs/InstCombineContributorGuide.html and in particular https://llvm.org/docs/InstCombineContributorGuide.html#canonicalization-and-target-independence.

This transform should be in the backend.

Thanks, I'll move it to the backend.

@LU-JOHN LU-JOHN marked this pull request as draft February 4, 2025 15:40
@LU-JOHN LU-JOHN marked this pull request as ready for review February 5, 2025 17:05
@nikic nikic requested a review from arsenm February 5, 2025 17:10
@nikic nikic dismissed their stale review February 5, 2025 17:10

No longer in InstCombine

}

; This case must not be reduced because the known minimum, 16, is not in range.
define i64 @shl_or16(i64 noundef %arg0, ptr %arg1.ptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should test more variants. Should test with SGPR inputs, and the same scenario scaled up with 64 vectors.

Also a negative test with this scaled down to i16

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added variations with vector and SGPR inputs.

Also a negative test with this scaled down to i16

What input should be scaled down to i16? The shl uses 64-bit inputs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean a 32-bit shift that is reducible to 16-bit. Everything just half sized. We should do that, but it's trickier because we don't want to force vector usage in scalar contexts

}

; This case must not be reduced because the known minimum, 16, is not in range.
define i64 @shl_or16(i64 noundef %arg0, ptr %arg1.ptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean a 32-bit shift that is reducible to 16-bit. Everything just half sized. We should do that, but it's trickier because we don't want to force vector usage in scalar contexts


; test inreg

define i64 @shl_or16_inreg(i64 noundef %arg0, i64 inreg %shift_amt) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually add s and v to indicate the operand types. Inreg doesn't tell me much. Also, the most interesting case requires testing both inputs are inreg so the whole computation is scalar. This is also using a vector return type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used inreg for both inputs. Changed function name to use "sgpr" suffix. Could not find a way to return result in SGPR.

@LU-JOHN
Copy link
Contributor Author

LU-JOHN commented Feb 7, 2025

I mean a 32-bit shift that is reducible to 16-bit. Everything just half sized. We should do that, but it's trickier because we don't want to force vector usage in scalar contexts

Should I generalize the reduction to include shl32 -> shl16?

@arsenm
Copy link
Contributor

arsenm commented Feb 7, 2025

Should I generalize the reduction to include shl32 -> shl16?

I would prepare to generalize it, but do the reduction as a separate step

@LU-JOHN
Copy link
Contributor Author

LU-JOHN commented Feb 7, 2025

Should I generalize the reduction to include shl32 -> shl16?

I would prepare to generalize it, but do the reduction as a separate step

Then should we also add half-sized testing:

I mean a 32-bit shift that is reducible to 16-bit. Everything just half sized. We should do that, but it's trickier because we don't > want to force vector usage in scalar contexts

in a separate step?

@arsenm
Copy link
Contributor

arsenm commented Feb 10, 2025

I mean a 32-bit shift that is reducible to 16-bit. Everything just half sized. We should do that, but it's trickier because we don't > want to force vector usage in scalar contexts

in a separate step?

Yes

@arsenm arsenm changed the title Reduce shl64 to shl32 if shift range is [63-32] AMDGPU: Reduce shl64 to shl32 if shift range is [63-32] Feb 13, 2025
@arsenm arsenm removed llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms labels Feb 13, 2025
@bcahoon bcahoon merged commit 5decab1 into llvm:main Feb 13, 2025
6 of 8 checks passed
Copy link

@LU-JOHN Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

joaosaffran pushed a commit to joaosaffran/llvm-project that referenced this pull request Feb 14, 2025
Reduce:

   DST = shl i64 X, Y

where Y is in the range [63-32] to:

   DST = [0, shl i32 X, (Y & 32)]


Alive2 analysis:

https://alive2.llvm.org/ce/z/w_u5je

---------

Signed-off-by: John Lu <[email protected]>
sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this pull request Feb 24, 2025
Reduce:

   DST = shl i64 X, Y

where Y is in the range [63-32] to:

   DST = [0, shl i32 X, (Y & 32)]


Alive2 analysis:

https://alive2.llvm.org/ce/z/w_u5je

---------

Signed-off-by: John Lu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants