-
Notifications
You must be signed in to change notification settings - Fork 14.3k
AMDGPU: Reduce shl64 to shl32 if shift range is [63-32] #125574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR addresses: #63848 |
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-amdgpu Author: None (LU-JOHN) ChangesReduce: DST = shl i64 X, Y where Y is in the range [63-32] to: DST = [shl i32 X, (Y - 32), 0] Full diff: https://github.com/llvm/llvm-project/pull/125574.diff 6 Files Affected:
diff --git a/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h b/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
index fa6b60cba15aaf..dfd275b020ed75 100644
--- a/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
+++ b/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
@@ -521,6 +521,8 @@ class LLVM_LIBRARY_VISIBILITY InstCombiner {
bool AllowMultipleUsers = false) = 0;
bool isValidAddrSpaceCast(unsigned FromAS, unsigned ToAS) const;
+
+ bool shouldReduceShl64ToShl32();
};
} // namespace llvm
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp b/llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
index 7ef95800975dba..3ced23671f11a8 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
@@ -1032,6 +1032,32 @@ static bool setShiftFlags(BinaryOperator &I, const SimplifyQuery &Q) {
return Changed;
}
+static Instruction *transformClampedShift64(BinaryOperator &I,
+ const SimplifyQuery &Q,
+ InstCombiner::BuilderTy &Builder) {
+ Value *Op0 = I.getOperand(0), *Op1 = I.getOperand(1);
+ Type *I32Type = Type::getInt32Ty(I.getContext());
+ Type *I64Type = Type::getInt64Ty(I.getContext());
+
+ if (I.getType() == I64Type) {
+ KnownBits KnownAmt = computeKnownBits(Op1, /* Depth */ 0, Q);
+ if (KnownAmt.getMinValue().uge(32)) {
+ Value *TruncVal = Builder.CreateTrunc(Op0, I32Type);
+ Value *TruncShiftAmt = Builder.CreateTrunc(Op1, I32Type);
+ Value *AdjustedShiftAmt =
+ Builder.CreateSub(TruncShiftAmt, ConstantInt::get(I32Type, 32));
+ Value *Shl32 = Builder.CreateShl(TruncVal, AdjustedShiftAmt);
+ Value *VResult =
+ Builder.CreateVectorSplat(2, ConstantInt::get(I32Type, 0));
+
+ VResult = Builder.CreateInsertElement(VResult, Shl32,
+ ConstantInt::get(I32Type, 1));
+ return CastInst::Create(Instruction::BitCast, VResult, I64Type);
+ }
+ }
+ return nullptr;
+}
+
Instruction *InstCombinerImpl::visitShl(BinaryOperator &I) {
const SimplifyQuery Q = SQ.getWithInstruction(&I);
@@ -1266,6 +1292,10 @@ Instruction *InstCombinerImpl::visitShl(BinaryOperator &I) {
}
}
+ if (this->shouldReduceShl64ToShl32())
+ if (Instruction *V = transformClampedShift64(I, Q, Builder))
+ return V;
+
return nullptr;
}
diff --git a/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp b/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
index 5621511570b581..d356741fcdf21e 100644
--- a/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
@@ -194,6 +194,15 @@ bool InstCombiner::isValidAddrSpaceCast(unsigned FromAS, unsigned ToAS) const {
return TTIForTargetIntrinsicsOnly.isValidAddrSpaceCast(FromAS, ToAS);
}
+bool InstCombiner::shouldReduceShl64ToShl32() {
+ InstructionCost costShl32 = TTIForTargetIntrinsicsOnly.getArithmeticInstrCost(
+ Instruction::Shl, Builder.getInt32Ty(), TTI::TCK_Latency);
+ InstructionCost costShl64 = TTIForTargetIntrinsicsOnly.getArithmeticInstrCost(
+ Instruction::Shl, Builder.getInt64Ty(), TTI::TCK_Latency);
+
+ return costShl32 < costShl64;
+}
+
Value *InstCombinerImpl::EmitGEPOffset(GEPOperator *GEP, bool RewriteGEP) {
if (!RewriteGEP)
return llvm::emitGEPOffset(&Builder, DL, GEP);
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
index ab2363860af9de..84ac4af6584677 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
@@ -174,7 +174,7 @@ define double @test_pow_fast_f64__integral_y(double %x, i32 %y.i) {
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
; CHECK-NEXT: v_lshlrev_b32_e32 v2, 31, v41
-; CHECK-NEXT: v_and_b32_e32 v2, v2, v42
+; CHECK-NEXT: v_and_b32_e32 v2, v42, v2
; CHECK-NEXT: buffer_load_dword v42, off, s[0:3], s33 ; 4-byte Folded Reload
; CHECK-NEXT: buffer_load_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
; CHECK-NEXT: buffer_load_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload
@@ -458,7 +458,7 @@ define double @test_pown_fast_f64(double %x, i32 %y) {
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
; CHECK-NEXT: v_lshlrev_b32_e32 v2, 31, v41
-; CHECK-NEXT: v_and_b32_e32 v2, v2, v42
+; CHECK-NEXT: v_and_b32_e32 v2, v42, v2
; CHECK-NEXT: buffer_load_dword v42, off, s[0:3], s33 ; 4-byte Folded Reload
; CHECK-NEXT: buffer_load_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
; CHECK-NEXT: buffer_load_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pown.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pown.ll
index f9c359bc114ed3..5155e42fef3cbf 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pown.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pown.ll
@@ -720,14 +720,15 @@ define double @test_pown_afn_nnan_ninf_f64(double %x, i32 %y) {
; CHECK-NEXT: [[POWNI2F:%.*]] = sitofp i32 [[Y]] to double
; CHECK-NEXT: [[__YLOGX:%.*]] = fmul nnan ninf afn double [[__LOG2]], [[POWNI2F]]
; CHECK-NEXT: [[__EXP2:%.*]] = call nnan ninf afn double @_Z4exp2d(double [[__YLOGX]])
-; CHECK-NEXT: [[__YTOU:%.*]] = zext i32 [[Y]] to i64
-; CHECK-NEXT: [[__YEVEN:%.*]] = shl i64 [[__YTOU]], 63
-; CHECK-NEXT: [[TMP0:%.*]] = bitcast double [[X]] to i64
-; CHECK-NEXT: [[__POW_SIGN:%.*]] = and i64 [[__YEVEN]], [[TMP0]]
-; CHECK-NEXT: [[TMP1:%.*]] = bitcast double [[__EXP2]] to i64
-; CHECK-NEXT: [[TMP2:%.*]] = or i64 [[__POW_SIGN]], [[TMP1]]
-; CHECK-NEXT: [[TMP3:%.*]] = bitcast i64 [[TMP2]] to double
-; CHECK-NEXT: ret double [[TMP3]]
+; CHECK-NEXT: [[TMP0:%.*]] = shl i32 [[Y]], 31
+; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x i32> <i32 0, i32 poison>, i32 [[TMP0]], i64 1
+; CHECK-NEXT: [[__YEVEN:%.*]] = bitcast <2 x i32> [[TMP1]] to i64
+; CHECK-NEXT: [[TMP2:%.*]] = bitcast double [[X]] to i64
+; CHECK-NEXT: [[__POW_SIGN:%.*]] = and i64 [[TMP2]], [[__YEVEN]]
+; CHECK-NEXT: [[TMP3:%.*]] = bitcast double [[__EXP2]] to i64
+; CHECK-NEXT: [[TMP4:%.*]] = or i64 [[__POW_SIGN]], [[TMP3]]
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast i64 [[TMP4]] to double
+; CHECK-NEXT: ret double [[TMP5]]
;
entry:
%call = tail call nnan ninf afn double @_Z4powndi(double %x, i32 %y)
diff --git a/llvm/test/Transforms/InstCombine/shl64-reduce.ll b/llvm/test/Transforms/InstCombine/shl64-reduce.ll
new file mode 100644
index 00000000000000..00f6c82fae9ad0
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/shl64-reduce.ll
@@ -0,0 +1,48 @@
+;; Test reduction of:
+;;
+;; DST = shl i64 X, Y
+;;
+;; where Y is in the range [63-32] to:
+;;
+;; DST = [shl i32 X, (Y - 32), 0]
+
+; RUN: opt < %s -passes=instcombine -S | FileCheck %s
+
+
+target triple = "amdgcn-amd-amdhsa"
+
+; Test reduction where range information comes from meta-data
+define i64 @func_range(i64 noundef %arg0, ptr %arg1.ptr) {
+ %shift.amt = load i64, ptr %arg1.ptr, !range !0
+ %shl = shl i64 %arg0, %shift.amt
+ ret i64 %shl
+
+; CHECK: define i64 @func_range(i64 noundef %arg0, ptr %arg1.ptr) {
+; CHECK: %shift.amt = load i64, ptr %arg1.ptr, align 8, !range !0
+; CHECK: %1 = trunc i64 %arg0 to i32
+; CHECK: %2 = trunc nuw nsw i64 %shift.amt to i32
+; CHECK: %3 = add nsw i32 %2, -32
+; CHECK: %4 = shl i32 %1, %3
+; CHECK: %5 = insertelement <2 x i32> <i32 0, i32 poison>, i32 %4, i64 1
+; CHECK: %shl = bitcast <2 x i32> %5 to i64
+; CHECK: ret i64 %shl
+
+}
+!0 = !{i64 32, i64 64}
+
+; FIXME: This case should be reduced too, but computeKnownBits() cannot
+; determine the range. Match current results for now.
+define i64 @func_max(i64 noundef %arg0, i64 noundef %arg1) {
+ %max = call i64 @llvm.umax.i64(i64 %arg1, i64 32)
+ %min = call i64 @llvm.umin.i64(i64 %max, i64 63)
+ %shl = shl i64 %arg0, %min
+ ret i64 %shl
+
+; CHECK: define i64 @func_max(i64 noundef %arg0, i64 noundef %arg1) {
+; CHECK: %max = call i64 @llvm.umax.i64(i64 %arg1, i64 32)
+; CHECK: %min = call i64 @llvm.umin.i64(i64 %max, i64 63)
+; CHECK: %shl = shl i64 %arg0, %min
+; CHECK: ret i64 %shl
+}
+
+
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please read https://llvm.org/docs/InstCombineContributorGuide.html and in particular https://llvm.org/docs/InstCombineContributorGuide.html#canonicalization-and-target-independence.
This transform should be in the backend.
Thanks, I'll move it to the backend. |
} | ||
|
||
; This case must not be reduced because the known minimum, 16, is not in range. | ||
define i64 @shl_or16(i64 noundef %arg0, ptr %arg1.ptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should test more variants. Should test with SGPR inputs, and the same scenario scaled up with 64 vectors.
Also a negative test with this scaled down to i16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added variations with vector and SGPR inputs.
Also a negative test with this scaled down to i16
What input should be scaled down to i16? The shl uses 64-bit inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean a 32-bit shift that is reducible to 16-bit. Everything just half sized. We should do that, but it's trickier because we don't want to force vector usage in scalar contexts
Signed-off-by: John Lu <[email protected]>
Signed-off-by: John Lu <[email protected]>
Signed-off-by: John Lu <[email protected]>
Signed-off-by: John Lu <[email protected]>
} | ||
|
||
; This case must not be reduced because the known minimum, 16, is not in range. | ||
define i64 @shl_or16(i64 noundef %arg0, ptr %arg1.ptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean a 32-bit shift that is reducible to 16-bit. Everything just half sized. We should do that, but it's trickier because we don't want to force vector usage in scalar contexts
|
||
; test inreg | ||
|
||
define i64 @shl_or16_inreg(i64 noundef %arg0, i64 inreg %shift_amt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually add s and v to indicate the operand types. Inreg doesn't tell me much. Also, the most interesting case requires testing both inputs are inreg so the whole computation is scalar. This is also using a vector return type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used inreg for both inputs. Changed function name to use "sgpr" suffix. Could not find a way to return result in SGPR.
Should I generalize the reduction to include shl32 -> shl16? |
Signed-off-by: John Lu <[email protected]>
I would prepare to generalize it, but do the reduction as a separate step |
Signed-off-by: John Lu <[email protected]>
Signed-off-by: John Lu <[email protected]>
Then should we also add half-sized testing:
in a separate step? |
Signed-off-by: John Lu <[email protected]>
Signed-off-by: John Lu <[email protected]>
Yes |
Signed-off-by: John Lu <[email protected]>
Signed-off-by: John Lu <[email protected]>
Signed-off-by: John Lu <[email protected]>
@LU-JOHN Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR. Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues. How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
Reduce: DST = shl i64 X, Y where Y is in the range [63-32] to: DST = [0, shl i32 X, (Y & 32)] Alive2 analysis: https://alive2.llvm.org/ce/z/w_u5je --------- Signed-off-by: John Lu <[email protected]>
Reduce: DST = shl i64 X, Y where Y is in the range [63-32] to: DST = [0, shl i32 X, (Y & 32)] Alive2 analysis: https://alive2.llvm.org/ce/z/w_u5je --------- Signed-off-by: John Lu <[email protected]>
Reduce:
DST = shl i64 X, Y
where Y is in the range [63-32] to:
DST = [0, shl i32 X, (Y & 32)]
Alive2 analysis:
https://alive2.llvm.org/ce/z/w_u5je