-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AMDGPU] Implement hasAndNot for scalar bitwise AND-NOT operations. #112647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Harrison Hao (harrisonGPU) ChangesFrom #112550 This PR implements the For example:
In such cases, the condition (X & Y) == Y can be optimized to (~X & Y) == 0 if hasAndNot returns true. Full diff: https://github.com/llvm/llvm-project/pull/112647.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 0f65df0763cc83..b746b94a60be21 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -3721,6 +3721,14 @@ SDValue AMDGPUTargetLowering::LowerSIGN_EXTEND_INREG(SDValue Op,
return DAG.getBuildVector(VT, DL, Args);
}
+bool AMDGPUTargetLowering::hasAndNot(SDValue Op) const {
+ if (Op->isDivergent())
+ return false;
+
+ EVT VT = Op.getValueType();
+ return VT == MVT::i32 || VT == MVT::i64;
+}
+
//===----------------------------------------------------------------------===//
// Custom DAG optimizations
//===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
index b2fd31cb2346eb..1289458570358b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
@@ -99,6 +99,8 @@ class AMDGPUTargetLowering : public TargetLowering {
SDValue LowerSIGN_EXTEND_INREG(SDValue Op, SelectionDAG &DAG) const;
+ bool hasAndNot(SDValue Y) const override;
+
protected:
bool shouldCombineMemoryType(EVT VT) const;
SDValue performLoadCombine(SDNode *N, DAGCombinerInfo &DCI) const;
|
test |
Okay, thanks Shilei. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs tests (most of the effort on this patch is adding the requisite tests).
We need coverage with all the combinations of SGPR / VGPR inputs in i1 / i8 / i16 / i32 / i64 for the basic and-not pattern.
Additionally we need some tests for the optimizations enabled by this hook.
define amdgpu_ps float @xor3_vgpr_b(i32 inreg %a, i32 %b, i32 inreg %c) { |
Use arguments with inreg to get sample SGPR inputs, otherwise VGPR
bool SITargetLowering::hasAndNot(SDValue Op) const { | ||
if (Op->isDivergent()) | ||
return false; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment why this is the set of cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't really need to consider the machine opcode case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late update on this PR. Last month, I was still thinking about this patch and forgot to push it to the origin branch. I'm still considering this issue, because some lit tests show an increase in the number of instructions, while others show a decrease. So, I'm not yet sure whether it impacts performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the work of this change is avoiding the regressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Do you have any further suggestions? Also, do you think it's ready to be merged now? :-)
Okay, I need to consider how to add tests for this patch. :) |
✅ With the latest revision this PR passed the C/C++ code formatter. |
@@ -6822,6 +6822,81 @@ static unsigned getExtOpcodeForPromotedOp(SDValue Op) { | |||
} | |||
} | |||
|
|||
SDValue SITargetLowering::combineAnd(SDValue Op, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not be implementing any new combine in this patch. The point of this is to enable the existing optimizations. If it's worth adding something else, it would be a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but I think hasAndNot can use And
operation when meet this scenario: (and LHS, (or Y, ~Z))
and X86 used this way, so I try to implement it. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why x86 has that in the backend but it belongs in a generic combine
8dba2b5
to
ea08a49
Compare
Do I need to add cases for i1, i8, and i16? I think the hasAndNot function does not affect the i1, i8, or i16 cases. |
llvm/test/CodeGen/AMDGPU/andorn2.ll
Outdated
@@ -25,6 +25,28 @@ entry: | |||
ret void | |||
} | |||
|
|||
; GCN-LABEL: {{^}}scalar_andn2_i32_one_sgpr | |||
; GCN: s_andn2_b32 | |||
define amdgpu_kernel void @scalar_andn2_i32_one_sgpr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to use a shader calling convention and use the return value. the style of tests with the output pointer and a kernel is older and predates function support. kernels have a lot of extra noise in the output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I updated it. :)
bool SITargetLowering::hasAndNot(SDValue Op) const { | ||
if (Op->isDivergent()) | ||
return false; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't really need to consider the machine opcode case
llvm/test/CodeGen/AMDGPU/andorn2.ll
Outdated
@@ -25,6 +25,24 @@ entry: | |||
ret void | |||
} | |||
|
|||
; GCN-LABEL: {{^}}scalar_andn2_i32_one_sgpr | |||
; GCN: s_andn2_b32 | |||
define i32 @scalar_andn2_i32_one_sgpr(i32 inreg %a, i32 inreg %b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests only show the and not pattern to begin with, it is not showing the transforms enabled by this hook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed it. Sorry for the late update on this PR. Last month, I was still thinking about this patch and forgot to push it to the origin branch. I'm still considering this issue, because some lit tests show an increase in the number of instructions, while others show a decrease. So, I'm not yet sure whether it impacts performance.
5e64801
to
f46d274
Compare
@@ -0,0 +1,764 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 | |||
; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck --check-prefix=GCN %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck --check-prefix=GCN %s | |
; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 < %s | FileCheck --check-prefix=GCN %s |
Should test with a few sub targets
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 | ||
; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck --check-prefix=GCN %s | ||
|
||
define i32 @out32(i32 inreg %x, i32 inreg %y, i32 inreg %mask) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what "out32" refers to. Need SGPR and VGPR versions of the tests (which usually use s_ and v_ prefixes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a quetion, In hasAndNot we have:
if (Op->isDivergent())
return false;
Values in VGPRs are divergent, so the check returns false right away—the pass only triggers on inreg values. With that in mind, do we really need a VGPR test case?
%n1 = and i32 %n0, %mask | ||
%r = xor i32 %n1, %z | ||
ret i32 %r | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test vector types
@@ -1,3 +1,4 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't convert a test to generated checks in a functional change
// Return false if the operation is divergent, as AND-NOT is a scalar-only | ||
// instruction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First part of the comment is describing the mechanics, not the reason
Also it would be better to precommit the tests separately so the diff is obvious here |
Okay, I will try to separate it ! Thanks. |
6791fea
to
dba6155
Compare
return false; | ||
|
||
EVT VT = Op.getValueType(); | ||
return VT == MVT::i32 || VT == MVT::i64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to check types here. How about just return !Op->isDivergent();
? If the types are not legal they will get legalized, but that should not affect the decision of whether to form an and-not pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a different type and then is legalized, there will be intermediate instructions that break the and not pattern
This PR implements the
hasAndNot
function for AMDGPU in theTargetLowering
class, enabling LLVM to recognize and optimize bitwise AND-NOT operations for scalar 32-bit and 64-bit integer types (i32
andi64
).For example:
In such cases, the condition (X & Y) == Y can be optimized to (~X & Y) == 0 if hasAndNot returns true.
Closes #112550