Skip to content

[AMDGPU] Implement hasAndNot for scalar bitwise AND-NOT operations. #112647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

harrisonGPU
Copy link
Contributor

@harrisonGPU harrisonGPU commented Oct 17, 2024

This PR implements the hasAndNot function for AMDGPU in the TargetLowering class, enabling LLVM to recognize and optimize bitwise AND-NOT operations for scalar 32-bit and 64-bit integer types (i32 and i64).

For example:

if ((X & Y) == Y) {
  // Perform action if all bits set in Y are also set in X
}

In such cases, the condition (X & Y) == Y can be optimized to (~X & Y) == 0 if hasAndNot returns true.

Closes #112550

@llvmbot
Copy link
Member

llvmbot commented Oct 17, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Harrison Hao (harrisonGPU)

Changes

From #112550

This PR implements the hasAndNot function for AMDGPU in the TargetLowering class, enabling LLVM to recognize and optimize bitwise AND-NOT operations for scalar 32-bit and 64-bit integer types (i32 and i64).

For example:

if ((X & Y) == Y) {
  // Perform action if all bits set in Y are also set in X
}

In such cases, the condition (X & Y) == Y can be optimized to (~X & Y) == 0 if hasAndNot returns true.


Full diff: https://github.com/llvm/llvm-project/pull/112647.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+8)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h (+2)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 0f65df0763cc83..b746b94a60be21 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -3721,6 +3721,14 @@ SDValue AMDGPUTargetLowering::LowerSIGN_EXTEND_INREG(SDValue Op,
   return DAG.getBuildVector(VT, DL, Args);
 }
 
+bool AMDGPUTargetLowering::hasAndNot(SDValue Op) const {
+  if (Op->isDivergent())
+    return false;
+
+  EVT VT = Op.getValueType();
+  return VT == MVT::i32 || VT == MVT::i64;
+}
+
 //===----------------------------------------------------------------------===//
 // Custom DAG optimizations
 //===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
index b2fd31cb2346eb..1289458570358b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
@@ -99,6 +99,8 @@ class AMDGPUTargetLowering : public TargetLowering {
 
   SDValue LowerSIGN_EXTEND_INREG(SDValue Op, SelectionDAG &DAG) const;
 
+  bool hasAndNot(SDValue Y) const override;
+
 protected:
   bool shouldCombineMemoryType(EVT VT) const;
   SDValue performLoadCombine(SDNode *N, DAGCombinerInfo &DCI) const;

@shiltian
Copy link
Contributor

test

@harrisonGPU
Copy link
Contributor Author

test

Okay, thanks Shilei. :)

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs tests (most of the effort on this patch is adding the requisite tests).

We need coverage with all the combinations of SGPR / VGPR inputs in i1 / i8 / i16 / i32 / i64 for the basic and-not pattern.

Additionally we need some tests for the optimizations enabled by this hook.

define amdgpu_ps float @xor3_vgpr_b(i32 inreg %a, i32 %b, i32 inreg %c) {
is one example.

Use arguments with inreg to get sample SGPR inputs, otherwise VGPR

bool SITargetLowering::hasAndNot(SDValue Op) const {
if (Op->isDivergent())
return false;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment why this is the set of cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't really need to consider the machine opcode case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late update on this PR. Last month, I was still thinking about this patch and forgot to push it to the origin branch. I'm still considering this issue, because some lit tests show an increase in the number of instructions, while others show a decrease. So, I'm not yet sure whether it impacts performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the work of this change is avoiding the regressions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Do you have any further suggestions? Also, do you think it's ready to be merged now? :-)

@harrisonGPU
Copy link
Contributor Author

harrisonGPU commented Oct 17, 2024

Needs tests (most of the effort on this patch is adding the requisite tests).

We need coverage with all the combinations of SGPR / VGPR inputs in i1 / i8 / i16 / i32 / i64 for the basic and-not pattern.

Additionally we need some tests for the optimizations enabled by this hook.

define amdgpu_ps float @xor3_vgpr_b(i32 inreg %a, i32 %b, i32 inreg %c) {

is one example.
Use arguments with inreg to get sample SGPR inputs, otherwise VGPR

Okay, I need to consider how to add tests for this patch. :)

@harrisonGPU harrisonGPU marked this pull request as draft October 17, 2024 13:01
Copy link

github-actions bot commented Oct 18, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@@ -6822,6 +6822,81 @@ static unsigned getExtOpcodeForPromotedOp(SDValue Op) {
}
}

SDValue SITargetLowering::combineAnd(SDValue Op,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not be implementing any new combine in this patch. The point of this is to enable the existing optimizations. If it's worth adding something else, it would be a separate PR

Copy link
Contributor Author

@harrisonGPU harrisonGPU Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but I think hasAndNot can use And operation when meet this scenario: (and LHS, (or Y, ~Z)) and X86 used this way, so I try to implement it. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why x86 has that in the backend but it belongs in a generic combine

@harrisonGPU harrisonGPU marked this pull request as ready for review October 18, 2024 05:38
@harrisonGPU harrisonGPU marked this pull request as draft October 18, 2024 07:14
@harrisonGPU harrisonGPU marked this pull request as ready for review October 18, 2024 10:32
@harrisonGPU
Copy link
Contributor Author

Needs tests (most of the effort on this patch is adding the requisite tests).

We need coverage with all the combinations of SGPR / VGPR inputs in i1 / i8 / i16 / i32 / i64 for the basic and-not pattern.

Additionally we need some tests for the optimizations enabled by this hook.

define amdgpu_ps float @xor3_vgpr_b(i32 inreg %a, i32 %b, i32 inreg %c) {

is one example.
Use arguments with inreg to get sample SGPR inputs, otherwise VGPR

Do I need to add cases for i1, i8, and i16? I think the hasAndNot function does not affect the i1, i8, or i16 cases.

@@ -25,6 +25,28 @@ entry:
ret void
}

; GCN-LABEL: {{^}}scalar_andn2_i32_one_sgpr
; GCN: s_andn2_b32
define amdgpu_kernel void @scalar_andn2_i32_one_sgpr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use a shader calling convention and use the return value. the style of tests with the output pointer and a kernel is older and predates function support. kernels have a lot of extra noise in the output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I updated it. :)

bool SITargetLowering::hasAndNot(SDValue Op) const {
if (Op->isDivergent())
return false;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't really need to consider the machine opcode case

@@ -25,6 +25,24 @@ entry:
ret void
}

; GCN-LABEL: {{^}}scalar_andn2_i32_one_sgpr
; GCN: s_andn2_b32
define i32 @scalar_andn2_i32_one_sgpr(i32 inreg %a, i32 inreg %b) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests only show the and not pattern to begin with, it is not showing the transforms enabled by this hook

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed it. Sorry for the late update on this PR. Last month, I was still thinking about this patch and forgot to push it to the origin branch. I'm still considering this issue, because some lit tests show an increase in the number of instructions, while others show a decrease. So, I'm not yet sure whether it impacts performance.

@harrisonGPU harrisonGPU force-pushed the harrison/amdgpu branch 2 times, most recently from 5e64801 to f46d274 Compare May 13, 2025 12:39
@@ -0,0 +1,764 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck --check-prefix=GCN %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck --check-prefix=GCN %s
; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 < %s | FileCheck --check-prefix=GCN %s

Should test with a few sub targets

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck --check-prefix=GCN %s

define i32 @out32(i32 inreg %x, i32 inreg %y, i32 inreg %mask) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what "out32" refers to. Need SGPR and VGPR versions of the tests (which usually use s_ and v_ prefixes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a quetion, In hasAndNot we have:

  if (Op->isDivergent())
    return false;

Values in VGPRs are divergent, so the check returns false right away—the pass only triggers on inreg values. With that in mind, do we really need a VGPR test case?

%n1 = and i32 %n0, %mask
%r = xor i32 %n1, %z
ret i32 %r
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test vector types

@@ -1,3 +1,4 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't convert a test to generated checks in a functional change

Comment on lines 17566 to 17583
// Return false if the operation is divergent, as AND-NOT is a scalar-only
// instruction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First part of the comment is describing the mechanics, not the reason

@arsenm
Copy link
Contributor

arsenm commented May 13, 2025

Also it would be better to precommit the tests separately so the diff is obvious here

@harrisonGPU
Copy link
Contributor Author

Also it would be better to precommit the tests separately so the diff is obvious here

Okay, I will try to separate it ! Thanks.

return false;

EVT VT = Op.getValueType();
return VT == MVT::i32 || VT == MVT::i64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to check types here. How about just return !Op->isDivergent();? If the types are not legal they will get legalized, but that should not affect the decision of whether to form an and-not pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a different type and then is legalized, there will be intermediate instructions that break the and not pattern

@arsenm arsenm requested a review from Pierre-vh May 21, 2025 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AMDGPU should implement TargetLowering::hasAndNot
7 participants