Skip to content

[AMDGPU] Implement hasAndNot for scalar bitwise AND-NOT operations. #112647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
10 changes: 10 additions & 0 deletions llvm/lib/Target/AMDGPU/SIISelLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17577,3 +17577,13 @@ SITargetLowering::lowerIdempotentRMWIntoFencedLoad(AtomicRMWInst *AI) const {
AI->eraseFromParent();
return LI;
}

bool SITargetLowering::hasAndNot(SDValue Op) const {
// AND-NOT is only valid on uniform (SGPR) values; divergent values live in
// VGPRs.
if (Op->isDivergent())
return false;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment why this is the set of cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't really need to consider the machine opcode case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late update on this PR. Last month, I was still thinking about this patch and forgot to push it to the origin branch. I'm still considering this issue, because some lit tests show an increase in the number of instructions, while others show a decrease. So, I'm not yet sure whether it impacts performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the work of this change is avoiding the regressions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Do you have any further suggestions? Also, do you think it's ready to be merged now? :-)

EVT VT = Op.getValueType();
return VT == MVT::i32 || VT == MVT::i64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to check types here. How about just return !Op->isDivergent();? If the types are not legal they will get legalized, but that should not affect the decision of whether to form an and-not pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a different type and then is legalized, there will be intermediate instructions that break the and not pattern

}
1 change: 1 addition & 0 deletions llvm/lib/Target/AMDGPU/SIISelLowering.h
Original file line number Diff line number Diff line change
Expand Up @@ -611,6 +611,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {

MachineMemOperand::Flags
getTargetMMOFlags(const Instruction &I) const override;
bool hasAndNot(SDValue Op) const override;
};

// Returns true if argument is a boolean value which is not serialized into
Expand Down
109 changes: 67 additions & 42 deletions llvm/test/CodeGen/AMDGPU/bfi_int.ll
Original file line number Diff line number Diff line change
Expand Up @@ -135,9 +135,9 @@ define amdgpu_kernel void @s_bfi_sha256_ch(ptr addrspace(1) %out, i32 %x, i32 %y
; GFX7-NEXT: s_mov_b32 s7, 0xf000
; GFX7-NEXT: s_mov_b32 s6, -1
; GFX7-NEXT: s_waitcnt lgkmcnt(0)
; GFX7-NEXT: s_xor_b32 s1, s1, s2
; GFX7-NEXT: s_and_b32 s0, s0, s1
; GFX7-NEXT: s_xor_b32 s0, s2, s0
; GFX7-NEXT: s_andn2_b32 s2, s2, s0
; GFX7-NEXT: s_and_b32 s0, s1, s0
; GFX7-NEXT: s_or_b32 s0, s0, s2
; GFX7-NEXT: v_mov_b32_e32 v0, s0
; GFX7-NEXT: buffer_store_dword v0, off, s[4:7], 0
; GFX7-NEXT: s_endpgm
Expand All @@ -147,9 +147,9 @@ define amdgpu_kernel void @s_bfi_sha256_ch(ptr addrspace(1) %out, i32 %x, i32 %y
; GFX8-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x2c
; GFX8-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x24
; GFX8-NEXT: s_waitcnt lgkmcnt(0)
; GFX8-NEXT: s_xor_b32 s1, s1, s2
; GFX8-NEXT: s_and_b32 s0, s0, s1
; GFX8-NEXT: s_xor_b32 s0, s2, s0
; GFX8-NEXT: s_andn2_b32 s2, s2, s0
; GFX8-NEXT: s_and_b32 s0, s1, s0
; GFX8-NEXT: s_or_b32 s0, s0, s2
; GFX8-NEXT: v_mov_b32_e32 v0, s4
; GFX8-NEXT: v_mov_b32_e32 v1, s5
; GFX8-NEXT: v_mov_b32_e32 v2, s0
Expand All @@ -163,9 +163,9 @@ define amdgpu_kernel void @s_bfi_sha256_ch(ptr addrspace(1) %out, i32 %x, i32 %y
; GFX10-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x24
; GFX10-NEXT: v_mov_b32_e32 v0, 0
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
; GFX10-NEXT: s_xor_b32 s1, s1, s2
; GFX10-NEXT: s_and_b32 s0, s0, s1
; GFX10-NEXT: s_xor_b32 s0, s2, s0
; GFX10-NEXT: s_andn2_b32 s2, s2, s0
; GFX10-NEXT: s_and_b32 s0, s1, s0
; GFX10-NEXT: s_or_b32 s0, s0, s2
; GFX10-NEXT: v_mov_b32_e32 v1, s0
; GFX10-NEXT: global_store_dword v0, v1, s[4:5]
; GFX10-NEXT: s_endpgm
Expand Down Expand Up @@ -317,19 +317,26 @@ entry:
define amdgpu_ps float @s_s_v_bfi_sha256_ch(i32 inreg %x, i32 inreg %y, i32 %z) {
; GFX7-LABEL: s_s_v_bfi_sha256_ch:
; GFX7: ; %bb.0: ; %entry
; GFX7-NEXT: v_mov_b32_e32 v1, s0
; GFX7-NEXT: v_bfi_b32 v0, v1, s1, v0
; GFX7-NEXT: s_not_b32 s1, s1
; GFX7-NEXT: v_or_b32_e32 v0, s0, v0
; GFX7-NEXT: s_nand_b32 s0, s1, s0
; GFX7-NEXT: v_and_b32_e32 v0, s0, v0
; GFX7-NEXT: ; return to shader part epilog
;
; GFX8-LABEL: s_s_v_bfi_sha256_ch:
; GFX8: ; %bb.0: ; %entry
; GFX8-NEXT: v_mov_b32_e32 v1, s0
; GFX8-NEXT: v_bfi_b32 v0, v1, s1, v0
; GFX8-NEXT: s_not_b32 s1, s1
; GFX8-NEXT: v_or_b32_e32 v0, s0, v0
; GFX8-NEXT: s_nand_b32 s0, s1, s0
; GFX8-NEXT: v_and_b32_e32 v0, s0, v0
; GFX8-NEXT: ; return to shader part epilog
;
; GFX10-LABEL: s_s_v_bfi_sha256_ch:
; GFX10: ; %bb.0: ; %entry
; GFX10-NEXT: v_bfi_b32 v0, s0, s1, v0
; GFX10-NEXT: v_or_b32_e32 v0, s0, v0
; GFX10-NEXT: s_not_b32 s1, s1
; GFX10-NEXT: s_nand_b32 s0, s1, s0
; GFX10-NEXT: v_and_b32_e32 v0, s0, v0
; GFX10-NEXT: ; return to shader part epilog
;
; GFX8-GISEL-LABEL: s_s_v_bfi_sha256_ch:
Expand All @@ -350,30 +357,40 @@ entry:
ret float %cast
}

define amdgpu_ps float @s_v_v_bfi_sha256_ch(i32 inreg %x, i32 %y, i32 %z) {
define amdgpu_ps float @s_v_v_bfi_sha256_ch(i32 inreg %x, i32 inreg %y, i32 %z) {
; GFX7-LABEL: s_v_v_bfi_sha256_ch:
; GFX7: ; %bb.0: ; %entry
; GFX7-NEXT: v_bfi_b32 v0, s0, v0, v1
; GFX7-NEXT: s_not_b32 s1, s1
; GFX7-NEXT: v_or_b32_e32 v0, s0, v0
; GFX7-NEXT: s_nand_b32 s0, s1, s0
; GFX7-NEXT: v_and_b32_e32 v0, s0, v0
; GFX7-NEXT: ; return to shader part epilog
;
; GFX8-LABEL: s_v_v_bfi_sha256_ch:
; GFX8: ; %bb.0: ; %entry
; GFX8-NEXT: v_bfi_b32 v0, s0, v0, v1
; GFX8-NEXT: s_not_b32 s1, s1
; GFX8-NEXT: v_or_b32_e32 v0, s0, v0
; GFX8-NEXT: s_nand_b32 s0, s1, s0
; GFX8-NEXT: v_and_b32_e32 v0, s0, v0
; GFX8-NEXT: ; return to shader part epilog
;
; GFX10-LABEL: s_v_v_bfi_sha256_ch:
; GFX10: ; %bb.0: ; %entry
; GFX10-NEXT: v_bfi_b32 v0, s0, v0, v1
; GFX10-NEXT: v_or_b32_e32 v0, s0, v0
; GFX10-NEXT: s_not_b32 s1, s1
; GFX10-NEXT: s_nand_b32 s0, s1, s0
; GFX10-NEXT: v_and_b32_e32 v0, s0, v0
; GFX10-NEXT: ; return to shader part epilog
;
; GFX8-GISEL-LABEL: s_v_v_bfi_sha256_ch:
; GFX8-GISEL: ; %bb.0: ; %entry
; GFX8-GISEL-NEXT: v_bfi_b32 v0, s0, v0, v1
; GFX8-GISEL-NEXT: v_mov_b32_e32 v1, s0
; GFX8-GISEL-NEXT: v_bfi_b32 v0, v1, s1, v0
; GFX8-GISEL-NEXT: ; return to shader part epilog
;
; GFX10-GISEL-LABEL: s_v_v_bfi_sha256_ch:
; GFX10-GISEL: ; %bb.0: ; %entry
; GFX10-GISEL-NEXT: v_bfi_b32 v0, s0, v0, v1
; GFX10-GISEL-NEXT: v_bfi_b32 v0, s0, s1, v0
; GFX10-GISEL-NEXT: ; return to shader part epilog
entry:
%xor0 = xor i32 %y, %z
Expand Down Expand Up @@ -1008,24 +1025,32 @@ define amdgpu_ps <2 x float> @v_s_s_bitselect_i64_pat_1(i64 %a, i64 inreg %b, i6
define amdgpu_ps <2 x float> @s_s_v_bitselect_i64_pat_1(i64 inreg %a, i64 inreg %b, i64 %mask) {
; GFX7-LABEL: s_s_v_bitselect_i64_pat_1:
; GFX7: ; %bb.0:
; GFX7-NEXT: v_mov_b32_e32 v2, s1
; GFX7-NEXT: v_bfi_b32 v1, s3, v2, v1
; GFX7-NEXT: v_mov_b32_e32 v2, s0
; GFX7-NEXT: v_bfi_b32 v0, s2, v2, v0
; GFX7-NEXT: s_not_b64 s[0:1], s[0:1]
; GFX7-NEXT: v_or_b32_e32 v1, s3, v1
; GFX7-NEXT: v_or_b32_e32 v0, s2, v0
; GFX7-NEXT: s_nand_b64 s[0:1], s[0:1], s[2:3]
; GFX7-NEXT: v_and_b32_e32 v1, s1, v1
; GFX7-NEXT: v_and_b32_e32 v0, s0, v0
; GFX7-NEXT: ; return to shader part epilog
;
; GFX8-LABEL: s_s_v_bitselect_i64_pat_1:
; GFX8: ; %bb.0:
; GFX8-NEXT: v_mov_b32_e32 v2, s1
; GFX8-NEXT: v_bfi_b32 v1, s3, v2, v1
; GFX8-NEXT: v_mov_b32_e32 v2, s0
; GFX8-NEXT: v_bfi_b32 v0, s2, v2, v0
; GFX8-NEXT: s_not_b64 s[0:1], s[0:1]
; GFX8-NEXT: v_or_b32_e32 v1, s3, v1
; GFX8-NEXT: v_or_b32_e32 v0, s2, v0
; GFX8-NEXT: s_nand_b64 s[0:1], s[0:1], s[2:3]
; GFX8-NEXT: v_and_b32_e32 v1, s1, v1
; GFX8-NEXT: v_and_b32_e32 v0, s0, v0
; GFX8-NEXT: ; return to shader part epilog
;
; GFX10-LABEL: s_s_v_bitselect_i64_pat_1:
; GFX10: ; %bb.0:
; GFX10-NEXT: v_bfi_b32 v0, s2, s0, v0
; GFX10-NEXT: v_bfi_b32 v1, s3, s1, v1
; GFX10-NEXT: v_or_b32_e32 v1, s3, v1
; GFX10-NEXT: v_or_b32_e32 v0, s2, v0
; GFX10-NEXT: s_not_b64 s[0:1], s[0:1]
; GFX10-NEXT: s_nand_b64 s[0:1], s[0:1], s[2:3]
; GFX10-NEXT: v_and_b32_e32 v0, s0, v0
; GFX10-NEXT: v_and_b32_e32 v1, s1, v1
; GFX10-NEXT: ; return to shader part epilog
;
; GFX8-GISEL-LABEL: s_s_v_bitselect_i64_pat_1:
Expand Down Expand Up @@ -1495,9 +1520,9 @@ define amdgpu_kernel void @s_bitselect_i64_pat_1(i64 %a, i64 %b, i64 %mask) {
; GFX7-NEXT: s_mov_b32 s7, 0xf000
; GFX7-NEXT: s_mov_b32 s6, -1
; GFX7-NEXT: s_waitcnt lgkmcnt(0)
; GFX7-NEXT: s_xor_b64 s[0:1], s[0:1], s[4:5]
; GFX7-NEXT: s_and_b64 s[0:1], s[0:1], s[2:3]
; GFX7-NEXT: s_xor_b64 s[0:1], s[0:1], s[4:5]
; GFX7-NEXT: s_andn2_b64 s[4:5], s[4:5], s[2:3]
; GFX7-NEXT: s_or_b64 s[0:1], s[0:1], s[4:5]
; GFX7-NEXT: s_add_u32 s0, s0, 10
; GFX7-NEXT: s_addc_u32 s1, s1, 0
; GFX7-NEXT: v_mov_b32_e32 v0, s0
Expand All @@ -1510,9 +1535,9 @@ define amdgpu_kernel void @s_bitselect_i64_pat_1(i64 %a, i64 %b, i64 %mask) {
; GFX8-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24
; GFX8-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x34
; GFX8-NEXT: s_waitcnt lgkmcnt(0)
; GFX8-NEXT: s_xor_b64 s[0:1], s[0:1], s[4:5]
; GFX8-NEXT: s_and_b64 s[0:1], s[0:1], s[2:3]
; GFX8-NEXT: s_xor_b64 s[0:1], s[0:1], s[4:5]
; GFX8-NEXT: s_andn2_b64 s[4:5], s[4:5], s[2:3]
; GFX8-NEXT: s_or_b64 s[0:1], s[0:1], s[4:5]
; GFX8-NEXT: s_add_u32 s0, s0, 10
; GFX8-NEXT: s_addc_u32 s1, s1, 0
; GFX8-NEXT: v_mov_b32_e32 v0, s0
Expand All @@ -1526,9 +1551,9 @@ define amdgpu_kernel void @s_bitselect_i64_pat_1(i64 %a, i64 %b, i64 %mask) {
; GFX10-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24
; GFX10-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x34
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
; GFX10-NEXT: s_xor_b64 s[0:1], s[0:1], s[4:5]
; GFX10-NEXT: s_and_b64 s[0:1], s[0:1], s[2:3]
; GFX10-NEXT: s_xor_b64 s[0:1], s[0:1], s[4:5]
; GFX10-NEXT: s_andn2_b64 s[4:5], s[4:5], s[2:3]
; GFX10-NEXT: s_or_b64 s[0:1], s[0:1], s[4:5]
; GFX10-NEXT: s_add_u32 s0, s0, 10
; GFX10-NEXT: s_addc_u32 s1, s1, 0
; GFX10-NEXT: v_mov_b32_e32 v0, s0
Expand Down Expand Up @@ -1583,9 +1608,9 @@ define amdgpu_kernel void @s_bitselect_i64_pat_2(i64 %a, i64 %b, i64 %mask) {
; GFX7-NEXT: s_mov_b32 s7, 0xf000
; GFX7-NEXT: s_mov_b32 s6, -1
; GFX7-NEXT: s_waitcnt lgkmcnt(0)
; GFX7-NEXT: s_xor_b64 s[0:1], s[0:1], s[4:5]
; GFX7-NEXT: s_and_b64 s[0:1], s[0:1], s[2:3]
; GFX7-NEXT: s_xor_b64 s[0:1], s[0:1], s[4:5]
; GFX7-NEXT: s_andn2_b64 s[4:5], s[4:5], s[2:3]
; GFX7-NEXT: s_or_b64 s[0:1], s[0:1], s[4:5]
; GFX7-NEXT: s_add_u32 s0, s0, 10
; GFX7-NEXT: s_addc_u32 s1, s1, 0
; GFX7-NEXT: v_mov_b32_e32 v0, s0
Expand All @@ -1598,9 +1623,9 @@ define amdgpu_kernel void @s_bitselect_i64_pat_2(i64 %a, i64 %b, i64 %mask) {
; GFX8-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24
; GFX8-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x34
; GFX8-NEXT: s_waitcnt lgkmcnt(0)
; GFX8-NEXT: s_xor_b64 s[0:1], s[0:1], s[4:5]
; GFX8-NEXT: s_and_b64 s[0:1], s[0:1], s[2:3]
; GFX8-NEXT: s_xor_b64 s[0:1], s[0:1], s[4:5]
; GFX8-NEXT: s_andn2_b64 s[4:5], s[4:5], s[2:3]
; GFX8-NEXT: s_or_b64 s[0:1], s[0:1], s[4:5]
; GFX8-NEXT: s_add_u32 s0, s0, 10
; GFX8-NEXT: s_addc_u32 s1, s1, 0
; GFX8-NEXT: v_mov_b32_e32 v0, s0
Expand All @@ -1614,9 +1639,9 @@ define amdgpu_kernel void @s_bitselect_i64_pat_2(i64 %a, i64 %b, i64 %mask) {
; GFX10-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24
; GFX10-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x34
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
; GFX10-NEXT: s_xor_b64 s[0:1], s[0:1], s[4:5]
; GFX10-NEXT: s_and_b64 s[0:1], s[0:1], s[2:3]
; GFX10-NEXT: s_xor_b64 s[0:1], s[0:1], s[4:5]
; GFX10-NEXT: s_andn2_b64 s[4:5], s[4:5], s[2:3]
; GFX10-NEXT: s_or_b64 s[0:1], s[0:1], s[4:5]
; GFX10-NEXT: s_add_u32 s0, s0, 10
; GFX10-NEXT: s_addc_u32 s1, s1, 0
; GFX10-NEXT: v_mov_b32_e32 v0, s0
Expand Down
17 changes: 9 additions & 8 deletions llvm/test/CodeGen/AMDGPU/commute-compares.ll
Original file line number Diff line number Diff line change
Expand Up @@ -541,19 +541,20 @@ define amdgpu_kernel void @commute_sgt_neg1_i64(ptr addrspace(1) %out, ptr addrs
; GCN-LABEL: commute_sgt_neg1_i64:
; GCN: ; %bb.0:
; GCN-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x9
; GCN-NEXT: s_mov_b32 s7, 0xf000
; GCN-NEXT: s_mov_b32 s6, 0
; GCN-NEXT: v_lshlrev_b32_e32 v1, 3, v0
; GCN-NEXT: s_mov_b32 s6, 0
; GCN-NEXT: s_mov_b32 s7, 0xf000
; GCN-NEXT: v_mov_b32_e32 v2, 0
; GCN-NEXT: s_mov_b64 s[10:11], s[6:7]
; GCN-NEXT: s_waitcnt lgkmcnt(0)
; GCN-NEXT: s_mov_b64 s[4:5], s[2:3]
; GCN-NEXT: buffer_load_dwordx2 v[3:4], v[1:2], s[4:7], 0 addr64
; GCN-NEXT: s_mov_b64 s[8:9], s[2:3]
; GCN-NEXT: buffer_load_dword v3, v[1:2], s[8:11], 0 addr64 offset:4
; GCN-NEXT: s_mov_b64 s[4:5], s[0:1]
; GCN-NEXT: v_lshlrev_b32_e32 v1, 2, v0
; GCN-NEXT: s_mov_b64 s[2:3], s[6:7]
; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_cmp_lt_i64_e32 vcc, -1, v[3:4]
; GCN-NEXT: v_cndmask_b32_e64 v0, 0, -1, vcc
; GCN-NEXT: buffer_store_dword v0, v[1:2], s[0:3], 0 addr64
; GCN-NEXT: v_ashrrev_i32_e32 v0, 31, v3
; GCN-NEXT: v_not_b32_e32 v0, v0
; GCN-NEXT: buffer_store_dword v0, v[1:2], s[4:7], 0 addr64
; GCN-NEXT: s_endpgm
%tid = call i32 @llvm.amdgcn.workitem.id.x() #0
%gep.in = getelementptr i64, ptr addrspace(1) %in, i32 %tid
Expand Down
Loading