Skip to content

[AMDGPU] Fix SDWA commuting #106920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2788,6 +2788,9 @@ MachineInstr *SIInstrInfo::commuteInstructionImpl(MachineInstr &MI, bool NewMI,
swapSourceModifiers(MI, Src0, AMDGPU::OpName::src0_modifiers,
Src1, AMDGPU::OpName::src1_modifiers);

swapSourceModifiers(MI, Src0, AMDGPU::OpName::src0_sel, Src1,
AMDGPU::OpName::src1_sel);

CommutedMI->setDesc(get(CommutedOpcode));
}

Expand Down
13 changes: 8 additions & 5 deletions llvm/lib/Target/AMDGPU/VOP2Instructions.td
Original file line number Diff line number Diff line change
Expand Up @@ -167,9 +167,11 @@ multiclass VOP2Inst_e64<string opName,
}

multiclass VOP2Inst_sdwa<string opName,
VOPProfile P> {
VOPProfile P,
string revOp = opName> {
if P.HasExtSDWA then
def _sdwa : VOP2_SDWA_Pseudo <opName, P>;
def _sdwa : VOP2_SDWA_Pseudo <opName, P>,
Commutable_REV<revOp#"_sdwa", !eq(revOp, opName)>;
}

multiclass VOP2Inst<string opName,
Expand All @@ -178,7 +180,7 @@ multiclass VOP2Inst<string opName,
string revOp = opName> :
VOP2Inst_e32<opName, P, node, revOp>,
VOP2Inst_e64<opName, P, node, revOp>,
VOP2Inst_sdwa<opName, P> {
VOP2Inst_sdwa<opName, P, revOp> {
if P.HasExtDPP then
def _dpp : VOP2_DPP_Pseudo <opName, P>;
}
Expand Down Expand Up @@ -225,7 +227,7 @@ multiclass VOP2Inst_VOPD<string opName,
string revOp = opName> :
VOP2Inst_e32_VOPD<opName, P, VOPDOp, VOPDName, node, revOp>,
VOP2Inst_e64<opName, P, node, revOp>,
VOP2Inst_sdwa<opName, P> {
VOP2Inst_sdwa<opName, P, revOp> {
if P.HasExtDPP then
def _dpp : VOP2_DPP_Pseudo <opName, P>;
}
Expand All @@ -243,7 +245,8 @@ multiclass VOP2bInst <string opName,
}

if P.HasExtSDWA then
def _sdwa : VOP2_SDWA_Pseudo <opName, P> {
def _sdwa : VOP2_SDWA_Pseudo <opName, P>,
Commutable_REV<revOp#"_sdwa", !eq(revOp, opName)> {
let AsmMatchConverter = "cvtSdwaVOP2b";
}
if P.HasExtDPP then
Expand Down
17 changes: 17 additions & 0 deletions llvm/test/CodeGen/AMDGPU/commute-op-sel.mir
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# RUN: llc -mtriple=amdgcn -mcpu=gfx1030 -run-pass=machine-cse -verify-machineinstrs %s -o - 2>&1 | FileCheck --check-prefix=GCN %s

# GCN-LABEL: name: test_machine_cse_op_sel
# GCN: %2:vgpr_32 = V_ADD_NC_U16_e64 0, %0, 0, %1, 1, 0, implicit $mode, implicit $exec
# GCN: %3:vgpr_32 = V_ADD_NC_U16_e64 0, %1, 0, %0, 1, 0, implicit $mode, implicit $exec
Comment on lines +4 to +5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No commute happened, and the modifiers aren't set?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we do not define VOP3 instructions as commutable. Even if both op_sel are 0, the two V_ADD_NC_U16_e64 are not merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bug that should be fixed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be done with separate PR? This PR is for SDWA commute issue which causes rocFFT to fail. Whereas VOP3 not commute is missed performance opportunity. The current issue is kind of urgent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I just noticed you have approved this PR. I will commit it first and will look into VOP3 commuting opportunity later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open an issue to track this #111205

# GCN: DS_WRITE2_B32_gfx9 undef %4:vgpr_32, %2, %3, 0, 1, 0, implicit $exec
---
name: test_machine_cse_op_sel
body: |
bb.0:
%0:vgpr_32 = IMPLICIT_DEF
%1:vgpr_32 = IMPLICIT_DEF
%2:vgpr_32 = V_ADD_NC_U16_e64 0, %0, 0, %1, 1, 0, implicit $mode, implicit $exec
%3:vgpr_32 = V_ADD_NC_U16_e64 0, %1, 0, %0, 1, 0, implicit $mode, implicit $exec
DS_WRITE2_B32_gfx9 undef %4:vgpr_32, %2, %3, 0, 1, 0, implicit $exec
...

46 changes: 46 additions & 0 deletions llvm/test/CodeGen/AMDGPU/sdwa-commute.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 < %s | FileCheck %s

%ret_struct = type { half, half }

define void @extracted_values(ptr %ret_struct, ptr addrspace(3) %arg0, ptr addrspace(3) %arg1, ptr addrspace(3) %arg2, ptr addrspace(3) %arg3) {
; CHECK-LABEL: extracted_values:
; CHECK: ; %bb.0: ; %entry
; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; CHECK-NEXT: ds_read_b32 v3, v3
; CHECK-NEXT: ds_read_b32 v4, v4
; CHECK-NEXT: ds_read_b32 v2, v2
; CHECK-NEXT: ds_read_b32 v5, v5
; CHECK-NEXT: s_waitcnt lgkmcnt(2)
; CHECK-NEXT: v_sub_f16_sdwa v6, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; CHECK-NEXT: v_sub_f16_sdwa v3, v4, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: v_sub_f16_sdwa v7, v2, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; CHECK-NEXT: v_sub_f16_sdwa v2, v5, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; CHECK-NEXT: v_add_f16_e32 v4, v6, v7
; CHECK-NEXT: v_add_f16_e32 v2, v3, v2
; CHECK-NEXT: flat_store_short v[0:1], v4
; CHECK-NEXT: flat_store_short v[0:1], v2 offset:2
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: s_setpc_b64 s[30:31]
entry:
%tmp0 = load <2 x half>, ptr addrspace(3) %arg1, align 4
%tmp1 = extractelement <2 x half> %tmp0, i64 1
%tmp2 = load <2 x half>, ptr addrspace(3) %arg2, align 4
%tmp3 = extractelement <2 x half> %tmp2, i64 1
%tmp4 = fsub contract half %tmp1, %tmp3
%tmp5 = load <2 x half>, ptr addrspace(3) %arg0, align 4
%tmp6 = extractelement <2 x half> %tmp5, i64 1
%tmp7 = load <2 x half>, ptr addrspace(3) %arg3, align 4
%tmp8 = extractelement <2 x half> %tmp7, i64 1
%tmp9 = fsub contract half %tmp6, %tmp8
%tmp10 = fadd contract half %tmp4, %tmp9
%tmp11 = fsub contract half %tmp3, %tmp1
%tmp12 = fsub contract half %tmp8, %tmp6
%tmp13 = fadd contract half %tmp11, %tmp12
%field_ptr = getelementptr %ret_struct, ptr %ret_struct, i32 0, i32 0
store half %tmp10, ptr %field_ptr, align 2
%field_ptr1 = getelementptr %ret_struct, ptr %ret_struct, i32 0, i32 1
store half %tmp13, ptr %field_ptr1, align 2
ret void
}
57 changes: 57 additions & 0 deletions llvm/test/CodeGen/AMDGPU/sdwa-cse.mir
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# RUN: llc -mtriple=amdgcn -mcpu=gfx1030 -run-pass=machine-cse -verify-machineinstrs %s -o - 2>&1 | FileCheck --check-prefix=GCN %s

# GCN-LABEL: name: test_machine_cse_subtraction_sdwa_f16_no_merge
# GCN: %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
# GCN: %3:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %1.sub0, 0, %0.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
# GCN: %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
# GCN: %6:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %3, %4, implicit $mode, implicit $exec
# GCN: DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %6, 0, 1, 0, implicit $exec
---
name: test_machine_cse_subtraction_sdwa_f16_no_merge
body: |
bb.0:
%0:vreg_64 = IMPLICIT_DEF
%1:vreg_64 = IMPLICIT_DEF
%2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
%3:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %1.sub0, 0, %0.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
%4:vgpr_32 = IMPLICIT_DEF
%5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
%6:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %3, %4, implicit $mode, implicit $exec
DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %6, 0, 1, 0, implicit $exec
...

# GCN-LABEL: name: test_machine_cse_subtraction_sdwa_f16_merge_same_src_sel
# GCN: %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
# GCN: %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
# GCN: DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %5, 0, 1, 0, implicit $exec
---
name: test_machine_cse_subtraction_sdwa_f16_merge_same_src_sel
body: |
bb.0:
%0:vreg_64 = IMPLICIT_DEF
%1:vreg_64 = IMPLICIT_DEF
%2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
%3:vgpr_32 = contract nofpexcept V_SUBREV_F16_sdwa 0, %1.sub0, 0, %0.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
%4:vgpr_32 = IMPLICIT_DEF
%5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
%6:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %3, %4, implicit $mode, implicit $exec
DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %6, 0, 1, 0, implicit $exec
...

# GCN-LABEL: name: test_machine_cse_subtraction_sdwa_f16_merge_diff_src_sel
# GCN: %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 6, 5, implicit $mode, implicit $exec
# GCN: %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
# GCN: DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %5, 0, 1, 0, implicit $exec
---
name: test_machine_cse_subtraction_sdwa_f16_merge_diff_src_sel
body: |
bb.0:
%0:vreg_64 = IMPLICIT_DEF
%1:vreg_64 = IMPLICIT_DEF
%2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 6, 5, implicit $mode, implicit $exec
%3:vgpr_32 = contract nofpexcept V_SUBREV_F16_sdwa 0, %1.sub0, 0, %0.sub0, 0, 0, 6, 0, 5, 6, implicit $mode, implicit $exec
%4:vgpr_32 = IMPLICIT_DEF
%5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
%6:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %3, %4, implicit $mode, implicit $exec
DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %6, 0, 1, 0, implicit $exec
...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test an op_sel case too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since op_sel is not in SDWA instructions, added another test commute-op-sel.mir

Loading