Skip to content

[AMDGPU][GlobalISel] Align selectVOP3PMadMixModsImpl with the SelectionDAG counterpart #110168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 8, 2024

Conversation

shiltian
Copy link
Contributor

@shiltian shiltian commented Sep 26, 2024

The current selectVOP3PMadMixModsImpl can produce V_MAD_FIX_F32 instruction
that violates constant bus restriction, while its SelectionDAG counterpart
doesn't. The culprit is in the copy stripping while the SelectionDAG version
only has a bitcast stripping. This PR simply aligns the two version.

Copy link
Contributor Author

shiltian commented Sep 26, 2024

@llvmbot
Copy link
Member

llvmbot commented Sep 26, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

The current selectVOP3PMadMixModsImpl can produce V_MAD_FIX_F32 instruction
that violates constant bus restriction, while its SelectionDAG counterpart
doesn't. The culprit is in the copy stripping while the SelectionDAG version
only has a bitcast stripping. This PR simply aligns the two version.


Full diff: https://github.com/llvm/llvm-project/pull/110168.diff

4 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-ext-mul.ll (+15-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll (+5-5)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/madmix-constant-bus-violation.mir (+110)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index febf0711c7d574..c3df962d448f94 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -5312,26 +5312,20 @@ AMDGPUInstructionSelector::selectVOP3PMadMixModsImpl(MachineOperand &Root,
     // Only change Src if src modifier could be gained. In such cases new Src
     // could be sgpr but this does not violate constant bus restriction for
     // instruction that is being selected.
-    // Note: Src is not changed when there is only a simple sgpr to vgpr copy
-    // since this could violate constant bus restriction.
-    Register PeekSrc = stripCopy(Src, *MRI);
+    Src = stripBitCast(Src, *MRI);
 
     const auto CheckAbsNeg = [&]() {
       // Be careful about folding modifiers if we already have an abs. fneg is
       // applied last, so we don't want to apply an earlier fneg.
       if ((Mods & SISrcMods::ABS) == 0) {
         unsigned ModsTmp;
-        std::tie(PeekSrc, ModsTmp) = selectVOP3ModsImpl(PeekSrc);
+        std::tie(Src, ModsTmp) = selectVOP3ModsImpl(Src);
 
-        if ((ModsTmp & SISrcMods::NEG) != 0) {
+        if ((ModsTmp & SISrcMods::NEG) != 0)
           Mods ^= SISrcMods::NEG;
-          Src = PeekSrc;
-        }
 
-        if ((ModsTmp & SISrcMods::ABS) != 0) {
+        if ((ModsTmp & SISrcMods::ABS) != 0)
           Mods |= SISrcMods::ABS;
-          Src = PeekSrc;
-        }
       }
     };
 
@@ -5344,8 +5338,7 @@ AMDGPUInstructionSelector::selectVOP3PMadMixModsImpl(MachineOperand &Root,
 
     Mods |= SISrcMods::OP_SEL_1;
 
-    if (isExtractHiElt(*MRI, PeekSrc, PeekSrc)) {
-      Src = PeekSrc;
+    if (isExtractHiElt(*MRI, Src, Src)) {
       Mods |= SISrcMods::OP_SEL_0;
       CheckAbsNeg();
     }
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-ext-mul.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-ext-mul.ll
index 4ebe1c499a1769..4d603f7487754a 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-ext-mul.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-ext-mul.ll
@@ -73,10 +73,14 @@ define amdgpu_vs <5 x float> @test_5xf16_5xf32_add_ext_mul(<5 x half> inreg %x,
 ; GFX10-FAST-DENORM-NEXT:    v_mov_b32_e32 v2, s8
 ; GFX10-FAST-DENORM-NEXT:    v_mov_b32_e32 v3, s9
 ; GFX10-FAST-DENORM-NEXT:    v_mov_b32_e32 v4, s10
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s11, s0, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s12, s1, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s13, s3, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s14, s4, 16
 ; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v0, s0, s3, v0 op_sel_hi:[1,1,0]
-; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v1, s0, s3, v1 op_sel:[1,1,0] op_sel_hi:[1,1,0]
+; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v1, s11, s13, v1 op_sel_hi:[1,1,0]
 ; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v2, s1, s4, v2 op_sel_hi:[1,1,0]
-; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v3, s1, s4, v3 op_sel:[1,1,0] op_sel_hi:[1,1,0]
+; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v3, s12, s14, v3 op_sel_hi:[1,1,0]
 ; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v4, s2, s5, v4 op_sel_hi:[1,1,0]
 ; GFX10-FAST-DENORM-NEXT:    ; return to shader part epilog
 .entry:
@@ -117,12 +121,18 @@ define amdgpu_vs <6 x float> @test_6xf16_6xf32_add_ext_mul_rhs(<6 x half> inreg
 ; GFX10-FAST-DENORM-NEXT:    v_mov_b32_e32 v3, s9
 ; GFX10-FAST-DENORM-NEXT:    v_mov_b32_e32 v4, s10
 ; GFX10-FAST-DENORM-NEXT:    v_mov_b32_e32 v5, s11
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s12, s0, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s13, s1, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s6, s2, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s14, s3, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s15, s4, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s16, s5, 16
 ; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v0, s0, s3, v0 op_sel_hi:[1,1,0]
-; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v1, s0, s3, v1 op_sel:[1,1,0] op_sel_hi:[1,1,0]
+; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v1, s12, s14, v1 op_sel_hi:[1,1,0]
 ; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v2, s1, s4, v2 op_sel_hi:[1,1,0]
-; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v3, s1, s4, v3 op_sel:[1,1,0] op_sel_hi:[1,1,0]
+; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v3, s13, s15, v3 op_sel_hi:[1,1,0]
 ; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v4, s2, s5, v4 op_sel_hi:[1,1,0]
-; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v5, s2, s5, v5 op_sel:[1,1,0] op_sel_hi:[1,1,0]
+; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v5, s6, s16, v5 op_sel_hi:[1,1,0]
 ; GFX10-FAST-DENORM-NEXT:    ; return to shader part epilog
 .entry:
     %a = fmul fast <6 x half> %x, %y
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll
index 89cd18ad9be70b..1a98285230b2cd 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll
@@ -2555,9 +2555,9 @@ define amdgpu_ps i32 @s_fdiv_v2f16(i32 inreg %a.arg, i32 inreg %b.arg) {
 ; GFX9-FLUSH-NEXT:    v_rcp_f32_e32 v1, v1
 ; GFX9-FLUSH-NEXT:    v_mad_mixlo_f16 v0, s0, v0, 0 op_sel_hi:[1,0,0]
 ; GFX9-FLUSH-NEXT:    v_div_fixup_f16 v0, v0, v2, s0
-; GFX9-FLUSH-NEXT:    v_mad_mixlo_f16 v1, s0, v1, 0 op_sel:[1,0,0] op_sel_hi:[1,0,0]
-; GFX9-FLUSH-NEXT:    v_mov_b32_e32 v2, s3
-; GFX9-FLUSH-NEXT:    v_div_fixup_f16 v1, v1, s2, v2
+; GFX9-FLUSH-NEXT:    v_mov_b32_e32 v2, s2
+; GFX9-FLUSH-NEXT:    v_mad_mixlo_f16 v1, s3, v1, 0 op_sel_hi:[1,0,0]
+; GFX9-FLUSH-NEXT:    v_div_fixup_f16 v1, v1, v2, s3
 ; GFX9-FLUSH-NEXT:    v_pack_b32_f16 v0, v0, v1
 ; GFX9-FLUSH-NEXT:    v_readfirstlane_b32 s0, v0
 ; GFX9-FLUSH-NEXT:    ; return to shader part epilog
@@ -2571,7 +2571,7 @@ define amdgpu_ps i32 @s_fdiv_v2f16(i32 inreg %a.arg, i32 inreg %b.arg) {
 ; GFX10-NEXT:    v_rcp_f32_e32 v0, v0
 ; GFX10-NEXT:    v_rcp_f32_e32 v1, v1
 ; GFX10-NEXT:    v_fma_mixlo_f16 v0, s0, v0, 0 op_sel_hi:[1,0,0]
-; GFX10-NEXT:    v_fma_mixlo_f16 v1, s0, v1, 0 op_sel:[1,0,0] op_sel_hi:[1,0,0]
+; GFX10-NEXT:    v_fma_mixlo_f16 v1, s3, v1, 0 op_sel_hi:[1,0,0]
 ; GFX10-NEXT:    v_div_fixup_f16 v0, v0, s1, s0
 ; GFX10-NEXT:    v_div_fixup_f16 v1, v1, s2, s3
 ; GFX10-NEXT:    v_pack_b32_f16 v0, v0, v1
@@ -2588,7 +2588,7 @@ define amdgpu_ps i32 @s_fdiv_v2f16(i32 inreg %a.arg, i32 inreg %b.arg) {
 ; GFX11-NEXT:    v_rcp_f32_e32 v1, v1
 ; GFX11-NEXT:    s_waitcnt_depctr 0xfff
 ; GFX11-NEXT:    v_fma_mixlo_f16 v0, s0, v0, 0 op_sel_hi:[1,0,0]
-; GFX11-NEXT:    v_fma_mixlo_f16 v1, s0, v1, 0 op_sel:[1,0,0] op_sel_hi:[1,0,0]
+; GFX11-NEXT:    v_fma_mixlo_f16 v1, s3, v1, 0 op_sel_hi:[1,0,0]
 ; GFX11-NEXT:    v_div_fixup_f16 v0, v0, s1, s0
 ; GFX11-NEXT:    v_div_fixup_f16 v1, v1, s2, s3
 ; GFX11-NEXT:    v_pack_b32_f16 v0, v0, v1
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/madmix-constant-bus-violation.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/madmix-constant-bus-violation.mir
new file mode 100644
index 00000000000000..ba9cfb8e1d68ed
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/madmix-constant-bus-violation.mir
@@ -0,0 +1,110 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx900 -run-pass=instruction-select,machineverifier -o - %s | FileCheck -check-prefixes=GFX9 %s
+
+---
+name: s_fdiv_v2f16
+legalized: true
+regBankSelected: true
+machineFunctionInfo:
+  mode:
+    fp32-output-denormals: false
+    fp32-input-denormals: false
+body: |
+  bb.0:
+    ; GFX9-LABEL: name: s_fdiv_v2f16
+    ; GFX9: [[COPY:%[0-9]+]]:sreg_32 = COPY $sgpr0
+    ; GFX9-NEXT: [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
+    ; GFX9-NEXT: [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 16
+    ; GFX9-NEXT: [[S_LSHR_B32_:%[0-9]+]]:sreg_32 = S_LSHR_B32 [[COPY]], [[S_MOV_B32_]], implicit-def dead $scc
+    ; GFX9-NEXT: [[S_LSHR_B32_1:%[0-9]+]]:sreg_32 = S_LSHR_B32 [[COPY1]], [[S_MOV_B32_]], implicit-def dead $scc
+    ; GFX9-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+    ; GFX9-NEXT: [[V_CVT_F32_F16_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_CVT_F32_F16_e64 0, [[COPY2]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[COPY3:%[0-9]+]]:vgpr_32 = COPY [[COPY1]]
+    ; GFX9-NEXT: [[V_CVT_F32_F16_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_CVT_F32_F16_e64 0, [[COPY3]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_RCP_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_RCP_F32_e64 0, [[V_CVT_F32_F16_e64_1]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MUL_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_MUL_F32_e64 0, [[V_CVT_F32_F16_e64_]], 0, [[V_RCP_F32_e64_]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MAD_MIX_F32_:%[0-9]+]]:vgpr_32 = V_MAD_MIX_F32 9, [[COPY3]], 0, [[V_MUL_F32_e64_]], 8, [[COPY2]], 0, 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MAC_F32_e64_:%[0-9]+]]:vgpr_32 = V_MAC_F32_e64 0, [[V_MAD_MIX_F32_]], 0, [[V_RCP_F32_e64_]], 0, [[V_MUL_F32_e64_]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MAD_MIX_F32_1:%[0-9]+]]:vgpr_32 = V_MAD_MIX_F32 9, [[COPY3]], 0, [[V_MAC_F32_e64_]], 8, [[COPY2]], 0, 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MUL_F32_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_MUL_F32_e64 0, [[V_MAD_MIX_F32_1]], 0, [[V_RCP_F32_e64_]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 -8388608
+    ; GFX9-NEXT: [[COPY4:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_1]]
+    ; GFX9-NEXT: [[V_AND_B32_e64_:%[0-9]+]]:vgpr_32 = V_AND_B32_e64 [[V_MUL_F32_e64_1]], [[COPY4]], implicit $exec
+    ; GFX9-NEXT: [[V_ADD_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[V_AND_B32_e64_]], 0, [[V_MAC_F32_e64_]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_CVT_F16_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_CVT_F16_F32_e64 0, [[V_ADD_F32_e64_]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY1]]
+    ; GFX9-NEXT: [[COPY6:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+    ; GFX9-NEXT: [[V_DIV_FIXUP_F16_gfx9_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_DIV_FIXUP_F16_gfx9_e64 0, [[V_CVT_F16_F32_e64_]], 0, [[COPY5]], 0, [[COPY6]], 0, 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[S_LSHR_B32_]]
+    ; GFX9-NEXT: [[V_CVT_F32_F16_e64_2:%[0-9]+]]:vgpr_32 = nofpexcept V_CVT_F32_F16_e64 0, [[COPY7]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[S_LSHR_B32_1]]
+    ; GFX9-NEXT: [[V_CVT_F32_F16_e64_3:%[0-9]+]]:vgpr_32 = nofpexcept V_CVT_F32_F16_e64 0, [[COPY8]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_RCP_F32_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_RCP_F32_e64 0, [[V_CVT_F32_F16_e64_3]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MUL_F32_e64_2:%[0-9]+]]:vgpr_32 = nofpexcept V_MUL_F32_e64 0, [[V_CVT_F32_F16_e64_2]], 0, [[V_RCP_F32_e64_1]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MAD_MIX_F32_2:%[0-9]+]]:vgpr_32 = V_MAD_MIX_F32 9, [[COPY8]], 0, [[V_MUL_F32_e64_2]], 8, [[COPY7]], 0, 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MAC_F32_e64_1:%[0-9]+]]:vgpr_32 = V_MAC_F32_e64 0, [[V_MAD_MIX_F32_2]], 0, [[V_RCP_F32_e64_1]], 0, [[V_MUL_F32_e64_2]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MAD_MIX_F32_3:%[0-9]+]]:vgpr_32 = V_MAD_MIX_F32 9, [[COPY8]], 0, [[V_MAC_F32_e64_1]], 8, [[COPY7]], 0, 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MUL_F32_e64_3:%[0-9]+]]:vgpr_32 = nofpexcept V_MUL_F32_e64 0, [[V_MAD_MIX_F32_3]], 0, [[V_RCP_F32_e64_1]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[COPY9:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_1]]
+    ; GFX9-NEXT: [[V_AND_B32_e64_1:%[0-9]+]]:vgpr_32 = V_AND_B32_e64 [[V_MUL_F32_e64_3]], [[COPY9]], implicit $exec
+    ; GFX9-NEXT: [[V_ADD_F32_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[V_AND_B32_e64_1]], 0, [[V_MAC_F32_e64_1]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_CVT_F16_F32_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_CVT_F16_F32_e64 0, [[V_ADD_F32_e64_1]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[COPY10:%[0-9]+]]:vgpr_32 = COPY [[S_LSHR_B32_1]]
+    ; GFX9-NEXT: [[COPY11:%[0-9]+]]:vgpr_32 = COPY [[S_LSHR_B32_]]
+    ; GFX9-NEXT: [[V_DIV_FIXUP_F16_gfx9_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_DIV_FIXUP_F16_gfx9_e64 0, [[V_CVT_F16_F32_e64_1]], 0, [[COPY10]], 0, [[COPY11]], 0, 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_PACK_B32_F16_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_PACK_B32_F16_e64 0, [[V_DIV_FIXUP_F16_gfx9_e64_]], 0, [[V_DIV_FIXUP_F16_gfx9_e64_1]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[V_PACK_B32_F16_e64_]], implicit $exec
+    ; GFX9-NEXT: $sgpr0 = COPY [[V_READFIRSTLANE_B32_]]
+    ; GFX9-NEXT: SI_RETURN_TO_EPILOG implicit $sgpr0
+    %0:sgpr(s32) = COPY $sgpr0
+    %1:sgpr(s32) = COPY $sgpr1
+    %2:sgpr(s16) = G_TRUNC %0:sgpr(s32)
+    %3:sgpr(s32) = G_CONSTANT i32 16
+    %4:sgpr(s32) = G_LSHR %0:sgpr, %3:sgpr(s32)
+    %5:sgpr(s16) = G_TRUNC %4:sgpr(s32)
+    %6:sgpr(s16) = G_TRUNC %1:sgpr(s32)
+    %7:sgpr(s32) = G_LSHR %1:sgpr, %3:sgpr(s32)
+    %8:sgpr(s16) = G_TRUNC %7:sgpr(s32)
+    %9:vgpr(s16) = COPY %2:sgpr(s16)
+    %10:vgpr(s32) = G_FPEXT %9:vgpr(s16)
+    %11:vgpr(s16) = COPY %6:sgpr(s16)
+    %12:vgpr(s32) = G_FPEXT %11:vgpr(s16)
+    %13:vgpr(s32) = G_FNEG %12:vgpr
+    %14:vgpr(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.rcp), %12:vgpr(s32)
+    %15:vgpr(s32) = G_FMUL %10:vgpr, %14:vgpr
+    %16:vgpr(s32) = G_FMAD %13:vgpr, %15:vgpr, %10:vgpr
+    %17:vgpr(s32) = G_FMAD %16:vgpr, %14:vgpr, %15:vgpr
+    %18:vgpr(s32) = G_FMAD %13:vgpr, %17:vgpr, %10:vgpr
+    %19:vgpr(s32) = G_FMUL %18:vgpr, %14:vgpr
+    %20:sgpr(s32) = G_CONSTANT i32 -8388608
+    %21:vgpr(s32) = COPY %20:sgpr(s32)
+    %22:vgpr(s32) = G_AND %19:vgpr, %21:vgpr
+    %23:vgpr(s32) = G_FADD %22:vgpr, %17:vgpr
+    %24:vgpr(s16) = G_FPTRUNC %23:vgpr(s32)
+    %25:vgpr(s16) = COPY %6:sgpr(s16)
+    %26:vgpr(s16) = COPY %2:sgpr(s16)
+    %27:vgpr(s16) = G_INTRINSIC intrinsic(@llvm.amdgcn.div.fixup), %24:vgpr(s16), %25:vgpr(s16), %26:vgpr(s16)
+    %28:vgpr(s16) = COPY %5:sgpr(s16)
+    %29:vgpr(s32) = G_FPEXT %28:vgpr(s16)
+    %30:vgpr(s16) = COPY %8:sgpr(s16)
+    %31:vgpr(s32) = G_FPEXT %30:vgpr(s16)
+    %32:vgpr(s32) = G_FNEG %31:vgpr
+    %33:vgpr(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.rcp), %31:vgpr(s32)
+    %34:vgpr(s32) = G_FMUL %29:vgpr, %33:vgpr
+    %35:vgpr(s32) = G_FMAD %32:vgpr, %34:vgpr, %29:vgpr
+    %36:vgpr(s32) = G_FMAD %35:vgpr, %33:vgpr, %34:vgpr
+    %37:vgpr(s32) = G_FMAD %32:vgpr, %36:vgpr, %29:vgpr
+    %38:vgpr(s32) = G_FMUL %37:vgpr, %33:vgpr
+    %39:vgpr(s32) = COPY %20:sgpr(s32)
+    %40:vgpr(s32) = G_AND %38:vgpr, %39:vgpr
+    %41:vgpr(s32) = G_FADD %40:vgpr, %36:vgpr
+    %42:vgpr(s16) = G_FPTRUNC %41:vgpr(s32)
+    %43:vgpr(s16) = COPY %8:sgpr(s16)
+    %44:vgpr(s16) = COPY %5:sgpr(s16)
+    %45:vgpr(s16) = G_INTRINSIC intrinsic(@llvm.amdgcn.div.fixup), %42:vgpr(s16), %43:vgpr(s16), %44:vgpr(s16)
+    %46:vgpr(<2 x s16>) = G_BUILD_VECTOR %27:vgpr(s16), %45:vgpr(s16)
+    %47:vgpr(s32) = G_BITCAST %46:vgpr(<2 x s16>)
+    %48:sgpr(s32) = G_INTRINSIC_CONVERGENT intrinsic(@llvm.amdgcn.readfirstlane), %47:vgpr(s32)
+    $sgpr0 = COPY %48:sgpr(s32)
+    SI_RETURN_TO_EPILOG implicit $sgpr0
+...

@llvmbot
Copy link
Member

llvmbot commented Sep 26, 2024

@llvm/pr-subscribers-llvm-globalisel

Author: Shilei Tian (shiltian)

Changes

The current selectVOP3PMadMixModsImpl can produce V_MAD_FIX_F32 instruction
that violates constant bus restriction, while its SelectionDAG counterpart
doesn't. The culprit is in the copy stripping while the SelectionDAG version
only has a bitcast stripping. This PR simply aligns the two version.


Full diff: https://github.com/llvm/llvm-project/pull/110168.diff

4 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-ext-mul.ll (+15-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll (+5-5)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/madmix-constant-bus-violation.mir (+110)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index febf0711c7d574..c3df962d448f94 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -5312,26 +5312,20 @@ AMDGPUInstructionSelector::selectVOP3PMadMixModsImpl(MachineOperand &Root,
     // Only change Src if src modifier could be gained. In such cases new Src
     // could be sgpr but this does not violate constant bus restriction for
     // instruction that is being selected.
-    // Note: Src is not changed when there is only a simple sgpr to vgpr copy
-    // since this could violate constant bus restriction.
-    Register PeekSrc = stripCopy(Src, *MRI);
+    Src = stripBitCast(Src, *MRI);
 
     const auto CheckAbsNeg = [&]() {
       // Be careful about folding modifiers if we already have an abs. fneg is
       // applied last, so we don't want to apply an earlier fneg.
       if ((Mods & SISrcMods::ABS) == 0) {
         unsigned ModsTmp;
-        std::tie(PeekSrc, ModsTmp) = selectVOP3ModsImpl(PeekSrc);
+        std::tie(Src, ModsTmp) = selectVOP3ModsImpl(Src);
 
-        if ((ModsTmp & SISrcMods::NEG) != 0) {
+        if ((ModsTmp & SISrcMods::NEG) != 0)
           Mods ^= SISrcMods::NEG;
-          Src = PeekSrc;
-        }
 
-        if ((ModsTmp & SISrcMods::ABS) != 0) {
+        if ((ModsTmp & SISrcMods::ABS) != 0)
           Mods |= SISrcMods::ABS;
-          Src = PeekSrc;
-        }
       }
     };
 
@@ -5344,8 +5338,7 @@ AMDGPUInstructionSelector::selectVOP3PMadMixModsImpl(MachineOperand &Root,
 
     Mods |= SISrcMods::OP_SEL_1;
 
-    if (isExtractHiElt(*MRI, PeekSrc, PeekSrc)) {
-      Src = PeekSrc;
+    if (isExtractHiElt(*MRI, Src, Src)) {
       Mods |= SISrcMods::OP_SEL_0;
       CheckAbsNeg();
     }
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-ext-mul.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-ext-mul.ll
index 4ebe1c499a1769..4d603f7487754a 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-ext-mul.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-ext-mul.ll
@@ -73,10 +73,14 @@ define amdgpu_vs <5 x float> @test_5xf16_5xf32_add_ext_mul(<5 x half> inreg %x,
 ; GFX10-FAST-DENORM-NEXT:    v_mov_b32_e32 v2, s8
 ; GFX10-FAST-DENORM-NEXT:    v_mov_b32_e32 v3, s9
 ; GFX10-FAST-DENORM-NEXT:    v_mov_b32_e32 v4, s10
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s11, s0, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s12, s1, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s13, s3, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s14, s4, 16
 ; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v0, s0, s3, v0 op_sel_hi:[1,1,0]
-; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v1, s0, s3, v1 op_sel:[1,1,0] op_sel_hi:[1,1,0]
+; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v1, s11, s13, v1 op_sel_hi:[1,1,0]
 ; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v2, s1, s4, v2 op_sel_hi:[1,1,0]
-; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v3, s1, s4, v3 op_sel:[1,1,0] op_sel_hi:[1,1,0]
+; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v3, s12, s14, v3 op_sel_hi:[1,1,0]
 ; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v4, s2, s5, v4 op_sel_hi:[1,1,0]
 ; GFX10-FAST-DENORM-NEXT:    ; return to shader part epilog
 .entry:
@@ -117,12 +121,18 @@ define amdgpu_vs <6 x float> @test_6xf16_6xf32_add_ext_mul_rhs(<6 x half> inreg
 ; GFX10-FAST-DENORM-NEXT:    v_mov_b32_e32 v3, s9
 ; GFX10-FAST-DENORM-NEXT:    v_mov_b32_e32 v4, s10
 ; GFX10-FAST-DENORM-NEXT:    v_mov_b32_e32 v5, s11
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s12, s0, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s13, s1, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s6, s2, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s14, s3, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s15, s4, 16
+; GFX10-FAST-DENORM-NEXT:    s_lshr_b32 s16, s5, 16
 ; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v0, s0, s3, v0 op_sel_hi:[1,1,0]
-; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v1, s0, s3, v1 op_sel:[1,1,0] op_sel_hi:[1,1,0]
+; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v1, s12, s14, v1 op_sel_hi:[1,1,0]
 ; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v2, s1, s4, v2 op_sel_hi:[1,1,0]
-; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v3, s1, s4, v3 op_sel:[1,1,0] op_sel_hi:[1,1,0]
+; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v3, s13, s15, v3 op_sel_hi:[1,1,0]
 ; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v4, s2, s5, v4 op_sel_hi:[1,1,0]
-; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v5, s2, s5, v5 op_sel:[1,1,0] op_sel_hi:[1,1,0]
+; GFX10-FAST-DENORM-NEXT:    v_fma_mix_f32 v5, s6, s16, v5 op_sel_hi:[1,1,0]
 ; GFX10-FAST-DENORM-NEXT:    ; return to shader part epilog
 .entry:
     %a = fmul fast <6 x half> %x, %y
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll
index 89cd18ad9be70b..1a98285230b2cd 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll
@@ -2555,9 +2555,9 @@ define amdgpu_ps i32 @s_fdiv_v2f16(i32 inreg %a.arg, i32 inreg %b.arg) {
 ; GFX9-FLUSH-NEXT:    v_rcp_f32_e32 v1, v1
 ; GFX9-FLUSH-NEXT:    v_mad_mixlo_f16 v0, s0, v0, 0 op_sel_hi:[1,0,0]
 ; GFX9-FLUSH-NEXT:    v_div_fixup_f16 v0, v0, v2, s0
-; GFX9-FLUSH-NEXT:    v_mad_mixlo_f16 v1, s0, v1, 0 op_sel:[1,0,0] op_sel_hi:[1,0,0]
-; GFX9-FLUSH-NEXT:    v_mov_b32_e32 v2, s3
-; GFX9-FLUSH-NEXT:    v_div_fixup_f16 v1, v1, s2, v2
+; GFX9-FLUSH-NEXT:    v_mov_b32_e32 v2, s2
+; GFX9-FLUSH-NEXT:    v_mad_mixlo_f16 v1, s3, v1, 0 op_sel_hi:[1,0,0]
+; GFX9-FLUSH-NEXT:    v_div_fixup_f16 v1, v1, v2, s3
 ; GFX9-FLUSH-NEXT:    v_pack_b32_f16 v0, v0, v1
 ; GFX9-FLUSH-NEXT:    v_readfirstlane_b32 s0, v0
 ; GFX9-FLUSH-NEXT:    ; return to shader part epilog
@@ -2571,7 +2571,7 @@ define amdgpu_ps i32 @s_fdiv_v2f16(i32 inreg %a.arg, i32 inreg %b.arg) {
 ; GFX10-NEXT:    v_rcp_f32_e32 v0, v0
 ; GFX10-NEXT:    v_rcp_f32_e32 v1, v1
 ; GFX10-NEXT:    v_fma_mixlo_f16 v0, s0, v0, 0 op_sel_hi:[1,0,0]
-; GFX10-NEXT:    v_fma_mixlo_f16 v1, s0, v1, 0 op_sel:[1,0,0] op_sel_hi:[1,0,0]
+; GFX10-NEXT:    v_fma_mixlo_f16 v1, s3, v1, 0 op_sel_hi:[1,0,0]
 ; GFX10-NEXT:    v_div_fixup_f16 v0, v0, s1, s0
 ; GFX10-NEXT:    v_div_fixup_f16 v1, v1, s2, s3
 ; GFX10-NEXT:    v_pack_b32_f16 v0, v0, v1
@@ -2588,7 +2588,7 @@ define amdgpu_ps i32 @s_fdiv_v2f16(i32 inreg %a.arg, i32 inreg %b.arg) {
 ; GFX11-NEXT:    v_rcp_f32_e32 v1, v1
 ; GFX11-NEXT:    s_waitcnt_depctr 0xfff
 ; GFX11-NEXT:    v_fma_mixlo_f16 v0, s0, v0, 0 op_sel_hi:[1,0,0]
-; GFX11-NEXT:    v_fma_mixlo_f16 v1, s0, v1, 0 op_sel:[1,0,0] op_sel_hi:[1,0,0]
+; GFX11-NEXT:    v_fma_mixlo_f16 v1, s3, v1, 0 op_sel_hi:[1,0,0]
 ; GFX11-NEXT:    v_div_fixup_f16 v0, v0, s1, s0
 ; GFX11-NEXT:    v_div_fixup_f16 v1, v1, s2, s3
 ; GFX11-NEXT:    v_pack_b32_f16 v0, v0, v1
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/madmix-constant-bus-violation.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/madmix-constant-bus-violation.mir
new file mode 100644
index 00000000000000..ba9cfb8e1d68ed
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/madmix-constant-bus-violation.mir
@@ -0,0 +1,110 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx900 -run-pass=instruction-select,machineverifier -o - %s | FileCheck -check-prefixes=GFX9 %s
+
+---
+name: s_fdiv_v2f16
+legalized: true
+regBankSelected: true
+machineFunctionInfo:
+  mode:
+    fp32-output-denormals: false
+    fp32-input-denormals: false
+body: |
+  bb.0:
+    ; GFX9-LABEL: name: s_fdiv_v2f16
+    ; GFX9: [[COPY:%[0-9]+]]:sreg_32 = COPY $sgpr0
+    ; GFX9-NEXT: [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
+    ; GFX9-NEXT: [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 16
+    ; GFX9-NEXT: [[S_LSHR_B32_:%[0-9]+]]:sreg_32 = S_LSHR_B32 [[COPY]], [[S_MOV_B32_]], implicit-def dead $scc
+    ; GFX9-NEXT: [[S_LSHR_B32_1:%[0-9]+]]:sreg_32 = S_LSHR_B32 [[COPY1]], [[S_MOV_B32_]], implicit-def dead $scc
+    ; GFX9-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+    ; GFX9-NEXT: [[V_CVT_F32_F16_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_CVT_F32_F16_e64 0, [[COPY2]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[COPY3:%[0-9]+]]:vgpr_32 = COPY [[COPY1]]
+    ; GFX9-NEXT: [[V_CVT_F32_F16_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_CVT_F32_F16_e64 0, [[COPY3]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_RCP_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_RCP_F32_e64 0, [[V_CVT_F32_F16_e64_1]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MUL_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_MUL_F32_e64 0, [[V_CVT_F32_F16_e64_]], 0, [[V_RCP_F32_e64_]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MAD_MIX_F32_:%[0-9]+]]:vgpr_32 = V_MAD_MIX_F32 9, [[COPY3]], 0, [[V_MUL_F32_e64_]], 8, [[COPY2]], 0, 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MAC_F32_e64_:%[0-9]+]]:vgpr_32 = V_MAC_F32_e64 0, [[V_MAD_MIX_F32_]], 0, [[V_RCP_F32_e64_]], 0, [[V_MUL_F32_e64_]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MAD_MIX_F32_1:%[0-9]+]]:vgpr_32 = V_MAD_MIX_F32 9, [[COPY3]], 0, [[V_MAC_F32_e64_]], 8, [[COPY2]], 0, 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MUL_F32_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_MUL_F32_e64 0, [[V_MAD_MIX_F32_1]], 0, [[V_RCP_F32_e64_]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 -8388608
+    ; GFX9-NEXT: [[COPY4:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_1]]
+    ; GFX9-NEXT: [[V_AND_B32_e64_:%[0-9]+]]:vgpr_32 = V_AND_B32_e64 [[V_MUL_F32_e64_1]], [[COPY4]], implicit $exec
+    ; GFX9-NEXT: [[V_ADD_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[V_AND_B32_e64_]], 0, [[V_MAC_F32_e64_]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_CVT_F16_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_CVT_F16_F32_e64 0, [[V_ADD_F32_e64_]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY1]]
+    ; GFX9-NEXT: [[COPY6:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+    ; GFX9-NEXT: [[V_DIV_FIXUP_F16_gfx9_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_DIV_FIXUP_F16_gfx9_e64 0, [[V_CVT_F16_F32_e64_]], 0, [[COPY5]], 0, [[COPY6]], 0, 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[S_LSHR_B32_]]
+    ; GFX9-NEXT: [[V_CVT_F32_F16_e64_2:%[0-9]+]]:vgpr_32 = nofpexcept V_CVT_F32_F16_e64 0, [[COPY7]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[S_LSHR_B32_1]]
+    ; GFX9-NEXT: [[V_CVT_F32_F16_e64_3:%[0-9]+]]:vgpr_32 = nofpexcept V_CVT_F32_F16_e64 0, [[COPY8]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_RCP_F32_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_RCP_F32_e64 0, [[V_CVT_F32_F16_e64_3]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MUL_F32_e64_2:%[0-9]+]]:vgpr_32 = nofpexcept V_MUL_F32_e64 0, [[V_CVT_F32_F16_e64_2]], 0, [[V_RCP_F32_e64_1]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MAD_MIX_F32_2:%[0-9]+]]:vgpr_32 = V_MAD_MIX_F32 9, [[COPY8]], 0, [[V_MUL_F32_e64_2]], 8, [[COPY7]], 0, 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MAC_F32_e64_1:%[0-9]+]]:vgpr_32 = V_MAC_F32_e64 0, [[V_MAD_MIX_F32_2]], 0, [[V_RCP_F32_e64_1]], 0, [[V_MUL_F32_e64_2]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MAD_MIX_F32_3:%[0-9]+]]:vgpr_32 = V_MAD_MIX_F32 9, [[COPY8]], 0, [[V_MAC_F32_e64_1]], 8, [[COPY7]], 0, 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_MUL_F32_e64_3:%[0-9]+]]:vgpr_32 = nofpexcept V_MUL_F32_e64 0, [[V_MAD_MIX_F32_3]], 0, [[V_RCP_F32_e64_1]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[COPY9:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_1]]
+    ; GFX9-NEXT: [[V_AND_B32_e64_1:%[0-9]+]]:vgpr_32 = V_AND_B32_e64 [[V_MUL_F32_e64_3]], [[COPY9]], implicit $exec
+    ; GFX9-NEXT: [[V_ADD_F32_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[V_AND_B32_e64_1]], 0, [[V_MAC_F32_e64_1]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_CVT_F16_F32_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_CVT_F16_F32_e64 0, [[V_ADD_F32_e64_1]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[COPY10:%[0-9]+]]:vgpr_32 = COPY [[S_LSHR_B32_1]]
+    ; GFX9-NEXT: [[COPY11:%[0-9]+]]:vgpr_32 = COPY [[S_LSHR_B32_]]
+    ; GFX9-NEXT: [[V_DIV_FIXUP_F16_gfx9_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_DIV_FIXUP_F16_gfx9_e64 0, [[V_CVT_F16_F32_e64_1]], 0, [[COPY10]], 0, [[COPY11]], 0, 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_PACK_B32_F16_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_PACK_B32_F16_e64 0, [[V_DIV_FIXUP_F16_gfx9_e64_]], 0, [[V_DIV_FIXUP_F16_gfx9_e64_1]], 0, 0, implicit $mode, implicit $exec
+    ; GFX9-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[V_PACK_B32_F16_e64_]], implicit $exec
+    ; GFX9-NEXT: $sgpr0 = COPY [[V_READFIRSTLANE_B32_]]
+    ; GFX9-NEXT: SI_RETURN_TO_EPILOG implicit $sgpr0
+    %0:sgpr(s32) = COPY $sgpr0
+    %1:sgpr(s32) = COPY $sgpr1
+    %2:sgpr(s16) = G_TRUNC %0:sgpr(s32)
+    %3:sgpr(s32) = G_CONSTANT i32 16
+    %4:sgpr(s32) = G_LSHR %0:sgpr, %3:sgpr(s32)
+    %5:sgpr(s16) = G_TRUNC %4:sgpr(s32)
+    %6:sgpr(s16) = G_TRUNC %1:sgpr(s32)
+    %7:sgpr(s32) = G_LSHR %1:sgpr, %3:sgpr(s32)
+    %8:sgpr(s16) = G_TRUNC %7:sgpr(s32)
+    %9:vgpr(s16) = COPY %2:sgpr(s16)
+    %10:vgpr(s32) = G_FPEXT %9:vgpr(s16)
+    %11:vgpr(s16) = COPY %6:sgpr(s16)
+    %12:vgpr(s32) = G_FPEXT %11:vgpr(s16)
+    %13:vgpr(s32) = G_FNEG %12:vgpr
+    %14:vgpr(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.rcp), %12:vgpr(s32)
+    %15:vgpr(s32) = G_FMUL %10:vgpr, %14:vgpr
+    %16:vgpr(s32) = G_FMAD %13:vgpr, %15:vgpr, %10:vgpr
+    %17:vgpr(s32) = G_FMAD %16:vgpr, %14:vgpr, %15:vgpr
+    %18:vgpr(s32) = G_FMAD %13:vgpr, %17:vgpr, %10:vgpr
+    %19:vgpr(s32) = G_FMUL %18:vgpr, %14:vgpr
+    %20:sgpr(s32) = G_CONSTANT i32 -8388608
+    %21:vgpr(s32) = COPY %20:sgpr(s32)
+    %22:vgpr(s32) = G_AND %19:vgpr, %21:vgpr
+    %23:vgpr(s32) = G_FADD %22:vgpr, %17:vgpr
+    %24:vgpr(s16) = G_FPTRUNC %23:vgpr(s32)
+    %25:vgpr(s16) = COPY %6:sgpr(s16)
+    %26:vgpr(s16) = COPY %2:sgpr(s16)
+    %27:vgpr(s16) = G_INTRINSIC intrinsic(@llvm.amdgcn.div.fixup), %24:vgpr(s16), %25:vgpr(s16), %26:vgpr(s16)
+    %28:vgpr(s16) = COPY %5:sgpr(s16)
+    %29:vgpr(s32) = G_FPEXT %28:vgpr(s16)
+    %30:vgpr(s16) = COPY %8:sgpr(s16)
+    %31:vgpr(s32) = G_FPEXT %30:vgpr(s16)
+    %32:vgpr(s32) = G_FNEG %31:vgpr
+    %33:vgpr(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.rcp), %31:vgpr(s32)
+    %34:vgpr(s32) = G_FMUL %29:vgpr, %33:vgpr
+    %35:vgpr(s32) = G_FMAD %32:vgpr, %34:vgpr, %29:vgpr
+    %36:vgpr(s32) = G_FMAD %35:vgpr, %33:vgpr, %34:vgpr
+    %37:vgpr(s32) = G_FMAD %32:vgpr, %36:vgpr, %29:vgpr
+    %38:vgpr(s32) = G_FMUL %37:vgpr, %33:vgpr
+    %39:vgpr(s32) = COPY %20:sgpr(s32)
+    %40:vgpr(s32) = G_AND %38:vgpr, %39:vgpr
+    %41:vgpr(s32) = G_FADD %40:vgpr, %36:vgpr
+    %42:vgpr(s16) = G_FPTRUNC %41:vgpr(s32)
+    %43:vgpr(s16) = COPY %8:sgpr(s16)
+    %44:vgpr(s16) = COPY %5:sgpr(s16)
+    %45:vgpr(s16) = G_INTRINSIC intrinsic(@llvm.amdgcn.div.fixup), %42:vgpr(s16), %43:vgpr(s16), %44:vgpr(s16)
+    %46:vgpr(<2 x s16>) = G_BUILD_VECTOR %27:vgpr(s16), %45:vgpr(s16)
+    %47:vgpr(s32) = G_BITCAST %46:vgpr(<2 x s16>)
+    %48:sgpr(s32) = G_INTRINSIC_CONVERGENT intrinsic(@llvm.amdgcn.readfirstlane), %47:vgpr(s32)
+    $sgpr0 = COPY %48:sgpr(s32)
+    SI_RETURN_TO_EPILOG implicit $sgpr0
+...

@shiltian
Copy link
Contributor Author

I understand that this seems to revert some optimization, so I'm happy to hear about better approaches.

@shiltian shiltian requested a review from rampitec September 26, 2024 20:46
@shiltian shiltian changed the title [AMDGPU][GlobalISel] Make selectVOP3PMadMixModsImpl same as the SelectionDAG counterpart [AMDGPU][GlobalISel] Align selectVOP3PMadMixModsImpl with the SelectionDAG counterpart Sep 26, 2024
@shiltian shiltian force-pushed the users/shiltian/fmad_mix-constant-bus-violation branch from 4e5ee8a to 7f665f0 Compare September 26, 2024 20:53
@petar-avramovic
Copy link
Collaborator

IMO, the easiest approach is to force vgpr reg class on all operands when selecting vop3p instructions. Do this with constrainOperandRegClass, to vgpr reg class you can get from getVGPRClassForBitWidth. This will insert copy from sgpr to vgpr for all operands.
SIFoldOperands can eliminate some of the copies by replacing vgpr with sgpr so you end up with same result in the end (SIFoldOperands knows about constant bus restriction).

@shiltian shiltian force-pushed the users/shiltian/fmad_mix-constant-bus-violation branch 2 times, most recently from 8131471 to b7e1fa3 Compare September 27, 2024 15:06
@shiltian
Copy link
Contributor Author

IMO, the easiest approach is to force vgpr reg class on all operands when selecting vop3p instructions.

IIUC, this PR made a right move by removing the copy stripping.

@petar-avramovic
Copy link
Collaborator

IMO, the easiest approach is to force vgpr reg class on all operands when selecting vop3p instructions.

IIUC, this PR made a right move by removing the copy stripping.

What I meant was to constrainOperandRegClass to vgpr after adding sgpr operand to MIB. This way there is no need to avoid copy stripping and final asm code stays the same. Didn't try to implement it, so it may not be that easy.

@shiltian
Copy link
Contributor Author

I added the following code into selectVOP3PMadMixModsExt and selectVOP3PMadMixMods. Not sure if this is what you suggested, but it's not able to fix the issue.

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index febf0711c7d5..d95e0ee944bf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -5366,6 +5366,13 @@ AMDGPUInstructionSelector::selectVOP3PMadMixModsExt(
   if (!Matched)
     return {};
 
+  const TargetRegisterClass *SrcRC =
+      TRI.getConstrainedRegClassForOperand(Root, *MRI);
+  const TargetRegisterClass *VGPRSrc =
+      TRI.getVGPRClassForBitWidth(AMDGPU::getRegBitWidth(SrcRC->getID()));
+  MachineInstr *I = Root.getParent();
+  Src = constrainOperandRegClass(*MF, TRI, *MRI, TII, RBI, *I, *VGPRSrc, Root);
+
   return {{
       [=](MachineInstrBuilder &MIB) { MIB.addReg(Src); },
       [=](MachineInstrBuilder &MIB) { MIB.addImm(Mods); } // src_mods
@@ -5379,6 +5386,13 @@ AMDGPUInstructionSelector::selectVOP3PMadMixMods(MachineOperand &Root) const {
   bool Matched;
   std::tie(Src, Mods) = selectVOP3PMadMixModsImpl(Root, Matched);
 
+  const TargetRegisterClass *SrcRC =
+      TRI.getConstrainedRegClassForOperand(Root, *MRI);
+  const TargetRegisterClass *VGPRSrc =
+      TRI.getVGPRClassForBitWidth(AMDGPU::getRegBitWidth(SrcRC->getID()));
+  MachineInstr *I = Root.getParent();
+  Src = constrainOperandRegClass(*MF, TRI, *MRI, TII, RBI, *I, *VGPRSrc, Root);
+
   return {{
       [=](MachineInstrBuilder &MIB) { MIB.addReg(Src); },
       [=](MachineInstrBuilder &MIB) { MIB.addImm(Mods); } // src_mods

@shiltian
Copy link
Contributor Author

Overall we can make it align with the SelectionDAG counterpart, unblock the PR depending on it, and then improve it using what @petar-avramovic suggested on both side.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I trust this. It might be better to directly check the constant bus restriction, similar to how ThreeOpFrag does

@petar-avramovic
Copy link
Collaborator

Hi @shiltian, I did not look into the patch carefully, can you provide LLVM IR test that violates constant bus restriction first.

What I had in mind was that you somehow end up with sgpr register after selectVOP3PMadMixModsImpl

If you want to use existing functions you could

  • add it to machine instruction first (without checking anything).
  • then constrain its register class to vgpr.
    expected result is that if sgpr was added to instruction, constrainOperandRegClass(...,*VGPRSrc, SrcOp) would replace it by inserting a copy to vgpr.
    The constrainSelectedInstRegOperands allows both sgpr and vgpr for that operand. We have to explicitly constrain all operands to vgpr. Other option is to do this in tablegen. End result should be the same.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index febf0711c7d5..9a81e7609dac 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -5356,6 +5356,13 @@ AMDGPUInstructionSelector::selectVOP3PMadMixModsImpl(MachineOperand &Root,
   return {Src, Mods};
 }
 
+MachineOperand &getLastOp(MachineInstrBuilder &MIB) {
+  unsigned Idx = MIB->getNumOperands() - 1;
+  while (MIB->getOperand(Idx).isImplicit())
+    --Idx;
+  return MIB->getOperand(Idx);
+}
+
 InstructionSelector::ComplexRendererFns
 AMDGPUInstructionSelector::selectVOP3PMadMixModsExt(
     MachineOperand &Root) const {
@@ -5367,7 +5374,14 @@ AMDGPUInstructionSelector::selectVOP3PMadMixModsExt(
     return {};
 
   return {{
-      [=](MachineInstrBuilder &MIB) { MIB.addReg(Src); },
+      [=](MachineInstrBuilder &MIB) {
+        MIB.addReg(Src);
+        const TargetRegisterClass *VGPRSrc = &AMDGPU::VGPR_32RegClass;
+        MachineOperand &SrcOp = getLastOp(MIB);
+        constrainOperandRegClass(*MF, TRI, *MRI, TII, RBI, *MIB, *VGPRSrc,
+                                 SrcOp);
+      },
       [=](MachineInstrBuilder &MIB) { MIB.addImm(Mods); } // src_mods
   }};
 }

I added snippet that I would assume works, but don't have test for it.

@shiltian
Copy link
Contributor Author

I'm not sure I trust this.

Which part?

It might be better to directly check the constant bus restriction, similar to how ThreeOpFrag does

I tried this as you suggested but because the pattern here is a ComplexPattern (or GIComplexOperandMatcher), it doesn't have support for checking predicate, GISelPredicateCode as what ThreeOpFrag is used.

@shiltian shiltian force-pushed the users/shiltian/fmad_mix-constant-bus-violation branch 3 times, most recently from 38b55c0 to 9f1f15e Compare September 30, 2024 18:26
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have end to end IR test, including bitcasted cases

@shiltian shiltian force-pushed the users/shiltian/fmad_mix-constant-bus-violation branch from 9f1f15e to 494e9fa Compare October 2, 2024 00:59
@shiltian
Copy link
Contributor Author

shiltian commented Oct 2, 2024

Thanks @petar-avramovic . It indeed works. It might be better to do it in TableGen so I will probably make a follow-up patch. For now, I'd like to get the issue resolved such that another PR can be merged as well.

@@ -73,10 +73,14 @@ define amdgpu_vs <5 x float> @test_5xf16_5xf32_add_ext_mul(<5 x half> inreg %x,
; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v2, s8
; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v3, s9
; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v4, s10
; GFX10-FAST-DENORM-NEXT: s_lshr_b32 s11, s0, 16
; GFX10-FAST-DENORM-NEXT: s_lshr_b32 s12, s1, 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regressed gfx10 which has the more relaxed constant bus limit

Copy link
Contributor Author

@shiltian shiltian Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The v_fma_mix_f32 instruction still uses two SGPRs. The only issue here is, it can't see "through" the copies such that it can't directly use the source operands of the shifts via op_sel. I think the ISel version also has this limitation (apparently because here we "copied" the logic).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still should fix this in a follow up

…ectionDAG counterpart

The current `selectVOP3PMadMixModsImpl` can produce `V_MAD_FIX_F32` instruction
that violates constant bus restriction, while its `SelectionDAG` counterpart
doesn't. The culprit is in the copy stripping while the SelectionDAG version
only has a bitcast stripping. This PR simply aligns the two version.
@shiltian shiltian force-pushed the users/shiltian/fmad_mix-constant-bus-violation branch from 494e9fa to c5680dd Compare October 4, 2024 04:36
@@ -0,0 +1,42 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
# RUN: llc -mtriple=amdgcn -mcpu=gfx908 -run-pass=instruction-select,machineverifier -o - %s | FileCheck -check-prefixes=GFX9 %s
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have end-to-end IR tests, can I just remove this two MIR tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should keep both probably

@shiltian
Copy link
Contributor Author

shiltian commented Oct 8, 2024

bump

@shiltian shiltian merged commit 48ac846 into main Oct 8, 2024
7 of 9 checks passed
@shiltian shiltian deleted the users/shiltian/fmad_mix-constant-bus-violation branch October 8, 2024 13:41
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Dec 2, 2024
…ctionDAG` counterpart (llvm#110168)

The current `selectVOP3PMadMixModsImpl` can produce `V_MAD_FIX_F32`
instruction
that violates constant bus restriction, while its `SelectionDAG`
counterpart
doesn't. The culprit is in the copy stripping while the `SelectionDAG`
version
only has a bitcast stripping. This PR simply aligns the two version.

(cherry picked from commit 48ac846)
Change-Id: I33ca6f072d443b9571675d7bee864d7fb3c4e1cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants