Skip to content

[AMDGPU] Fix FMA Combine #119217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 10, 2024
Merged

[AMDGPU] Fix FMA Combine #119217

merged 2 commits into from
Dec 10, 2024

Conversation

piotrAMD
Copy link
Collaborator

@piotrAMD piotrAMD commented Dec 9, 2024

Update the check in the FMA combine to check dot10-insts instead of dot7-insts.

The target of the combine, v_dot2_f32_f16, is available only if dot10-insts target feature is enabled.

The issue probably dates back to the change that split out dot10-insts out of dot7-insts.

As far as I can see, this does not affect any current targets, but if a future target has dot7-insts, but not dot10-insts that would cause a crash ("cannot select") for the input ir in the test.

Update the check in the FMA combine to check dot10-insts instead
of dot7-insts.

The target of the combine, v_dot2_f32_f16, is available only
if dot10-insts target feature is enabled.

The issue probably dates back to the change that split out dot10-insts
out of dot7-insts.

As far as I can see, this does not affect any current targets, but
if a future target has dot7-insts, but not dot10-insts that would
cause a crash ("cannot select") for the input ir in the test.
@llvmbot
Copy link
Member

llvmbot commented Dec 9, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Piotr Sobczak (piotrAMD)

Changes

Update the check in the FMA combine to check dot10-insts instead of dot7-insts.

The target of the combine, v_dot2_f32_f16, is available only if dot10-insts target feature is enabled.

The issue probably dates back to the change that split out dot10-insts out of dot7-insts.

As far as I can see, this does not affect any current targets, but if a future target has dot7-insts, but not dot10-insts that would cause a crash ("cannot select") for the input ir in the test.


Full diff: https://github.com/llvm/llvm-project/pull/119217.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+1-1)
  • (added) llvm/test/CodeGen/AMDGPU/fma_combine.ll (+70)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index f89fe8faa600ba..8dfebd36a962e1 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -14696,7 +14696,7 @@ SDValue SITargetLowering::performFMACombine(SDNode *N,
   EVT VT = N->getValueType(0);
   SDLoc SL(N);
 
-  if (!Subtarget->hasDot7Insts() || VT != MVT::f32)
+  if (!Subtarget->hasDot10Insts() || VT != MVT::f32)
     return SDValue();
 
   // FMA((F32)S0.x, (F32)S1. x, FMA((F32)S0.y, (F32)S1.y, (F32)z)) ->
diff --git a/llvm/test/CodeGen/AMDGPU/fma_combine.ll b/llvm/test/CodeGen/AMDGPU/fma_combine.ll
new file mode 100644
index 00000000000000..0414018862d071
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/fma_combine.ll
@@ -0,0 +1,70 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr="+dot7-insts,-dot10-insts,-dot5-insts" < %s | FileCheck %s -check-prefix=DISABLED
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr="+dot7-insts,+dot10-insts,-dot5-insts" < %s | FileCheck %s -check-prefix=ENABLED
+
+; Test that FMACombine is disabled for a target without dot10-insts, and enabled for a target with dot10-insts.
+
+define amdgpu_kernel void @func() {
+; DISABLED-LABEL: func:
+; DISABLED:       ; %bb.0: ; %bb
+; DISABLED-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, 0
+; DISABLED-NEXT:    v_mov_b32_e32 v2, 0
+; DISABLED-NEXT:    s_mov_b32 vcc_lo, exec_lo
+; DISABLED-NEXT:  .LBB0_1: ; %main
+; DISABLED-NEXT:    ; =>This Inner Loop Header: Depth=1
+; DISABLED-NEXT:    ds_load_b128 v[3:6], v1
+; DISABLED-NEXT:    ds_store_b32 v1, v0
+; DISABLED-NEXT:    s_waitcnt lgkmcnt(1)
+; DISABLED-NEXT:    v_fma_mix_f32 v2, v4, v3, v2 op_sel_hi:[1,1,0]
+; DISABLED-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; DISABLED-NEXT:    v_fma_mix_f32 v2, v4, v3, v2 op_sel:[1,1,0] op_sel_hi:[1,1,0]
+; DISABLED-NEXT:    v_add_f32_e32 v2, 0, v2
+; DISABLED-NEXT:    s_cbranch_vccnz .LBB0_1
+; DISABLED-NEXT:  ; %bb.2: ; %DummyReturnBlock
+; DISABLED-NEXT:    s_endpgm
+;
+; ENABLED-LABEL: func:
+; ENABLED:       ; %bb.0: ; %bb
+; ENABLED-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, 0
+; ENABLED-NEXT:    v_mov_b32_e32 v2, 0
+; ENABLED-NEXT:    s_mov_b32 vcc_lo, exec_lo
+; ENABLED-NEXT:  .LBB0_1: ; %main
+; ENABLED-NEXT:    ; =>This Inner Loop Header: Depth=1
+; ENABLED-NEXT:    ds_load_b128 v[3:6], v1
+; ENABLED-NEXT:    ds_store_b32 v1, v0
+; ENABLED-NEXT:    s_waitcnt lgkmcnt(1)
+; ENABLED-NEXT:    v_dot2_f32_f16 v2, v4, v3, v2
+; ENABLED-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; ENABLED-NEXT:    v_add_f32_e32 v2, 0, v2
+; ENABLED-NEXT:    s_cbranch_vccnz .LBB0_1
+; ENABLED-NEXT:  ; %bb.2: ; %DummyReturnBlock
+; ENABLED-NEXT:    s_endpgm
+bb:
+  br label %main
+
+main:                        ; preds = %main, %bb
+  %.sroa.3 = phi float [ 0.000000e+00, %bb ], [ %i17, %main ]
+  %.sroa.4 = phi float [ 0.000000e+00, %bb ], [ %.sroa.3, %main ]
+  %i = load <8 x half>, ptr addrspace(3) null, align 16
+  %i1 = bitcast <8 x half> %i to i128
+  %i2 = trunc i128 %i1 to i32
+  %i3 = bitcast i32 %i2 to <2 x half>
+  %.sroa.1 = extractelement <2 x half> %i3, i64 0
+  %.sroa.2 = extractelement <2 x half> %i3, i64 1
+  %i4 = shufflevector <8 x half> %i, <8 x half> zeroinitializer, <2 x i32> <i32 2, i32 3>
+  %i5 = fpext <2 x half> %i4 to <2 x float>
+  %i6 = fpext half %.sroa.1 to float
+  %i7 = fpext half %.sroa.2 to float
+  %i8 = extractelement <2 x float> %i5, i64 0
+  %i9 = fmul contract float %i8, %i6
+  %i10 = fadd contract float %.sroa.3, %i9
+  %i11 = extractelement <2 x float> %i5, i64 1
+  %i12 = fmul contract float %i11, %i7
+  %i13 = fadd contract float %i12, %i10
+  %i14 = insertelement <2 x float> zeroinitializer, float %i13, i64 0
+  %i15 = insertelement <2 x float> %i14, float %.sroa.4, i64 1
+  %i16 = fadd <2 x float> %i15, zeroinitializer
+  store <2 x half> zeroinitializer, ptr addrspace(3) null, align 4
+  %i17 = extractelement <2 x float> %i16, i64 0
+  br label %main
+}

@piotrAMD
Copy link
Collaborator Author

piotrAMD commented Dec 9, 2024

I haven't actually checked that, but I think this may be related to https://reviews.llvm.org/D142507 that split out dot10-insts out of dot7-insts.

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code change looks good.

@@ -14696,7 +14696,7 @@ SDValue SITargetLowering::performFMACombine(SDNode *N,
EVT VT = N->getValueType(0);
SDLoc SL(N);

if (!Subtarget->hasDot7Insts() || VT != MVT::f32)
if (!Subtarget->hasDot10Insts() || VT != MVT::f32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have no idea what these numbers mean

@piotrAMD piotrAMD merged commit a2d086a into llvm:main Dec 10, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants