[AMDGPU] Fix sign confusion in performMulLoHiCombine #106977

jayfoad · 2024-09-02T12:13:28Z

SMUL_LOHI and UMUL_LOHI are different operations because the high part of the result is different, so it is not OK to optimize the signed version to MUL_U24/MULHI_U24 or the unsigned version to MUL_I24/MULHI_I24.

jayfoad · 2024-09-02T12:13:50Z

This is a backport of #105831.

llvmbot · 2024-09-02T12:14:03Z

@llvm/pr-subscribers-backend-amdgpu

Author: Jay Foad (jayfoad)

Changes

SMUL_LOHI and UMUL_LOHI are different operations because the high part of the result is different, so it is not OK to optimize the signed version to MUL_U24/MULHI_U24 or the unsigned version to MUL_I24/MULHI_I24.

Full diff: https://github.com/llvm/llvm-project/pull/106977.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+18-12)
(modified) llvm/test/CodeGen/AMDGPU/mul_int24.ll (+98)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 39ae7c96cf7729..a71c9453d968dd 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -4349,6 +4349,7 @@ AMDGPUTargetLowering::performMulLoHiCombine(SDNode *N,
   SelectionDAG &DAG = DCI.DAG;
   SDLoc DL(N);
 
+  bool Signed = N->getOpcode() == ISD::SMUL_LOHI;
   SDValue N0 = N->getOperand(0);
   SDValue N1 = N->getOperand(1);
 
@@ -4363,20 +4364,25 @@ AMDGPUTargetLowering::performMulLoHiCombine(SDNode *N,
 
   // Try to use two fast 24-bit multiplies (one for each half of the result)
   // instead of one slow extending multiply.
-  unsigned LoOpcode, HiOpcode;
-  if (Subtarget->hasMulU24() && isU24(N0, DAG) && isU24(N1, DAG)) {
-    N0 = DAG.getZExtOrTrunc(N0, DL, MVT::i32);
-    N1 = DAG.getZExtOrTrunc(N1, DL, MVT::i32);
-    LoOpcode = AMDGPUISD::MUL_U24;
-    HiOpcode = AMDGPUISD::MULHI_U24;
-  } else if (Subtarget->hasMulI24() && isI24(N0, DAG) && isI24(N1, DAG)) {
-    N0 = DAG.getSExtOrTrunc(N0, DL, MVT::i32);
-    N1 = DAG.getSExtOrTrunc(N1, DL, MVT::i32);
-    LoOpcode = AMDGPUISD::MUL_I24;
-    HiOpcode = AMDGPUISD::MULHI_I24;
+  unsigned LoOpcode = 0;
+  unsigned HiOpcode = 0;
+  if (Signed) {
+    if (Subtarget->hasMulI24() && isI24(N0, DAG) && isI24(N1, DAG)) {
+      N0 = DAG.getSExtOrTrunc(N0, DL, MVT::i32);
+      N1 = DAG.getSExtOrTrunc(N1, DL, MVT::i32);
+      LoOpcode = AMDGPUISD::MUL_I24;
+      HiOpcode = AMDGPUISD::MULHI_I24;
+    }
   } else {
-    return SDValue();
+    if (Subtarget->hasMulU24() && isU24(N0, DAG) && isU24(N1, DAG)) {
+      N0 = DAG.getZExtOrTrunc(N0, DL, MVT::i32);
+      N1 = DAG.getZExtOrTrunc(N1, DL, MVT::i32);
+      LoOpcode = AMDGPUISD::MUL_U24;
+      HiOpcode = AMDGPUISD::MULHI_U24;
+    }
   }
+  if (!LoOpcode)
+    return SDValue();
 
   SDValue Lo = DAG.getNode(LoOpcode, DL, MVT::i32, N0, N1);
   SDValue Hi = DAG.getNode(HiOpcode, DL, MVT::i32, N0, N1);
diff --git a/llvm/test/CodeGen/AMDGPU/mul_int24.ll b/llvm/test/CodeGen/AMDGPU/mul_int24.ll
index be77a10380c49b..8f4c48fae6fb31 100644
--- a/llvm/test/CodeGen/AMDGPU/mul_int24.ll
+++ b/llvm/test/CodeGen/AMDGPU/mul_int24.ll
@@ -813,4 +813,102 @@ bb7:
   ret void
 
 }
+
+define amdgpu_kernel void @test_umul_i24(ptr addrspace(1) %out, i32 %arg) {
+; SI-LABEL: test_umul_i24:
+; SI:       ; %bb.0:
+; SI-NEXT:    s_load_dword s1, s[2:3], 0xb
+; SI-NEXT:    v_mov_b32_e32 v0, 0xff803fe1
+; SI-NEXT:    s_mov_b32 s0, 0
+; SI-NEXT:    s_mov_b32 s3, 0xf000
+; SI-NEXT:    s_waitcnt lgkmcnt(0)
+; SI-NEXT:    s_lshr_b32 s1, s1, 9
+; SI-NEXT:    v_mul_hi_u32 v0, s1, v0
+; SI-NEXT:    s_mul_i32 s1, s1, 0xff803fe1
+; SI-NEXT:    v_alignbit_b32 v0, v0, s1, 1
+; SI-NEXT:    s_mov_b32 s2, -1
+; SI-NEXT:    s_mov_b32 s1, s0
+; SI-NEXT:    buffer_store_dword v0, off, s[0:3], 0
+; SI-NEXT:    s_endpgm
+;
+; VI-LABEL: test_umul_i24:
+; VI:       ; %bb.0:
+; VI-NEXT:    s_load_dword s0, s[2:3], 0x2c
+; VI-NEXT:    v_mov_b32_e32 v0, 0xff803fe1
+; VI-NEXT:    s_mov_b32 s3, 0xf000
+; VI-NEXT:    s_mov_b32 s2, -1
+; VI-NEXT:    s_waitcnt lgkmcnt(0)
+; VI-NEXT:    s_lshr_b32 s0, s0, 9
+; VI-NEXT:    v_mad_u64_u32 v[0:1], s[0:1], s0, v0, 0
+; VI-NEXT:    s_mov_b32 s0, 0
+; VI-NEXT:    s_mov_b32 s1, s0
+; VI-NEXT:    v_alignbit_b32 v0, v1, v0, 1
+; VI-NEXT:    s_nop 1
+; VI-NEXT:    buffer_store_dword v0, off, s[0:3], 0
+; VI-NEXT:    s_endpgm
+;
+; GFX9-LABEL: test_umul_i24:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_load_dword s1, s[2:3], 0x2c
+; GFX9-NEXT:    s_mov_b32 s0, 0
+; GFX9-NEXT:    s_mov_b32 s3, 0xf000
+; GFX9-NEXT:    s_mov_b32 s2, -1
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    s_lshr_b32 s1, s1, 9
+; GFX9-NEXT:    s_mul_hi_u32 s4, s1, 0xff803fe1
+; GFX9-NEXT:    s_mul_i32 s1, s1, 0xff803fe1
+; GFX9-NEXT:    v_mov_b32_e32 v0, s1
+; GFX9-NEXT:    v_alignbit_b32 v0, s4, v0, 1
+; GFX9-NEXT:    s_mov_b32 s1, s0
+; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], 0
+; GFX9-NEXT:    s_endpgm
+;
+; EG-LABEL: test_umul_i24:
+; EG:       ; %bb.0:
+; EG-NEXT:    ALU 8, @4, KC0[CB0:0-32], KC1[]
+; EG-NEXT:    MEM_RAT_CACHELESS STORE_RAW T0.X, T1.X, 1
+; EG-NEXT:    CF_END
+; EG-NEXT:    PAD
+; EG-NEXT:    ALU clause starting at 4:
+; EG-NEXT:     LSHR * T0.W, KC0[2].Z, literal.x,
+; EG-NEXT:    9(1.261169e-44), 0(0.000000e+00)
+; EG-NEXT:     MULHI * T0.X, PV.W, literal.x,
+; EG-NEXT:    -8372255(nan), 0(0.000000e+00)
+; EG-NEXT:     MULLO_INT * T0.Y, T0.W, literal.x,
+; EG-NEXT:    -8372255(nan), 0(0.000000e+00)
+; EG-NEXT:     BIT_ALIGN_INT T0.X, T0.X, PS, 1,
+; EG-NEXT:     MOV * T1.X, literal.x,
+; EG-NEXT:    0(0.000000e+00), 0(0.000000e+00)
+;
+; CM-LABEL: test_umul_i24:
+; CM:       ; %bb.0:
+; CM-NEXT:    ALU 14, @4, KC0[CB0:0-32], KC1[]
+; CM-NEXT:    MEM_RAT_CACHELESS STORE_DWORD T0.X, T1.X
+; CM-NEXT:    CF_END
+; CM-NEXT:    PAD
+; CM-NEXT:    ALU clause starting at 4:
+; CM-NEXT:     LSHR * T0.W, KC0[2].Z, literal.x,
+; CM-NEXT:    9(1.261169e-44), 0(0.000000e+00)
+; CM-NEXT:     MULHI T0.X, T0.W, literal.x,
+; CM-NEXT:     MULHI T0.Y (MASKED), T0.W, literal.x,
+; CM-NEXT:     MULHI T0.Z (MASKED), T0.W, literal.x,
+; CM-NEXT:     MULHI * T0.W (MASKED), T0.W, literal.x,
+; CM-NEXT:    -8372255(nan), 0(0.000000e+00)
+; CM-NEXT:     MULLO_INT T0.X (MASKED), T0.W, literal.x,
+; CM-NEXT:     MULLO_INT T0.Y, T0.W, literal.x,
+; CM-NEXT:     MULLO_INT T0.Z (MASKED), T0.W, literal.x,
+; CM-NEXT:     MULLO_INT * T0.W (MASKED), T0.W, literal.x,
+; CM-NEXT:    -8372255(nan), 0(0.000000e+00)
+; CM-NEXT:     BIT_ALIGN_INT * T0.X, T0.X, PV.Y, 1,
+; CM-NEXT:     MOV * T1.X, literal.x,
+; CM-NEXT:    0(0.000000e+00), 0(0.000000e+00)
+  %i = lshr i32 %arg, 9
+  %i1 = zext i32 %i to i64
+  %i2 = mul i64 %i1, 4286595041
+  %i3 = lshr i64 %i2, 1
+  %i4 = trunc i64 %i3 to i32
+  store i32 %i4, ptr addrspace(1) null, align 4
+  ret void
+}
+
 attributes #0 = { nounwind }

tru · 2024-09-10T06:17:22Z

Hi, since we are wrapping up LLVM 19.1.0 we are very strict with the fixes we pick at this point. Can you please respond to the following questions to help me understand if this has to be included in the final release or not.

Is this PR a fix for a regression or a critical issue?

What is the risk of accepting this into the release branch?

What is the risk of NOT accepting this into the release branch?

jayfoad · 2024-09-10T07:25:35Z

Is this PR a fix for a regression or a critical issue?

No, I believe it has been broken for about 3 years (since d7e03df) but it was only reported to me recently.

I guess this means it is not appropriate for 19.1.0.

jayfoad · 2024-09-10T07:26:17Z

Is this PR a fix for a regression or a critical issue?

No, I believe it has been broken for about 3 years (since d7e03df) but it was only reported to me recently.

I guess this means it is not appropriate for 19.1.0.

Cc @marekolsak FYI.

marekolsak · 2024-09-10T13:23:23Z

ROCm should consider it a critical issue because it fixes int64 multiplication by a constant.

It's less critical for other LLVM users like Mesa.

tru · 2024-09-10T14:37:15Z

ROCm should consider it a critical issue because it fixes int64 multiplication by a constant.

I am unsure what ROCm is in this case and what that would mean for LLVM users if this fix is not included.

arsenm · 2024-09-11T10:16:38Z

I am unsure what ROCm is in this case and what that would mean for LLVM users if this fix is not included.

GPU compute stack. Current ROCm releases are basing off of the LLVM release branches (in the past they were more random branch points)

SMUL_LOHI and UMUL_LOHI are different operations because the high part of the result is different, so it is not OK to optimize the signed version to MUL_U24/MULHI_U24 or the unsigned version to MUL_I24/MULHI_I24.

github-actions · 2024-09-13T06:05:51Z

@jayfoad (or anyone else). If you would like to add a note about this fix in the release notes (completely optional). Please reply to this comment with a one or two sentence description of the fix. When you are done, please add the release:note label to this PR.

h-vetinari · 2024-09-13T07:02:43Z

I am unsure what ROCm is in this case

Radeon Open Compute, which is (very roughly) AMD's CUDA.

jayfoad added this to the LLVM 19.X Release milestone Sep 2, 2024

llvmbot added the backend:AMDGPU label Sep 2, 2024

jayfoad requested review from arsenm and rampitec September 2, 2024 12:14

rampitec approved these changes Sep 3, 2024

View reviewed changes

arsenm approved these changes Sep 4, 2024

View reviewed changes

arsenm approved these changes Sep 11, 2024

View reviewed changes

[AMDGPU] Fix sign confusion in performMulLoHiCombine (llvm#105831)

93998af

SMUL_LOHI and UMUL_LOHI are different operations because the high part of the result is different, so it is not OK to optimize the signed version to MUL_U24/MULHI_U24 or the unsigned version to MUL_I24/MULHI_I24.

tru force-pushed the cherry-pick-pr-105831 branch from 04226ba to 93998af Compare September 13, 2024 06:05

tru merged commit 93998af into llvm:release/19.x Sep 13, 2024
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Fix sign confusion in performMulLoHiCombine #106977

[AMDGPU] Fix sign confusion in performMulLoHiCombine #106977

Uh oh!

jayfoad commented Sep 2, 2024

Uh oh!

jayfoad commented Sep 2, 2024

Uh oh!

llvmbot commented Sep 2, 2024

Uh oh!

tru commented Sep 10, 2024

Uh oh!

jayfoad commented Sep 10, 2024

Uh oh!

jayfoad commented Sep 10, 2024

Uh oh!

marekolsak commented Sep 10, 2024

Uh oh!

tru commented Sep 10, 2024

Uh oh!

arsenm commented Sep 11, 2024

Uh oh!

Uh oh!

github-actions bot commented Sep 13, 2024

Uh oh!

h-vetinari commented Sep 13, 2024

Uh oh!

Uh oh!

[AMDGPU] Fix sign confusion in performMulLoHiCombine #106977

[AMDGPU] Fix sign confusion in performMulLoHiCombine #106977

Uh oh!

Conversation

jayfoad commented Sep 2, 2024

Uh oh!

jayfoad commented Sep 2, 2024

Uh oh!

llvmbot commented Sep 2, 2024

Uh oh!

tru commented Sep 10, 2024

Uh oh!

jayfoad commented Sep 10, 2024

Uh oh!

jayfoad commented Sep 10, 2024

Uh oh!

marekolsak commented Sep 10, 2024

Uh oh!

tru commented Sep 10, 2024

Uh oh!

arsenm commented Sep 11, 2024

Uh oh!

Uh oh!

github-actions bot commented Sep 13, 2024

Uh oh!

h-vetinari commented Sep 13, 2024

Uh oh!

Uh oh!