AMDGPU: Make v2f32 -> v2f16 legal when target supports v_cvt_pk_f16_f32 #139956

changpeng · 2025-05-14T20:20:44Z

If targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should be legal. However, SelectionDAG does not allow us to specify the source type in the legalization rules. To workaround this, we make FP_ROUND Custom for v2f16 then set up v2f32 -> v2f16 to be legal during custom lowering.

Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated.

If targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should be legal. However, SelectionDAG does not allow us to specify the source type in the legalization rules. To workaround this, we make FP_ROUND Custom for v2f16 then set up v2f32 -> v2f16 to be legal during custom lowering. Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated.

llvmbot · 2025-05-14T20:21:18Z

@llvm/pr-subscribers-backend-amdgpu

Author: Changpeng Fang (changpeng)

Changes

If targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should be legal. However, SelectionDAG does not allow us to specify the source type in the legalization rules. To workaround this, we make FP_ROUND Custom for v2f16 then set up v2f32 -> v2f16 to be legal during custom lowering.

Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated.

Full diff: https://github.com/llvm/llvm-project/pull/139956.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+10-1)
(renamed) llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll (+14)

diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 5cd6561914364..70f9485c3e5b4 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -919,6 +919,9 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::BUILD_VECTOR, MVT::v2bf16, Legal);
   }
 
+  if (Subtarget->hasCvtPkF16F32Inst())
+    setOperationAction(ISD::FP_ROUND, MVT::v2f16, Custom);
+
   setTargetDAGCombine({ISD::ADD,
                        ISD::UADDO_CARRY,
                        ISD::SUB,
@@ -6899,10 +6902,16 @@ SDValue SITargetLowering::getFPExtOrFPRound(SelectionDAG &DAG, SDValue Op,
 SDValue SITargetLowering::lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const {
   SDValue Src = Op.getOperand(0);
   EVT SrcVT = Src.getValueType();
+  EVT DstVT = Op.getValueType();
+
+  if (DstVT == MVT::v2f16) {
+    assert(Subtarget->hasCvtPkF16F32Inst() && "support v_cvt_pk_f16_f32");
+    return SrcVT == MVT::v2f32 ? Op : SDValue();
+  }
+
   if (SrcVT.getScalarType() != MVT::f64)
     return Op;
 
-  EVT DstVT = Op.getValueType();
   SDLoc DL(Op);
   if (DstVT == MVT::f16) {
     // TODO: Handle strictfp
diff --git a/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.path.ll b/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll
similarity index 93%
rename from llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.path.ll
rename to llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll
index 5a5e39489d888..e5815e96fbe33 100644
--- a/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.path.ll
+++ b/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll
@@ -12,6 +12,20 @@ define <2 x half> @v_test_cvt_v2f32_v2f16(<2 x float> %src) {
   ret <2 x half> %res
 }
 
+define half @fptrunc_v2f32_v2f16_then_extract(<2 x float> %src) {
+; GFX950-LABEL: fptrunc_v2f32_v2f16_then_extract:
+; GFX950:       ; %bb.0:
+; GFX950-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-NEXT:    v_cvt_pk_f16_f32 v0, v0, v1
+; GFX950-NEXT:    v_add_f16_sdwa v0, v0, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
+; GFX950-NEXT:    s_setpc_b64 s[30:31]
+  %vec_half = fptrunc <2 x float> %src to <2 x half>
+  %first = extractelement <2 x half> %vec_half, i64 1
+  %second = extractelement <2 x half> %vec_half, i64 0
+  %res = fadd half %first, %second
+  ret half %res
+}
+
 define <2 x half> @v_test_cvt_v2f64_v2f16(<2 x double> %src) {
 ; GFX950-SDAG-LABEL: v_test_cvt_v2f64_v2f16:
 ; GFX950-SDAG:       ; %bb.0:

shiltian · 2025-05-14T20:27:03Z

However, SelectionDAG does not allow us to specify the source type in the legalization rules.

I find it sometimes confusing when I look at those code. The comment of setOperationAction says:

  /// Indicate that the specified operation does not work with the specified
  /// type and indicate what to do about it. Note that VT may refer to either
  /// the type of a result or that of an operand of Op.

Then in what cases it refers to a destination type and in what cases it refers to the source operand type?

changpeng · 2025-05-14T20:59:41Z

Then in what cases it refers to a destination type and in what cases it refers to the source operand type?

It is indeed very confusion. I was thinking it is hard to specify operand types (you may also have multiple operands).
I am not sure how does it work in details, but I am seeing:
setOperationAction(ISD::STORE, MVT::v2f16, Promote);

arsenm · 2025-05-14T21:35:33Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+  if (DstVT == MVT::v2f16) {
+    assert(Subtarget->hasCvtPkF16F32Inst() && "support v_cvt_pk_f16_f32");


The same should apply to all vector types.

The same should apply to all vector types.

I don't think so. You can not make v3f32 and v4f32 legal because there is no corresponding instructions. Instead, we have to "Expand" (split) for wider vector types. It is another subject how to split the wider vector to take advantage of the packed conversion.

arsenm · 2025-05-14T21:37:40Z

llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll

+  %res = fadd half %first, %second
+  ret half %res
+}
+


Are we really missing test coverage for this and wider vector types? Can you add that? In particular we need to make sure we get 2 x pk instructions in the 4 x case, and 1 pk and one scalar op in the 3 x case

No. This change only takes care of v2f16. It is beyond this work to test wider vectors.
Additional work is needed for wider vector splitting (in order to generate the packed conversions)

It is not beyond this to test wider vectors. Please add the missing tests in a follow up

The fadd use here should also be unnecessary

It is not beyond this to test wider vectors. Please add the missing tests in a follow up

Do you want to simply add the wider vector tests, or also add the logic to split the wider vector wisely to form the packed conversion?

The fadd use here should also be unnecessary

Then what to do after extracts? I think it is the extracts that prevent instcombining. We need two uses of the extracts?

If the extract use is interesting, should be another test in addition to the base one with a direct return or store of the vector value

If the extract use is interesting, should be another test in addition to the base one with a direct return or store of the vector value

The base test was already there in the same file. In the base test, the expected v_cvt_pk_f16_f32 will always be generated. It is the extract use case that inspired this patch to make v2f32->v2f16 lega.

…32 (llvm#139956) (llvm#2186) If targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should be legal. However, SelectionDAG does not allow us to specify the source type in the legalization rules. To workaround this, we make FP_ROUND Custom for v2f16 then set up v2f32 -> v2f16 to be legal during custom lowering. Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated. Cherry-pick f01f082 from upstreaam to mainline.

Pulls in llvm/llvm-project#139956 to fix an regression on AMD backend. Also to pick up llvm/llvm-project#140183.

GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.

GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated. Fix clang-format

GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated. NOTE: 1) re-implement the helper function the sam way as splitUnaryVectorOp; 2) Use fewerElementsIf instead of custom lowering on the global isel path 3) add checks for 3x and 16x vectors

GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated. NOTE: minor changes

GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated. NOTE: Use .clampMaxNumElements(0, S16, 2)

The latest asics support v_cvt_pk_f16_f32 instruction. However current implementation of vector fptrunc lowering fully scalarizes the vectors, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in #139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.

The latest asics support v_cvt_pk_f16_f32 instruction. However current implementation of vector fptrunc lowering fully scalarizes the vectors, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm/llvm-project#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.

The latest asics support v_cvt_pk_f16_f32 instruction. However current implementation of vector fptrunc lowering fully scalarizes the vectors, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.

…32 (llvm#139956) (llvm#2478) If targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should be legal. However, SelectionDAG does not allow us to specify the source type in the legalization rules. To workaround this, we make FP_ROUND Custom for v2f16 then set up v2f32 -> v2f16 to be legal during custom lowering. Cherry-pick llvm#139956 to mainline (after bulk promotion) Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated.

llvmbot added the backend:AMDGPU label May 14, 2025

changpeng requested review from rampitec and arsenm May 14, 2025 20:22

rampitec approved these changes May 14, 2025

View reviewed changes

arsenm reviewed May 14, 2025

View reviewed changes

changpeng merged commit f01f082 into llvm:main May 15, 2025
13 checks passed

changpeng deleted the bf16 branch May 15, 2025 22:09

antiagainst mentioned this pull request May 20, 2025

[BACKEND] bump to llvm/llvm-project@e12cbd8 triton-lang/triton#6880

Merged

ThomasRaoux pushed a commit to triton-lang/triton that referenced this pull request May 23, 2025

[BACKEND] bump to llvm/llvm-project@e12cbd8 (#6880)

75fe113

Pulls in llvm/llvm-project#139956 to fix an regression on AMD backend. Also to pick up llvm/llvm-project#140183.

zwu-2025 pushed a commit to zwu-2025/triton that referenced this pull request May 27, 2025

[BACKEND] bump to llvm/llvm-project@e12cbd8 (triton-lang#6880)

2d76300

Pulls in llvm/llvm-project#139956 to fix an regression on AMD backend. Also to pick up llvm/llvm-project#140183.

changpeng mentioned this pull request May 29, 2025

AMDGPU: Custom lower fptrunc vectors for f32 -> f16 #141883

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMDGPU: Make v2f32 -> v2f16 legal when target supports v_cvt_pk_f16_f32 #139956

AMDGPU: Make v2f32 -> v2f16 legal when target supports v_cvt_pk_f16_f32 #139956

Uh oh!

changpeng commented May 14, 2025

Uh oh!

llvmbot commented May 14, 2025

Uh oh!

shiltian commented May 14, 2025

Uh oh!

changpeng commented May 14, 2025

Uh oh!

arsenm May 14, 2025

Uh oh!

changpeng May 14, 2025

Uh oh!

arsenm May 14, 2025

Uh oh!

changpeng May 14, 2025

Uh oh!

arsenm May 16, 2025

Uh oh!

arsenm May 16, 2025

Uh oh!

changpeng May 16, 2025 •

edited

Loading

Uh oh!

changpeng May 16, 2025 •

edited

Loading

Uh oh!

arsenm May 21, 2025

Uh oh!

changpeng May 21, 2025

Uh oh!

Uh oh!

Uh oh!

		if (DstVT == MVT::v2f16) {
		assert(Subtarget->hasCvtPkF16F32Inst() && "support v_cvt_pk_f16_f32");

AMDGPU: Make v2f32 -> v2f16 legal when target supports v_cvt_pk_f16_f32 #139956

AMDGPU: Make v2f32 -> v2f16 legal when target supports v_cvt_pk_f16_f32 #139956

Uh oh!

Conversation

changpeng commented May 14, 2025

Uh oh!

llvmbot commented May 14, 2025

Uh oh!

shiltian commented May 14, 2025

Uh oh!

changpeng commented May 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

changpeng May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

changpeng May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

changpeng May 16, 2025 •

edited

Loading

changpeng May 16, 2025 •

edited

Loading