Skip to content

AMDGPU: Make v2f32 -> v2f16 legal when target supports v_cvt_pk_f16_f32 #139956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 15, 2025

Conversation

changpeng
Copy link
Contributor

If targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should be legal. However, SelectionDAG does not allow us to specify the source type in the legalization rules. To workaround this, we make FP_ROUND Custom for v2f16 then set up v2f32 -> v2f16 to be legal during custom lowering.

Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated.

  If targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should be
legal. However, SelectionDAG does not allow us to specify the source type in
the legalization rules. To workaround this, we make FP_ROUND Custom for v2f16
then set up v2f32 -> v2f16 to be legal during custom lowering.

Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated.
@llvmbot
Copy link
Member

llvmbot commented May 14, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Changpeng Fang (changpeng)

Changes

If targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should be legal. However, SelectionDAG does not allow us to specify the source type in the legalization rules. To workaround this, we make FP_ROUND Custom for v2f16 then set up v2f32 -> v2f16 to be legal during custom lowering.

Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated.


Full diff: https://github.com/llvm/llvm-project/pull/139956.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+10-1)
  • (renamed) llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll (+14)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 5cd6561914364..70f9485c3e5b4 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -919,6 +919,9 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::BUILD_VECTOR, MVT::v2bf16, Legal);
   }
 
+  if (Subtarget->hasCvtPkF16F32Inst())
+    setOperationAction(ISD::FP_ROUND, MVT::v2f16, Custom);
+
   setTargetDAGCombine({ISD::ADD,
                        ISD::UADDO_CARRY,
                        ISD::SUB,
@@ -6899,10 +6902,16 @@ SDValue SITargetLowering::getFPExtOrFPRound(SelectionDAG &DAG, SDValue Op,
 SDValue SITargetLowering::lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const {
   SDValue Src = Op.getOperand(0);
   EVT SrcVT = Src.getValueType();
+  EVT DstVT = Op.getValueType();
+
+  if (DstVT == MVT::v2f16) {
+    assert(Subtarget->hasCvtPkF16F32Inst() && "support v_cvt_pk_f16_f32");
+    return SrcVT == MVT::v2f32 ? Op : SDValue();
+  }
+
   if (SrcVT.getScalarType() != MVT::f64)
     return Op;
 
-  EVT DstVT = Op.getValueType();
   SDLoc DL(Op);
   if (DstVT == MVT::f16) {
     // TODO: Handle strictfp
diff --git a/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.path.ll b/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll
similarity index 93%
rename from llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.path.ll
rename to llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll
index 5a5e39489d888..e5815e96fbe33 100644
--- a/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.path.ll
+++ b/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll
@@ -12,6 +12,20 @@ define <2 x half> @v_test_cvt_v2f32_v2f16(<2 x float> %src) {
   ret <2 x half> %res
 }
 
+define half @fptrunc_v2f32_v2f16_then_extract(<2 x float> %src) {
+; GFX950-LABEL: fptrunc_v2f32_v2f16_then_extract:
+; GFX950:       ; %bb.0:
+; GFX950-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-NEXT:    v_cvt_pk_f16_f32 v0, v0, v1
+; GFX950-NEXT:    v_add_f16_sdwa v0, v0, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
+; GFX950-NEXT:    s_setpc_b64 s[30:31]
+  %vec_half = fptrunc <2 x float> %src to <2 x half>
+  %first = extractelement <2 x half> %vec_half, i64 1
+  %second = extractelement <2 x half> %vec_half, i64 0
+  %res = fadd half %first, %second
+  ret half %res
+}
+
 define <2 x half> @v_test_cvt_v2f64_v2f16(<2 x double> %src) {
 ; GFX950-SDAG-LABEL: v_test_cvt_v2f64_v2f16:
 ; GFX950-SDAG:       ; %bb.0:

@changpeng changpeng requested review from rampitec and arsenm May 14, 2025 20:22
@shiltian
Copy link
Contributor

However, SelectionDAG does not allow us to specify the source type in the legalization rules.

I find it sometimes confusing when I look at those code. The comment of setOperationAction says:

  /// Indicate that the specified operation does not work with the specified
  /// type and indicate what to do about it. Note that VT may refer to either
  /// the type of a result or that of an operand of Op.

Then in what cases it refers to a destination type and in what cases it refers to the source operand type?

@changpeng
Copy link
Contributor Author

Then in what cases it refers to a destination type and in what cases it refers to the source operand type?

It is indeed very confusion. I was thinking it is hard to specify operand types (you may also have multiple operands).
I am not sure how does it work in details, but I am seeing:
setOperationAction(ISD::STORE, MVT::v2f16, Promote);

Comment on lines +6907 to +6908
if (DstVT == MVT::v2f16) {
assert(Subtarget->hasCvtPkF16F32Inst() && "support v_cvt_pk_f16_f32");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same should apply to all vector types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same should apply to all vector types.

I don't think so. You can not make v3f32 and v4f32 legal because there is no corresponding instructions. Instead, we have to "Expand" (split) for wider vector types. It is another subject how to split the wider vector to take advantage of the packed conversion.

%res = fadd half %first, %second
ret half %res
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we really missing test coverage for this and wider vector types? Can you add that? In particular we need to make sure we get 2 x pk instructions in the 4 x case, and 1 pk and one scalar op in the 3 x case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This change only takes care of v2f16. It is beyond this work to test wider vectors.
Additional work is needed for wider vector splitting (in order to generate the packed conversions)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not beyond this to test wider vectors. Please add the missing tests in a follow up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fadd use here should also be unnecessary

Copy link
Contributor Author

@changpeng changpeng May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not beyond this to test wider vectors. Please add the missing tests in a follow up

Do you want to simply add the wider vector tests, or also add the logic to split the wider vector wisely to form the packed conversion?

Copy link
Contributor Author

@changpeng changpeng May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fadd use here should also be unnecessary

Then what to do after extracts? I think it is the extracts that prevent instcombining. We need two uses of the extracts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the extract use is interesting, should be another test in addition to the base one with a direct return or store of the vector value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the extract use is interesting, should be another test in addition to the base one with a direct return or store of the vector value

The base test was already there in the same file. In the base test, the expected v_cvt_pk_f16_f32 will always be generated. It is the extract use case that inspired this patch to make v2f32->v2f16 lega.

@changpeng changpeng merged commit f01f082 into llvm:main May 15, 2025
13 checks passed
@changpeng changpeng deleted the bf16 branch May 15, 2025 22:09
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request May 20, 2025
…32 (llvm#139956) (llvm#2186)

If targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should
be legal. However, SelectionDAG does not allow us to specify the source
type in the legalization rules. To workaround this, we make FP_ROUND
Custom for v2f16 then set up v2f32 -> v2f16 to be legal during custom
lowering.

Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated.

Cherry-pick f01f082 from upstreaam to mainline.
ThomasRaoux pushed a commit to triton-lang/triton that referenced this pull request May 23, 2025
Pulls in llvm/llvm-project#139956 to fix an
regression on AMD backend. Also to pick up
llvm/llvm-project#140183.
zwu-2025 pushed a commit to zwu-2025/triton that referenced this pull request May 27, 2025
changpeng added a commit to changpeng/llvm-project that referenced this pull request May 29, 2025
  GFx950+ supports v_cvt_pk_f16_f32. However current implementation
of vector fptrunc lowering fully scalarizes the vector, and the scalar
conversions may not always be combined to generate the packed one.

  We made v2f32 -> v2f16 legal in llvm#139956.
This work is an extension to handle wider vectors. Instead of fully scalarization,
we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always
been generated.
changpeng added a commit to changpeng/llvm-project that referenced this pull request May 29, 2025
  GFx950+ supports v_cvt_pk_f16_f32. However current implementation
of vector fptrunc lowering fully scalarizes the vector, and the scalar
conversions may not always be combined to generate the packed one.
  We made v2f32 -> v2f16 legal in llvm#139956.
This work is an extension to handle wider vectors. Instead of fully scalarization,
we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always
been generated.

 Fix clang-format
changpeng added a commit to changpeng/llvm-project that referenced this pull request May 29, 2025
  GFx950+ supports v_cvt_pk_f16_f32. However current implementation
of vector fptrunc lowering fully scalarizes the vector, and the scalar
conversions may not always be combined to generate the packed one.
  We made v2f32 -> v2f16 legal in llvm#139956.
This work is an extension to handle wider vectors. Instead of fully scalarization,
we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always
been generated.

NOTE:
 1) re-implement the helper function the sam way as splitUnaryVectorOp;
 2) Use fewerElementsIf instead of custom lowering on the global isel path
 3) add checks for 3x and 16x vectors
changpeng added a commit to changpeng/llvm-project that referenced this pull request May 29, 2025
  GFx950+ supports v_cvt_pk_f16_f32. However current implementation
of vector fptrunc lowering fully scalarizes the vector, and the scalar
conversions may not always be combined to generate the packed one.
  We made v2f32 -> v2f16 legal in llvm#139956.
This work is an extension to handle wider vectors. Instead of fully scalarization,
we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always
been generated.

NOTE:
     minor changes
changpeng added a commit to changpeng/llvm-project that referenced this pull request May 29, 2025
  GFx950+ supports v_cvt_pk_f16_f32. However current implementation
of vector fptrunc lowering fully scalarizes the vector, and the scalar
conversions may not always be combined to generate the packed one.
  We made v2f32 -> v2f16 legal in llvm#139956.
This work is an extension to handle wider vectors. Instead of fully scalarization,
we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always
been generated.

NOTE: Use .clampMaxNumElements(0, S16, 2)
changpeng added a commit that referenced this pull request Jun 6, 2025
The latest asics support v_cvt_pk_f16_f32 instruction. However current
implementation of vector fptrunc lowering fully scalarizes the vectors,
and the scalar conversions may not always be combined to generate the
packed one.
We made v2f32 -> v2f16 legal in
#139956. This work is an
extension to handle wider vectors. Instead of fully scalarization, we
split the vector to packs (v2f32 -> v2f16) to ensure the packed
conversion can always been generated.
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Jun 6, 2025
The latest asics support v_cvt_pk_f16_f32 instruction. However current
implementation of vector fptrunc lowering fully scalarizes the vectors,
and the scalar conversions may not always be combined to generate the
packed one.
We made v2f32 -> v2f16 legal in
llvm/llvm-project#139956. This work is an
extension to handle wider vectors. Instead of fully scalarization, we
split the vector to packs (v2f32 -> v2f16) to ensure the packed
conversion can always been generated.
rorth pushed a commit to rorth/llvm-project that referenced this pull request Jun 11, 2025
The latest asics support v_cvt_pk_f16_f32 instruction. However current
implementation of vector fptrunc lowering fully scalarizes the vectors,
and the scalar conversions may not always be combined to generate the
packed one.
We made v2f32 -> v2f16 legal in
llvm#139956. This work is an
extension to handle wider vectors. Instead of fully scalarization, we
split the vector to packs (v2f32 -> v2f16) to ensure the packed
conversion can always been generated.
DhruvSrivastavaX pushed a commit to DhruvSrivastavaX/lldb-for-aix that referenced this pull request Jun 12, 2025
The latest asics support v_cvt_pk_f16_f32 instruction. However current
implementation of vector fptrunc lowering fully scalarizes the vectors,
and the scalar conversions may not always be combined to generate the
packed one.
We made v2f32 -> v2f16 legal in
llvm#139956. This work is an
extension to handle wider vectors. Instead of fully scalarization, we
split the vector to packs (v2f32 -> v2f16) to ensure the packed
conversion can always been generated.
tomtor pushed a commit to tomtor/llvm-project that referenced this pull request Jun 14, 2025
The latest asics support v_cvt_pk_f16_f32 instruction. However current
implementation of vector fptrunc lowering fully scalarizes the vectors,
and the scalar conversions may not always be combined to generate the
packed one.
We made v2f32 -> v2f16 legal in
llvm#139956. This work is an
extension to handle wider vectors. Instead of fully scalarization, we
split the vector to packs (v2f32 -> v2f16) to ensure the packed
conversion can always been generated.
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Jun 18, 2025
…32 (llvm#139956) (llvm#2478)

If targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should
be legal. However, SelectionDAG does not allow us to specify the source
type in the legalization rules. To workaround this, we make FP_ROUND
Custom for v2f16 then set up v2f32 -> v2f16 to be legal during custom
lowering.

Cherry-pick llvm#139956 to mainline
(after bulk promotion)

Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants