-
Notifications
You must be signed in to change notification settings - Fork 14.3k
AMDGPU: Make v2f32 -> v2f16 legal when target supports v_cvt_pk_f16_f32 #139956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should be legal. However, SelectionDAG does not allow us to specify the source type in the legalization rules. To workaround this, we make FP_ROUND Custom for v2f16 then set up v2f32 -> v2f16 to be legal during custom lowering. Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated.
@llvm/pr-subscribers-backend-amdgpu Author: Changpeng Fang (changpeng) ChangesIf targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should be legal. However, SelectionDAG does not allow us to specify the source type in the legalization rules. To workaround this, we make FP_ROUND Custom for v2f16 then set up v2f32 -> v2f16 to be legal during custom lowering. Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated. Full diff: https://github.com/llvm/llvm-project/pull/139956.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 5cd6561914364..70f9485c3e5b4 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -919,6 +919,9 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
setOperationAction(ISD::BUILD_VECTOR, MVT::v2bf16, Legal);
}
+ if (Subtarget->hasCvtPkF16F32Inst())
+ setOperationAction(ISD::FP_ROUND, MVT::v2f16, Custom);
+
setTargetDAGCombine({ISD::ADD,
ISD::UADDO_CARRY,
ISD::SUB,
@@ -6899,10 +6902,16 @@ SDValue SITargetLowering::getFPExtOrFPRound(SelectionDAG &DAG, SDValue Op,
SDValue SITargetLowering::lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const {
SDValue Src = Op.getOperand(0);
EVT SrcVT = Src.getValueType();
+ EVT DstVT = Op.getValueType();
+
+ if (DstVT == MVT::v2f16) {
+ assert(Subtarget->hasCvtPkF16F32Inst() && "support v_cvt_pk_f16_f32");
+ return SrcVT == MVT::v2f32 ? Op : SDValue();
+ }
+
if (SrcVT.getScalarType() != MVT::f64)
return Op;
- EVT DstVT = Op.getValueType();
SDLoc DL(Op);
if (DstVT == MVT::f16) {
// TODO: Handle strictfp
diff --git a/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.path.ll b/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll
similarity index 93%
rename from llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.path.ll
rename to llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll
index 5a5e39489d888..e5815e96fbe33 100644
--- a/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.path.ll
+++ b/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.no.fast.math.ll
@@ -12,6 +12,20 @@ define <2 x half> @v_test_cvt_v2f32_v2f16(<2 x float> %src) {
ret <2 x half> %res
}
+define half @fptrunc_v2f32_v2f16_then_extract(<2 x float> %src) {
+; GFX950-LABEL: fptrunc_v2f32_v2f16_then_extract:
+; GFX950: ; %bb.0:
+; GFX950-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX950-NEXT: v_cvt_pk_f16_f32 v0, v0, v1
+; GFX950-NEXT: v_add_f16_sdwa v0, v0, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
+; GFX950-NEXT: s_setpc_b64 s[30:31]
+ %vec_half = fptrunc <2 x float> %src to <2 x half>
+ %first = extractelement <2 x half> %vec_half, i64 1
+ %second = extractelement <2 x half> %vec_half, i64 0
+ %res = fadd half %first, %second
+ ret half %res
+}
+
define <2 x half> @v_test_cvt_v2f64_v2f16(<2 x double> %src) {
; GFX950-SDAG-LABEL: v_test_cvt_v2f64_v2f16:
; GFX950-SDAG: ; %bb.0:
|
I find it sometimes confusing when I look at those code. The comment of
Then in what cases it refers to a destination type and in what cases it refers to the source operand type? |
It is indeed very confusion. I was thinking it is hard to specify operand types (you may also have multiple operands). |
if (DstVT == MVT::v2f16) { | ||
assert(Subtarget->hasCvtPkF16F32Inst() && "support v_cvt_pk_f16_f32"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same should apply to all vector types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same should apply to all vector types.
I don't think so. You can not make v3f32 and v4f32 legal because there is no corresponding instructions. Instead, we have to "Expand" (split) for wider vector types. It is another subject how to split the wider vector to take advantage of the packed conversion.
%res = fadd half %first, %second | ||
ret half %res | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we really missing test coverage for this and wider vector types? Can you add that? In particular we need to make sure we get 2 x pk instructions in the 4 x case, and 1 pk and one scalar op in the 3 x case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This change only takes care of v2f16. It is beyond this work to test wider vectors.
Additional work is needed for wider vector splitting (in order to generate the packed conversions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not beyond this to test wider vectors. Please add the missing tests in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fadd use here should also be unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not beyond this to test wider vectors. Please add the missing tests in a follow up
Do you want to simply add the wider vector tests, or also add the logic to split the wider vector wisely to form the packed conversion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fadd use here should also be unnecessary
Then what to do after extracts? I think it is the extracts that prevent instcombining. We need two uses of the extracts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the extract use is interesting, should be another test in addition to the base one with a direct return or store of the vector value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the extract use is interesting, should be another test in addition to the base one with a direct return or store of the vector value
The base test was already there in the same file. In the base test, the expected v_cvt_pk_f16_f32 will always be generated. It is the extract use case that inspired this patch to make v2f32->v2f16 lega.
…32 (llvm#139956) (llvm#2186) If targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should be legal. However, SelectionDAG does not allow us to specify the source type in the legalization rules. To workaround this, we make FP_ROUND Custom for v2f16 then set up v2f32 -> v2f16 to be legal during custom lowering. Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated. Cherry-pick f01f082 from upstreaam to mainline.
Pulls in llvm/llvm-project#139956 to fix an regression on AMD backend. Also to pick up llvm/llvm-project#140183.
Pulls in llvm/llvm-project#139956 to fix an regression on AMD backend. Also to pick up llvm/llvm-project#140183.
GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.
GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated. Fix clang-format
GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated. NOTE: 1) re-implement the helper function the sam way as splitUnaryVectorOp; 2) Use fewerElementsIf instead of custom lowering on the global isel path 3) add checks for 3x and 16x vectors
GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated. NOTE: minor changes
GFx950+ supports v_cvt_pk_f16_f32. However current implementation of vector fptrunc lowering fully scalarizes the vector, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated. NOTE: Use .clampMaxNumElements(0, S16, 2)
The latest asics support v_cvt_pk_f16_f32 instruction. However current implementation of vector fptrunc lowering fully scalarizes the vectors, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in #139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.
The latest asics support v_cvt_pk_f16_f32 instruction. However current implementation of vector fptrunc lowering fully scalarizes the vectors, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm/llvm-project#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.
The latest asics support v_cvt_pk_f16_f32 instruction. However current implementation of vector fptrunc lowering fully scalarizes the vectors, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.
The latest asics support v_cvt_pk_f16_f32 instruction. However current implementation of vector fptrunc lowering fully scalarizes the vectors, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.
The latest asics support v_cvt_pk_f16_f32 instruction. However current implementation of vector fptrunc lowering fully scalarizes the vectors, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in llvm#139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.
…32 (llvm#139956) (llvm#2478) If targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should be legal. However, SelectionDAG does not allow us to specify the source type in the legalization rules. To workaround this, we make FP_ROUND Custom for v2f16 then set up v2f32 -> v2f16 to be legal during custom lowering. Cherry-pick llvm#139956 to mainline (after bulk promotion) Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated.
If targets support v_cvt_pk_f16_f32 instruction, v2f32 -> v2f16 should be legal. However, SelectionDAG does not allow us to specify the source type in the legalization rules. To workaround this, we make FP_ROUND Custom for v2f16 then set up v2f32 -> v2f16 to be legal during custom lowering.
Fixes: SWDEV-532608 -- expected v_cvt_pk_f16_f32 was not generated.