-
Notifications
You must be signed in to change notification settings - Fork 14.3k
AMDGPU: Make v2f64 -> v2f16 conversion Legal only when unsafe fast math is set #134738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructons: v_cvt_f32_f64_e32 v1, s[6:7] v_cvt_f32_f64_e32 v2, s[4:5] v_cvt_pk_f16_f32 v1, v2, v1 While it is okay when unsafe fast math flag is set, it may return imprecise results due to double rounding. This patch makes v2f64 -> v2f16 conversion Legal only when unsafe fast math is set. Fixes: SWDEV-523856
@llvm/pr-subscribers-backend-amdgpu Author: Changpeng Fang (changpeng) ChangesOn targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions: Fixes: SWDEV-523856 Full diff: https://github.com/llvm/llvm-project/pull/134738.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index 275d0193452a5..4f264c1d6c794 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -1053,10 +1053,11 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
}
auto &FPTruncActions = getActionDefinitionsBuilder(G_FPTRUNC);
- if (ST.hasCvtPkF16F32Inst())
- FPTruncActions.legalFor(
- {{S32, S64}, {S16, S32}, {V2S16, V2S32}, {V2S16, V2S64}});
- else
+ if (ST.hasCvtPkF16F32Inst()) {
+ FPTruncActions.legalFor({{S32, S64}, {S16, S32}, {V2S16, V2S32}});
+ if (TM.Options.UnsafeFPMath)
+ FPTruncActions.legalFor({V2S16, V2S64});
+ } else
FPTruncActions.legalFor({{S32, S64}, {S16, S32}});
FPTruncActions.scalarize(0).lower();
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 356040da95672..8e49fa4323f5d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -915,7 +915,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
setOperationAction(ISD::BUILD_VECTOR, MVT::v2bf16, Legal);
}
- if (Subtarget->hasCvtPkF16F32Inst()) {
+ if (Subtarget->hasCvtPkF16F32Inst() && TM.Options.UnsafeFPMath) {
setOperationAction(ISD::FP_ROUND, MVT::v2f16, Legal);
}
diff --git a/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.fpmath.ll b/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.fpmath.ll
new file mode 100644
index 0000000000000..7ef31ef6387ae
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.fpmath.ll
@@ -0,0 +1,199 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=0 < %s | FileCheck -check-prefixes=GFX950,GFX950-SAFE-SDAG %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=1 < %s | FileCheck -check-prefixes=GFX950,GFX950-SAFE-GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=0 -enable-unsafe-fp-math < %s | FileCheck -check-prefixes=GFX950,GFX950-UNSAFE %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=1 -enable-unsafe-fp-math < %s | FileCheck -check-prefixes=GFX950,GFX950-UNSAFE %s
+
+define amdgpu_ps <2 x half> @v_test_cvt_v2f32_v2f16(<2 x float> %src) {
+; GFX950-LABEL: v_test_cvt_v2f32_v2f16:
+; GFX950: ; %bb.0:
+; GFX950-NEXT: v_cvt_pk_f16_f32 v0, v0, v1
+; GFX950-NEXT: ; return to shader part epilog
+ %res = fptrunc <2 x float> %src to <2 x half>
+ ret <2 x half> %res
+}
+
+define amdgpu_ps <2 x half> @v_test_cvt_v2f64_v2f16(<2 x double> %src) {
+; GFX950-SAFE-SDAG-LABEL: v_test_cvt_v2f64_v2f16:
+; GFX950-SAFE-SDAG: ; %bb.0:
+; GFX950-SAFE-SDAG-NEXT: s_movk_i32 s0, 0x1ff
+; GFX950-SAFE-SDAG-NEXT: v_and_or_b32 v0, v1, s0, v0
+; GFX950-SAFE-SDAG-NEXT: v_cmp_ne_u32_e32 vcc, 0, v0
+; GFX950-SAFE-SDAG-NEXT: v_lshrrev_b32_e32 v4, 8, v1
+; GFX950-SAFE-SDAG-NEXT: s_movk_i32 s1, 0xffe
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT: v_bfe_u32 v5, v1, 20, 11
+; GFX950-SAFE-SDAG-NEXT: v_and_or_b32 v0, v4, s1, v0
+; GFX950-SAFE-SDAG-NEXT: v_sub_u32_e32 v6, 0x3f1, v5
+; GFX950-SAFE-SDAG-NEXT: v_or_b32_e32 v4, 0x1000, v0
+; GFX950-SAFE-SDAG-NEXT: v_med3_i32 v6, v6, 0, 13
+; GFX950-SAFE-SDAG-NEXT: v_lshrrev_b32_e32 v7, v6, v4
+; GFX950-SAFE-SDAG-NEXT: v_lshlrev_b32_e32 v6, v6, v7
+; GFX950-SAFE-SDAG-NEXT: v_cmp_ne_u32_e32 vcc, v6, v4
+; GFX950-SAFE-SDAG-NEXT: v_add_u32_e32 v5, 0xfffffc10, v5
+; GFX950-SAFE-SDAG-NEXT: v_lshl_or_b32 v6, v5, 12, v0
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e64 v4, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT: v_or_b32_e32 v4, v7, v4
+; GFX950-SAFE-SDAG-NEXT: v_cmp_gt_i32_e32 vcc, 1, v5
+; GFX950-SAFE-SDAG-NEXT: s_movk_i32 s2, 0x40f
+; GFX950-SAFE-SDAG-NEXT: v_lshrrev_b32_e32 v1, 16, v1
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e32 v4, v6, v4, vcc
+; GFX950-SAFE-SDAG-NEXT: v_and_b32_e32 v6, 7, v4
+; GFX950-SAFE-SDAG-NEXT: v_cmp_lt_i32_e32 vcc, 5, v6
+; GFX950-SAFE-SDAG-NEXT: v_lshrrev_b32_e32 v4, 2, v4
+; GFX950-SAFE-SDAG-NEXT: s_mov_b32 s3, 0x8000
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e64 v7, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT: v_cmp_eq_u32_e32 vcc, 3, v6
+; GFX950-SAFE-SDAG-NEXT: s_nop 1
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e64 v6, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT: v_or_b32_e32 v6, v6, v7
+; GFX950-SAFE-SDAG-NEXT: v_add_u32_e32 v4, v4, v6
+; GFX950-SAFE-SDAG-NEXT: v_mov_b32_e32 v6, 0x7c00
+; GFX950-SAFE-SDAG-NEXT: v_cmp_gt_i32_e32 vcc, 31, v5
+; GFX950-SAFE-SDAG-NEXT: v_mov_b32_e32 v7, 0x7e00
+; GFX950-SAFE-SDAG-NEXT: s_nop 0
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e32 v4, v6, v4, vcc
+; GFX950-SAFE-SDAG-NEXT: v_cmp_ne_u32_e32 vcc, 0, v0
+; GFX950-SAFE-SDAG-NEXT: s_nop 1
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e32 v0, v6, v7, vcc
+; GFX950-SAFE-SDAG-NEXT: v_cmp_eq_u32_e32 vcc, s2, v5
+; GFX950-SAFE-SDAG-NEXT: s_nop 1
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e32 v0, v4, v0, vcc
+; GFX950-SAFE-SDAG-NEXT: v_and_or_b32 v0, v1, s3, v0
+; GFX950-SAFE-SDAG-NEXT: v_and_or_b32 v1, v3, s0, v2
+; GFX950-SAFE-SDAG-NEXT: v_cmp_ne_u32_e32 vcc, 0, v1
+; GFX950-SAFE-SDAG-NEXT: v_lshrrev_b32_e32 v2, 8, v3
+; GFX950-SAFE-SDAG-NEXT: v_bfe_u32 v4, v3, 20, 11
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e64 v1, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT: v_and_or_b32 v1, v2, s1, v1
+; GFX950-SAFE-SDAG-NEXT: v_sub_u32_e32 v5, 0x3f1, v4
+; GFX950-SAFE-SDAG-NEXT: v_or_b32_e32 v2, 0x1000, v1
+; GFX950-SAFE-SDAG-NEXT: v_med3_i32 v5, v5, 0, 13
+; GFX950-SAFE-SDAG-NEXT: v_lshrrev_b32_e32 v8, v5, v2
+; GFX950-SAFE-SDAG-NEXT: v_lshlrev_b32_e32 v5, v5, v8
+; GFX950-SAFE-SDAG-NEXT: v_cmp_ne_u32_e32 vcc, v5, v2
+; GFX950-SAFE-SDAG-NEXT: v_add_u32_e32 v4, 0xfffffc10, v4
+; GFX950-SAFE-SDAG-NEXT: v_lshl_or_b32 v5, v4, 12, v1
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e64 v2, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT: v_or_b32_e32 v2, v8, v2
+; GFX950-SAFE-SDAG-NEXT: v_cmp_gt_i32_e32 vcc, 1, v4
+; GFX950-SAFE-SDAG-NEXT: s_mov_b32 s0, 0x5040100
+; GFX950-SAFE-SDAG-NEXT: s_nop 0
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e32 v2, v5, v2, vcc
+; GFX950-SAFE-SDAG-NEXT: v_and_b32_e32 v5, 7, v2
+; GFX950-SAFE-SDAG-NEXT: v_cmp_lt_i32_e32 vcc, 5, v5
+; GFX950-SAFE-SDAG-NEXT: v_lshrrev_b32_e32 v2, 2, v2
+; GFX950-SAFE-SDAG-NEXT: s_nop 0
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e64 v8, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT: v_cmp_eq_u32_e32 vcc, 3, v5
+; GFX950-SAFE-SDAG-NEXT: s_nop 1
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e64 v5, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT: v_or_b32_e32 v5, v5, v8
+; GFX950-SAFE-SDAG-NEXT: v_add_u32_e32 v2, v2, v5
+; GFX950-SAFE-SDAG-NEXT: v_cmp_gt_i32_e32 vcc, 31, v4
+; GFX950-SAFE-SDAG-NEXT: s_nop 1
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e32 v2, v6, v2, vcc
+; GFX950-SAFE-SDAG-NEXT: v_cmp_ne_u32_e32 vcc, 0, v1
+; GFX950-SAFE-SDAG-NEXT: s_nop 1
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e32 v1, v6, v7, vcc
+; GFX950-SAFE-SDAG-NEXT: v_cmp_eq_u32_e32 vcc, s2, v4
+; GFX950-SAFE-SDAG-NEXT: s_nop 1
+; GFX950-SAFE-SDAG-NEXT: v_cndmask_b32_e32 v1, v2, v1, vcc
+; GFX950-SAFE-SDAG-NEXT: v_lshrrev_b32_e32 v2, 16, v3
+; GFX950-SAFE-SDAG-NEXT: v_and_or_b32 v1, v2, s3, v1
+; GFX950-SAFE-SDAG-NEXT: v_perm_b32 v0, v1, v0, s0
+; GFX950-SAFE-SDAG-NEXT: ; return to shader part epilog
+;
+; GFX950-SAFE-GISEL-LABEL: v_test_cvt_v2f64_v2f16:
+; GFX950-SAFE-GISEL: ; %bb.0:
+; GFX950-SAFE-GISEL-NEXT: v_mov_b32_e32 v7, 0x1ff
+; GFX950-SAFE-GISEL-NEXT: v_and_or_b32 v0, v1, v7, v0
+; GFX950-SAFE-GISEL-NEXT: v_bfe_u32 v4, v1, 20, 11
+; GFX950-SAFE-GISEL-NEXT: v_cmp_ne_u32_e32 vcc, 0, v0
+; GFX950-SAFE-GISEL-NEXT: v_add_u32_e32 v4, 0xfffffc10, v4
+; GFX950-SAFE-GISEL-NEXT: v_lshrrev_b32_e32 v5, 8, v1
+; GFX950-SAFE-GISEL-NEXT: v_mov_b32_e32 v6, 0xffe
+; GFX950-SAFE-GISEL-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT: v_and_or_b32 v0, v5, v6, v0
+; GFX950-SAFE-GISEL-NEXT: v_sub_u32_e32 v10, 1, v4
+; GFX950-SAFE-GISEL-NEXT: v_cmp_ne_u32_e32 vcc, 0, v0
+; GFX950-SAFE-GISEL-NEXT: v_lshl_or_b32 v9, v4, 12, v0
+; GFX950-SAFE-GISEL-NEXT: v_med3_i32 v10, v10, 0, 13
+; GFX950-SAFE-GISEL-NEXT: v_or_b32_e32 v0, 0x1000, v0
+; GFX950-SAFE-GISEL-NEXT: v_lshrrev_b32_e32 v11, v10, v0
+; GFX950-SAFE-GISEL-NEXT: v_lshlrev_b32_e32 v10, v10, v11
+; GFX950-SAFE-GISEL-NEXT: v_cndmask_b32_e64 v5, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT: v_cmp_ne_u32_e32 vcc, v10, v0
+; GFX950-SAFE-GISEL-NEXT: v_mov_b32_e32 v8, 0x7c00
+; GFX950-SAFE-GISEL-NEXT: v_lshl_or_b32 v5, v5, 9, v8
+; GFX950-SAFE-GISEL-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT: v_or_b32_e32 v0, v11, v0
+; GFX950-SAFE-GISEL-NEXT: v_cmp_gt_i32_e32 vcc, 1, v4
+; GFX950-SAFE-GISEL-NEXT: v_lshrrev_b32_e32 v1, 16, v1
+; GFX950-SAFE-GISEL-NEXT: v_and_or_b32 v2, v3, v7, v2
+; GFX950-SAFE-GISEL-NEXT: v_cndmask_b32_e32 v0, v9, v0, vcc
+; GFX950-SAFE-GISEL-NEXT: v_and_b32_e32 v9, 7, v0
+; GFX950-SAFE-GISEL-NEXT: v_cmp_eq_u32_e32 vcc, 3, v9
+; GFX950-SAFE-GISEL-NEXT: v_cmp_lt_i32_e64 s[0:1], 5, v9
+; GFX950-SAFE-GISEL-NEXT: s_or_b64 s[0:1], vcc, s[0:1]
+; GFX950-SAFE-GISEL-NEXT: v_lshrrev_b32_e32 v0, 2, v0
+; GFX950-SAFE-GISEL-NEXT: v_cndmask_b32_e64 v9, 0, 1, s[0:1]
+; GFX950-SAFE-GISEL-NEXT: v_add_u32_e32 v0, v0, v9
+; GFX950-SAFE-GISEL-NEXT: v_cmp_lt_i32_e32 vcc, 30, v4
+; GFX950-SAFE-GISEL-NEXT: v_mov_b32_e32 v9, 0x40f
+; GFX950-SAFE-GISEL-NEXT: s_nop 0
+; GFX950-SAFE-GISEL-NEXT: v_cndmask_b32_e32 v0, v0, v8, vcc
+; GFX950-SAFE-GISEL-NEXT: v_cmp_eq_u32_e32 vcc, v4, v9
+; GFX950-SAFE-GISEL-NEXT: v_mov_b32_e32 v4, 0x8000
+; GFX950-SAFE-GISEL-NEXT: s_nop 0
+; GFX950-SAFE-GISEL-NEXT: v_cndmask_b32_e32 v0, v0, v5, vcc
+; GFX950-SAFE-GISEL-NEXT: v_and_or_b32 v0, v1, v4, v0
+; GFX950-SAFE-GISEL-NEXT: v_bfe_u32 v1, v3, 20, 11
+; GFX950-SAFE-GISEL-NEXT: v_cmp_ne_u32_e32 vcc, 0, v2
+; GFX950-SAFE-GISEL-NEXT: v_add_u32_e32 v1, 0xfffffc10, v1
+; GFX950-SAFE-GISEL-NEXT: v_lshrrev_b32_e32 v5, 8, v3
+; GFX950-SAFE-GISEL-NEXT: v_cndmask_b32_e64 v2, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT: v_and_or_b32 v2, v5, v6, v2
+; GFX950-SAFE-GISEL-NEXT: v_sub_u32_e32 v7, 1, v1
+; GFX950-SAFE-GISEL-NEXT: v_cmp_ne_u32_e32 vcc, 0, v2
+; GFX950-SAFE-GISEL-NEXT: v_lshl_or_b32 v6, v1, 12, v2
+; GFX950-SAFE-GISEL-NEXT: v_med3_i32 v7, v7, 0, 13
+; GFX950-SAFE-GISEL-NEXT: v_or_b32_e32 v2, 0x1000, v2
+; GFX950-SAFE-GISEL-NEXT: v_lshrrev_b32_e32 v10, v7, v2
+; GFX950-SAFE-GISEL-NEXT: v_lshlrev_b32_e32 v7, v7, v10
+; GFX950-SAFE-GISEL-NEXT: v_cndmask_b32_e64 v5, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT: v_cmp_ne_u32_e32 vcc, v7, v2
+; GFX950-SAFE-GISEL-NEXT: v_lshl_or_b32 v5, v5, 9, v8
+; GFX950-SAFE-GISEL-NEXT: v_and_b32_e32 v0, 0xffff, v0
+; GFX950-SAFE-GISEL-NEXT: v_cndmask_b32_e64 v2, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT: v_or_b32_e32 v2, v10, v2
+; GFX950-SAFE-GISEL-NEXT: v_cmp_gt_i32_e32 vcc, 1, v1
+; GFX950-SAFE-GISEL-NEXT: s_nop 1
+; GFX950-SAFE-GISEL-NEXT: v_cndmask_b32_e32 v2, v6, v2, vcc
+; GFX950-SAFE-GISEL-NEXT: v_and_b32_e32 v6, 7, v2
+; GFX950-SAFE-GISEL-NEXT: v_cmp_eq_u32_e32 vcc, 3, v6
+; GFX950-SAFE-GISEL-NEXT: v_cmp_lt_i32_e64 s[0:1], 5, v6
+; GFX950-SAFE-GISEL-NEXT: s_or_b64 s[0:1], vcc, s[0:1]
+; GFX950-SAFE-GISEL-NEXT: v_lshrrev_b32_e32 v2, 2, v2
+; GFX950-SAFE-GISEL-NEXT: v_cndmask_b32_e64 v6, 0, 1, s[0:1]
+; GFX950-SAFE-GISEL-NEXT: v_add_u32_e32 v2, v2, v6
+; GFX950-SAFE-GISEL-NEXT: v_cmp_lt_i32_e32 vcc, 30, v1
+; GFX950-SAFE-GISEL-NEXT: s_nop 1
+; GFX950-SAFE-GISEL-NEXT: v_cndmask_b32_e32 v2, v2, v8, vcc
+; GFX950-SAFE-GISEL-NEXT: v_cmp_eq_u32_e32 vcc, v1, v9
+; GFX950-SAFE-GISEL-NEXT: s_nop 1
+; GFX950-SAFE-GISEL-NEXT: v_cndmask_b32_e32 v1, v2, v5, vcc
+; GFX950-SAFE-GISEL-NEXT: v_lshrrev_b32_e32 v2, 16, v3
+; GFX950-SAFE-GISEL-NEXT: v_and_or_b32 v1, v2, v4, v1
+; GFX950-SAFE-GISEL-NEXT: v_lshl_or_b32 v0, v1, 16, v0
+; GFX950-SAFE-GISEL-NEXT: ; return to shader part epilog
+;
+; GFX950-UNSAFE-LABEL: v_test_cvt_v2f64_v2f16:
+; GFX950-UNSAFE: ; %bb.0:
+; GFX950-UNSAFE-NEXT: v_cvt_f32_f64_e32 v2, v[2:3]
+; GFX950-UNSAFE-NEXT: v_cvt_f32_f64_e32 v0, v[0:1]
+; GFX950-UNSAFE-NEXT: v_cvt_pk_f16_f32 v0, v0, v2
+; GFX950-UNSAFE-NEXT: ; return to shader part epilog
+ %res = fptrunc <2 x double> %src to <2 x half>
+ ret <2 x half> %res
+}
|
@@ -915,7 +915,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM, | |||
setOperationAction(ISD::BUILD_VECTOR, MVT::v2bf16, Legal); | |||
} | |||
|
|||
if (Subtarget->hasCvtPkF16F32Inst()) { | |||
if (Subtarget->hasCvtPkF16F32Inst() && TM.Options.UnsafeFPMath) { | |||
setOperationAction(ISD::FP_ROUND, MVT::v2f16, Legal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It disables it not just from v2f64, but also from v2f32, right? How is that lowered then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It disables it not just from v2f64, but also from v2f32, right? How is that lowered then?
We have a pattern to generate v_cvt_pk_f16_f32:
def : GCNPat<(v2f16 (fpround v2f32:$src)),
Even though we also have a pattern for v2f64->v2f16, but we will do splitting the vector and then Custom lowering of f64 ->f16 first.
def : GCNPat<(v2f16 (fpround v2f64:$src)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the pattern shall not work for operation which is not legal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the pattern shall not work for operation which is not legal.
my observation is that, if we do not Expand/Custom/Promote, the pattern will be picked up, no matter you specify Legal or not. So it doesn't matter whether you specify Legal or not, pattern for v2f32->v2f16 wil always be picked up.
This is similar to f32 -> f16, even though we did not specify Legal, v_cvt_f16_f32 will still be selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default is Legal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Legal is default:
https://llvm.org/docs/WritingAnLLVMBackend.html#legal
The reason we need to explicitly specify Legal sometimes is because we need to specify different Actions based on different hardware features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still learning. I think even though we specify Custom lowering but actually doing nothing, like:
if (SrcVT. != MVT::f64)
return Op; // f32 is legal for fp_round
This is essentially "Legal", which means either there is a native instruction, or a pattern to match to native instructions. The compiler will throw an error if there is nothing to select.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not further spread use of the global options. FP casts have fast math flags, make use of those. Note this needs to be in custom lowering, legality cannot be use context dependent
; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=0 -enable-unsafe-fp-math < %s | FileCheck -check-prefixes=GFX950,GFX950-UNSAFE %s | ||
; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=1 -enable-unsafe-fp-math < %s | FileCheck -check-prefixes=GFX950,GFX950-UNSAFE %s | ||
|
||
define amdgpu_ps <2 x half> @v_test_cvt_v2f32_v2f16(<2 x float> %src) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need the shader conventions. Also test the scalar case and wider vector s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need the shader conventions. Also test the scalar case and wider vector s
Removed the shade conventions, and added tests for other types.
Custom lower FP_ROUND for v2f16. Also add test cases for scalar and wider vectors
I understand your proposal to avoid use of global options. But here we need to be consistent. Options.UnsafeFPMath has been used in all other places to expand/lower f64->f16. We should not use a different approach only here. |
Or do you prefer a combination of fast math flag and global option, like this? |
This is because traditionally the FP casts did not have fast math flags. They do now, and we should migrate to only using them. |
You can do that, but it still needs to be a contextual use transformation, not modify the legality rules |
I do not fully understand what did you mean "modify the legality rules", and why we should not modify. Here is the example in custom lowering for FP_TO_FP16: if (getTargetMachine().Options.UnsafeFPMath) { Do you mean this is to change from Custom to Expand, and is not allowed? Thanks. |
How to check the instruction has the "fast" flag? |
On SelectDAG path, FP_ROUND still does not have fast math flags. I still think adding fast math flags for FP_ROUND is beyond the work of this patch. (We also need to propagate the flags when we do transformation like splitting the vector). |
…th is set Custom lowering v2f64 -> v2f16.
You can test this locally with the following command:git-clang-format --diff HEAD~1 HEAD --extensions h,cpp -- llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h llvm/lib/Target/AMDGPU/SIISelLowering.cpp View the diff from clang-format here.diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index ee3f5dd4e..bf46aef0c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -1055,7 +1055,7 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
auto &FPTruncActions = getActionDefinitionsBuilder(G_FPTRUNC);
if (ST.hasCvtPkF16F32Inst()) {
FPTruncActions.legalFor({{S32, S64}, {S16, S32}, {V2S16, V2S32}})
- .customFor({V2S16, V2S64});
+ .customFor({V2S16, V2S64});
} else
FPTruncActions.legalFor({{S32, S64}, {S16, S32}});
FPTruncActions.scalarize(0).lower();
@@ -2750,8 +2750,8 @@ bool AMDGPULegalizerInfo::legalizeFPTrunc(LegalizerHelper &Helper,
// still used here to be consistent, especially when the fast math flag is
// not working for FP_ROUND on the SelectDAG path at this moment.
MachineFunction &MF = Helper.MIRBuilder.getMF();
- bool AllowInaccurateFPTRUNC = MI.getFlag(MachineInstr::FmAfn) ||
- MF.getTarget().Options.UnsafeFPMath;
+ bool AllowInaccurateFPTRUNC =
+ MI.getFlag(MachineInstr::FmAfn) || MF.getTarget().Options.UnsafeFPMath;
if (AllowInaccurateFPTRUNC) {
// Use the tablegen pattern to select native instructions.
|
Custom lowering for v264 -> v2f16 on both SelectionDAG and GlobalISel paths now. Fast math flag for fp_round is completely not working and needs additional work beyond this patch. |
else | ||
if (ST.hasCvtPkF16F32Inst()) { | ||
FPTruncActions.legalFor({{S32, S64}, {S16, S32}, {V2S16, V2S32}}) | ||
.customFor({V2S16, V2S64}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's suspect that we need custom handling for the v2 case, but not the regular scalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's suspect that we need custom handling for the v2 case, but not the regular scalar
Previously only v2f16 was made Legal in order to generate v_cvt_pk_f16_f32, I think.
// TODO: We should only use fast math flag. But the global option is | ||
// still used here to be consistent, especially when the fast math flag is | ||
// not working for FP_ROUND on the SelectDAG path at this moment. | ||
MachineFunction &MF = Helper.MIRBuilder.getMF(); | ||
bool AllowInaccurateFPTRUNC = MI.getFlag(MachineInstr::FmAfn) || | ||
MF.getTarget().Options.UnsafeFPMath; | ||
|
||
if (AllowInaccurateFPTRUNC) { | ||
// Use the tablegen pattern to select native instructions. | ||
return true; | ||
} | ||
|
||
Register DstReg = MI.getOperand(0).getReg(); | ||
LLT DstTy = MRI.getType(DstReg); | ||
|
||
// Scalarize the vector and fall through to lower f64 -> f16. | ||
return Helper.fewerElementsVector(MI, 0, DstTy.getElementType()) == | ||
LegalizerHelper::Legalized; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it should be a combine and not a modification of legalization rules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it should be a combine and not a modification of legalization rules
I am looking at instcombine to generate v_cvt_pk_f16_f32 instructions. Should we implement in AMDGPU because it is target dependent? Thanks.
@@ -0,0 +1,1748 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test belongs in the existing fptrunc.f16.ll test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test belongs in the existing fptrunc.f16.ll test
The existing fptrunc.f16.ll only checks when unsafe fpmath is set. This patch actually fixes the codegen when unsafe fpmath is not set for v2f16 only. Maybe we should not include scalar and wider in this file?
// Use the tablegen pattern to select native instructions. | ||
return true; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This handling logic doesn't make sense. The decision to scalarize or not has nothing to do with the FP properties? The tablegen pattern also handles the build_vector case, which is the MIR I would expect to select from. Can we just drop the direct vector handling, and always scalarize?
The tablegen patterns should also be guarded with the fast checks if we're going to have them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This handling logic doesn't make sense. The decision to scalarize or not has nothing to do with the FP properties? The tablegen pattern also handles the build_vector case, which is the MIR I would expect to select from. Can we just drop the direct vector handling, and always scalarize?
The tablegen patterns should also be guarded with the fast checks if we're going to have them.
The logic to generate instructions based on the pattern, but this is only correct when the unsafe fpmath is set.
Yes, we may always scalarize. Should we do instcombine to generate v_cvt_pk_f16_f32?
@@ -6893,6 +6893,19 @@ SDValue SITargetLowering::lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const { | |||
return Op; | |||
|
|||
EVT DstVT = Op.getValueType(); | |||
|
|||
if (DstVT == MVT::v2f16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as the globalisel path. The vectoriness shouldn't matter, we don't need to directly consume the operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as the globalisel path. The vectoriness shouldn't matter, we don't need to directly consume the operation
Okay, we can do as #135659, which essentially scalarize v2f16. We may do instcombine to generate the vector cvt as long as the target supports. Thanks.
On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions: v_cvt_f32_f64_e32 v1, s[6:7] v_cvt_f32_f64_e32 v2, s[4:5] v_cvt_pk_f16_f32 v1, v2, v1 It possibly returns imprecise results due to double rounding. This patch fixes the issue by not setting the conversion Legal. While we may still expect the above sequence of code when unsafe fpmath is set, I hope llvm#134738 can address that performance concern. Fixes: SWDEV-523856
…135659) On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions: v_cvt_f32_f64_e32 v1, s[6:7] v_cvt_f32_f64_e32 v2, s[4:5] v_cvt_pk_f16_f32 v1, v2, v1 It possibly returns imprecise results due to double rounding. This patch fixes the issue by not setting the conversion Legal. While we may still expect the above sequence of code when unsafe fpmath is set, I hope #134738 can address that performance concern. Fixes: SWDEV-523856
…nversion (#135659) On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions: v_cvt_f32_f64_e32 v1, s[6:7] v_cvt_f32_f64_e32 v2, s[4:5] v_cvt_pk_f16_f32 v1, v2, v1 It possibly returns imprecise results due to double rounding. This patch fixes the issue by not setting the conversion Legal. While we may still expect the above sequence of code when unsafe fpmath is set, I hope llvm/llvm-project#134738 can address that performance concern. Fixes: SWDEV-523856
…lvm#135659) (llvm#1791) On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions: v_cvt_f32_f64_e32 v1, s[6:7] v_cvt_f32_f64_e32 v2, s[4:5] v_cvt_pk_f16_f32 v1, v2, v1 It possibly returns imprecise results due to double rounding. This patch fixes the issue by not setting the conversion Legal. While we may still expect the above sequence of code when unsafe fpmath is set, I hope llvm#134738 can address that performance concern. Cherry-pick: llvm#135659 to amd-mainline Fixes: SWDEV-523856
…lvm#135659) On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions: v_cvt_f32_f64_e32 v1, s[6:7] v_cvt_f32_f64_e32 v2, s[4:5] v_cvt_pk_f16_f32 v1, v2, v1 It possibly returns imprecise results due to double rounding. This patch fixes the issue by not setting the conversion Legal. While we may still expect the above sequence of code when unsafe fpmath is set, I hope llvm#134738 can address that performance concern. Fixes: SWDEV-523856
…lvm#135659) On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions: v_cvt_f32_f64_e32 v1, s[6:7] v_cvt_f32_f64_e32 v2, s[4:5] v_cvt_pk_f16_f32 v1, v2, v1 It possibly returns imprecise results due to double rounding. This patch fixes the issue by not setting the conversion Legal. While we may still expect the above sequence of code when unsafe fpmath is set, I hope llvm#134738 can address that performance concern. Fixes: SWDEV-523856
…lvm#135659) On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions: v_cvt_f32_f64_e32 v1, s[6:7] v_cvt_f32_f64_e32 v2, s[4:5] v_cvt_pk_f16_f32 v1, v2, v1 It possibly returns imprecise results due to double rounding. This patch fixes the issue by not setting the conversion Legal. While we may still expect the above sequence of code when unsafe fpmath is set, I hope llvm#134738 can address that performance concern. Fixes: SWDEV-523856
On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions:
v_cvt_f32_f64_e32 v1, s[6:7]
v_cvt_f32_f64_e32 v2, s[4:5]
v_cvt_pk_f16_f32 v1, v2, v1
While it may be when unsafe fast math is set, it possibly returns imprecise results due to double rounding. This patch makes v2f64 -> v2f16 conversion Legal only when unsafe fast math is set.
Fixes: SWDEV-523856