AMDGPU: Make v2f64 -> v2f16 conversion Legal only when unsafe fast math is set #134738

changpeng · 2025-04-07T21:35:30Z

On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions:
v_cvt_f32_f64_e32 v1, s[6:7]
v_cvt_f32_f64_e32 v2, s[4:5]
v_cvt_pk_f16_f32 v1, v2, v1
While it may be when unsafe fast math is set, it possibly returns imprecise results due to double rounding. This patch makes v2f64 -> v2f16 conversion Legal only when unsafe fast math is set.

Fixes: SWDEV-523856

On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructons: v_cvt_f32_f64_e32 v1, s[6:7] v_cvt_f32_f64_e32 v2, s[4:5] v_cvt_pk_f16_f32 v1, v2, v1 While it is okay when unsafe fast math flag is set, it may return imprecise results due to double rounding. This patch makes v2f64 -> v2f16 conversion Legal only when unsafe fast math is set. Fixes: SWDEV-523856

llvmbot · 2025-04-07T21:36:05Z

@llvm/pr-subscribers-backend-amdgpu

Author: Changpeng Fang (changpeng)

Changes

On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions:
v_cvt_f32_f64_e32 v1, s[6:7]
v_cvt_f32_f64_e32 v2, s[4:5]
v_cvt_pk_f16_f32 v1, v2, v1
While it may be when unsafe fast math is set, it possibly returns imprecise results due to double rounding. This patch makes v2f64 -> v2f16 conversion Legal only when unsafe fast math is set.

Fixes: SWDEV-523856

Full diff: https://github.com/llvm/llvm-project/pull/134738.diff

3 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+5-4)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+1-1)
(added) llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.fpmath.ll (+199)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index 275d0193452a5..4f264c1d6c794 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -1053,10 +1053,11 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
   }
 
   auto &FPTruncActions = getActionDefinitionsBuilder(G_FPTRUNC);
-  if (ST.hasCvtPkF16F32Inst())
-    FPTruncActions.legalFor(
-        {{S32, S64}, {S16, S32}, {V2S16, V2S32}, {V2S16, V2S64}});
-  else
+  if (ST.hasCvtPkF16F32Inst()) {
+    FPTruncActions.legalFor({{S32, S64}, {S16, S32}, {V2S16, V2S32}});
+    if (TM.Options.UnsafeFPMath)
+      FPTruncActions.legalFor({V2S16, V2S64});
+  } else
     FPTruncActions.legalFor({{S32, S64}, {S16, S32}});
   FPTruncActions.scalarize(0).lower();
 
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 356040da95672..8e49fa4323f5d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -915,7 +915,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::BUILD_VECTOR, MVT::v2bf16, Legal);
   }
 
-  if (Subtarget->hasCvtPkF16F32Inst()) {
+  if (Subtarget->hasCvtPkF16F32Inst() && TM.Options.UnsafeFPMath) {
     setOperationAction(ISD::FP_ROUND, MVT::v2f16, Legal);
   }
 
diff --git a/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.fpmath.ll b/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.fpmath.ll
new file mode 100644
index 0000000000000..7ef31ef6387ae
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.fpmath.ll
@@ -0,0 +1,199 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=0 < %s | FileCheck -check-prefixes=GFX950,GFX950-SAFE-SDAG %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=1 < %s | FileCheck -check-prefixes=GFX950,GFX950-SAFE-GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=0 -enable-unsafe-fp-math < %s | FileCheck -check-prefixes=GFX950,GFX950-UNSAFE %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=1 -enable-unsafe-fp-math < %s | FileCheck -check-prefixes=GFX950,GFX950-UNSAFE %s
+
+define amdgpu_ps <2 x half> @v_test_cvt_v2f32_v2f16(<2 x float> %src) {
+; GFX950-LABEL: v_test_cvt_v2f32_v2f16:
+; GFX950:       ; %bb.0:
+; GFX950-NEXT:    v_cvt_pk_f16_f32 v0, v0, v1
+; GFX950-NEXT:    ; return to shader part epilog
+  %res = fptrunc <2 x float> %src to <2 x half>
+  ret <2 x half> %res
+}
+
+define amdgpu_ps <2 x half> @v_test_cvt_v2f64_v2f16(<2 x double> %src) {
+; GFX950-SAFE-SDAG-LABEL: v_test_cvt_v2f64_v2f16:
+; GFX950-SAFE-SDAG:       ; %bb.0:
+; GFX950-SAFE-SDAG-NEXT:    s_movk_i32 s0, 0x1ff
+; GFX950-SAFE-SDAG-NEXT:    v_and_or_b32 v0, v1, s0, v0
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v4, 8, v1
+; GFX950-SAFE-SDAG-NEXT:    s_movk_i32 s1, 0xffe
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v0, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_bfe_u32 v5, v1, 20, 11
+; GFX950-SAFE-SDAG-NEXT:    v_and_or_b32 v0, v4, s1, v0
+; GFX950-SAFE-SDAG-NEXT:    v_sub_u32_e32 v6, 0x3f1, v5
+; GFX950-SAFE-SDAG-NEXT:    v_or_b32_e32 v4, 0x1000, v0
+; GFX950-SAFE-SDAG-NEXT:    v_med3_i32 v6, v6, 0, 13
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v7, v6, v4
+; GFX950-SAFE-SDAG-NEXT:    v_lshlrev_b32_e32 v6, v6, v7
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_ne_u32_e32 vcc, v6, v4
+; GFX950-SAFE-SDAG-NEXT:    v_add_u32_e32 v5, 0xfffffc10, v5
+; GFX950-SAFE-SDAG-NEXT:    v_lshl_or_b32 v6, v5, 12, v0
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v4, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_or_b32_e32 v4, v7, v4
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_gt_i32_e32 vcc, 1, v5
+; GFX950-SAFE-SDAG-NEXT:    s_movk_i32 s2, 0x40f
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v4, v6, v4, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_and_b32_e32 v6, 7, v4
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_lt_i32_e32 vcc, 5, v6
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v4, 2, v4
+; GFX950-SAFE-SDAG-NEXT:    s_mov_b32 s3, 0x8000
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v7, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_eq_u32_e32 vcc, 3, v6
+; GFX950-SAFE-SDAG-NEXT:    s_nop 1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v6, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_or_b32_e32 v6, v6, v7
+; GFX950-SAFE-SDAG-NEXT:    v_add_u32_e32 v4, v4, v6
+; GFX950-SAFE-SDAG-NEXT:    v_mov_b32_e32 v6, 0x7c00
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_gt_i32_e32 vcc, 31, v5
+; GFX950-SAFE-SDAG-NEXT:    v_mov_b32_e32 v7, 0x7e00
+; GFX950-SAFE-SDAG-NEXT:    s_nop 0
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v4, v6, v4, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; GFX950-SAFE-SDAG-NEXT:    s_nop 1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v0, v6, v7, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_eq_u32_e32 vcc, s2, v5
+; GFX950-SAFE-SDAG-NEXT:    s_nop 1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v0, v4, v0, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_and_or_b32 v0, v1, s3, v0
+; GFX950-SAFE-SDAG-NEXT:    v_and_or_b32 v1, v3, s0, v2
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v1
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v2, 8, v3
+; GFX950-SAFE-SDAG-NEXT:    v_bfe_u32 v4, v3, 20, 11
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v1, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_and_or_b32 v1, v2, s1, v1
+; GFX950-SAFE-SDAG-NEXT:    v_sub_u32_e32 v5, 0x3f1, v4
+; GFX950-SAFE-SDAG-NEXT:    v_or_b32_e32 v2, 0x1000, v1
+; GFX950-SAFE-SDAG-NEXT:    v_med3_i32 v5, v5, 0, 13
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v8, v5, v2
+; GFX950-SAFE-SDAG-NEXT:    v_lshlrev_b32_e32 v5, v5, v8
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_ne_u32_e32 vcc, v5, v2
+; GFX950-SAFE-SDAG-NEXT:    v_add_u32_e32 v4, 0xfffffc10, v4
+; GFX950-SAFE-SDAG-NEXT:    v_lshl_or_b32 v5, v4, 12, v1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v2, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_or_b32_e32 v2, v8, v2
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_gt_i32_e32 vcc, 1, v4
+; GFX950-SAFE-SDAG-NEXT:    s_mov_b32 s0, 0x5040100
+; GFX950-SAFE-SDAG-NEXT:    s_nop 0
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v2, v5, v2, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_and_b32_e32 v5, 7, v2
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_lt_i32_e32 vcc, 5, v5
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v2, 2, v2
+; GFX950-SAFE-SDAG-NEXT:    s_nop 0
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v8, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_eq_u32_e32 vcc, 3, v5
+; GFX950-SAFE-SDAG-NEXT:    s_nop 1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v5, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_or_b32_e32 v5, v5, v8
+; GFX950-SAFE-SDAG-NEXT:    v_add_u32_e32 v2, v2, v5
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_gt_i32_e32 vcc, 31, v4
+; GFX950-SAFE-SDAG-NEXT:    s_nop 1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v2, v6, v2, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v1
+; GFX950-SAFE-SDAG-NEXT:    s_nop 1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v1, v6, v7, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_eq_u32_e32 vcc, s2, v4
+; GFX950-SAFE-SDAG-NEXT:    s_nop 1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v2, 16, v3
+; GFX950-SAFE-SDAG-NEXT:    v_and_or_b32 v1, v2, s3, v1
+; GFX950-SAFE-SDAG-NEXT:    v_perm_b32 v0, v1, v0, s0
+; GFX950-SAFE-SDAG-NEXT:    ; return to shader part epilog
+;
+; GFX950-SAFE-GISEL-LABEL: v_test_cvt_v2f64_v2f16:
+; GFX950-SAFE-GISEL:       ; %bb.0:
+; GFX950-SAFE-GISEL-NEXT:    v_mov_b32_e32 v7, 0x1ff
+; GFX950-SAFE-GISEL-NEXT:    v_and_or_b32 v0, v1, v7, v0
+; GFX950-SAFE-GISEL-NEXT:    v_bfe_u32 v4, v1, 20, 11
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; GFX950-SAFE-GISEL-NEXT:    v_add_u32_e32 v4, 0xfffffc10, v4
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v5, 8, v1
+; GFX950-SAFE-GISEL-NEXT:    v_mov_b32_e32 v6, 0xffe
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v0, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_and_or_b32 v0, v5, v6, v0
+; GFX950-SAFE-GISEL-NEXT:    v_sub_u32_e32 v10, 1, v4
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; GFX950-SAFE-GISEL-NEXT:    v_lshl_or_b32 v9, v4, 12, v0
+; GFX950-SAFE-GISEL-NEXT:    v_med3_i32 v10, v10, 0, 13
+; GFX950-SAFE-GISEL-NEXT:    v_or_b32_e32 v0, 0x1000, v0
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v11, v10, v0
+; GFX950-SAFE-GISEL-NEXT:    v_lshlrev_b32_e32 v10, v10, v11
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v5, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_ne_u32_e32 vcc, v10, v0
+; GFX950-SAFE-GISEL-NEXT:    v_mov_b32_e32 v8, 0x7c00
+; GFX950-SAFE-GISEL-NEXT:    v_lshl_or_b32 v5, v5, 9, v8
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v0, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_or_b32_e32 v0, v11, v0
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_gt_i32_e32 vcc, 1, v4
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
+; GFX950-SAFE-GISEL-NEXT:    v_and_or_b32 v2, v3, v7, v2
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e32 v0, v9, v0, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_and_b32_e32 v9, 7, v0
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_eq_u32_e32 vcc, 3, v9
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_lt_i32_e64 s[0:1], 5, v9
+; GFX950-SAFE-GISEL-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v0, 2, v0
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v9, 0, 1, s[0:1]
+; GFX950-SAFE-GISEL-NEXT:    v_add_u32_e32 v0, v0, v9
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_lt_i32_e32 vcc, 30, v4
+; GFX950-SAFE-GISEL-NEXT:    v_mov_b32_e32 v9, 0x40f
+; GFX950-SAFE-GISEL-NEXT:    s_nop 0
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e32 v0, v0, v8, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_eq_u32_e32 vcc, v4, v9
+; GFX950-SAFE-GISEL-NEXT:    v_mov_b32_e32 v4, 0x8000
+; GFX950-SAFE-GISEL-NEXT:    s_nop 0
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e32 v0, v0, v5, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_and_or_b32 v0, v1, v4, v0
+; GFX950-SAFE-GISEL-NEXT:    v_bfe_u32 v1, v3, 20, 11
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
+; GFX950-SAFE-GISEL-NEXT:    v_add_u32_e32 v1, 0xfffffc10, v1
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v5, 8, v3
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v2, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_and_or_b32 v2, v5, v6, v2
+; GFX950-SAFE-GISEL-NEXT:    v_sub_u32_e32 v7, 1, v1
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
+; GFX950-SAFE-GISEL-NEXT:    v_lshl_or_b32 v6, v1, 12, v2
+; GFX950-SAFE-GISEL-NEXT:    v_med3_i32 v7, v7, 0, 13
+; GFX950-SAFE-GISEL-NEXT:    v_or_b32_e32 v2, 0x1000, v2
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v10, v7, v2
+; GFX950-SAFE-GISEL-NEXT:    v_lshlrev_b32_e32 v7, v7, v10
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v5, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_ne_u32_e32 vcc, v7, v2
+; GFX950-SAFE-GISEL-NEXT:    v_lshl_or_b32 v5, v5, 9, v8
+; GFX950-SAFE-GISEL-NEXT:    v_and_b32_e32 v0, 0xffff, v0
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v2, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_or_b32_e32 v2, v10, v2
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_gt_i32_e32 vcc, 1, v1
+; GFX950-SAFE-GISEL-NEXT:    s_nop 1
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e32 v2, v6, v2, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_and_b32_e32 v6, 7, v2
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_eq_u32_e32 vcc, 3, v6
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_lt_i32_e64 s[0:1], 5, v6
+; GFX950-SAFE-GISEL-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v2, 2, v2
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[0:1]
+; GFX950-SAFE-GISEL-NEXT:    v_add_u32_e32 v2, v2, v6
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_lt_i32_e32 vcc, 30, v1
+; GFX950-SAFE-GISEL-NEXT:    s_nop 1
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e32 v2, v2, v8, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_eq_u32_e32 vcc, v1, v9
+; GFX950-SAFE-GISEL-NEXT:    s_nop 1
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e32 v1, v2, v5, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v2, 16, v3
+; GFX950-SAFE-GISEL-NEXT:    v_and_or_b32 v1, v2, v4, v1
+; GFX950-SAFE-GISEL-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
+; GFX950-SAFE-GISEL-NEXT:    ; return to shader part epilog
+;
+; GFX950-UNSAFE-LABEL: v_test_cvt_v2f64_v2f16:
+; GFX950-UNSAFE:       ; %bb.0:
+; GFX950-UNSAFE-NEXT:    v_cvt_f32_f64_e32 v2, v[2:3]
+; GFX950-UNSAFE-NEXT:    v_cvt_f32_f64_e32 v0, v[0:1]
+; GFX950-UNSAFE-NEXT:    v_cvt_pk_f16_f32 v0, v0, v2
+; GFX950-UNSAFE-NEXT:    ; return to shader part epilog
+  %res = fptrunc <2 x double> %src to <2 x half>
+  ret <2 x half> %res
+}

rampitec · 2025-04-07T22:49:45Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

@@ -915,7 +915,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
    setOperationAction(ISD::BUILD_VECTOR, MVT::v2bf16, Legal);
  }

-  if (Subtarget->hasCvtPkF16F32Inst()) {
+  if (Subtarget->hasCvtPkF16F32Inst() && TM.Options.UnsafeFPMath) {
    setOperationAction(ISD::FP_ROUND, MVT::v2f16, Legal);


It disables it not just from v2f64, but also from v2f32, right? How is that lowered then?

It disables it not just from v2f64, but also from v2f32, right? How is that lowered then?
We have a pattern to generate v_cvt_pk_f16_f32:

llvm-project/llvm/lib/Target/AMDGPU/VOP3Instructions.td

Line 1194 in 9b63a92

def : GCNPat<(v2f16 (fpround v2f32:$src)),

Even though we also have a pattern for v2f64->v2f16, but we will do splitting the vector and then Custom lowering of f64 ->f16 first.

llvm-project/llvm/lib/Target/AMDGPU/VOP3Instructions.td

Line 1196 in 9b63a92

def : GCNPat<(v2f16 (fpround v2f64:$src)),

But the pattern shall not work for operation which is not legal.

But the pattern shall not work for operation which is not legal.

my observation is that, if we do not Expand/Custom/Promote, the pattern will be picked up, no matter you specify Legal or not. So it doesn't matter whether you specify Legal or not, pattern for v2f32->v2f16 wil always be picked up.

This is similar to f32 -> f16, even though we did not specify Legal, v_cvt_f16_f32 will still be selected.

Default is Legal?

I think Legal is default:
https://llvm.org/docs/WritingAnLLVMBackend.html#legal

The reason we need to explicitly specify Legal sometimes is because we need to specify different Actions based on different hardware features.

I am still learning. I think even though we specify Custom lowering but actually doing nothing, like:
if (SrcVT. != MVT::f64)
return Op; // f32 is legal for fp_round
This is essentially "Legal", which means either there is a native instruction, or a pattern to match to native instructions. The compiler will throw an error if there is nothing to select.

arsenm

Should not further spread use of the global options. FP casts have fast math flags, make use of those. Note this needs to be in custom lowering, legality cannot be use context dependent

arsenm · 2025-04-08T00:37:36Z

llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.fpmath.ll

+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=0 -enable-unsafe-fp-math < %s | FileCheck -check-prefixes=GFX950,GFX950-UNSAFE %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=1 -enable-unsafe-fp-math < %s | FileCheck -check-prefixes=GFX950,GFX950-UNSAFE %s
+
+define amdgpu_ps <2 x half> @v_test_cvt_v2f32_v2f16(<2 x float> %src) {


Don't need the shader conventions. Also test the scalar case and wider vector s

Don't need the shader conventions. Also test the scalar case and wider vector s

Removed the shade conventions, and added tests for other types.

Custom lower FP_ROUND for v2f16. Also add test cases for scalar and wider vectors

changpeng · 2025-04-08T22:34:17Z

Should not further spread use of the global options. FP casts have fast math flags, make use of those. Note this needs to be in custom lowering, legality cannot be use context dependent

I understand your proposal to avoid use of global options. But here we need to be consistent. Options.UnsafeFPMath has been used in all other places to expand/lower f64->f16. We should not use a different approach only here.

changpeng · 2025-04-09T21:36:54Z

Should not further spread use of the global options. FP casts have fast math flags, make use of those. Note this needs to be in custom lowering, legality cannot be use context dependent

I understand your proposal to avoid use of global options. But here we need to be consistent. Options.UnsafeFPMath has been used in all other places to expand/lower f64->f16. We should not use a different approach only here.

Or do you prefer a combination of fast math flag and global option, like this?
const SDNodeFlags Flags = Op->getFlags();
bool AllowInaccurateFP_ROUND =
Flags.hasApproximateFuncs() || DAG.getTarget().Options.UnsafeFPMath;

arsenm · 2025-04-10T10:20:28Z

I understand your proposal to avoid use of global options. But here we need to be consistent. Options.UnsafeFPMath has been used in all other places to expand/lower f64->f16. We should not use a different approach only here.

This is because traditionally the FP casts did not have fast math flags. They do now, and we should migrate to only using them.

arsenm · 2025-04-10T10:22:26Z

Or do you prefer a combination of fast math flag and global option, like this? const SDNodeFlags Flags = Op->getFlags(); bool AllowInaccurateFP_ROUND = Flags.hasApproximateFuncs() || DAG.getTarget().Options.UnsafeFPMath;

You can do that, but it still needs to be a contextual use transformation, not modify the legality rules

changpeng · 2025-04-10T17:05:05Z

Or do you prefer a combination of fast math flag and global option, like this? const SDNodeFlags Flags = Op->getFlags(); bool AllowInaccurateFP_ROUND = Flags.hasApproximateFuncs() || DAG.getTarget().Options.UnsafeFPMath;

You can do that, but it still needs to be a contextual use transformation, not modify the legality rules

I do not fully understand what did you mean "modify the legality rules", and why we should not modify.

Here is the example in custom lowering for FP_TO_FP16:

if (getTargetMachine().Options.UnsafeFPMath) {
// There is a generic expand for FP_TO_FP16 with unsafe fast math.
return SDValue();
}

Do you mean this is to change from Custom to Expand, and is not allowed? Thanks.

changpeng · 2025-04-10T20:20:13Z

I understand your proposal to avoid use of global options. But here we need to be consistent. Options.UnsafeFPMath has been used in all other places to expand/lower f64->f16. We should not use a different approach only here.

This is because traditionally the FP casts did not have fast math flags. They do now, and we should migrate to only using them.
Do you mean the instruction fptrunc can has "fast" flag now?
%res = fptrunc fast <2 x double> %src to <2 x half>

How to check the instruction has the "fast" flag?
Flags.hasApproximateFuncs() is not working when we have the "fast" Thanks

changpeng · 2025-04-11T07:04:02Z

I understand your proposal to avoid use of global options. But here we need to be consistent. Options.UnsafeFPMath has been used in all other places to expand/lower f64->f16. We should not use a different approach only here.

This is because traditionally the FP casts did not have fast math flags. They do now, and we should migrate to only using them.

On SelectDAG path, FP_ROUND still does not have fast math flags. I still think adding fast math flags for FP_ROUND is beyond the work of this patch. (We also need to propagate the flags when we do transformation like splitting the vector).

…th is set Custom lowering v2f64 -> v2f16.

github-actions · 2025-04-11T20:51:54Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff HEAD~1 HEAD --extensions h,cpp -- llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h llvm/lib/Target/AMDGPU/SIISelLowering.cpp

View the diff from clang-format here.

diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index ee3f5dd4e..bf46aef0c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -1055,7 +1055,7 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
   auto &FPTruncActions = getActionDefinitionsBuilder(G_FPTRUNC);
   if (ST.hasCvtPkF16F32Inst()) {
     FPTruncActions.legalFor({{S32, S64}, {S16, S32}, {V2S16, V2S32}})
-                  .customFor({V2S16, V2S64});
+        .customFor({V2S16, V2S64});
   } else
     FPTruncActions.legalFor({{S32, S64}, {S16, S32}});
   FPTruncActions.scalarize(0).lower();
@@ -2750,8 +2750,8 @@ bool AMDGPULegalizerInfo::legalizeFPTrunc(LegalizerHelper &Helper,
   // still used here to be consistent, especially when the fast math flag is
   // not working for FP_ROUND on the SelectDAG path at this moment.
   MachineFunction &MF = Helper.MIRBuilder.getMF();
-  bool AllowInaccurateFPTRUNC = MI.getFlag(MachineInstr::FmAfn) ||
-                                MF.getTarget().Options.UnsafeFPMath;
+  bool AllowInaccurateFPTRUNC =
+      MI.getFlag(MachineInstr::FmAfn) || MF.getTarget().Options.UnsafeFPMath;
 
   if (AllowInaccurateFPTRUNC) {
     // Use the tablegen pattern to select native instructions.

changpeng · 2025-04-11T20:55:09Z

Custom lowering for v264 -> v2f16 on both SelectionDAG and GlobalISel paths now. Fast math flag for fp_round is completely not working and needs additional work beyond this patch.

arsenm · 2025-04-13T09:07:10Z

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp

-  else
+  if (ST.hasCvtPkF16F32Inst()) {
+    FPTruncActions.legalFor({{S32, S64}, {S16, S32}, {V2S16, V2S32}})
+                  .customFor({V2S16, V2S64});


It's suspect that we need custom handling for the v2 case, but not the regular scalar

It's suspect that we need custom handling for the v2 case, but not the regular scalar

Previously only v2f16 was made Legal in order to generate v_cvt_pk_f16_f32, I think.

arsenm · 2025-04-13T09:08:13Z

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp

+  // TODO: We should only use fast math flag. But the global option is
+  // still used here to be consistent, especially when the fast math flag is
+  // not working for FP_ROUND on the SelectDAG path at this moment.
+  MachineFunction &MF = Helper.MIRBuilder.getMF();
+  bool AllowInaccurateFPTRUNC = MI.getFlag(MachineInstr::FmAfn) ||
+                                MF.getTarget().Options.UnsafeFPMath;
+
+  if (AllowInaccurateFPTRUNC) {
+    // Use the tablegen pattern to select native instructions.
+    return true;
+  }
+
+  Register DstReg = MI.getOperand(0).getReg();
+  LLT DstTy = MRI.getType(DstReg);
+
+  // Scalarize the vector and fall through to lower f64 -> f16.
+  return Helper.fewerElementsVector(MI, 0, DstTy.getElementType()) ==
+         LegalizerHelper::Legalized;


This seems like it should be a combine and not a modification of legalization rules

This seems like it should be a combine and not a modification of legalization rules

I am looking at instcombine to generate v_cvt_pk_f16_f32 instructions. Should we implement in AMDGPU because it is target dependent? Thanks.

arsenm · 2025-04-13T09:10:42Z

llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.fpmath.ll

@@ -0,0 +1,1748 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5


Test belongs in the existing fptrunc.f16.ll test

Test belongs in the existing fptrunc.f16.ll test

The existing fptrunc.f16.ll only checks when unsafe fpmath is set. This patch actually fixes the codegen when unsafe fpmath is not set for v2f16 only. Maybe we should not include scalar and wider in this file?

arsenm · 2025-04-13T09:15:01Z

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp

+    // Use the tablegen pattern to select native instructions.
+    return true;
+  }
+


This handling logic doesn't make sense. The decision to scalarize or not has nothing to do with the FP properties? The tablegen pattern also handles the build_vector case, which is the MIR I would expect to select from. Can we just drop the direct vector handling, and always scalarize?

The tablegen patterns should also be guarded with the fast checks if we're going to have them.

This handling logic doesn't make sense. The decision to scalarize or not has nothing to do with the FP properties? The tablegen pattern also handles the build_vector case, which is the MIR I would expect to select from. Can we just drop the direct vector handling, and always scalarize?

The tablegen patterns should also be guarded with the fast checks if we're going to have them.

The logic to generate instructions based on the pattern, but this is only correct when the unsafe fpmath is set.
Yes, we may always scalarize. Should we do instcombine to generate v_cvt_pk_f16_f32?

arsenm · 2025-04-13T09:15:31Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

@@ -6893,6 +6893,19 @@ SDValue SITargetLowering::lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const {
    return Op;

  EVT DstVT = Op.getValueType();
+
+  if (DstVT == MVT::v2f16) {


Same comments as the globalisel path. The vectoriness shouldn't matter, we don't need to directly consume the operation

Same comments as the globalisel path. The vectoriness shouldn't matter, we don't need to directly consume the operation

Okay, we can do as #135659, which essentially scalarize v2f16. We may do instcombine to generate the vector cvt as long as the target supports. Thanks.

On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions: v_cvt_f32_f64_e32 v1, s[6:7] v_cvt_f32_f64_e32 v2, s[4:5] v_cvt_pk_f16_f32 v1, v2, v1 It possibly returns imprecise results due to double rounding. This patch fixes the issue by not setting the conversion Legal. While we may still expect the above sequence of code when unsafe fpmath is set, I hope llvm#134738 can address that performance concern. Fixes: SWDEV-523856

…135659) On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions: v_cvt_f32_f64_e32 v1, s[6:7] v_cvt_f32_f64_e32 v2, s[4:5] v_cvt_pk_f16_f32 v1, v2, v1 It possibly returns imprecise results due to double rounding. This patch fixes the issue by not setting the conversion Legal. While we may still expect the above sequence of code when unsafe fpmath is set, I hope #134738 can address that performance concern. Fixes: SWDEV-523856

…nversion (#135659) On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions: v_cvt_f32_f64_e32 v1, s[6:7] v_cvt_f32_f64_e32 v2, s[4:5] v_cvt_pk_f16_f32 v1, v2, v1 It possibly returns imprecise results due to double rounding. This patch fixes the issue by not setting the conversion Legal. While we may still expect the above sequence of code when unsafe fpmath is set, I hope llvm/llvm-project#134738 can address that performance concern. Fixes: SWDEV-523856

…lvm#135659) (llvm#1791) On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions: v_cvt_f32_f64_e32 v1, s[6:7] v_cvt_f32_f64_e32 v2, s[4:5] v_cvt_pk_f16_f32 v1, v2, v1 It possibly returns imprecise results due to double rounding. This patch fixes the issue by not setting the conversion Legal. While we may still expect the above sequence of code when unsafe fpmath is set, I hope llvm#134738 can address that performance concern. Cherry-pick: llvm#135659 to amd-mainline Fixes: SWDEV-523856

…lvm#135659) On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions: v_cvt_f32_f64_e32 v1, s[6:7] v_cvt_f32_f64_e32 v2, s[4:5] v_cvt_pk_f16_f32 v1, v2, v1 It possibly returns imprecise results due to double rounding. This patch fixes the issue by not setting the conversion Legal. While we may still expect the above sequence of code when unsafe fpmath is set, I hope llvm#134738 can address that performance concern. Fixes: SWDEV-523856

llvmbot added the backend:AMDGPU label Apr 7, 2025

changpeng requested a review from rampitec April 7, 2025 21:35

changpeng requested review from arsenm, jayfoad, shiltian and b-sumner April 7, 2025 21:36

rampitec reviewed Apr 7, 2025

View reviewed changes

arsenm requested changes Apr 8, 2025

View reviewed changes

arsenm reviewed Apr 8, 2025

View reviewed changes

AMDGPU: Make v2f64 -> v2f16 conversion Legal only for unsafe fast math

6d970b0

Custom lower FP_ROUND for v2f16. Also add test cases for scalar and wider vectors

changpeng requested a review from bcahoon April 9, 2025 21:34

AMDGPU: Make v2f64 -> v2f16 conversion Legal only when unsafe fast ma…

62078af

…th is set Custom lowering v2f64 -> v2f16.

changpeng requested a review from arsenm April 11, 2025 20:52

arsenm reviewed Apr 13, 2025

View reviewed changes

changpeng mentioned this pull request Apr 14, 2025

AMDGPU: Fix the double rounding issue in v2f64 -> v2f16 conversion #135659

Merged

		@@ -0,0 +1,1748 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5

AMDGPU: Make v2f64 -> v2f16 conversion Legal only when unsafe fast math is set #134738

Are you sure you want to change the base?

AMDGPU: Make v2f64 -> v2f16 conversion Legal only when unsafe fast math is set #134738

Conversation

changpeng commented Apr 7, 2025

Uh oh!

llvmbot commented Apr 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

changpeng commented Apr 8, 2025

Uh oh!

changpeng commented Apr 9, 2025

Uh oh!

arsenm commented Apr 10, 2025

Uh oh!

arsenm commented Apr 10, 2025

Uh oh!

changpeng commented Apr 10, 2025

Uh oh!

changpeng commented Apr 10, 2025

Uh oh!

changpeng commented Apr 11, 2025

Uh oh!

github-actions bot commented Apr 11, 2025

Uh oh!

changpeng commented Apr 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

changpeng Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arsenm Apr 8, 2025 •

edited

Loading

changpeng Apr 17, 2025 •

edited

Loading