Skip to content

AMDGPU: Make v2f64 -> v2f16 conversion Legal only when unsafe fast math is set #134738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

changpeng
Copy link
Contributor

On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions:
v_cvt_f32_f64_e32 v1, s[6:7]
v_cvt_f32_f64_e32 v2, s[4:5]
v_cvt_pk_f16_f32 v1, v2, v1
While it may be when unsafe fast math is set, it possibly returns imprecise results due to double rounding. This patch makes v2f64 -> v2f16 conversion Legal only when unsafe fast math is set.

Fixes: SWDEV-523856

  On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16
Legal, we will generate the following sequence of instructons:
   v_cvt_f32_f64_e32 v1, s[6:7]
   v_cvt_f32_f64_e32 v2, s[4:5]
   v_cvt_pk_f16_f32 v1, v2, v1
While it is okay when unsafe fast math flag is set, it may return imprecise results
due to double rounding. This patch makes v2f64 -> v2f16 conversion Legal only when
unsafe fast math is set.

Fixes: SWDEV-523856
@llvmbot
Copy link
Member

llvmbot commented Apr 7, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Changpeng Fang (changpeng)

Changes

On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16 Legal, we will generate the following sequence of instructions:
v_cvt_f32_f64_e32 v1, s[6:7]
v_cvt_f32_f64_e32 v2, s[4:5]
v_cvt_pk_f16_f32 v1, v2, v1
While it may be when unsafe fast math is set, it possibly returns imprecise results due to double rounding. This patch makes v2f64 -> v2f16 conversion Legal only when unsafe fast math is set.

Fixes: SWDEV-523856


Full diff: https://github.com/llvm/llvm-project/pull/134738.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+5-4)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+1-1)
  • (added) llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.fpmath.ll (+199)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index 275d0193452a5..4f264c1d6c794 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -1053,10 +1053,11 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
   }
 
   auto &FPTruncActions = getActionDefinitionsBuilder(G_FPTRUNC);
-  if (ST.hasCvtPkF16F32Inst())
-    FPTruncActions.legalFor(
-        {{S32, S64}, {S16, S32}, {V2S16, V2S32}, {V2S16, V2S64}});
-  else
+  if (ST.hasCvtPkF16F32Inst()) {
+    FPTruncActions.legalFor({{S32, S64}, {S16, S32}, {V2S16, V2S32}});
+    if (TM.Options.UnsafeFPMath)
+      FPTruncActions.legalFor({V2S16, V2S64});
+  } else
     FPTruncActions.legalFor({{S32, S64}, {S16, S32}});
   FPTruncActions.scalarize(0).lower();
 
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 356040da95672..8e49fa4323f5d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -915,7 +915,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::BUILD_VECTOR, MVT::v2bf16, Legal);
   }
 
-  if (Subtarget->hasCvtPkF16F32Inst()) {
+  if (Subtarget->hasCvtPkF16F32Inst() && TM.Options.UnsafeFPMath) {
     setOperationAction(ISD::FP_ROUND, MVT::v2f16, Legal);
   }
 
diff --git a/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.fpmath.ll b/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.fpmath.ll
new file mode 100644
index 0000000000000..7ef31ef6387ae
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/fptrunc.v2f16.fpmath.ll
@@ -0,0 +1,199 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=0 < %s | FileCheck -check-prefixes=GFX950,GFX950-SAFE-SDAG %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=1 < %s | FileCheck -check-prefixes=GFX950,GFX950-SAFE-GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=0 -enable-unsafe-fp-math < %s | FileCheck -check-prefixes=GFX950,GFX950-UNSAFE %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=1 -enable-unsafe-fp-math < %s | FileCheck -check-prefixes=GFX950,GFX950-UNSAFE %s
+
+define amdgpu_ps <2 x half> @v_test_cvt_v2f32_v2f16(<2 x float> %src) {
+; GFX950-LABEL: v_test_cvt_v2f32_v2f16:
+; GFX950:       ; %bb.0:
+; GFX950-NEXT:    v_cvt_pk_f16_f32 v0, v0, v1
+; GFX950-NEXT:    ; return to shader part epilog
+  %res = fptrunc <2 x float> %src to <2 x half>
+  ret <2 x half> %res
+}
+
+define amdgpu_ps <2 x half> @v_test_cvt_v2f64_v2f16(<2 x double> %src) {
+; GFX950-SAFE-SDAG-LABEL: v_test_cvt_v2f64_v2f16:
+; GFX950-SAFE-SDAG:       ; %bb.0:
+; GFX950-SAFE-SDAG-NEXT:    s_movk_i32 s0, 0x1ff
+; GFX950-SAFE-SDAG-NEXT:    v_and_or_b32 v0, v1, s0, v0
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v4, 8, v1
+; GFX950-SAFE-SDAG-NEXT:    s_movk_i32 s1, 0xffe
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v0, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_bfe_u32 v5, v1, 20, 11
+; GFX950-SAFE-SDAG-NEXT:    v_and_or_b32 v0, v4, s1, v0
+; GFX950-SAFE-SDAG-NEXT:    v_sub_u32_e32 v6, 0x3f1, v5
+; GFX950-SAFE-SDAG-NEXT:    v_or_b32_e32 v4, 0x1000, v0
+; GFX950-SAFE-SDAG-NEXT:    v_med3_i32 v6, v6, 0, 13
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v7, v6, v4
+; GFX950-SAFE-SDAG-NEXT:    v_lshlrev_b32_e32 v6, v6, v7
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_ne_u32_e32 vcc, v6, v4
+; GFX950-SAFE-SDAG-NEXT:    v_add_u32_e32 v5, 0xfffffc10, v5
+; GFX950-SAFE-SDAG-NEXT:    v_lshl_or_b32 v6, v5, 12, v0
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v4, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_or_b32_e32 v4, v7, v4
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_gt_i32_e32 vcc, 1, v5
+; GFX950-SAFE-SDAG-NEXT:    s_movk_i32 s2, 0x40f
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v4, v6, v4, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_and_b32_e32 v6, 7, v4
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_lt_i32_e32 vcc, 5, v6
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v4, 2, v4
+; GFX950-SAFE-SDAG-NEXT:    s_mov_b32 s3, 0x8000
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v7, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_eq_u32_e32 vcc, 3, v6
+; GFX950-SAFE-SDAG-NEXT:    s_nop 1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v6, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_or_b32_e32 v6, v6, v7
+; GFX950-SAFE-SDAG-NEXT:    v_add_u32_e32 v4, v4, v6
+; GFX950-SAFE-SDAG-NEXT:    v_mov_b32_e32 v6, 0x7c00
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_gt_i32_e32 vcc, 31, v5
+; GFX950-SAFE-SDAG-NEXT:    v_mov_b32_e32 v7, 0x7e00
+; GFX950-SAFE-SDAG-NEXT:    s_nop 0
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v4, v6, v4, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; GFX950-SAFE-SDAG-NEXT:    s_nop 1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v0, v6, v7, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_eq_u32_e32 vcc, s2, v5
+; GFX950-SAFE-SDAG-NEXT:    s_nop 1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v0, v4, v0, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_and_or_b32 v0, v1, s3, v0
+; GFX950-SAFE-SDAG-NEXT:    v_and_or_b32 v1, v3, s0, v2
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v1
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v2, 8, v3
+; GFX950-SAFE-SDAG-NEXT:    v_bfe_u32 v4, v3, 20, 11
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v1, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_and_or_b32 v1, v2, s1, v1
+; GFX950-SAFE-SDAG-NEXT:    v_sub_u32_e32 v5, 0x3f1, v4
+; GFX950-SAFE-SDAG-NEXT:    v_or_b32_e32 v2, 0x1000, v1
+; GFX950-SAFE-SDAG-NEXT:    v_med3_i32 v5, v5, 0, 13
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v8, v5, v2
+; GFX950-SAFE-SDAG-NEXT:    v_lshlrev_b32_e32 v5, v5, v8
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_ne_u32_e32 vcc, v5, v2
+; GFX950-SAFE-SDAG-NEXT:    v_add_u32_e32 v4, 0xfffffc10, v4
+; GFX950-SAFE-SDAG-NEXT:    v_lshl_or_b32 v5, v4, 12, v1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v2, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_or_b32_e32 v2, v8, v2
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_gt_i32_e32 vcc, 1, v4
+; GFX950-SAFE-SDAG-NEXT:    s_mov_b32 s0, 0x5040100
+; GFX950-SAFE-SDAG-NEXT:    s_nop 0
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v2, v5, v2, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_and_b32_e32 v5, 7, v2
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_lt_i32_e32 vcc, 5, v5
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v2, 2, v2
+; GFX950-SAFE-SDAG-NEXT:    s_nop 0
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v8, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_eq_u32_e32 vcc, 3, v5
+; GFX950-SAFE-SDAG-NEXT:    s_nop 1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e64 v5, 0, 1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_or_b32_e32 v5, v5, v8
+; GFX950-SAFE-SDAG-NEXT:    v_add_u32_e32 v2, v2, v5
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_gt_i32_e32 vcc, 31, v4
+; GFX950-SAFE-SDAG-NEXT:    s_nop 1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v2, v6, v2, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v1
+; GFX950-SAFE-SDAG-NEXT:    s_nop 1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v1, v6, v7, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_cmp_eq_u32_e32 vcc, s2, v4
+; GFX950-SAFE-SDAG-NEXT:    s_nop 1
+; GFX950-SAFE-SDAG-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc
+; GFX950-SAFE-SDAG-NEXT:    v_lshrrev_b32_e32 v2, 16, v3
+; GFX950-SAFE-SDAG-NEXT:    v_and_or_b32 v1, v2, s3, v1
+; GFX950-SAFE-SDAG-NEXT:    v_perm_b32 v0, v1, v0, s0
+; GFX950-SAFE-SDAG-NEXT:    ; return to shader part epilog
+;
+; GFX950-SAFE-GISEL-LABEL: v_test_cvt_v2f64_v2f16:
+; GFX950-SAFE-GISEL:       ; %bb.0:
+; GFX950-SAFE-GISEL-NEXT:    v_mov_b32_e32 v7, 0x1ff
+; GFX950-SAFE-GISEL-NEXT:    v_and_or_b32 v0, v1, v7, v0
+; GFX950-SAFE-GISEL-NEXT:    v_bfe_u32 v4, v1, 20, 11
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; GFX950-SAFE-GISEL-NEXT:    v_add_u32_e32 v4, 0xfffffc10, v4
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v5, 8, v1
+; GFX950-SAFE-GISEL-NEXT:    v_mov_b32_e32 v6, 0xffe
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v0, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_and_or_b32 v0, v5, v6, v0
+; GFX950-SAFE-GISEL-NEXT:    v_sub_u32_e32 v10, 1, v4
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; GFX950-SAFE-GISEL-NEXT:    v_lshl_or_b32 v9, v4, 12, v0
+; GFX950-SAFE-GISEL-NEXT:    v_med3_i32 v10, v10, 0, 13
+; GFX950-SAFE-GISEL-NEXT:    v_or_b32_e32 v0, 0x1000, v0
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v11, v10, v0
+; GFX950-SAFE-GISEL-NEXT:    v_lshlrev_b32_e32 v10, v10, v11
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v5, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_ne_u32_e32 vcc, v10, v0
+; GFX950-SAFE-GISEL-NEXT:    v_mov_b32_e32 v8, 0x7c00
+; GFX950-SAFE-GISEL-NEXT:    v_lshl_or_b32 v5, v5, 9, v8
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v0, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_or_b32_e32 v0, v11, v0
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_gt_i32_e32 vcc, 1, v4
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
+; GFX950-SAFE-GISEL-NEXT:    v_and_or_b32 v2, v3, v7, v2
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e32 v0, v9, v0, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_and_b32_e32 v9, 7, v0
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_eq_u32_e32 vcc, 3, v9
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_lt_i32_e64 s[0:1], 5, v9
+; GFX950-SAFE-GISEL-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v0, 2, v0
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v9, 0, 1, s[0:1]
+; GFX950-SAFE-GISEL-NEXT:    v_add_u32_e32 v0, v0, v9
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_lt_i32_e32 vcc, 30, v4
+; GFX950-SAFE-GISEL-NEXT:    v_mov_b32_e32 v9, 0x40f
+; GFX950-SAFE-GISEL-NEXT:    s_nop 0
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e32 v0, v0, v8, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_eq_u32_e32 vcc, v4, v9
+; GFX950-SAFE-GISEL-NEXT:    v_mov_b32_e32 v4, 0x8000
+; GFX950-SAFE-GISEL-NEXT:    s_nop 0
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e32 v0, v0, v5, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_and_or_b32 v0, v1, v4, v0
+; GFX950-SAFE-GISEL-NEXT:    v_bfe_u32 v1, v3, 20, 11
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
+; GFX950-SAFE-GISEL-NEXT:    v_add_u32_e32 v1, 0xfffffc10, v1
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v5, 8, v3
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v2, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_and_or_b32 v2, v5, v6, v2
+; GFX950-SAFE-GISEL-NEXT:    v_sub_u32_e32 v7, 1, v1
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v2
+; GFX950-SAFE-GISEL-NEXT:    v_lshl_or_b32 v6, v1, 12, v2
+; GFX950-SAFE-GISEL-NEXT:    v_med3_i32 v7, v7, 0, 13
+; GFX950-SAFE-GISEL-NEXT:    v_or_b32_e32 v2, 0x1000, v2
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v10, v7, v2
+; GFX950-SAFE-GISEL-NEXT:    v_lshlrev_b32_e32 v7, v7, v10
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v5, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_ne_u32_e32 vcc, v7, v2
+; GFX950-SAFE-GISEL-NEXT:    v_lshl_or_b32 v5, v5, 9, v8
+; GFX950-SAFE-GISEL-NEXT:    v_and_b32_e32 v0, 0xffff, v0
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v2, 0, 1, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_or_b32_e32 v2, v10, v2
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_gt_i32_e32 vcc, 1, v1
+; GFX950-SAFE-GISEL-NEXT:    s_nop 1
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e32 v2, v6, v2, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_and_b32_e32 v6, 7, v2
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_eq_u32_e32 vcc, 3, v6
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_lt_i32_e64 s[0:1], 5, v6
+; GFX950-SAFE-GISEL-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v2, 2, v2
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[0:1]
+; GFX950-SAFE-GISEL-NEXT:    v_add_u32_e32 v2, v2, v6
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_lt_i32_e32 vcc, 30, v1
+; GFX950-SAFE-GISEL-NEXT:    s_nop 1
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e32 v2, v2, v8, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_cmp_eq_u32_e32 vcc, v1, v9
+; GFX950-SAFE-GISEL-NEXT:    s_nop 1
+; GFX950-SAFE-GISEL-NEXT:    v_cndmask_b32_e32 v1, v2, v5, vcc
+; GFX950-SAFE-GISEL-NEXT:    v_lshrrev_b32_e32 v2, 16, v3
+; GFX950-SAFE-GISEL-NEXT:    v_and_or_b32 v1, v2, v4, v1
+; GFX950-SAFE-GISEL-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
+; GFX950-SAFE-GISEL-NEXT:    ; return to shader part epilog
+;
+; GFX950-UNSAFE-LABEL: v_test_cvt_v2f64_v2f16:
+; GFX950-UNSAFE:       ; %bb.0:
+; GFX950-UNSAFE-NEXT:    v_cvt_f32_f64_e32 v2, v[2:3]
+; GFX950-UNSAFE-NEXT:    v_cvt_f32_f64_e32 v0, v[0:1]
+; GFX950-UNSAFE-NEXT:    v_cvt_pk_f16_f32 v0, v0, v2
+; GFX950-UNSAFE-NEXT:    ; return to shader part epilog
+  %res = fptrunc <2 x double> %src to <2 x half>
+  ret <2 x half> %res
+}

@@ -915,7 +915,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
setOperationAction(ISD::BUILD_VECTOR, MVT::v2bf16, Legal);
}

if (Subtarget->hasCvtPkF16F32Inst()) {
if (Subtarget->hasCvtPkF16F32Inst() && TM.Options.UnsafeFPMath) {
setOperationAction(ISD::FP_ROUND, MVT::v2f16, Legal);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It disables it not just from v2f64, but also from v2f32, right? How is that lowered then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It disables it not just from v2f64, but also from v2f32, right? How is that lowered then?
We have a pattern to generate v_cvt_pk_f16_f32:

def : GCNPat<(v2f16 (fpround v2f32:$src)),

Even though we also have a pattern for v2f64->v2f16, but we will do splitting the vector and then Custom lowering of f64 ->f16 first.

def : GCNPat<(v2f16 (fpround v2f64:$src)),

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the pattern shall not work for operation which is not legal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the pattern shall not work for operation which is not legal.

my observation is that, if we do not Expand/Custom/Promote, the pattern will be picked up, no matter you specify Legal or not. So it doesn't matter whether you specify Legal or not, pattern for v2f32->v2f16 wil always be picked up.

This is similar to f32 -> f16, even though we did not specify Legal, v_cvt_f16_f32 will still be selected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default is Legal?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Legal is default:
https://llvm.org/docs/WritingAnLLVMBackend.html#legal

The reason we need to explicitly specify Legal sometimes is because we need to specify different Actions based on different hardware features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still learning. I think even though we specify Custom lowering but actually doing nothing, like:
if (SrcVT. != MVT::f64)
return Op; // f32 is legal for fp_round
This is essentially "Legal", which means either there is a native instruction, or a pattern to match to native instructions. The compiler will throw an error if there is nothing to select.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not further spread use of the global options. FP casts have fast math flags, make use of those. Note this needs to be in custom lowering, legality cannot be use context dependent

; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=0 -enable-unsafe-fp-math < %s | FileCheck -check-prefixes=GFX950,GFX950-UNSAFE %s
; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=1 -enable-unsafe-fp-math < %s | FileCheck -check-prefixes=GFX950,GFX950-UNSAFE %s

define amdgpu_ps <2 x half> @v_test_cvt_v2f32_v2f16(<2 x float> %src) {
Copy link
Contributor

@arsenm arsenm Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the shader conventions. Also test the scalar case and wider vector s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the shader conventions. Also test the scalar case and wider vector s

Removed the shade conventions, and added tests for other types.

  Custom lower FP_ROUND for v2f16. Also add test cases for scalar and wider vectors
@changpeng
Copy link
Contributor Author

Should not further spread use of the global options. FP casts have fast math flags, make use of those. Note this needs to be in custom lowering, legality cannot be use context dependent

I understand your proposal to avoid use of global options. But here we need to be consistent. Options.UnsafeFPMath has been used in all other places to expand/lower f64->f16. We should not use a different approach only here.

@changpeng changpeng requested a review from bcahoon April 9, 2025 21:34
@changpeng
Copy link
Contributor Author

Should not further spread use of the global options. FP casts have fast math flags, make use of those. Note this needs to be in custom lowering, legality cannot be use context dependent

I understand your proposal to avoid use of global options. But here we need to be consistent. Options.UnsafeFPMath has been used in all other places to expand/lower f64->f16. We should not use a different approach only here.

Or do you prefer a combination of fast math flag and global option, like this?
const SDNodeFlags Flags = Op->getFlags();
bool AllowInaccurateFP_ROUND =
Flags.hasApproximateFuncs() || DAG.getTarget().Options.UnsafeFPMath;

@arsenm
Copy link
Contributor

arsenm commented Apr 10, 2025

I understand your proposal to avoid use of global options. But here we need to be consistent. Options.UnsafeFPMath has been used in all other places to expand/lower f64->f16. We should not use a different approach only here.

This is because traditionally the FP casts did not have fast math flags. They do now, and we should migrate to only using them.

@arsenm
Copy link
Contributor

arsenm commented Apr 10, 2025

Or do you prefer a combination of fast math flag and global option, like this? const SDNodeFlags Flags = Op->getFlags(); bool AllowInaccurateFP_ROUND = Flags.hasApproximateFuncs() || DAG.getTarget().Options.UnsafeFPMath;

You can do that, but it still needs to be a contextual use transformation, not modify the legality rules

@changpeng
Copy link
Contributor Author

Or do you prefer a combination of fast math flag and global option, like this? const SDNodeFlags Flags = Op->getFlags(); bool AllowInaccurateFP_ROUND = Flags.hasApproximateFuncs() || DAG.getTarget().Options.UnsafeFPMath;

You can do that, but it still needs to be a contextual use transformation, not modify the legality rules

I do not fully understand what did you mean "modify the legality rules", and why we should not modify.

Here is the example in custom lowering for FP_TO_FP16:

if (getTargetMachine().Options.UnsafeFPMath) {
// There is a generic expand for FP_TO_FP16 with unsafe fast math.
return SDValue();
}

Do you mean this is to change from Custom to Expand, and is not allowed? Thanks.

@changpeng
Copy link
Contributor Author

I understand your proposal to avoid use of global options. But here we need to be consistent. Options.UnsafeFPMath has been used in all other places to expand/lower f64->f16. We should not use a different approach only here.

This is because traditionally the FP casts did not have fast math flags. They do now, and we should migrate to only using them.
Do you mean the instruction fptrunc can has "fast" flag now?
%res = fptrunc fast <2 x double> %src to <2 x half>

How to check the instruction has the "fast" flag?
Flags.hasApproximateFuncs() is not working when we have the "fast" Thanks

@changpeng
Copy link
Contributor Author

I understand your proposal to avoid use of global options. But here we need to be consistent. Options.UnsafeFPMath has been used in all other places to expand/lower f64->f16. We should not use a different approach only here.

This is because traditionally the FP casts did not have fast math flags. They do now, and we should migrate to only using them.

On SelectDAG path, FP_ROUND still does not have fast math flags. I still think adding fast math flags for FP_ROUND is beyond the work of this patch. (We also need to propagate the flags when we do transformation like splitting the vector).

Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff HEAD~1 HEAD --extensions h,cpp -- llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h llvm/lib/Target/AMDGPU/SIISelLowering.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index ee3f5dd4e..bf46aef0c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -1055,7 +1055,7 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
   auto &FPTruncActions = getActionDefinitionsBuilder(G_FPTRUNC);
   if (ST.hasCvtPkF16F32Inst()) {
     FPTruncActions.legalFor({{S32, S64}, {S16, S32}, {V2S16, V2S32}})
-                  .customFor({V2S16, V2S64});
+        .customFor({V2S16, V2S64});
   } else
     FPTruncActions.legalFor({{S32, S64}, {S16, S32}});
   FPTruncActions.scalarize(0).lower();
@@ -2750,8 +2750,8 @@ bool AMDGPULegalizerInfo::legalizeFPTrunc(LegalizerHelper &Helper,
   // still used here to be consistent, especially when the fast math flag is
   // not working for FP_ROUND on the SelectDAG path at this moment.
   MachineFunction &MF = Helper.MIRBuilder.getMF();
-  bool AllowInaccurateFPTRUNC = MI.getFlag(MachineInstr::FmAfn) ||
-                                MF.getTarget().Options.UnsafeFPMath;
+  bool AllowInaccurateFPTRUNC =
+      MI.getFlag(MachineInstr::FmAfn) || MF.getTarget().Options.UnsafeFPMath;
 
   if (AllowInaccurateFPTRUNC) {
     // Use the tablegen pattern to select native instructions.

@changpeng changpeng requested a review from arsenm April 11, 2025 20:52
@changpeng
Copy link
Contributor Author

Custom lowering for v264 -> v2f16 on both SelectionDAG and GlobalISel paths now. Fast math flag for fp_round is completely not working and needs additional work beyond this patch.

else
if (ST.hasCvtPkF16F32Inst()) {
FPTruncActions.legalFor({{S32, S64}, {S16, S32}, {V2S16, V2S32}})
.customFor({V2S16, V2S64});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's suspect that we need custom handling for the v2 case, but not the regular scalar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's suspect that we need custom handling for the v2 case, but not the regular scalar

Previously only v2f16 was made Legal in order to generate v_cvt_pk_f16_f32, I think.

Comment on lines +2749 to +2766
// TODO: We should only use fast math flag. But the global option is
// still used here to be consistent, especially when the fast math flag is
// not working for FP_ROUND on the SelectDAG path at this moment.
MachineFunction &MF = Helper.MIRBuilder.getMF();
bool AllowInaccurateFPTRUNC = MI.getFlag(MachineInstr::FmAfn) ||
MF.getTarget().Options.UnsafeFPMath;

if (AllowInaccurateFPTRUNC) {
// Use the tablegen pattern to select native instructions.
return true;
}

Register DstReg = MI.getOperand(0).getReg();
LLT DstTy = MRI.getType(DstReg);

// Scalarize the vector and fall through to lower f64 -> f16.
return Helper.fewerElementsVector(MI, 0, DstTy.getElementType()) ==
LegalizerHelper::Legalized;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it should be a combine and not a modification of legalization rules

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it should be a combine and not a modification of legalization rules

I am looking at instcombine to generate v_cvt_pk_f16_f32 instructions. Should we implement in AMDGPU because it is target dependent? Thanks.

@@ -0,0 +1,1748 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test belongs in the existing fptrunc.f16.ll test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test belongs in the existing fptrunc.f16.ll test

The existing fptrunc.f16.ll only checks when unsafe fpmath is set. This patch actually fixes the codegen when unsafe fpmath is not set for v2f16 only. Maybe we should not include scalar and wider in this file?

// Use the tablegen pattern to select native instructions.
return true;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handling logic doesn't make sense. The decision to scalarize or not has nothing to do with the FP properties? The tablegen pattern also handles the build_vector case, which is the MIR I would expect to select from. Can we just drop the direct vector handling, and always scalarize?

The tablegen patterns should also be guarded with the fast checks if we're going to have them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handling logic doesn't make sense. The decision to scalarize or not has nothing to do with the FP properties? The tablegen pattern also handles the build_vector case, which is the MIR I would expect to select from. Can we just drop the direct vector handling, and always scalarize?

The tablegen patterns should also be guarded with the fast checks if we're going to have them.

The logic to generate instructions based on the pattern, but this is only correct when the unsafe fpmath is set.
Yes, we may always scalarize. Should we do instcombine to generate v_cvt_pk_f16_f32?

@@ -6893,6 +6893,19 @@ SDValue SITargetLowering::lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const {
return Op;

EVT DstVT = Op.getValueType();

if (DstVT == MVT::v2f16) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as the globalisel path. The vectoriness shouldn't matter, we don't need to directly consume the operation

Copy link
Contributor Author

@changpeng changpeng Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as the globalisel path. The vectoriness shouldn't matter, we don't need to directly consume the operation

Okay, we can do as #135659, which essentially scalarize v2f16. We may do instcombine to generate the vector cvt as long as the target supports. Thanks.

changpeng added a commit to changpeng/llvm-project that referenced this pull request Apr 14, 2025
  On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64 -> v2f16
Legal, we will generate the following sequence of instructions:
  v_cvt_f32_f64_e32 v1, s[6:7]
  v_cvt_f32_f64_e32 v2, s[4:5]
  v_cvt_pk_f16_f32 v1, v2, v1
It possibly returns imprecise results due to double rounding. This patch fixes the issue
by not setting the conversion Legal. While we may still expect the above sequence of code
when unsafe fpmath is set, I hope llvm#134738 can
address that performance concern.

Fixes: SWDEV-523856
changpeng added a commit that referenced this pull request Apr 17, 2025
…135659)

On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64
-> v2f16 Legal, we will generate the following sequence of instructions:
  v_cvt_f32_f64_e32 v1, s[6:7]
  v_cvt_f32_f64_e32 v2, s[4:5]
  v_cvt_pk_f16_f32 v1, v2, v1
It possibly returns imprecise results due to double rounding. This patch
fixes the issue by not setting the conversion Legal. While we may still
expect the above sequence of code when unsafe fpmath is set, I hope
#134738 can address that
performance concern.

Fixes: SWDEV-523856
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Apr 17, 2025
…nversion (#135659)

On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64
-> v2f16 Legal, we will generate the following sequence of instructions:
  v_cvt_f32_f64_e32 v1, s[6:7]
  v_cvt_f32_f64_e32 v2, s[4:5]
  v_cvt_pk_f16_f32 v1, v2, v1
It possibly returns imprecise results due to double rounding. This patch
fixes the issue by not setting the conversion Legal. While we may still
expect the above sequence of code when unsafe fpmath is set, I hope
llvm/llvm-project#134738 can address that
performance concern.

Fixes: SWDEV-523856
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Apr 24, 2025
…lvm#135659) (llvm#1791)

On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64
-> v2f16 Legal, we will generate the following sequence of instructions:
  v_cvt_f32_f64_e32 v1, s[6:7]
  v_cvt_f32_f64_e32 v2, s[4:5]
  v_cvt_pk_f16_f32 v1, v2, v1
It possibly returns imprecise results due to double rounding. This patch
fixes the issue by not setting the conversion Legal. While we may still
expect the above sequence of code when unsafe fpmath is set, I hope
llvm#134738 can address that
performance concern.

Cherry-pick: llvm#135659 to
amd-mainline

Fixes: SWDEV-523856
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…lvm#135659)

On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64
-> v2f16 Legal, we will generate the following sequence of instructions:
  v_cvt_f32_f64_e32 v1, s[6:7]
  v_cvt_f32_f64_e32 v2, s[4:5]
  v_cvt_pk_f16_f32 v1, v2, v1
It possibly returns imprecise results due to double rounding. This patch
fixes the issue by not setting the conversion Legal. While we may still
expect the above sequence of code when unsafe fpmath is set, I hope
llvm#134738 can address that
performance concern.

Fixes: SWDEV-523856
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…lvm#135659)

On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64
-> v2f16 Legal, we will generate the following sequence of instructions:
  v_cvt_f32_f64_e32 v1, s[6:7]
  v_cvt_f32_f64_e32 v2, s[4:5]
  v_cvt_pk_f16_f32 v1, v2, v1
It possibly returns imprecise results due to double rounding. This patch
fixes the issue by not setting the conversion Legal. While we may still
expect the above sequence of code when unsafe fpmath is set, I hope
llvm#134738 can address that
performance concern.

Fixes: SWDEV-523856
jrbyrnes pushed a commit to jrbyrnes/llvm-project that referenced this pull request May 27, 2025
…lvm#135659)

On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64
-> v2f16 Legal, we will generate the following sequence of instructions:
  v_cvt_f32_f64_e32 v1, s[6:7]
  v_cvt_f32_f64_e32 v2, s[4:5]
  v_cvt_pk_f16_f32 v1, v2, v1
It possibly returns imprecise results due to double rounding. This patch
fixes the issue by not setting the conversion Legal. While we may still
expect the above sequence of code when unsafe fpmath is set, I hope
llvm#134738 can address that
performance concern.

Fixes: SWDEV-523856
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants