Skip to content

AMDGPU: Add round-to-odd rounding during f64 to bf16 conversion #133995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 5, 2025

Conversation

changpeng
Copy link
Contributor

@changpeng changpeng commented Apr 1, 2025

f64 -> bf16 conversion can be lowered to f64 -> f32 followed by f32 -> bf16:
v_cvt_f32_f64_e32 v0, v[0:1]
v_cvt_pk_bf16_f32 v0, v0, s0
Both conversion instructions will do round-to-even rounding, and thus we will have double rounding issue which may generate incorrect result in some data range. We need to add round-to-odd rounding during f64 -> f32 to avoid double rounding,.

NOTE: we are having the same issue with f64 -> f16 conversion. Will add round-to-odd rounding for it in a separate patch, which fixes SWDEV-523856

 This is to upstream gfx950 change: 9dd15535ec. In f64 -> bf16 conversion,
the following piece of code does double roundings, and will generate
incorrect result in some data range:
   v_cvt_f32_f64_e32 v0, v[0:1]
   v_cvt_pk_bf16_f32 v0, v0, s0
We need to add round-to-odd rounding during f64 -> f32.

Here is the description from commit 9dd15535ec to gfx950:

  In f64 to bf16 conversion, we do f64 -> f32, followed by f32 -> bf16.
We should do round-to-odd rounding in f64 -> f32, and then nearest-even
rounding in f32 -> bf16.
  Refer to llvm#82399 for details.
On hardwares with f32 -> bf16 instructions, round-to-odd was missed in
rounding in f64 -> f32. This patches adds this rounding support by
custom lowering of FP_ROUND.

NOTE: we are having the same issue with f64 -> f16 conversion. Will add
round-to-odd rounding for it in a separate patch, which fixes SWDEV-523856
Copy link

github-actions bot commented Apr 1, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff HEAD~1 HEAD --extensions cpp -- llvm/lib/Target/AMDGPU/SIISelLowering.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 21f8c7cfe..e8d6b76a4 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6910,8 +6910,8 @@ SDValue SITargetLowering::lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const {
 
   // Round-inexact-to-odd f64 to f32, then do the final rounding using the
   // hardware f32 -> bf16 instruction.
-  EVT F32VT = SrcVT.isVector() ? SrcVT.changeVectorElementType(MVT::f32) :
-                                 MVT::f32;
+  EVT F32VT =
+      SrcVT.isVector() ? SrcVT.changeVectorElementType(MVT::f32) : MVT::f32;
   SDValue Rod = expandRoundInexactToOdd(F32VT, Src, DL, DAG);
   return DAG.getNode(ISD::FP_ROUND, DL, DstVT, Rod,
                      DAG.getTargetConstant(0, DL, MVT::i32));

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clean up the description to refer to just the changes here, without referring to the out of tree commit context

// hardware f32 -> bf16 instruction.
EVT F32VT = SrcVT.isVector() ? SrcVT.changeVectorElementType(MVT::f32) :
MVT::f32;
SDValue Rod = expandRoundInexactToOdd(F32VT, Src, DL, DAG);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to custom lower here at all? This seems like the default action. Can you just return the original node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. If we return the original node, it will generate the two conversions and have the double rounding issue.
Also, if we setOperationAction to "Expand", it will do the Round-inexact-to-odd, but won't generate the desired v_cvt_pk_bf16_f32 instruction for f32->bf16.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the default expansion is buggy? We shouldn't have unique bf16 legalization needs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default expansion for bf16 is not optimal at least. I can take a look in a following patch.

@changpeng
Copy link
Contributor Author

Can you clean up the description to refer to just the changes here, without referring to the out of tree commit context

Okay. Thanks.

@llvmbot
Copy link
Member

llvmbot commented Apr 2, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Changpeng Fang (changpeng)

Changes

f64 -> bf16 conversion can be lowered to f64 -> f32 followed by f32 -> bf16:
v_cvt_f32_f64_e32 v0, v[0:1]
v_cvt_pk_bf16_f32 v0, v0, s0
Both conversion instructions will do round-to-even rounding, and thus we will have double rounding issue which may generate incorrect result in some data range. We need to add round-to-odd rounding during f64 -> f32 to avoid double rounding,.

NOTE: we are having the same issue with f64 -> f16 conversion. Will add round-to-odd rounding for it in a separate patch, which fixes SWDEV-523856


Full diff: https://github.com/llvm/llvm-project/pull/133995.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+22-13)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16-conversions.ll (+66-6)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 96c113cc5d24c..e9aef744abcd9 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -911,8 +911,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
   setOperationAction(ISD::MUL, MVT::i1, Promote);
 
   if (Subtarget->hasBF16ConversionInsts()) {
-    setOperationAction(ISD::FP_ROUND, MVT::v2bf16, Legal);
-    setOperationAction(ISD::FP_ROUND, MVT::bf16, Legal);
+    setOperationAction(ISD::FP_ROUND, {MVT::bf16, MVT::v2bf16}, Custom);
     setOperationAction(ISD::BUILD_VECTOR, MVT::v2bf16, Legal);
   }
 
@@ -6888,23 +6887,33 @@ SDValue SITargetLowering::getFPExtOrFPRound(SelectionDAG &DAG, SDValue Op,
 }
 
 SDValue SITargetLowering::lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const {
-  assert(Op.getValueType() == MVT::f16 &&
-         "Do not know how to custom lower FP_ROUND for non-f16 type");
-
   SDValue Src = Op.getOperand(0);
   EVT SrcVT = Src.getValueType();
-  if (SrcVT != MVT::f64)
-    return Op;
-
-  // TODO: Handle strictfp
-  if (Op.getOpcode() != ISD::FP_ROUND)
+  if (SrcVT.getScalarType() != MVT::f64)
     return Op;
 
+  EVT DstVT = Op.getValueType();
   SDLoc DL(Op);
+  if (DstVT == MVT::f16) {
+    // TODO: Handle strictfp
+    if (Op.getOpcode() != ISD::FP_ROUND)
+      return Op;
+
+    SDValue FpToFp16 = DAG.getNode(ISD::FP_TO_FP16, DL, MVT::i32, Src);
+    SDValue Trunc = DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, FpToFp16);
+    return DAG.getNode(ISD::BITCAST, DL, MVT::f16, Trunc);
+  }
+
+  assert(DstVT.getScalarType() == MVT::bf16 &&
+          "custom lower FP_ROUND for f16 or bf16");
+  assert(Subtarget->hasBF16ConversionInsts() && "f32 -> bf16 is legal");
 
-  SDValue FpToFp16 = DAG.getNode(ISD::FP_TO_FP16, DL, MVT::i32, Src);
-  SDValue Trunc = DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, FpToFp16);
-  return DAG.getNode(ISD::BITCAST, DL, MVT::f16, Trunc);
+  // Round-inexact-to-odd f64 to f32, then do the final rounding using the
+  // hardware f32 -> bf16 instruction.
+  EVT F32VT = SrcVT.isVector() ? SrcVT.changeVectorElementType(MVT::f32) :
+                                 MVT::f32;
+  SDValue Rod = expandRoundInexactToOdd(F32VT, Src, DL, DAG);
+  return getFPExtOrFPRound(DAG, Rod, DL, DstVT);
 }
 
 SDValue SITargetLowering::lowerFMINNUM_FMAXNUM(SDValue Op,
diff --git a/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll b/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll
index 4c01e583713a7..3be911ab9e7f4 100644
--- a/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll
+++ b/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll
@@ -153,9 +153,34 @@ define amdgpu_ps float @v_test_cvt_v2f64_v2bf16_v(<2 x double> %src) {
 ;
 ; GFX-950-LABEL: v_test_cvt_v2f64_v2bf16_v:
 ; GFX-950:       ; %bb.0:
-; GFX-950-NEXT:    v_cvt_f32_f64_e32 v2, v[2:3]
-; GFX-950-NEXT:    v_cvt_f32_f64_e32 v0, v[0:1]
-; GFX-950-NEXT:    v_cvt_pk_bf16_f32 v0, v0, v2
+; GFX-950-NEXT:    v_mov_b32_e32 v4, v3
+; GFX-950-NEXT:    v_and_b32_e32 v3, 0x7fffffff, v4
+; GFX-950-NEXT:    v_mov_b32_e32 v5, v1
+; GFX-950-NEXT:    v_cvt_f32_f64_e32 v1, v[2:3]
+; GFX-950-NEXT:    v_cvt_f64_f32_e32 v[6:7], v1
+; GFX-950-NEXT:    v_and_b32_e32 v8, 1, v1
+; GFX-950-NEXT:    v_cmp_gt_f64_e64 s[2:3], v[2:3], v[6:7]
+; GFX-950-NEXT:    v_cmp_nlg_f64_e32 vcc, v[2:3], v[6:7]
+; GFX-950-NEXT:    v_cmp_eq_u32_e64 s[0:1], 1, v8
+; GFX-950-NEXT:    v_cndmask_b32_e64 v2, -1, 1, s[2:3]
+; GFX-950-NEXT:    v_add_u32_e32 v2, v1, v2
+; GFX-950-NEXT:    s_or_b64 vcc, vcc, s[0:1]
+; GFX-950-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc
+; GFX-950-NEXT:    s_brev_b32 s4, 1
+; GFX-950-NEXT:    v_and_or_b32 v4, v4, s4, v1
+; GFX-950-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v5
+; GFX-950-NEXT:    v_cvt_f32_f64_e32 v6, v[0:1]
+; GFX-950-NEXT:    v_cvt_f64_f32_e32 v[2:3], v6
+; GFX-950-NEXT:    v_and_b32_e32 v7, 1, v6
+; GFX-950-NEXT:    v_cmp_gt_f64_e64 s[2:3], v[0:1], v[2:3]
+; GFX-950-NEXT:    v_cmp_nlg_f64_e32 vcc, v[0:1], v[2:3]
+; GFX-950-NEXT:    v_cmp_eq_u32_e64 s[0:1], 1, v7
+; GFX-950-NEXT:    v_cndmask_b32_e64 v0, -1, 1, s[2:3]
+; GFX-950-NEXT:    v_add_u32_e32 v0, v6, v0
+; GFX-950-NEXT:    s_or_b64 vcc, vcc, s[0:1]
+; GFX-950-NEXT:    v_cndmask_b32_e32 v0, v0, v6, vcc
+; GFX-950-NEXT:    v_and_or_b32 v0, v5, s4, v0
+; GFX-950-NEXT:    v_cvt_pk_bf16_f32 v0, v0, v4
 ; GFX-950-NEXT:    ; return to shader part epilog
   %res = fptrunc <2 x double> %src to <2 x bfloat>
   %cast = bitcast <2 x bfloat> %res to float
@@ -347,7 +372,18 @@ define amdgpu_ps void @fptrunc_f64_to_bf16(double %a, ptr %out) {
 ;
 ; GFX-950-LABEL: fptrunc_f64_to_bf16:
 ; GFX-950:       ; %bb.0: ; %entry
-; GFX-950-NEXT:    v_cvt_f32_f64_e32 v0, v[0:1]
+; GFX-950-NEXT:    v_cvt_f32_f64_e64 v6, |v[0:1]|
+; GFX-950-NEXT:    v_cvt_f64_f32_e32 v[4:5], v6
+; GFX-950-NEXT:    v_and_b32_e32 v7, 1, v6
+; GFX-950-NEXT:    v_cmp_gt_f64_e64 s[2:3], |v[0:1]|, v[4:5]
+; GFX-950-NEXT:    v_cmp_nlg_f64_e64 s[0:1], |v[0:1]|, v[4:5]
+; GFX-950-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v7
+; GFX-950-NEXT:    v_cndmask_b32_e64 v0, -1, 1, s[2:3]
+; GFX-950-NEXT:    v_add_u32_e32 v0, v6, v0
+; GFX-950-NEXT:    s_or_b64 vcc, s[0:1], vcc
+; GFX-950-NEXT:    v_cndmask_b32_e32 v0, v0, v6, vcc
+; GFX-950-NEXT:    s_brev_b32 s0, 1
+; GFX-950-NEXT:    v_and_or_b32 v0, v1, s0, v0
 ; GFX-950-NEXT:    v_cvt_pk_bf16_f32 v0, v0, s0
 ; GFX-950-NEXT:    flat_store_short v[2:3], v0
 ; GFX-950-NEXT:    s_endpgm
@@ -385,7 +421,19 @@ define amdgpu_ps void @fptrunc_f64_to_bf16_neg(double %a, ptr %out) {
 ;
 ; GFX-950-LABEL: fptrunc_f64_to_bf16_neg:
 ; GFX-950:       ; %bb.0: ; %entry
-; GFX-950-NEXT:    v_cvt_f32_f64_e64 v0, -v[0:1]
+; GFX-950-NEXT:    v_cvt_f32_f64_e64 v7, |v[0:1]|
+; GFX-950-NEXT:    v_cvt_f64_f32_e32 v[4:5], v7
+; GFX-950-NEXT:    v_and_b32_e32 v8, 1, v7
+; GFX-950-NEXT:    v_cmp_gt_f64_e64 s[2:3], |v[0:1]|, v[4:5]
+; GFX-950-NEXT:    v_cmp_nlg_f64_e64 s[0:1], |v[0:1]|, v[4:5]
+; GFX-950-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v8
+; GFX-950-NEXT:    v_cndmask_b32_e64 v0, -1, 1, s[2:3]
+; GFX-950-NEXT:    v_add_u32_e32 v0, v7, v0
+; GFX-950-NEXT:    s_or_b64 vcc, s[0:1], vcc
+; GFX-950-NEXT:    s_brev_b32 s4, 1
+; GFX-950-NEXT:    v_xor_b32_e32 v6, 0x80000000, v1
+; GFX-950-NEXT:    v_cndmask_b32_e32 v0, v0, v7, vcc
+; GFX-950-NEXT:    v_and_or_b32 v0, v6, s4, v0
 ; GFX-950-NEXT:    v_cvt_pk_bf16_f32 v0, v0, s0
 ; GFX-950-NEXT:    flat_store_short v[2:3], v0
 ; GFX-950-NEXT:    s_endpgm
@@ -424,7 +472,19 @@ define amdgpu_ps void @fptrunc_f64_to_bf16_abs(double %a, ptr %out) {
 ;
 ; GFX-950-LABEL: fptrunc_f64_to_bf16_abs:
 ; GFX-950:       ; %bb.0: ; %entry
-; GFX-950-NEXT:    v_cvt_f32_f64_e64 v0, |v[0:1]|
+; GFX-950-NEXT:    v_cvt_f32_f64_e64 v7, |v[0:1]|
+; GFX-950-NEXT:    v_cvt_f64_f32_e32 v[4:5], v7
+; GFX-950-NEXT:    v_and_b32_e32 v8, 1, v7
+; GFX-950-NEXT:    v_cmp_gt_f64_e64 s[2:3], |v[0:1]|, v[4:5]
+; GFX-950-NEXT:    v_cmp_nlg_f64_e64 s[0:1], |v[0:1]|, v[4:5]
+; GFX-950-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v8
+; GFX-950-NEXT:    v_cndmask_b32_e64 v0, -1, 1, s[2:3]
+; GFX-950-NEXT:    v_add_u32_e32 v0, v7, v0
+; GFX-950-NEXT:    s_or_b64 vcc, s[0:1], vcc
+; GFX-950-NEXT:    v_and_b32_e32 v6, 0x7fffffff, v1
+; GFX-950-NEXT:    v_cndmask_b32_e32 v0, v0, v7, vcc
+; GFX-950-NEXT:    s_brev_b32 s0, 1
+; GFX-950-NEXT:    v_and_or_b32 v0, v6, s0, v0
 ; GFX-950-NEXT:    v_cvt_pk_bf16_f32 v0, v0, s0
 ; GFX-950-NEXT:    flat_store_short v[2:3], v0
 ; GFX-950-NEXT:    s_endpgm

@jayfoad
Copy link
Contributor

jayfoad commented Apr 2, 2025

Should also delete the incorrect isel patterns in VOP3Instructions.td?

  def : GCNPat<(v2bf16 (bf16_fpround v2f64:$src)),
               (V_CVT_PK_BF16_F32_e64 0, (V_CVT_F32_F64_e64 0, (EXTRACT_SUBREG VReg_128:$src, sub0_sub1)),
                                      0, (V_CVT_F32_F64_e64 0, (EXTRACT_SUBREG VReg_128:$src, sub2_sub3)))>;
...
  def : GCNPat<(bf16 (bf16_fpround (f64 (VOP3Mods f64:$src0, i32:$src0_modifiers)))),
               (V_CVT_PK_BF16_F32_e64 0, (f32 (V_CVT_F32_F64_e64 $src0_modifiers, $src0)), 0, (f32 (IMPLICIT_DEF)))>;

@changpeng
Copy link
Contributor Author

changpeng commented Apr 2, 2025

GCNPat<(v2bf16 (bf16_fpround v2f64:$src))

Right, we should remove these incorrect patterns. The work is done

@changpeng
Copy link
Contributor Author

Ping. Is there any further suggestions or comment?
This is the patch from the downstream branches, and the logic is needed to fix the double rounding issue.
Thanks.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but we shouldn't really need anything custom here

@changpeng changpeng merged commit f47034c into llvm:main Apr 5, 2025
10 of 11 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Apr 5, 2025

LLVM Buildbot has detected a new failure on builder lldb-x86_64-debian running on lldb-x86_64-debian while building llvm at step 6 "test".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/162/builds/19592

Here is the relevant piece of the build log for the reference
Step 6 (test) failure: build (failure)
...
PASS: lldb-api :: python_api/formatters/TestFormattersSBAPI.py (308 of 2823)
PASS: lldb-api :: types/TestFloatTypesExpr.py (309 of 2823)
PASS: lldb-api :: lang/cpp/unique-types3/TestUniqueTypes3.py (310 of 2823)
PASS: lldb-api :: functionalities/vtable/TestVTableValue.py (311 of 2823)
PASS: lldb-api :: tools/lldb-dap/breakpoint/TestDAP_setExceptionBreakpoints.py (312 of 2823)
PASS: lldb-api :: functionalities/thread/thread_exit/TestThreadExit.py (313 of 2823)
PASS: lldb-shell :: Minidump/fb-dump.test (314 of 2823)
PASS: lldb-api :: lang/cpp/preferred_name/TestPreferredName.py (315 of 2823)
PASS: lldb-api :: commands/expression/call-function/TestCallStopAndContinue.py (316 of 2823)
UNRESOLVED: lldb-api :: driver/batch_mode/TestBatchMode.py (317 of 2823)
******************** TEST 'lldb-api :: driver/batch_mode/TestBatchMode.py' FAILED ********************
Script:
--
/usr/bin/python3 /home/worker/2.0.1/lldb-x86_64-debian/llvm-project/lldb/test/API/dotest.py -u CXXFLAGS -u CFLAGS --env LLVM_LIBS_DIR=/home/worker/2.0.1/lldb-x86_64-debian/build/./lib --env LLVM_INCLUDE_DIR=/home/worker/2.0.1/lldb-x86_64-debian/build/include --env LLVM_TOOLS_DIR=/home/worker/2.0.1/lldb-x86_64-debian/build/./bin --arch x86_64 --build-dir /home/worker/2.0.1/lldb-x86_64-debian/build/lldb-test-build.noindex --lldb-module-cache-dir /home/worker/2.0.1/lldb-x86_64-debian/build/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir /home/worker/2.0.1/lldb-x86_64-debian/build/lldb-test-build.noindex/module-cache-clang/lldb-api --executable /home/worker/2.0.1/lldb-x86_64-debian/build/./bin/lldb --compiler /home/worker/2.0.1/lldb-x86_64-debian/build/./bin/clang --dsymutil /home/worker/2.0.1/lldb-x86_64-debian/build/./bin/dsymutil --make /usr/bin/gmake --llvm-tools-dir /home/worker/2.0.1/lldb-x86_64-debian/build/./bin --lldb-obj-root /home/worker/2.0.1/lldb-x86_64-debian/build/tools/lldb --lldb-libs-dir /home/worker/2.0.1/lldb-x86_64-debian/build/./lib -t /home/worker/2.0.1/lldb-x86_64-debian/llvm-project/lldb/test/API/driver/batch_mode -p TestBatchMode.py
--
Exit Code: 1

Command Output (stdout):
--
lldb version 21.0.0git (https://github.com/llvm/llvm-project.git revision f47034cbe5c02b748742c733cf453b3b907687e5)
  clang revision f47034cbe5c02b748742c733cf453b3b907687e5
  llvm revision f47034cbe5c02b748742c733cf453b3b907687e5
�7�[0;23r�8�[1A
(lldb) settings clear -all
�7
�[7m                                                                                �[0m�8(lldb) settings set symbols.enable-external-lookup false
(lldb) settings set target.inherit-tcc true
(lldb) settings set target.disable-aslr false
(lldb) settings set target.detach-on-error false
(lldb) settings set target.auto-apply-fixits false
(lldb) settings set plugin.process.gdb-remote.packet-timeout 60
(lldb) settings set symbols.clang-modules-cache-path "/home/worker/2.0.1/lldb-x86_64-debian/build/lldb-test-build.noindex/module-cache-lldb/lldb-api"
(lldb) settings set use-color false
(lldb) settings set show-statusline false
�7�[0;24r�8�[J(lldb) target create "/home/worker/2.0.1/lldb-x86_64-debian/build/lldb-test-build.noindex/driver/batch_mode/TestBatchMode.test_batch_mode_attach_exit/a.out"
Current executable set to '/home/worker/2.0.1/lldb-x86_64-debian/build/lldb-test-build.noindex/driver/batch_mode/TestBatchMode.test_batch_mode_attach_exit/a.out' (x86_64).
(lldb) process attach -p 1733134
Process 1733134 stopped
* thread #1, name = 'a.out', stop reason = signal SIGSTOP
    frame #0: 0x00007efc3faee443 libc.so.6`clock_nanosleep + 35
libc.so.6`clock_nanosleep:
->  0x7efc3faee443 <+35>: negl   %eax
    0x7efc3faee445 <+37>: retq   
    0x7efc3faee446 <+38>: nopw   %cs:(%rax,%rax)
    0x7efc3faee450 <+48>: subq   $0x28, %rsp
(lldb) breakpoint set --file 'main.c' -p 'Stop here to unset keep_waiting' -N keep_waiting
Breakpoint 1: where = a.out`main + 221 at main.c:29:13, address = 0x0000556cbc76925d
(lldb) continue
Process 1733134 resuming

@changpeng
Copy link
Contributor Author

LGTM but we shouldn't really need anything custom here

Thanks. It seems we have a lot of work to do with f64->f16 (Custom) and f64->bf16 (Expand). Just that we do not actually tahe advantage of the existing f32->f16 (or f32->bf16 for gfx950+) hardware instructions, and generate much longer sequence of code (without unsafe fast math flag set).

@changpeng
Copy link
Contributor Author

GCNPat<(v2bf16 (bf16_fpround v2f64:$src))

Right, we should remove these incorrect patterns. The work is done

Just realized that we should still need these patterns for unsafe fast math. Will take further look and may put them back. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants