Skip to content

[NVPTX] Add float to tf32 conversion intrinsics #121507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 13, 2025

Conversation

durga4github
Copy link
Contributor

@durga4github durga4github commented Jan 2, 2025

This patch adds the missing variants of float to tf32 conversion
intrinsics, with their corresponding lit tests.

PTX Spec link:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt

@llvmbot
Copy link
Member

llvmbot commented Jan 2, 2025

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-nvptx

Author: Durgadoss R (durga4github)

Changes

This patch adds an intrinsic to convert float to tf32.

  • This intrinsic uses flags for rounding, saturation modes, and relu. The backend looks through these flags and lowers them to the appropriate instruction.
  • Docs have been updated to describe the usage of flag arguments.
  • Lit tests are added for all the combinations.

TODO: We already have an intrinsic 'llvm.nvvm.f2tf32.rna' which caters only to one variant of the PTX instruction. Once this change lands, I will submit a follow-up PR to auto-upgrade it to use the generic variant.

PTX Spec link:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt


Full diff: https://github.com/llvm/llvm-project/pull/121507.diff

12 Files Affected:

  • (modified) llvm/docs/NVPTXUsage.rst (+60)
  • (modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+9)
  • (modified) llvm/include/llvm/IR/NVVMIntrinsicFlags.h (+16)
  • (modified) llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp (+53)
  • (modified) llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h (+6)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp (+46)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h (+1)
  • (modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+16)
  • (modified) llvm/lib/Target/NVPTX/NVPTXSubtarget.h (+1)
  • (modified) llvm/test/CodeGen/NVPTX/convert-sm80.ll (+17)
  • (modified) llvm/test/CodeGen/NVPTX/convert-sm89.ll (+9)
  • (added) llvm/test/CodeGen/NVPTX/convert-sm90.ll (+65)
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 313e84f3722a95..f6d5d27b8850c9 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -462,6 +462,66 @@ to left-shift the found bit into the most-significant bit position, otherwise
 the result is the shift amount needed to right-shift the found bit into the
 least-significant bit position. 0xffffffff is returned if no 1 bit is found.
 
+Conversion Intrinsics (for cvt.* PTX instructions)
+--------------------------------------------------
+
+'``llvm.nvvm.cvt.float.to.tf32``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare i32 @llvm.nvvm.cvt.float.to.tf32(float %f1, i8 %flag_fp_rnd_mode, i8 %flag_sat_mode, i1 %flag_relu)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.cvt.float.to.tf32``' intrinsic lowers to
+the ``cvt.*.tf32.f32`` set of PTX instructions.
+
+* The first argument is the input float to be converted to TF32.
+  This is followed by three flag arguments encoding the rounding mode,
+  saturation mode, and the relu modifier respectively.
+
+* The second argument (denoted by ``i8 %flag_fp_rnd_mode``) denotes
+  the floating-point rounding modes supported for this instruction.
+  This must be a compile-time constant and the encoding is as below:
+
+  ========== ==============
+  Enum Value  Rounding Mode
+  ========== ==============
+  ``0``       NONE
+  ``1``       ROUND_RZ
+  ``2``       ROUND_RN
+  ``3``       ROUND_RP
+  ``4``       ROUND_RM
+  ``5``       ROUND_RNA
+  ========== ==============
+
+  The valid rounding modes are ``RNA, RN and RZ``.
+
+* The third argument (denoted by ``i8 %flag_sat_mode``) denotes the
+  saturation modifier for this intrinsic. As of now, it can either
+  be None or Satfinite, according to the enumeration below:
+
+  ========== ================
+  Enum Value  Saturation Mode
+  ========== ================
+  ``0``       NONE
+  ``1``       SATFINITE
+  ========== ================
+
+* The last argument (denoted by ``i1 %flag_relu``) when set, generates
+  the ``.relu`` variant of the instruction.
+
+* Invalid values for the compile-time flag arguments may lead
+  to error(s) during Codegen.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cvt>`_.
+
 TMA family of Intrinsics
 ------------------------
 
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index fd07d131ce15b2..870378bda44b0a 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -1466,6 +1466,15 @@ let TargetPrefix = "nvvm" in {
   def int_nvvm_e5m2x2_to_f16x2_rn_relu : ClangBuiltin<"__nvvm_e5m2x2_to_f16x2_rn_relu">,
       Intrinsic<[llvm_v2f16_ty], [llvm_i16_ty], [IntrNoMem, IntrNoCallback]>;
 
+// Convert Float to TF32
+def int_nvvm_cvt_float_to_tf32 : Intrinsic<[llvm_i32_ty],
+    [llvm_float_ty, // Input float
+     llvm_i8_ty,    // Flag for Rounding Modes
+     llvm_i8_ty,    // Flag for Saturation Modes
+     llvm_i1_ty],   // Flag for relu
+    [IntrNoMem, IntrNoCallback,
+     ImmArg<ArgIndex<1>>, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>]>;
+
 // FNS
 
   def int_nvvm_fns : ClangBuiltin<"__nvvm_fns">,
diff --git a/llvm/include/llvm/IR/NVVMIntrinsicFlags.h b/llvm/include/llvm/IR/NVVMIntrinsicFlags.h
index dfb6e857b3a6ad..3dfa58313e3b60 100644
--- a/llvm/include/llvm/IR/NVVMIntrinsicFlags.h
+++ b/llvm/include/llvm/IR/NVVMIntrinsicFlags.h
@@ -34,6 +34,22 @@ enum class TMAReductionOp : uint8_t {
   XOR = 7,
 };
 
+// Rounding Modes for floating point types
+enum class FPRoundingMode : uint8_t {
+  NONE = 0,
+  ROUND_RZ = 1,  // roundTowardZero
+  ROUND_RN = 2,  // roundToNearest-TiesToEven
+  ROUND_RP = 3,  // roundTowardPositiveInf
+  ROUND_RM = 4,  // roundTowardNegativeInf
+  ROUND_RNA = 5, // roundToNearest-TiesAwayFromZero
+};
+
+// Saturation Modes
+enum class SaturationMode : uint8_t {
+  NONE = 0,
+  SATFINITE = 1,
+};
+
 } // namespace nvvm
 } // namespace llvm
 #endif // LLVM_IR_NVVMINTRINSICFLAGS_H
diff --git a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp
index 65e1893d3f3bdf..06dc60da9e6462 100644
--- a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp
+++ b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp
@@ -453,3 +453,56 @@ void NVPTXInstPrinter::printTmaReductionMode(const MCInst *MI, int OpNum,
   llvm_unreachable(
       "Invalid Reduction Op in printCpAsyncBulkTensorReductionMode");
 }
+
+void NVPTXInstPrinter::printFPRoundingMode(const MCInst *MI, int OpNum,
+                                           raw_ostream &O,
+                                           const char *Modifier) {
+  const MCOperand &MO = MI->getOperand(OpNum);
+  using Mode = nvvm::FPRoundingMode;
+
+  switch (static_cast<Mode>(MO.getImm())) {
+  case Mode::NONE:
+    O << "";
+    return;
+  case Mode::ROUND_RN:
+    O << ".rn";
+    return;
+  case Mode::ROUND_RNA:
+    O << ".rna";
+    return;
+  case Mode::ROUND_RZ:
+    O << ".rz";
+    return;
+  case Mode::ROUND_RP:
+    O << ".rp";
+    return;
+  case Mode::ROUND_RM:
+    O << ".rm";
+    return;
+  }
+  llvm_unreachable("Invalid mode in printFPRoundingMode");
+}
+
+void NVPTXInstPrinter::printSaturationMode(const MCInst *MI, int OpNum,
+                                           raw_ostream &O,
+                                           const char *Modifier) {
+  const MCOperand &MO = MI->getOperand(OpNum);
+  using Mode = nvvm::SaturationMode;
+
+  switch (static_cast<Mode>(MO.getImm())) {
+  case Mode::NONE:
+    O << "";
+    return;
+  case Mode::SATFINITE:
+    O << ".satfinite";
+    return;
+  }
+  llvm_unreachable("Invalid mode in printSaturationMode");
+}
+
+void NVPTXInstPrinter::printReluModifier(const MCInst *MI, int OpNum,
+                                         raw_ostream &O, const char *Modifier) {
+  const MCOperand &MO = MI->getOperand(OpNum);
+  if (MO.getImm())
+    O << ".relu";
+}
diff --git a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h
index 2b19386ef17fe5..7c3be27751ca14 100644
--- a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h
+++ b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h
@@ -56,6 +56,12 @@ class NVPTXInstPrinter : public MCInstPrinter {
                      const char *Modifier = nullptr);
   void printTmaReductionMode(const MCInst *MI, int OpNum, raw_ostream &O,
                              const char *Modifier = nullptr);
+  void printFPRoundingMode(const MCInst *MI, int OpNum, raw_ostream &O,
+                           const char *Modifier = nullptr);
+  void printSaturationMode(const MCInst *MI, int OpNum, raw_ostream &O,
+                           const char *Modifier = nullptr);
+  void printReluModifier(const MCInst *MI, int OpNum, raw_ostream &O,
+                         const char *Modifier = nullptr);
 };
 
 }
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index c51729e224bf54..82ac658c4a4570 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -728,7 +728,53 @@ bool NVPTXDAGToDAGISel::tryIntrinsicNoChain(SDNode *N) {
   case Intrinsic::nvvm_texsurf_handle_internal:
     SelectTexSurfHandle(N);
     return true;
+  case Intrinsic::nvvm_cvt_float_to_tf32:
+    SelectCvtFloatToTF32(N);
+    return true;
+  }
+}
+
+void NVPTXDAGToDAGISel::SelectCvtFloatToTF32(SDNode *N) {
+  // 0 - IID
+  // 1 - Input Float
+  // 2 - Rounding Mode
+  // 3 - Saturation Mode
+  // 4 - Relu Flag
+  uint64_t Rnd = N->getConstantOperandVal(2);
+  uint64_t Sat = N->getConstantOperandVal(3);
+  bool IsRelu = N->getConstantOperandVal(4) == 1;
+
+  if (!Subtarget->hasTF32Math())
+    report_fatal_error("TF32 destination format requires at least sm80");
+
+  using SatMode = nvvm::SaturationMode;
+  bool IsSatFinite = static_cast<SatMode>(Sat) == SatMode::SATFINITE;
+  if (IsSatFinite && Subtarget->getPTXVersion() < 81)
+    report_fatal_error("satfinite modifier requires PTX version 8.1 or higher");
+
+  using RndMode = nvvm::FPRoundingMode;
+  switch (static_cast<RndMode>(Rnd)) {
+  case RndMode::ROUND_RNA:
+    if (IsRelu)
+      report_fatal_error("relu not supported with rna rounding mode");
+    break;
+  case RndMode::ROUND_RN:
+  case RndMode::ROUND_RZ: {
+    if (Subtarget->getSmVersion() < 90)
+      report_fatal_error("rn/rz rounding modes require at least sm90");
+    if (IsSatFinite)
+      report_fatal_error("satfinite not supported with rn/rz rounding modes");
+    break;
+  }
+  default:
+    report_fatal_error("Invalid FP rounding mode in SelectCvtFloatToTF32");
   }
+
+  SDLoc DL(N);
+  SDValue Ops[] = {N->getOperand(1), getI32Imm(Rnd, DL), getI32Imm(Sat, DL),
+                   getI32Imm(IsRelu, DL)};
+  ReplaceNode(N, CurDAG->getMachineNode(NVPTX::cvt_float_to_tf32, DL,
+                                        N->getVTList(), Ops));
 }
 
 void NVPTXDAGToDAGISel::SelectTexSurfHandle(SDNode *N) {
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
index c307f28fcc6c0a..3e22ef5bab9931 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
@@ -73,6 +73,7 @@ class LLVM_LIBRARY_VISIBILITY NVPTXDAGToDAGISel : public SelectionDAGISel {
   bool tryIntrinsicChain(SDNode *N);
   bool tryIntrinsicVoid(SDNode *N);
   void SelectTexSurfHandle(SDNode *N);
+  void SelectCvtFloatToTF32(SDNode *N);
   bool tryLoad(SDNode *N);
   bool tryLoadVector(SDNode *N);
   bool tryLDGLDU(SDNode *N);
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 8ede1ec4f20dc9..3274c1ef4260db 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -1802,6 +1802,22 @@ def : Pat<(int_nvvm_e5m2x2_to_f16x2_rn Int16Regs:$a),
 def : Pat<(int_nvvm_e5m2x2_to_f16x2_rn_relu Int16Regs:$a),
           (CVT_f16x2_e5m2x2 $a, CvtRN_RELU)>;
 
+def FPRoundingMode : Operand<i32> {
+  let PrintMethod = "printFPRoundingMode";
+}
+
+def SatMode : Operand<i32> {
+  let PrintMethod = "printSaturationMode";
+}
+
+def ReluFlag : Operand<i32> {
+  let PrintMethod = "printReluModifier";
+}
+
+def cvt_float_to_tf32 : NVPTXInst<(outs Int32Regs:$dest),
+    (ins Float32Regs:$a, FPRoundingMode:$rnd, SatMode:$sat, ReluFlag:$relu),
+    "cvt${rnd:rnd}${sat:sat}${relu:relu}.tf32.f32 \t$dest, $a;", []>;
+
 //
 // FNS
 //
diff --git a/llvm/lib/Target/NVPTX/NVPTXSubtarget.h b/llvm/lib/Target/NVPTX/NVPTXSubtarget.h
index 7555a2368ec963..9f0b437bd32dc5 100644
--- a/llvm/lib/Target/NVPTX/NVPTXSubtarget.h
+++ b/llvm/lib/Target/NVPTX/NVPTXSubtarget.h
@@ -83,6 +83,7 @@ class NVPTXSubtarget : public NVPTXGenSubtargetInfo {
   bool hasFP16Math() const { return SmVersion >= 53; }
   bool hasBF16Math() const { return SmVersion >= 80; }
   bool allowFP16Math() const;
+  bool hasTF32Math() const { return SmVersion >= 80 && PTXVersion >= 70; }
   bool hasMaskOperator() const { return PTXVersion >= 71; }
   bool hasNoReturn() const { return SmVersion >= 30 && PTXVersion >= 64; }
   // Does SM & PTX support memory orderings (weak and atomic: relaxed, acquire,
diff --git a/llvm/test/CodeGen/NVPTX/convert-sm80.ll b/llvm/test/CodeGen/NVPTX/convert-sm80.ll
index aebc28b1cfea3e..d54fa73d306ffa 100644
--- a/llvm/test/CodeGen/NVPTX/convert-sm80.ll
+++ b/llvm/test/CodeGen/NVPTX/convert-sm80.ll
@@ -261,3 +261,20 @@ define <2 x half> @fold_ff2f16x2(float %lo, float %hi) {
   %v1 = insertelement <2 x half> %v0, half %hih, i64 1
   ret <2 x half> %v1
 }
+
+declare i32 @llvm.nvvm.cvt.float.to.tf32(float %f1, i8, i8, i1)
+
+define i32 @cvt_rna_tf32_f32_flags(float %f1) {
+; CHECK-LABEL: cvt_rna_tf32_f32_flags(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b32 %r<2>;
+; CHECK-NEXT:    .reg .f32 %f<2>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f32 %f1, [cvt_rna_tf32_f32_flags_param_0];
+; CHECK-NEXT:    cvt.rna.tf32.f32 %r1, %f1;
+; CHECK-NEXT:    st.param.b32 [func_retval0], %r1;
+; CHECK-NEXT:    ret;
+  %val = call i32 @llvm.nvvm.cvt.float.to.tf32(float %f1, i8 5, i8 0, i1 0)
+  ret i32 %val
+}
diff --git a/llvm/test/CodeGen/NVPTX/convert-sm89.ll b/llvm/test/CodeGen/NVPTX/convert-sm89.ll
index 5d0576aebbe089..3ff58b95348095 100644
--- a/llvm/test/CodeGen/NVPTX/convert-sm89.ll
+++ b/llvm/test/CodeGen/NVPTX/convert-sm89.ll
@@ -84,3 +84,12 @@ define <2 x half> @cvt_rn_relu_f16x2_e5m2x2(i16 %in) {
   %val = call <2 x half> @llvm.nvvm.e5m2x2.to.f16x2.rn.relu(i16 %in);
   ret <2 x half> %val
 }
+
+declare i32 @llvm.nvvm.cvt.float.to.tf32(float %f1, i8, i8, i1)
+
+; CHECK-LABEL: cvt_rna_satfinite_tf32_f32
+define i32 @cvt_rna_satfinite_tf32_f32(float %f1) {
+; CHECK: cvt.rna.satfinite.tf32.f32
+  %val = call i32 @llvm.nvvm.cvt.float.to.tf32(float %f1, i8 5, i8 1, i1 0)
+  ret i32 %val
+}
diff --git a/llvm/test/CodeGen/NVPTX/convert-sm90.ll b/llvm/test/CodeGen/NVPTX/convert-sm90.ll
new file mode 100644
index 00000000000000..8f932005830250
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/convert-sm90.ll
@@ -0,0 +1,65 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx78| FileCheck --check-prefixes=CHECK %s
+; RUN: %if ptxas-12.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx78| %ptxas-verify -arch=sm_90 %}
+
+declare i32 @llvm.nvvm.cvt.float.to.tf32(float %f1, i8, i8, i1)
+
+define i32 @cvt_rn_tf32_f32(float %f1) {
+; CHECK-LABEL: cvt_rn_tf32_f32(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b32 %r<2>;
+; CHECK-NEXT:    .reg .f32 %f<2>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f32 %f1, [cvt_rn_tf32_f32_param_0];
+; CHECK-NEXT:    cvt.rn.tf32.f32 %r1, %f1;
+; CHECK-NEXT:    st.param.b32 [func_retval0], %r1;
+; CHECK-NEXT:    ret;
+  %val = call i32 @llvm.nvvm.cvt.float.to.tf32(float %f1, i8 2, i8 0, i1 0)
+  ret i32 %val
+}
+
+define i32 @cvt_rn_relu_tf32_f32(float %f1) {
+; CHECK-LABEL: cvt_rn_relu_tf32_f32(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b32 %r<2>;
+; CHECK-NEXT:    .reg .f32 %f<2>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f32 %f1, [cvt_rn_relu_tf32_f32_param_0];
+; CHECK-NEXT:    cvt.rn.relu.tf32.f32 %r1, %f1;
+; CHECK-NEXT:    st.param.b32 [func_retval0], %r1;
+; CHECK-NEXT:    ret;
+  %val = call i32 @llvm.nvvm.cvt.float.to.tf32(float %f1, i8 2, i8 0, i1 1)
+  ret i32 %val
+}
+
+define i32 @cvt_rz_tf32_f32(float %f1) {
+; CHECK-LABEL: cvt_rz_tf32_f32(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b32 %r<2>;
+; CHECK-NEXT:    .reg .f32 %f<2>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f32 %f1, [cvt_rz_tf32_f32_param_0];
+; CHECK-NEXT:    cvt.rz.tf32.f32 %r1, %f1;
+; CHECK-NEXT:    st.param.b32 [func_retval0], %r1;
+; CHECK-NEXT:    ret;
+  %val = call i32 @llvm.nvvm.cvt.float.to.tf32(float %f1, i8 1, i8 0, i1 0)
+  ret i32 %val
+}
+
+define i32 @cvt_rz_relu_tf32_f32(float %f1) {
+; CHECK-LABEL: cvt_rz_relu_tf32_f32(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b32 %r<2>;
+; CHECK-NEXT:    .reg .f32 %f<2>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f32 %f1, [cvt_rz_relu_tf32_f32_param_0];
+; CHECK-NEXT:    cvt.rz.relu.tf32.f32 %r1, %f1;
+; CHECK-NEXT:    st.param.b32 [func_retval0], %r1;
+; CHECK-NEXT:    ret;
+  %val = call i32 @llvm.nvvm.cvt.float.to.tf32(float %f1, i8 1, i8 0, i1 1)
+  ret i32 %val
+}

@durga4github
Copy link
Contributor Author

@Artem-B , could you please help with a review?

@durga4github
Copy link
Contributor Author

cc: @LewisCrawford

@durga4github durga4github force-pushed the durgadossr/nvptx_cvt_tf32 branch from b435f83 to c75c744 Compare January 3, 2025 19:14
Copy link
Member

@AlexMaclean AlexMaclean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! baring some minor nits this looks good to me. The approaches we adopt here, for representing rounding modes etc., will probably be used in many other future intrinsic so lets make sure we set good standards now. Curious for the thoughts of @Artem-B and @andykaylor on this.

@durga4github durga4github force-pushed the durgadossr/nvptx_cvt_tf32 branch 2 times, most recently from fcbecc6 to 00059f6 Compare January 7, 2025 14:48
@durga4github durga4github requested a review from Artem-B January 8, 2025 06:21
@durga4github durga4github force-pushed the durgadossr/nvptx_cvt_tf32 branch from 00059f6 to bf34c2e Compare January 9, 2025 14:09
@durga4github durga4github changed the title [NVPTX] Add float to tf32 conversion intrinsic [NVPTX] Add float to tf32 conversion intrinsics Jan 9, 2025
@durga4github durga4github force-pushed the durgadossr/nvptx_cvt_tf32 branch from bf34c2e to def2900 Compare January 10, 2025 12:01
This patch adds the missing variants of float to tf32
conversion intrinsics. Lit tests are added for all the
intrinsics.

PTX Spec link:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt

Signed-off-by: Durgadoss R <[email protected]>
@durga4github durga4github force-pushed the durgadossr/nvptx_cvt_tf32 branch from def2900 to 58b66bc Compare January 10, 2025 17:42
@durga4github durga4github merged commit 7e2eb0f into llvm:main Jan 13, 2025
8 checks passed
@durga4github durga4github deleted the durgadossr/nvptx_cvt_tf32 branch January 13, 2025 10:48
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jan 13, 2025

LLVM Buildbot has detected a new failure on builder clang-m68k-linux-cross running on suse-gary-m68k-cross while building llvm at step 4 "build stage 1".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/27/builds/4561

Here is the relevant piece of the build log for the reference
Step 4 (build stage 1) failure: 'ninja' (failure)
...
                 from /var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/llvm/clang/lib/CodeGen/CodeGenAction.cpp:30:
/var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/llvm/clang/include/clang/Sema/Sema.h:464:7: warning: ‘clang::Sema’ declared with greater visibility than the type of its field ‘clang::Sema::UnusedFileScopedDecls’ [-Wattributes]
  464 | class Sema final : public SemaBase {
      |       ^~~~
/var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/llvm/clang/include/clang/Sema/Sema.h:464:7: warning: ‘clang::Sema’ declared with greater visibility than the type of its field ‘clang::Sema::TentativeDefinitions’ [-Wattributes]
/var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/llvm/clang/include/clang/Sema/Sema.h:464:7: warning: ‘clang::Sema’ declared with greater visibility than the type of its field ‘clang::Sema::ExtVectorDecls’ [-Wattributes]
/var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/llvm/clang/include/clang/Sema/Sema.h:464:7: warning: ‘clang::Sema’ declared with greater visibility than the type of its field ‘clang::Sema::DelegatingCtorDecls’ [-Wattributes]
[1011/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/CodeGenFunction.cpp.o
[1012/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/ARC.cpp.o
[1013/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/CGBuiltin.cpp.o
FAILED: tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/CGBuiltin.cpp.o 
/usr/bin/c++ -DCLANG_EXPORTS -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/stage1/tools/clang/lib/CodeGen -I/var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/llvm/clang/lib/CodeGen -I/var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/llvm/clang/include -I/var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/stage1/tools/clang/include -I/var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/stage1/include -I/var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/llvm/llvm/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -fno-lifetime-dse -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-maybe-uninitialized -Wno-nonnull -Wno-class-memaccess -Wno-redundant-move -Wno-pessimizing-move -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wno-misleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -fno-common -Woverloaded-virtual -fno-strict-aliasing -O3 -DNDEBUG -std=c++17  -fno-exceptions -funwind-tables -fno-rtti -UNDEBUG -MD -MT tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/CGBuiltin.cpp.o -MF tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/CGBuiltin.cpp.o.d -o tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/CGBuiltin.cpp.o -c /var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/llvm/clang/lib/CodeGen/CGBuiltin.cpp
c++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
[1014/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/ItaniumCXXABI.cpp.o
[1015/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/BPF.cpp.o
[1016/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/MicrosoftCXXABI.cpp.o
[1017/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/AArch64.cpp.o
[1018/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/AMDGPU.cpp.o
[1019/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/MSP430.cpp.o
[1020/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/TargetInfo.cpp.o
[1021/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/CoverageMappingGen.cpp.o
[1022/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/M68k.cpp.o
[1023/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/LoongArch.cpp.o
[1024/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/CSKY.cpp.o
[1025/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/ARM.cpp.o
[1026/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/Lanai.cpp.o
[1027/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/DirectX.cpp.o
[1028/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/PNaCl.cpp.o
[1029/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/AVR.cpp.o
[1030/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/Mips.cpp.o
[1031/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/Hexagon.cpp.o
[1032/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/CGStmtOpenMP.cpp.o
[1033/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/NVPTX.cpp.o
[1034/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/SPIR.cpp.o
[1035/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/RISCV.cpp.o
[1036/1313] Building CXX object lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/SLPVectorizer.cpp.o
[1037/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/Sparc.cpp.o
[1038/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/ObjectFilePCHContainerWriter.cpp.o
[1039/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/CodeGenPGO.cpp.o
[1040/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/PPC.cpp.o
[1041/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/SystemZ.cpp.o
[1042/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/CGOpenMPRuntime.cpp.o
[1043/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/Targets/TCE.cpp.o
[1044/1313] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/CodeGenModule.cpp.o
[1045/1313] Building CXX object lib/Passes/CMakeFiles/LLVMPasses.dir/PassBuilder.cpp.o
[1046/1313] Building CXX object lib/Target/X86/CMakeFiles/LLVMX86CodeGen.dir/X86ISelLowering.cpp.o
/var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/llvm/llvm/lib/Target/X86/X86ISelLowering.cpp: In function ‘llvm::SDValue combineTargetShuffle(llvm::SDValue, const llvm::SDLoc&, llvm::SelectionDAG&, llvm::TargetLowering::DAGCombinerInfo&, const llvm::X86Subtarget&)’:
/var/lib/buildbot/workers/suse-gary-m68k-cross/clang-m68k-linux-cross/llvm/llvm/lib/Target/X86/X86ISelLowering.cpp:42434: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses]

kazutakahirata pushed a commit to kazutakahirata/llvm-project that referenced this pull request Jan 13, 2025
This patch adds the missing variants of float to tf32 conversion
intrinsics, with their corresponding lit tests.

PTX Spec link:

https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt

Signed-off-by: Durgadoss R <[email protected]>
durga4github added a commit to durga4github/llvm-project that referenced this pull request Jan 16, 2025
PR llvm#121507 added 'cvt' intrinsics to convert
float to tf32, with the valid set of rounding and
saturation modes. This PR adds an NVVM Dialect Op
for the same.
* lit tests are added to verify the lowering to
  intrinsics.
* Negative tests are also added to check the
  error-handling of invalid combinations.

Signed-off-by: Durgadoss R <[email protected]>
durga4github added a commit that referenced this pull request Jan 17, 2025
PR #121507 added 'cvt' intrinsics to convert
float to tf32, with the valid set of rounding and
saturation modes. This PR adds an NVVM Dialect Op
for the same.
* lit tests are added to verify the lowering to intrinsics.
* Negative tests are also added to check the error-handling of invalid
combinations.

Signed-off-by: Durgadoss R <[email protected]>
durga4github added a commit to durga4github/llvm-project that referenced this pull request Jan 24, 2025
This patch adds the set of f32->tf32 cvt intrinsics
introduced in sm100 with ptx8.6. This builds
on top of the recent PR llvm#121507.

Tests are verified with a 12.8 ptxas executable.

Also update the lit.cfg.py to include the latest
PTXAS_EXE versions.

PTX ISA link:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt

Signed-off-by: Durgadoss R <[email protected]>
durga4github added a commit to durga4github/llvm-project that referenced this pull request Jan 25, 2025
This patch adds the set of f32->tf32 cvt intrinsics
introduced in sm100 with ptx8.6. This builds
on top of the recent PR llvm#121507.

Tests are verified with a 12.8 ptxas executable.

PTX ISA link:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt

Signed-off-by: Durgadoss R <[email protected]>
durga4github added a commit that referenced this pull request Jan 27, 2025
This patch adds the set of f32 -> tf32 cvt intrinsics introduced
in sm100 with ptx8.6. This builds on top of the recent PR #121507.

Tests are verified with a 12.8 ptxas executable.

PTX ISA link:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt

Signed-off-by: Durgadoss R <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants