Reland "[NVPTX] Support copysign PTX instruction" #108125

AlexMaclean · 2024-09-11T00:55:54Z

Lower fcopysign SDNodes into copysign PTX instructions where possible. See PTX ISA: 9.7.3.2. Floating Point Instructions: copysign.

Copysign SDNodes with mismatched types are expanded as before, since the PTX instruction requires the types to match.

Lower `fcopysign` SDNodes into `copysign` PTX instructions where possible. See [PTX ISA: 9.7.3.2. Floating Point Instructions: copysign] (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#floating-point-instructions-copysign).

llvmbot · 2024-09-11T00:56:27Z

@llvm/pr-subscribers-backend-nvptx

@llvm/pr-subscribers-llvm-selectiondag

Author: Alex MacLean (AlexMaclean)

Changes

Lower fcopysign SDNodes into copysign PTX instructions where possible. See PTX ISA: 9.7.3.2. Floating Point Instructions: copysign.

Add a new TLI function supportMismatchCopysign to prevent SelectionDAG folding that causes the types of FCOPYSIGN SDNodes to no longer be matched. This allows for simpler target code and appears to be potentially useful for several other targets as well.

Full diff: https://github.com/llvm/llvm-project/pull/108125.diff

7 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetLowering.h (+4)
(modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+10-6)
(modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (+2-2)
(modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.h (+2)
(modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+14)
(added) llvm/test/CodeGen/NVPTX/copysign.ll (+112)
(modified) llvm/test/CodeGen/NVPTX/math-intrins.ll (+6-13)

diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index e17d68d2690c86..a1d50585521f2f 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -4401,6 +4401,10 @@ class TargetLowering : public TargetLoweringBase {
   /// Return true if the target supports ptrauth operand bundles.
   virtual bool supportPtrAuthBundles() const { return false; }
 
+  /// Return true if the target supports FCOPYSIGN nodes with operands of
+  /// mismatched types.
+  virtual bool supportMismatchCopysign() const { return true; }
+
   /// Perform necessary initialization to handle a subset of CSRs explicitly
   /// via copies. This function is called at the beginning of instruction
   /// selection.
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index bb907633e1f824..ccc2510ba82341 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -17686,7 +17686,11 @@ SDValue DAGCombiner::visitFSQRT(SDNode *N) {
 /// copysign(x, fp_extend(y)) -> copysign(x, y)
 /// copysign(x, fp_round(y)) -> copysign(x, y)
 /// Operands to the functions are the type of X and Y respectively.
-static inline bool CanCombineFCOPYSIGN_EXTEND_ROUND(EVT XTy, EVT YTy) {
+static inline bool CanCombineFCOPYSIGN_EXTEND_ROUND(EVT XTy, EVT YTy,
+                                                    const TargetLowering &TLI) {
+  if (!TLI.supportMismatchCopysign())
+    return false;
+
   // Always fold no-op FP casts.
   if (XTy == YTy)
     return true;
@@ -17701,14 +17705,15 @@ static inline bool CanCombineFCOPYSIGN_EXTEND_ROUND(EVT XTy, EVT YTy) {
   return !YTy.isVector() || EnableVectorFCopySignExtendRound;
 }
 
-static inline bool CanCombineFCOPYSIGN_EXTEND_ROUND(SDNode *N) {
+static inline bool CanCombineFCOPYSIGN_EXTEND_ROUND(SDNode *N,
+                                                    const TargetLowering &TLI) {
   SDValue N1 = N->getOperand(1);
   if (N1.getOpcode() != ISD::FP_EXTEND &&
       N1.getOpcode() != ISD::FP_ROUND)
     return false;
   EVT N1VT = N1->getValueType(0);
   EVT N1Op0VT = N1->getOperand(0).getValueType();
-  return CanCombineFCOPYSIGN_EXTEND_ROUND(N1VT, N1Op0VT);
+  return CanCombineFCOPYSIGN_EXTEND_ROUND(N1VT, N1Op0VT, TLI);
 }
 
 SDValue DAGCombiner::visitFCOPYSIGN(SDNode *N) {
@@ -17752,7 +17757,7 @@ SDValue DAGCombiner::visitFCOPYSIGN(SDNode *N) {
 
   // copysign(x, fp_extend(y)) -> copysign(x, y)
   // copysign(x, fp_round(y)) -> copysign(x, y)
-  if (CanCombineFCOPYSIGN_EXTEND_ROUND(N))
+  if (CanCombineFCOPYSIGN_EXTEND_ROUND(N, TLI))
     return DAG.getNode(ISD::FCOPYSIGN, DL, VT, N0, N1.getOperand(0));
 
   // We only take the sign bit from the sign operand.
@@ -18105,8 +18110,7 @@ SDValue DAGCombiner::visitFP_ROUND(SDNode *N) {
   // eliminate the fp_round on Y.  The second step requires an additional
   // predicate to match the implementation above.
   if (N0.getOpcode() == ISD::FCOPYSIGN && N0->hasOneUse() &&
-      CanCombineFCOPYSIGN_EXTEND_ROUND(VT,
-                                       N0.getValueType())) {
+      CanCombineFCOPYSIGN_EXTEND_ROUND(VT, N0.getValueType(), TLI)) {
     SDValue Tmp = DAG.getNode(ISD::FP_ROUND, SDLoc(N0), VT,
                               N0.getOperand(0), N1);
     AddToWorklist(Tmp.getNode());
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 5c5766a8b23455..3816e099537199 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -838,8 +838,8 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
   setOperationAction(ISD::FCOPYSIGN, MVT::v2f16, Expand);
   setOperationAction(ISD::FCOPYSIGN, MVT::bf16, Expand);
   setOperationAction(ISD::FCOPYSIGN, MVT::v2bf16, Expand);
-  setOperationAction(ISD::FCOPYSIGN, MVT::f32, Expand);
-  setOperationAction(ISD::FCOPYSIGN, MVT::f64, Expand);
+  setOperationAction(ISD::FCOPYSIGN, MVT::f32, Legal);
+  setOperationAction(ISD::FCOPYSIGN, MVT::f64, Legal);
 
   // These map to corresponding instructions for f32/f64. f16 must be
   // promoted to f32. v2f16 is expanded to f16, which is then promoted
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
index 32e6b044b0de1f..4ef4a0cf064afa 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
@@ -596,6 +596,8 @@ class NVPTXTargetLowering : public TargetLowering {
   // instruction, so we say that ctlz is cheap to speculate.
   bool isCheapToSpeculateCtlz(Type *Ty) const override { return true; }
 
+  bool supportMismatchCopysign() const override { return false; }
+
   AtomicExpansionKind shouldCastAtomicLoadInIR(LoadInst *LI) const override {
     return AtomicExpansionKind::None;
   }
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 0c883093dd0a54..e8e8548120131e 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -977,6 +977,20 @@ def INT_NVVM_FABS_F : F_MATH_1<"abs.f32 \t$dst, $src0;", Float32Regs,
 def INT_NVVM_FABS_D : F_MATH_1<"abs.f64 \t$dst, $src0;", Float64Regs,
   Float64Regs, int_nvvm_fabs_d>;
 
+//
+// copysign
+//
+
+def COPYSIGN_F :
+    NVPTXInst<(outs Float32Regs:$dst), (ins Float32Regs:$src0, Float32Regs:$src1),
+              "copysign.f32 \t$dst, $src0, $src1;",
+              [(set Float32Regs:$dst, (fcopysign Float32Regs:$src1, Float32Regs:$src0))]>;
+
+def COPYSIGN_D :
+    NVPTXInst<(outs Float64Regs:$dst), (ins Float64Regs:$src0, Float64Regs:$src1),
+              "copysign.f64 \t$dst, $src0, $src1;",
+              [(set Float64Regs:$dst, (fcopysign Float64Regs:$src1, Float64Regs:$src0))]>;
+
 //
 // Abs, Neg bf16, bf16x2
 //
diff --git a/llvm/test/CodeGen/NVPTX/copysign.ll b/llvm/test/CodeGen/NVPTX/copysign.ll
new file mode 100644
index 00000000000000..0dfc1cae6133af
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/copysign.ll
@@ -0,0 +1,112 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -verify-machineinstrs | FileCheck %s
+; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_20 -verify-machineinstrs | %ptxas-verify %}
+
+target triple = "nvptx64-nvidia-cuda"
+target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
+
+define float @fcopysign_f_f(float %a, float %b) {
+; CHECK-LABEL: fcopysign_f_f(
+; CHECK:       {
+; CHECK-NEXT:    .reg .f32 %f<4>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f32 %f1, [fcopysign_f_f_param_0];
+; CHECK-NEXT:    ld.param.f32 %f2, [fcopysign_f_f_param_1];
+; CHECK-NEXT:    copysign.f32 %f3, %f2, %f1;
+; CHECK-NEXT:    st.param.f32 [func_retval0+0], %f3;
+; CHECK-NEXT:    ret;
+  %val = call float @llvm.copysign.f32(float %a, float %b)
+  ret float %val
+}
+
+define double @fcopysign_d_d(double %a, double %b) {
+; CHECK-LABEL: fcopysign_d_d(
+; CHECK:       {
+; CHECK-NEXT:    .reg .f64 %fd<4>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f64 %fd1, [fcopysign_d_d_param_0];
+; CHECK-NEXT:    ld.param.f64 %fd2, [fcopysign_d_d_param_1];
+; CHECK-NEXT:    copysign.f64 %fd3, %fd2, %fd1;
+; CHECK-NEXT:    st.param.f64 [func_retval0+0], %fd3;
+; CHECK-NEXT:    ret;
+  %val = call double @llvm.copysign.f64(double %a, double %b)
+  ret double %val
+}
+
+define float @fcopysign_f_d(float %a, double %b) {
+; CHECK-LABEL: fcopysign_f_d(
+; CHECK:       {
+; CHECK-NEXT:    .reg .f32 %f<4>;
+; CHECK-NEXT:    .reg .f64 %fd<2>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f32 %f1, [fcopysign_f_d_param_0];
+; CHECK-NEXT:    ld.param.f64 %fd1, [fcopysign_f_d_param_1];
+; CHECK-NEXT:    cvt.rn.f32.f64 %f2, %fd1;
+; CHECK-NEXT:    copysign.f32 %f3, %f2, %f1;
+; CHECK-NEXT:    st.param.f32 [func_retval0+0], %f3;
+; CHECK-NEXT:    ret;
+  %c = fptrunc double %b to float
+  %val = call float @llvm.copysign.f32(float %a, float %c)
+  ret float %val
+}
+
+define float @fcopysign_f_h(float %a, half %b) {
+; CHECK-LABEL: fcopysign_f_h(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b16 %rs<2>;
+; CHECK-NEXT:    .reg .f32 %f<4>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f32 %f1, [fcopysign_f_h_param_0];
+; CHECK-NEXT:    ld.param.b16 %rs1, [fcopysign_f_h_param_1];
+; CHECK-NEXT:    cvt.f32.f16 %f2, %rs1;
+; CHECK-NEXT:    copysign.f32 %f3, %f2, %f1;
+; CHECK-NEXT:    st.param.f32 [func_retval0+0], %f3;
+; CHECK-NEXT:    ret;
+  %c = fpext half %b to float
+  %val = call float @llvm.copysign.f32(float %a, float %c)
+  ret float %val
+}
+
+define double @fcopysign_d_f(double %a, float %b) {
+; CHECK-LABEL: fcopysign_d_f(
+; CHECK:       {
+; CHECK-NEXT:    .reg .f32 %f<2>;
+; CHECK-NEXT:    .reg .f64 %fd<4>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f64 %fd1, [fcopysign_d_f_param_0];
+; CHECK-NEXT:    ld.param.f32 %f1, [fcopysign_d_f_param_1];
+; CHECK-NEXT:    cvt.f64.f32 %fd2, %f1;
+; CHECK-NEXT:    copysign.f64 %fd3, %fd2, %fd1;
+; CHECK-NEXT:    st.param.f64 [func_retval0+0], %fd3;
+; CHECK-NEXT:    ret;
+  %c = fpext float %b to double
+  %val = call double @llvm.copysign.f64(double %a, double %c)
+  ret double %val
+}
+
+define double @fcopysign_d_h(double %a, half %b) {
+; CHECK-LABEL: fcopysign_d_h(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b16 %rs<2>;
+; CHECK-NEXT:    .reg .f64 %fd<4>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f64 %fd1, [fcopysign_d_h_param_0];
+; CHECK-NEXT:    ld.param.b16 %rs1, [fcopysign_d_h_param_1];
+; CHECK-NEXT:    cvt.f64.f16 %fd2, %rs1;
+; CHECK-NEXT:    copysign.f64 %fd3, %fd2, %fd1;
+; CHECK-NEXT:    st.param.f64 [func_retval0+0], %fd3;
+; CHECK-NEXT:    ret;
+  %c = fpext half %b to double
+  %val = call double @llvm.copysign.f64(double %a, double %c)
+  ret double %val
+}
+
+
+declare float @llvm.copysign.f32(float, float)
+declare double @llvm.copysign.f64(double, double)
diff --git a/llvm/test/CodeGen/NVPTX/math-intrins.ll b/llvm/test/CodeGen/NVPTX/math-intrins.ll
index fcc4ec6e4017f7..bdd6c914384601 100644
--- a/llvm/test/CodeGen/NVPTX/math-intrins.ll
+++ b/llvm/test/CodeGen/NVPTX/math-intrins.ll
@@ -195,9 +195,8 @@ define double @round_double(double %a) {
 ; check the use of 0.5 to implement round
 ; CHECK-LABEL: round_double(
 ; CHECK:       {
-; CHECK-NEXT:    .reg .pred %p<4>;
-; CHECK-NEXT:    .reg .b64 %rd<4>;
-; CHECK-NEXT:    .reg .f64 %fd<10>;
+; CHECK-NEXT:    .reg .pred %p<3>;
+; CHECK-NEXT:    .reg .f64 %fd<8>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.f64 %fd1, [round_double_param_0];
@@ -206,16 +205,10 @@ define double @round_double(double %a) {
 ; CHECK-NEXT:    add.rn.f64 %fd3, %fd2, 0d3FE0000000000000;
 ; CHECK-NEXT:    cvt.rzi.f64.f64 %fd4, %fd3;
 ; CHECK-NEXT:    selp.f64 %fd5, 0d0000000000000000, %fd4, %p1;
-; CHECK-NEXT:    abs.f64 %fd6, %fd5;
-; CHECK-NEXT:    neg.f64 %fd7, %fd6;
-; CHECK-NEXT:    mov.b64 %rd1, %fd1;
-; CHECK-NEXT:    shr.u64 %rd2, %rd1, 63;
-; CHECK-NEXT:    and.b64 %rd3, %rd2, 1;
-; CHECK-NEXT:    setp.eq.b64 %p2, %rd3, 1;
-; CHECK-NEXT:    selp.f64 %fd8, %fd7, %fd6, %p2;
-; CHECK-NEXT:    setp.gt.f64 %p3, %fd2, 0d4330000000000000;
-; CHECK-NEXT:    selp.f64 %fd9, %fd1, %fd8, %p3;
-; CHECK-NEXT:    st.param.f64 [func_retval0+0], %fd9;
+; CHECK-NEXT:    copysign.f64 %fd6, %fd1, %fd5;
+; CHECK-NEXT:    setp.gt.f64 %p2, %fd2, 0d4330000000000000;
+; CHECK-NEXT:    selp.f64 %fd7, %fd1, %fd6, %p2;
+; CHECK-NEXT:    st.param.f64 [func_retval0+0], %fd7;
 ; CHECK-NEXT:    ret;
   %b = call double @llvm.round.f64(double %a)
   ret double %b

arsenm · 2024-09-11T09:53:55Z

llvm/include/llvm/CodeGen/TargetLowering.h

@@ -4401,6 +4401,10 @@ class TargetLowering : public TargetLoweringBase {
  /// Return true if the target supports ptrauth operand bundles.
  virtual bool supportPtrAuthBundles() const { return false; }

+  /// Return true if the target supports FCOPYSIGN nodes with operands of
+  /// mismatched types.
+  virtual bool supportMismatchCopysign() const { return true; }


-1 on adding this. It's not hard to match the mismatched cases. Also, this should be expressible in terms of the legality of the type combination

Fair enough. To be honest, I think it might be best to change the type of FCOPYSIGN since it seems like most targets do not benefit from the mismatched types and must instead add additional logic to re-insert the proper extension/truncation. That being said for the purposes of this change I've removed this API and instead added custom lowering to a node with matching types.

This is mostly yet another instance of SDAG being bad with legality rules that aren't expressive enough. globalisel fixes this by having 2 type parameters for copysign

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

AlexMaclean requested review from Artem-B and pranavk September 11, 2024 00:55

AlexMaclean self-assigned this Sep 11, 2024

llvmbot added backend:NVPTX llvm:SelectionDAG SelectionDAGISel as well labels Sep 11, 2024

AlexMaclean mentioned this pull request Sep 11, 2024

[NVPTX] Support copysign PTX instruction #107800

Merged

arsenm reviewed Sep 11, 2024

View reviewed changes

Introduce NVPTXISD::FCOPYSIGN with matching types, lower into this

17b39fd

AlexMaclean force-pushed the dev/amaclean/reland-copysign branch from 0c0f08a to 17b39fd Compare September 11, 2024 15:34

Artem-B reviewed Sep 11, 2024

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp Outdated Show resolved Hide resolved

address comments

8df122b

Artem-B approved these changes Sep 11, 2024

View reviewed changes

AlexMaclean merged commit e42f473 into llvm:main Sep 12, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reland "[NVPTX] Support copysign PTX instruction" #108125

Reland "[NVPTX] Support copysign PTX instruction" #108125

Uh oh!

AlexMaclean commented Sep 11, 2024 •

edited

Loading

Uh oh!

llvmbot commented Sep 11, 2024 •

edited

Loading

Uh oh!

arsenm Sep 11, 2024

Uh oh!

AlexMaclean Sep 11, 2024

Uh oh!

arsenm Sep 13, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reland "[NVPTX] Support copysign PTX instruction" #108125

Reland "[NVPTX] Support copysign PTX instruction" #108125

Uh oh!

Conversation

AlexMaclean commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

AlexMaclean Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

arsenm Sep 13, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AlexMaclean commented Sep 11, 2024 •

edited

Loading

llvmbot commented Sep 11, 2024 •

edited

Loading