[NVPTX] designate fabs and fneg as free #121513

AlexMaclean · 2025-01-02T18:29:17Z

No description provided.

llvmbot · 2025-01-02T18:29:52Z

@llvm/pr-subscribers-backend-nvptx

Author: Alex MacLean (AlexMaclean)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/121513.diff

3 Files Affected:

(modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.h (+3)
(modified) llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll (+4-4)
(added) llvm/test/CodeGen/NVPTX/fabs-fneg-free.ll (+34)

diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
index 4a98fe21b81dc6..c9b7e874556990 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
@@ -261,6 +261,9 @@ class NVPTXTargetLowering : public TargetLowering {
     return true;
   }
 
+  bool isFAbsFree(EVT VT) const override { return true; }
+  bool isFNegFree(EVT VT) const override { return true; }
+
 private:
   const NVPTXSubtarget &STI; // cache the subtarget here
   SDValue getParamSymbol(SelectionDAG &DAG, int idx, EVT) const;
diff --git a/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll b/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
index 03cdeb9683abae..8be3a66b7f4836 100644
--- a/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
@@ -182,8 +182,8 @@ define <2 x bfloat> @test_fneg(<2 x bfloat> %a) #0 {
 ; CHECK-NEXT:    .reg .b32 %r<3>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
-; CHECK-NEXT:    ld.param.u32 %r1, [test_fneg_param_0];
-; CHECK-NEXT:    xor.b32 %r2, %r1, -2147450880;
+; CHECK-NEXT:    ld.param.b32 %r1, [test_fneg_param_0];
+; CHECK-NEXT:    neg.bf16x2 %r2, %r1;
 ; CHECK-NEXT:    st.param.b32 [func_retval0], %r2;
 ; CHECK-NEXT:    ret;
   %r = fneg <2 x bfloat> %a
@@ -532,8 +532,8 @@ define <2 x bfloat> @test_fabs(<2 x bfloat> %a) #0 {
 ; CHECK-NEXT:    .reg .b32 %r<3>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
-; CHECK-NEXT:    ld.param.u32 %r1, [test_fabs_param_0];
-; CHECK-NEXT:    and.b32 %r2, %r1, 2147450879;
+; CHECK-NEXT:    ld.param.b32 %r1, [test_fabs_param_0];
+; CHECK-NEXT:    abs.bf16x2 %r2, %r1;
 ; CHECK-NEXT:    st.param.b32 [func_retval0], %r2;
 ; CHECK-NEXT:    ret;
   %r = call <2 x bfloat> @llvm.fabs.f16(<2 x bfloat> %a)
diff --git a/llvm/test/CodeGen/NVPTX/fabs-fneg-free.ll b/llvm/test/CodeGen/NVPTX/fabs-fneg-free.ll
new file mode 100644
index 00000000000000..9031f33939f2fe
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/fabs-fneg-free.ll
@@ -0,0 +1,34 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_35 -verify-machineinstrs | FileCheck %s
+; RUN: %if ptxas %{ llc < %s -mtriple=nvptx64 -mcpu=sm_35 | %ptxas-verify %}
+target triple = "nvptx64-nvidia-cuda"
+
+define float @fabs_free(i32 %in) {
+; CHECK-LABEL: fabs_free(
+; CHECK:       {
+; CHECK-NEXT:    .reg .f32 %f<3>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f32 %f1, [fabs_free_param_0];
+; CHECK-NEXT:    abs.f32 %f2, %f1;
+; CHECK-NEXT:    st.param.f32 [func_retval0], %f2;
+; CHECK-NEXT:    ret;
+  %b = bitcast i32 %in to float
+  %f = call float @llvm.fabs.f32(float %b)
+  ret float %f
+}
+
+define float @fneg_free(i32 %in) {
+; CHECK-LABEL: fneg_free(
+; CHECK:       {
+; CHECK-NEXT:    .reg .f32 %f<3>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f32 %f1, [fneg_free_param_0];
+; CHECK-NEXT:    neg.f32 %f2, %f1;
+; CHECK-NEXT:    st.param.f32 [func_retval0], %f2;
+; CHECK-NEXT:    ret;
+  %b = bitcast i32 %in to float
+  %f = fneg float %b
+  ret float %f
+}

AlexMaclean · 2025-01-07T20:43:50Z

@Artem-B @frasercrmck ping for review

Artem-B · 2025-01-07T22:09:54Z

llvm/lib/Target/NVPTX/NVPTXISelLowering.h

+  bool isFAbsFree(EVT VT) const override { return true; }
+  bool isFNegFree(EVT VT) const override { return true; }


What do we get in SASS for those? I do not see any FABS/FNEG instructions in https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#hopper-instruction-set

If ptxas ends up lowering fabs/fneg to the logical ops, we may as well let LLVM do it which may let it optimize some of that away better than ptxas.

The SASS emitted will be dependent on lots of decisions in ptxas and I'm not able to say anything definitive or non-public about how PTX instructions will be emitted in SASS. However, here is a simple compiler explorer program which shows a case where these instructions appear to be free in terms of SASS instruction count: https://cuda.godbolt.org/z/MGeb71jjz

My general rule of thumb is that it's likely beneficial to use specific PTX instruction, if there's a matching h/w instruction, but let LLVM handle it if it turns into generic code that LLVM can do itself. In this case I'm not convinced that folding some cases into "FADD" with a negated argument counts, as LLVM would be able to do the folding into fsub, which, I'm pretty sure, would result in the same FADD a, -b on the SASS level.

This patch generates slightly more concise and readable PTX, with the downside of shifting some potential optimizations from LLVM to ptxas. Overall it looks like a wash to me.

There's also a question of whether the "free" applies to all types equally. E.g. for bf16x2 the sign folding no longer happens: https://cuda.godbolt.org/z/e15cG6jdM

My general rule of thumb is that it's likely beneficial to use specific PTX instruction, if there's a matching h/w instruction, but let LLVM handle it if it turns into generic code that LLVM can do itself.

I'm not sure this is the right rule of thumb. I tend to think we should use the most expressive/specific ptx instruction even if we suspect it may just be syntactic sugar. ptxas may have specific heuristics/peepholes which apply very unevenly and are dependent on which instruction is used. Even if in theory doing the expansion in LLVM should not matter, in many cases it can substantially impact perf.

In this case I'm not convinced that folding some cases into "FADD" with a negated argument counts

This particular example is simple and contrived and intended only to highlight the possibility that sometimes ptxas may be able to fold floating point abs and neg operands into a single instruction. There are many other similar cases. While this isn't true in all cases, it can happen enough that I think it makes sense to treat these operations as free.

There's also a question of whether the "free" applies to all types equally. E.g. for bf16x2 the sign folding no longer happens: https://cuda.godbolt.org/z/e15cG6jdM

Yes, it may not apply to all types equally, still I think it is simplest and clearest to prefer fabs and fneg in all cases due to readability and the possibility of unknown ptxas optimizations.

I'm not sure this is the right rule of thumb.

All models are wrong, but some are still useful. :-) I'm all for refining mine.

we should use the most expressive/specific ptx instruction even if we suspect it may just be syntactic sugar.
ptxas may have specific heuristics/peepholes which apply very unevenly and are dependent on which instruction is
used. Even if in theory doing the expansion in LLVM should not matter, in many cases it can substantially impact perf.

It can be argued both ways. I've seen both the cases that support your assertions, and the cases where ptxas did not do a particularly great job. Granted, both LLVM and ptxas change over time, so there will never be one true answer suitable for all use cases all the time.

If I have to choose between the code I can observe, and improve, vs some kind of black box that requires blind trust, I'm biased towards the former, unless there's a strong evidence that whatever that black box will do is objectively better (e.g. has h/w support for exactly this operation). Without that, it boils down to who can optimize things better LLVM or PTXAS. LLVM has the benefit of having access to more information about the code. PTXAS knows more about the hardware. Which one wins depends on the details. Most of the time both are good enough, so the result is a wash.

Yes, it may not apply to all types equally, still I think it is simplest and clearest to prefer fabs and fneg in all cases due to readability and the possibility of unknown ptxas optimizations.

Readability is a cosmetic factor. Unlike high level languages approximately nobody cares about PTX readability. It's nice to have, but it's not something to lose performance for.

"unknown ptxas optimizations" are not a very strong argument, IMO. For generic operations like abs/neg, I doubt that ptxas or any other compiler will have some magic sauce which will make whole lot of a difference. In the absence of specific evidence of SASS level advantage, my preference is to handle it somewhere where we control the source code.

That said, I do not see much of a downside to this change either, so for the sake of saving time for both of us I'll stamp the change.

Yea, you make some good points. The main reason I think this is worth doing is that in many cases abs and neg are essentially free due to ptxas optimizations (here is a slightly larger example highlighting this: https://cuda.godbolt.org/z/86G9E6fMd).

In any case, thanks for the flexibility on this one. I'll go ahead and merge.

GPU instructions with predicate math, operand negation, and modulus are fun:

FSETP.NEU.AND P0, PT, |R3|, -R0, PT

Too bad that we have no way to access those features in a deterministic way via PTX, and have to second-guess what would get ptxas to produce good SASS. :-(

d0k · 2025-01-09T15:03:08Z

This looks like a nice improvement, but I'm afraid it's built on a wrong assumption. PTX abs and neg are not equivalent to LLVM's fabs and fneg since the latter explicitly preserve NaN bits while PTX instructions do not.

LLVM uses that assumption internally so even if the user doesn't care about NaN payloads in general, it can miscompile things. https://godbolt.org/z/Y9Pr9jGxe

This reverts commit 45d4698. NVPTX fabs & fneg are incompatible with LLVM's semantics as LLVM guarantees the payload of NaNs to stay the same while PTX mangles NaNs. The bad patterns are still in the NVPTX backend and should probably be removed, since this change only exposed the bad behavior.

AlexMaclean · 2025-01-09T16:56:41Z

This looks like a nice improvement, but I'm afraid it's built on a wrong assumption. PTX abs and neg are not equivalent to LLVM's fabs and fneg since the latter explicitly preserve NaN bits while PTX instructions do not.

LLVM uses that assumption internally so even if the user doesn't care about NaN payloads in general, it can miscompile things. https://godbolt.org/z/Y9Pr9jGxe

This isn't new, nor is it necessarily a problem we need to fix. For as long as the NVPTX backend has existed we've lowered llvm fabs and fneg to PTX abs and neg even though the semantics don't quite match. This is unfortunate but likely the least bad option, as lowering them with and / xor would significantly harm performance in most cases and fix correctness only for some rare cases that no one seems to really care about.

@d0k Is this change causing an issue in a real program?

d0k · 2025-01-09T19:42:49Z

This looks like a nice improvement, but I'm afraid it's built on a wrong assumption. PTX abs and neg are not equivalent to LLVM's fabs and fneg since the latter explicitly preserve NaN bits while PTX instructions do not.
LLVM uses that assumption internally so even if the user doesn't care about NaN payloads in general, it can miscompile things. https://godbolt.org/z/Y9Pr9jGxe

This isn't new, nor is it necessarily a problem we need to fix. For as long as the NVPTX backend has existed we've lowered llvm fabs and fneg to PTX abs and neg even though the semantics don't quite match. This is unfortunate but likely the least bad option, as lowering them with and / xor would significantly harm performance in most cases and fix correctness only for some rare cases that no one seems to really care about.

@d0k Is this change causing an issue in a real program?

It makes https://github.com/openxla/xla/blob/main/xla/service/gpu/tests/gpu_cub_sort_test.cc fail on A100, which is derived from a real model that depends on the sorting order of NaN. It's a xor getting turned into neg problem. This is obviously an edge case, but an LLVM backend is not at liberty to redefine IR semantics as it choses. LangRef is very clear about how these instructions should behave. Dealing with mismatches between IR semantics and ISA semantics is what fast math flags and intrinsics are for.

AlexMaclean added 2 commits January 2, 2025 18:28

pre-commit tests

e46e4bf

[NVPTX] designate fabs and fneg as free

8e5571c

AlexMaclean requested review from Artem-B and frasercrmck January 2, 2025 18:29

AlexMaclean self-assigned this Jan 2, 2025

llvmbot added the backend:NVPTX label Jan 2, 2025

AlexMaclean requested a review from kalxr January 2, 2025 18:36

kalxr approved these changes Jan 2, 2025

View reviewed changes

Artem-B reviewed Jan 7, 2025

View reviewed changes

Artem-B approved these changes Jan 8, 2025

View reviewed changes

AlexMaclean merged commit 45d4698 into llvm:main Jan 8, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVPTX] designate fabs and fneg as free #121513

[NVPTX] designate fabs and fneg as free #121513

Uh oh!

AlexMaclean commented Jan 2, 2025

Uh oh!

llvmbot commented Jan 2, 2025

Uh oh!

AlexMaclean commented Jan 7, 2025

Uh oh!

Artem-B Jan 7, 2025

Uh oh!

AlexMaclean Jan 7, 2025

Uh oh!

Artem-B Jan 7, 2025

Uh oh!

AlexMaclean Jan 7, 2025

Uh oh!

Artem-B Jan 8, 2025

Uh oh!

AlexMaclean Jan 8, 2025

Uh oh!

Artem-B Jan 8, 2025

Uh oh!

Uh oh!

d0k commented Jan 9, 2025

Uh oh!

AlexMaclean commented Jan 9, 2025

Uh oh!

d0k commented Jan 9, 2025

Uh oh!

Uh oh!

		bool isFAbsFree(EVT VT) const override { return true; }
		bool isFNegFree(EVT VT) const override { return true; }

[NVPTX] designate fabs and fneg as free #121513

[NVPTX] designate fabs and fneg as free #121513

Uh oh!

Conversation

AlexMaclean commented Jan 2, 2025

Uh oh!

llvmbot commented Jan 2, 2025

Uh oh!

AlexMaclean commented Jan 7, 2025

Uh oh!

Artem-B Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

AlexMaclean Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

Artem-B Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

AlexMaclean Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

Artem-B Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

AlexMaclean Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

Artem-B Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

d0k commented Jan 9, 2025

Uh oh!

AlexMaclean commented Jan 9, 2025

Uh oh!

d0k commented Jan 9, 2025

Uh oh!

Uh oh!