Skip to content

[NVPTX] designate fabs and fneg as free #121513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

AlexMaclean
Copy link
Member

No description provided.

@llvmbot
Copy link
Member

llvmbot commented Jan 2, 2025

@llvm/pr-subscribers-backend-nvptx

Author: Alex MacLean (AlexMaclean)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/121513.diff

3 Files Affected:

  • (modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.h (+3)
  • (modified) llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll (+4-4)
  • (added) llvm/test/CodeGen/NVPTX/fabs-fneg-free.ll (+34)
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
index 4a98fe21b81dc6..c9b7e874556990 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
@@ -261,6 +261,9 @@ class NVPTXTargetLowering : public TargetLowering {
     return true;
   }
 
+  bool isFAbsFree(EVT VT) const override { return true; }
+  bool isFNegFree(EVT VT) const override { return true; }
+
 private:
   const NVPTXSubtarget &STI; // cache the subtarget here
   SDValue getParamSymbol(SelectionDAG &DAG, int idx, EVT) const;
diff --git a/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll b/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
index 03cdeb9683abae..8be3a66b7f4836 100644
--- a/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
@@ -182,8 +182,8 @@ define <2 x bfloat> @test_fneg(<2 x bfloat> %a) #0 {
 ; CHECK-NEXT:    .reg .b32 %r<3>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
-; CHECK-NEXT:    ld.param.u32 %r1, [test_fneg_param_0];
-; CHECK-NEXT:    xor.b32 %r2, %r1, -2147450880;
+; CHECK-NEXT:    ld.param.b32 %r1, [test_fneg_param_0];
+; CHECK-NEXT:    neg.bf16x2 %r2, %r1;
 ; CHECK-NEXT:    st.param.b32 [func_retval0], %r2;
 ; CHECK-NEXT:    ret;
   %r = fneg <2 x bfloat> %a
@@ -532,8 +532,8 @@ define <2 x bfloat> @test_fabs(<2 x bfloat> %a) #0 {
 ; CHECK-NEXT:    .reg .b32 %r<3>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
-; CHECK-NEXT:    ld.param.u32 %r1, [test_fabs_param_0];
-; CHECK-NEXT:    and.b32 %r2, %r1, 2147450879;
+; CHECK-NEXT:    ld.param.b32 %r1, [test_fabs_param_0];
+; CHECK-NEXT:    abs.bf16x2 %r2, %r1;
 ; CHECK-NEXT:    st.param.b32 [func_retval0], %r2;
 ; CHECK-NEXT:    ret;
   %r = call <2 x bfloat> @llvm.fabs.f16(<2 x bfloat> %a)
diff --git a/llvm/test/CodeGen/NVPTX/fabs-fneg-free.ll b/llvm/test/CodeGen/NVPTX/fabs-fneg-free.ll
new file mode 100644
index 00000000000000..9031f33939f2fe
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/fabs-fneg-free.ll
@@ -0,0 +1,34 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_35 -verify-machineinstrs | FileCheck %s
+; RUN: %if ptxas %{ llc < %s -mtriple=nvptx64 -mcpu=sm_35 | %ptxas-verify %}
+target triple = "nvptx64-nvidia-cuda"
+
+define float @fabs_free(i32 %in) {
+; CHECK-LABEL: fabs_free(
+; CHECK:       {
+; CHECK-NEXT:    .reg .f32 %f<3>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f32 %f1, [fabs_free_param_0];
+; CHECK-NEXT:    abs.f32 %f2, %f1;
+; CHECK-NEXT:    st.param.f32 [func_retval0], %f2;
+; CHECK-NEXT:    ret;
+  %b = bitcast i32 %in to float
+  %f = call float @llvm.fabs.f32(float %b)
+  ret float %f
+}
+
+define float @fneg_free(i32 %in) {
+; CHECK-LABEL: fneg_free(
+; CHECK:       {
+; CHECK-NEXT:    .reg .f32 %f<3>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f32 %f1, [fneg_free_param_0];
+; CHECK-NEXT:    neg.f32 %f2, %f1;
+; CHECK-NEXT:    st.param.f32 [func_retval0], %f2;
+; CHECK-NEXT:    ret;
+  %b = bitcast i32 %in to float
+  %f = fneg float %b
+  ret float %f
+}

@AlexMaclean AlexMaclean requested a review from kalxr January 2, 2025 18:36
@AlexMaclean
Copy link
Member Author

@Artem-B @frasercrmck ping for review

Comment on lines +264 to +265
bool isFAbsFree(EVT VT) const override { return true; }
bool isFNegFree(EVT VT) const override { return true; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we get in SASS for those? I do not see any FABS/FNEG instructions in https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#hopper-instruction-set

If ptxas ends up lowering fabs/fneg to the logical ops, we may as well let LLVM do it which may let it optimize some of that away better than ptxas.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SASS emitted will be dependent on lots of decisions in ptxas and I'm not able to say anything definitive or non-public about how PTX instructions will be emitted in SASS. However, here is a simple compiler explorer program which shows a case where these instructions appear to be free in terms of SASS instruction count: https://cuda.godbolt.org/z/MGeb71jjz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general rule of thumb is that it's likely beneficial to use specific PTX instruction, if there's a matching h/w instruction, but let LLVM handle it if it turns into generic code that LLVM can do itself. In this case I'm not convinced that folding some cases into "FADD" with a negated argument counts, as LLVM would be able to do the folding into fsub, which, I'm pretty sure, would result in the same FADD a, -b on the SASS level.

This patch generates slightly more concise and readable PTX, with the downside of shifting some potential optimizations from LLVM to ptxas. Overall it looks like a wash to me.

There's also a question of whether the "free" applies to all types equally. E.g. for bf16x2 the sign folding no longer happens: https://cuda.godbolt.org/z/e15cG6jdM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general rule of thumb is that it's likely beneficial to use specific PTX instruction, if there's a matching h/w instruction, but let LLVM handle it if it turns into generic code that LLVM can do itself.

I'm not sure this is the right rule of thumb. I tend to think we should use the most expressive/specific ptx instruction even if we suspect it may just be syntactic sugar. ptxas may have specific heuristics/peepholes which apply very unevenly and are dependent on which instruction is used. Even if in theory doing the expansion in LLVM should not matter, in many cases it can substantially impact perf.

In this case I'm not convinced that folding some cases into "FADD" with a negated argument counts

This particular example is simple and contrived and intended only to highlight the possibility that sometimes ptxas may be able to fold floating point abs and neg operands into a single instruction. There are many other similar cases. While this isn't true in all cases, it can happen enough that I think it makes sense to treat these operations as free.

There's also a question of whether the "free" applies to all types equally. E.g. for bf16x2 the sign folding no longer happens: https://cuda.godbolt.org/z/e15cG6jdM

Yes, it may not apply to all types equally, still I think it is simplest and clearest to prefer fabs and fneg in all cases due to readability and the possibility of unknown ptxas optimizations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the right rule of thumb.

All models are wrong, but some are still useful. :-) I'm all for refining mine.

we should use the most expressive/specific ptx instruction even if we suspect it may just be syntactic sugar.
ptxas may have specific heuristics/peepholes which apply very unevenly and are dependent on which instruction is
used. Even if in theory doing the expansion in LLVM should not matter, in many cases it can substantially impact perf.

It can be argued both ways. I've seen both the cases that support your assertions, and the cases where ptxas did not do a particularly great job. Granted, both LLVM and ptxas change over time, so there will never be one true answer suitable for all use cases all the time.

If I have to choose between the code I can observe, and improve, vs some kind of black box that requires blind trust, I'm biased towards the former, unless there's a strong evidence that whatever that black box will do is objectively better (e.g. has h/w support for exactly this operation). Without that, it boils down to who can optimize things better LLVM or PTXAS. LLVM has the benefit of having access to more information about the code. PTXAS knows more about the hardware. Which one wins depends on the details. Most of the time both are good enough, so the result is a wash.

Yes, it may not apply to all types equally, still I think it is simplest and clearest to prefer fabs and fneg in all cases due to readability and the possibility of unknown ptxas optimizations.

Readability is a cosmetic factor. Unlike high level languages approximately nobody cares about PTX readability. It's nice to have, but it's not something to lose performance for.

"unknown ptxas optimizations" are not a very strong argument, IMO. For generic operations like abs/neg, I doubt that ptxas or any other compiler will have some magic sauce which will make whole lot of a difference. In the absence of specific evidence of SASS level advantage, my preference is to handle it somewhere where we control the source code.

That said, I do not see much of a downside to this change either, so for the sake of saving time for both of us I'll stamp the change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, you make some good points. The main reason I think this is worth doing is that in many cases abs and neg are essentially free due to ptxas optimizations (here is a slightly larger example highlighting this: https://cuda.godbolt.org/z/86G9E6fMd).

In any case, thanks for the flexibility on this one. I'll go ahead and merge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU instructions with predicate math, operand negation, and modulus are fun:

FSETP.NEU.AND P0, PT, |R3|, -R0, PT 

Too bad that we have no way to access those features in a deterministic way via PTX, and have to second-guess what would get ptxas to produce good SASS. :-(

@AlexMaclean AlexMaclean merged commit 45d4698 into llvm:main Jan 8, 2025
10 checks passed
@d0k
Copy link
Member

d0k commented Jan 9, 2025

This looks like a nice improvement, but I'm afraid it's built on a wrong assumption. PTX abs and neg are not equivalent to LLVM's fabs and fneg since the latter explicitly preserve NaN bits while PTX instructions do not.

LLVM uses that assumption internally so even if the user doesn't care about NaN payloads in general, it can miscompile things. https://godbolt.org/z/Y9Pr9jGxe

d0k added a commit that referenced this pull request Jan 9, 2025
This reverts commit 45d4698.

NVPTX fabs & fneg are incompatible with LLVM's semantics as LLVM
guarantees the payload of NaNs to stay the same while PTX mangles NaNs.

The bad patterns are still in the NVPTX backend and should probably be
removed, since this change only exposed the bad behavior.
@AlexMaclean
Copy link
Member Author

This looks like a nice improvement, but I'm afraid it's built on a wrong assumption. PTX abs and neg are not equivalent to LLVM's fabs and fneg since the latter explicitly preserve NaN bits while PTX instructions do not.

LLVM uses that assumption internally so even if the user doesn't care about NaN payloads in general, it can miscompile things. https://godbolt.org/z/Y9Pr9jGxe

This isn't new, nor is it necessarily a problem we need to fix. For as long as the NVPTX backend has existed we've lowered llvm fabs and fneg to PTX abs and neg even though the semantics don't quite match. This is unfortunate but likely the least bad option, as lowering them with and / xor would significantly harm performance in most cases and fix correctness only for some rare cases that no one seems to really care about.

@d0k Is this change causing an issue in a real program?

@d0k
Copy link
Member

d0k commented Jan 9, 2025

This looks like a nice improvement, but I'm afraid it's built on a wrong assumption. PTX abs and neg are not equivalent to LLVM's fabs and fneg since the latter explicitly preserve NaN bits while PTX instructions do not.
LLVM uses that assumption internally so even if the user doesn't care about NaN payloads in general, it can miscompile things. https://godbolt.org/z/Y9Pr9jGxe

This isn't new, nor is it necessarily a problem we need to fix. For as long as the NVPTX backend has existed we've lowered llvm fabs and fneg to PTX abs and neg even though the semantics don't quite match. This is unfortunate but likely the least bad option, as lowering them with and / xor would significantly harm performance in most cases and fix correctness only for some rare cases that no one seems to really care about.

@d0k Is this change causing an issue in a real program?

It makes https://github.com/openxla/xla/blob/main/xla/service/gpu/tests/gpu_cub_sort_test.cc fail on A100, which is derived from a real model that depends on the sorting order of NaN. It's a xor getting turned into neg problem. This is obviously an edge case, but an LLVM backend is not at liberty to redefine IR semantics as it choses. LangRef is very clear about how these instructions should behave. Dealing with mismatches between IR semantics and ISA semantics is what fast math flags and intrinsics are for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants