-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[NVPTX] designate fabs and fneg as free #121513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVPTX] designate fabs and fneg as free #121513
Conversation
@llvm/pr-subscribers-backend-nvptx Author: Alex MacLean (AlexMaclean) ChangesFull diff: https://github.com/llvm/llvm-project/pull/121513.diff 3 Files Affected:
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
index 4a98fe21b81dc6..c9b7e874556990 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
@@ -261,6 +261,9 @@ class NVPTXTargetLowering : public TargetLowering {
return true;
}
+ bool isFAbsFree(EVT VT) const override { return true; }
+ bool isFNegFree(EVT VT) const override { return true; }
+
private:
const NVPTXSubtarget &STI; // cache the subtarget here
SDValue getParamSymbol(SelectionDAG &DAG, int idx, EVT) const;
diff --git a/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll b/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
index 03cdeb9683abae..8be3a66b7f4836 100644
--- a/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
@@ -182,8 +182,8 @@ define <2 x bfloat> @test_fneg(<2 x bfloat> %a) #0 {
; CHECK-NEXT: .reg .b32 %r<3>;
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0:
-; CHECK-NEXT: ld.param.u32 %r1, [test_fneg_param_0];
-; CHECK-NEXT: xor.b32 %r2, %r1, -2147450880;
+; CHECK-NEXT: ld.param.b32 %r1, [test_fneg_param_0];
+; CHECK-NEXT: neg.bf16x2 %r2, %r1;
; CHECK-NEXT: st.param.b32 [func_retval0], %r2;
; CHECK-NEXT: ret;
%r = fneg <2 x bfloat> %a
@@ -532,8 +532,8 @@ define <2 x bfloat> @test_fabs(<2 x bfloat> %a) #0 {
; CHECK-NEXT: .reg .b32 %r<3>;
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0:
-; CHECK-NEXT: ld.param.u32 %r1, [test_fabs_param_0];
-; CHECK-NEXT: and.b32 %r2, %r1, 2147450879;
+; CHECK-NEXT: ld.param.b32 %r1, [test_fabs_param_0];
+; CHECK-NEXT: abs.bf16x2 %r2, %r1;
; CHECK-NEXT: st.param.b32 [func_retval0], %r2;
; CHECK-NEXT: ret;
%r = call <2 x bfloat> @llvm.fabs.f16(<2 x bfloat> %a)
diff --git a/llvm/test/CodeGen/NVPTX/fabs-fneg-free.ll b/llvm/test/CodeGen/NVPTX/fabs-fneg-free.ll
new file mode 100644
index 00000000000000..9031f33939f2fe
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/fabs-fneg-free.ll
@@ -0,0 +1,34 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_35 -verify-machineinstrs | FileCheck %s
+; RUN: %if ptxas %{ llc < %s -mtriple=nvptx64 -mcpu=sm_35 | %ptxas-verify %}
+target triple = "nvptx64-nvidia-cuda"
+
+define float @fabs_free(i32 %in) {
+; CHECK-LABEL: fabs_free(
+; CHECK: {
+; CHECK-NEXT: .reg .f32 %f<3>;
+; CHECK-EMPTY:
+; CHECK-NEXT: // %bb.0:
+; CHECK-NEXT: ld.param.f32 %f1, [fabs_free_param_0];
+; CHECK-NEXT: abs.f32 %f2, %f1;
+; CHECK-NEXT: st.param.f32 [func_retval0], %f2;
+; CHECK-NEXT: ret;
+ %b = bitcast i32 %in to float
+ %f = call float @llvm.fabs.f32(float %b)
+ ret float %f
+}
+
+define float @fneg_free(i32 %in) {
+; CHECK-LABEL: fneg_free(
+; CHECK: {
+; CHECK-NEXT: .reg .f32 %f<3>;
+; CHECK-EMPTY:
+; CHECK-NEXT: // %bb.0:
+; CHECK-NEXT: ld.param.f32 %f1, [fneg_free_param_0];
+; CHECK-NEXT: neg.f32 %f2, %f1;
+; CHECK-NEXT: st.param.f32 [func_retval0], %f2;
+; CHECK-NEXT: ret;
+ %b = bitcast i32 %in to float
+ %f = fneg float %b
+ ret float %f
+}
|
@Artem-B @frasercrmck ping for review |
bool isFAbsFree(EVT VT) const override { return true; } | ||
bool isFNegFree(EVT VT) const override { return true; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we get in SASS for those? I do not see any FABS/FNEG instructions in https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#hopper-instruction-set
If ptxas ends up lowering fabs/fneg to the logical ops, we may as well let LLVM do it which may let it optimize some of that away better than ptxas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SASS emitted will be dependent on lots of decisions in ptxas
and I'm not able to say anything definitive or non-public about how PTX instructions will be emitted in SASS. However, here is a simple compiler explorer program which shows a case where these instructions appear to be free in terms of SASS instruction count: https://cuda.godbolt.org/z/MGeb71jjz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My general rule of thumb is that it's likely beneficial to use specific PTX instruction, if there's a matching h/w instruction, but let LLVM handle it if it turns into generic code that LLVM can do itself. In this case I'm not convinced that folding some cases into "FADD" with a negated argument counts, as LLVM would be able to do the folding into fsub
, which, I'm pretty sure, would result in the same FADD a, -b
on the SASS level.
This patch generates slightly more concise and readable PTX, with the downside of shifting some potential optimizations from LLVM to ptxas. Overall it looks like a wash to me.
There's also a question of whether the "free" applies to all types equally. E.g. for bf16x2
the sign folding no longer happens: https://cuda.godbolt.org/z/e15cG6jdM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My general rule of thumb is that it's likely beneficial to use specific PTX instruction, if there's a matching h/w instruction, but let LLVM handle it if it turns into generic code that LLVM can do itself.
I'm not sure this is the right rule of thumb. I tend to think we should use the most expressive/specific ptx instruction even if we suspect it may just be syntactic sugar. ptxas
may have specific heuristics/peepholes which apply very unevenly and are dependent on which instruction is used. Even if in theory doing the expansion in LLVM should not matter, in many cases it can substantially impact perf.
In this case I'm not convinced that folding some cases into "FADD" with a negated argument counts
This particular example is simple and contrived and intended only to highlight the possibility that sometimes ptxas
may be able to fold floating point abs and neg operands into a single instruction. There are many other similar cases. While this isn't true in all cases, it can happen enough that I think it makes sense to treat these operations as free.
There's also a question of whether the "free" applies to all types equally. E.g. for bf16x2 the sign folding no longer happens: https://cuda.godbolt.org/z/e15cG6jdM
Yes, it may not apply to all types equally, still I think it is simplest and clearest to prefer fabs and fneg in all cases due to readability and the possibility of unknown ptxas
optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the right rule of thumb.
All models are wrong, but some are still useful. :-) I'm all for refining mine.
we should use the most expressive/specific ptx instruction even if we suspect it may just be syntactic sugar.
ptxas may have specific heuristics/peepholes which apply very unevenly and are dependent on which instruction is
used. Even if in theory doing the expansion in LLVM should not matter, in many cases it can substantially impact perf.
It can be argued both ways. I've seen both the cases that support your assertions, and the cases where ptxas did not do a particularly great job. Granted, both LLVM and ptxas change over time, so there will never be one true answer suitable for all use cases all the time.
If I have to choose between the code I can observe, and improve, vs some kind of black box that requires blind trust, I'm biased towards the former, unless there's a strong evidence that whatever that black box will do is objectively better (e.g. has h/w support for exactly this operation). Without that, it boils down to who can optimize things better LLVM or PTXAS. LLVM has the benefit of having access to more information about the code. PTXAS knows more about the hardware. Which one wins depends on the details. Most of the time both are good enough, so the result is a wash.
Yes, it may not apply to all types equally, still I think it is simplest and clearest to prefer fabs and fneg in all cases due to readability and the possibility of unknown ptxas optimizations.
Readability is a cosmetic factor. Unlike high level languages approximately nobody cares about PTX readability. It's nice to have, but it's not something to lose performance for.
"unknown ptxas optimizations" are not a very strong argument, IMO. For generic operations like abs/neg, I doubt that ptxas or any other compiler will have some magic sauce which will make whole lot of a difference. In the absence of specific evidence of SASS level advantage, my preference is to handle it somewhere where we control the source code.
That said, I do not see much of a downside to this change either, so for the sake of saving time for both of us I'll stamp the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, you make some good points. The main reason I think this is worth doing is that in many cases abs and neg are essentially free due to ptxas optimizations (here is a slightly larger example highlighting this: https://cuda.godbolt.org/z/86G9E6fMd).
In any case, thanks for the flexibility on this one. I'll go ahead and merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU instructions with predicate math, operand negation, and modulus are fun:
FSETP.NEU.AND P0, PT, |R3|, -R0, PT
Too bad that we have no way to access those features in a deterministic way via PTX, and have to second-guess what would get ptxas to produce good SASS. :-(
This looks like a nice improvement, but I'm afraid it's built on a wrong assumption. PTX LLVM uses that assumption internally so even if the user doesn't care about NaN payloads in general, it can miscompile things. https://godbolt.org/z/Y9Pr9jGxe |
This reverts commit 45d4698. NVPTX fabs & fneg are incompatible with LLVM's semantics as LLVM guarantees the payload of NaNs to stay the same while PTX mangles NaNs. The bad patterns are still in the NVPTX backend and should probably be removed, since this change only exposed the bad behavior.
This isn't new, nor is it necessarily a problem we need to fix. For as long as the NVPTX backend has existed we've lowered llvm @d0k Is this change causing an issue in a real program? |
It makes https://github.com/openxla/xla/blob/main/xla/service/gpu/tests/gpu_cub_sort_test.cc fail on A100, which is derived from a real model that depends on the sorting order of NaN. It's a |
No description provided.