Skip to content

[NVPTX] designate fabs and fneg as free #121513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions llvm/lib/Target/NVPTX/NVPTXISelLowering.h
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,9 @@ class NVPTXTargetLowering : public TargetLowering {
return true;
}

bool isFAbsFree(EVT VT) const override { return true; }
bool isFNegFree(EVT VT) const override { return true; }
Comment on lines +264 to +265
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we get in SASS for those? I do not see any FABS/FNEG instructions in https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#hopper-instruction-set

If ptxas ends up lowering fabs/fneg to the logical ops, we may as well let LLVM do it which may let it optimize some of that away better than ptxas.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SASS emitted will be dependent on lots of decisions in ptxas and I'm not able to say anything definitive or non-public about how PTX instructions will be emitted in SASS. However, here is a simple compiler explorer program which shows a case where these instructions appear to be free in terms of SASS instruction count: https://cuda.godbolt.org/z/MGeb71jjz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general rule of thumb is that it's likely beneficial to use specific PTX instruction, if there's a matching h/w instruction, but let LLVM handle it if it turns into generic code that LLVM can do itself. In this case I'm not convinced that folding some cases into "FADD" with a negated argument counts, as LLVM would be able to do the folding into fsub, which, I'm pretty sure, would result in the same FADD a, -b on the SASS level.

This patch generates slightly more concise and readable PTX, with the downside of shifting some potential optimizations from LLVM to ptxas. Overall it looks like a wash to me.

There's also a question of whether the "free" applies to all types equally. E.g. for bf16x2 the sign folding no longer happens: https://cuda.godbolt.org/z/e15cG6jdM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general rule of thumb is that it's likely beneficial to use specific PTX instruction, if there's a matching h/w instruction, but let LLVM handle it if it turns into generic code that LLVM can do itself.

I'm not sure this is the right rule of thumb. I tend to think we should use the most expressive/specific ptx instruction even if we suspect it may just be syntactic sugar. ptxas may have specific heuristics/peepholes which apply very unevenly and are dependent on which instruction is used. Even if in theory doing the expansion in LLVM should not matter, in many cases it can substantially impact perf.

In this case I'm not convinced that folding some cases into "FADD" with a negated argument counts

This particular example is simple and contrived and intended only to highlight the possibility that sometimes ptxas may be able to fold floating point abs and neg operands into a single instruction. There are many other similar cases. While this isn't true in all cases, it can happen enough that I think it makes sense to treat these operations as free.

There's also a question of whether the "free" applies to all types equally. E.g. for bf16x2 the sign folding no longer happens: https://cuda.godbolt.org/z/e15cG6jdM

Yes, it may not apply to all types equally, still I think it is simplest and clearest to prefer fabs and fneg in all cases due to readability and the possibility of unknown ptxas optimizations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the right rule of thumb.

All models are wrong, but some are still useful. :-) I'm all for refining mine.

we should use the most expressive/specific ptx instruction even if we suspect it may just be syntactic sugar.
ptxas may have specific heuristics/peepholes which apply very unevenly and are dependent on which instruction is
used. Even if in theory doing the expansion in LLVM should not matter, in many cases it can substantially impact perf.

It can be argued both ways. I've seen both the cases that support your assertions, and the cases where ptxas did not do a particularly great job. Granted, both LLVM and ptxas change over time, so there will never be one true answer suitable for all use cases all the time.

If I have to choose between the code I can observe, and improve, vs some kind of black box that requires blind trust, I'm biased towards the former, unless there's a strong evidence that whatever that black box will do is objectively better (e.g. has h/w support for exactly this operation). Without that, it boils down to who can optimize things better LLVM or PTXAS. LLVM has the benefit of having access to more information about the code. PTXAS knows more about the hardware. Which one wins depends on the details. Most of the time both are good enough, so the result is a wash.

Yes, it may not apply to all types equally, still I think it is simplest and clearest to prefer fabs and fneg in all cases due to readability and the possibility of unknown ptxas optimizations.

Readability is a cosmetic factor. Unlike high level languages approximately nobody cares about PTX readability. It's nice to have, but it's not something to lose performance for.

"unknown ptxas optimizations" are not a very strong argument, IMO. For generic operations like abs/neg, I doubt that ptxas or any other compiler will have some magic sauce which will make whole lot of a difference. In the absence of specific evidence of SASS level advantage, my preference is to handle it somewhere where we control the source code.

That said, I do not see much of a downside to this change either, so for the sake of saving time for both of us I'll stamp the change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, you make some good points. The main reason I think this is worth doing is that in many cases abs and neg are essentially free due to ptxas optimizations (here is a slightly larger example highlighting this: https://cuda.godbolt.org/z/86G9E6fMd).

In any case, thanks for the flexibility on this one. I'll go ahead and merge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU instructions with predicate math, operand negation, and modulus are fun:

FSETP.NEU.AND P0, PT, |R3|, -R0, PT 

Too bad that we have no way to access those features in a deterministic way via PTX, and have to second-guess what would get ptxas to produce good SASS. :-(


private:
const NVPTXSubtarget &STI; // cache the subtarget here
SDValue getParamSymbol(SelectionDAG &DAG, int idx, EVT) const;
Expand Down
8 changes: 4 additions & 4 deletions llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll
Original file line number Diff line number Diff line change
Expand Up @@ -182,8 +182,8 @@ define <2 x bfloat> @test_fneg(<2 x bfloat> %a) #0 {
; CHECK-NEXT: .reg .b32 %r<3>;
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0:
; CHECK-NEXT: ld.param.u32 %r1, [test_fneg_param_0];
; CHECK-NEXT: xor.b32 %r2, %r1, -2147450880;
; CHECK-NEXT: ld.param.b32 %r1, [test_fneg_param_0];
; CHECK-NEXT: neg.bf16x2 %r2, %r1;
; CHECK-NEXT: st.param.b32 [func_retval0], %r2;
; CHECK-NEXT: ret;
%r = fneg <2 x bfloat> %a
Expand Down Expand Up @@ -532,8 +532,8 @@ define <2 x bfloat> @test_fabs(<2 x bfloat> %a) #0 {
; CHECK-NEXT: .reg .b32 %r<3>;
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0:
; CHECK-NEXT: ld.param.u32 %r1, [test_fabs_param_0];
; CHECK-NEXT: and.b32 %r2, %r1, 2147450879;
; CHECK-NEXT: ld.param.b32 %r1, [test_fabs_param_0];
; CHECK-NEXT: abs.bf16x2 %r2, %r1;
; CHECK-NEXT: st.param.b32 [func_retval0], %r2;
; CHECK-NEXT: ret;
%r = call <2 x bfloat> @llvm.fabs.f16(<2 x bfloat> %a)
Expand Down
34 changes: 34 additions & 0 deletions llvm/test/CodeGen/NVPTX/fabs-fneg-free.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_35 -verify-machineinstrs | FileCheck %s
; RUN: %if ptxas %{ llc < %s -mtriple=nvptx64 -mcpu=sm_35 | %ptxas-verify %}
target triple = "nvptx64-nvidia-cuda"

define float @fabs_free(i32 %in) {
; CHECK-LABEL: fabs_free(
; CHECK: {
; CHECK-NEXT: .reg .f32 %f<3>;
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0:
; CHECK-NEXT: ld.param.f32 %f1, [fabs_free_param_0];
; CHECK-NEXT: abs.f32 %f2, %f1;
; CHECK-NEXT: st.param.f32 [func_retval0], %f2;
; CHECK-NEXT: ret;
%b = bitcast i32 %in to float
%f = call float @llvm.fabs.f32(float %b)
ret float %f
}

define float @fneg_free(i32 %in) {
; CHECK-LABEL: fneg_free(
; CHECK: {
; CHECK-NEXT: .reg .f32 %f<3>;
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0:
; CHECK-NEXT: ld.param.f32 %f1, [fneg_free_param_0];
; CHECK-NEXT: neg.f32 %f2, %f1;
; CHECK-NEXT: st.param.f32 [func_retval0], %f2;
; CHECK-NEXT: ret;
%b = bitcast i32 %in to float
%f = fneg float %b
ret float %f
}
Loading