-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[NVPTX] designate fabs and fneg as free #121513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
AlexMaclean
merged 2 commits into
llvm:main
from
AlexMaclean:dev/amaclean/upstream-fneg-fabs-free
Jan 8, 2025
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 | ||
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_35 -verify-machineinstrs | FileCheck %s | ||
; RUN: %if ptxas %{ llc < %s -mtriple=nvptx64 -mcpu=sm_35 | %ptxas-verify %} | ||
target triple = "nvptx64-nvidia-cuda" | ||
|
||
define float @fabs_free(i32 %in) { | ||
; CHECK-LABEL: fabs_free( | ||
; CHECK: { | ||
; CHECK-NEXT: .reg .f32 %f<3>; | ||
; CHECK-EMPTY: | ||
; CHECK-NEXT: // %bb.0: | ||
; CHECK-NEXT: ld.param.f32 %f1, [fabs_free_param_0]; | ||
; CHECK-NEXT: abs.f32 %f2, %f1; | ||
; CHECK-NEXT: st.param.f32 [func_retval0], %f2; | ||
; CHECK-NEXT: ret; | ||
%b = bitcast i32 %in to float | ||
%f = call float @llvm.fabs.f32(float %b) | ||
ret float %f | ||
} | ||
|
||
define float @fneg_free(i32 %in) { | ||
; CHECK-LABEL: fneg_free( | ||
; CHECK: { | ||
; CHECK-NEXT: .reg .f32 %f<3>; | ||
; CHECK-EMPTY: | ||
; CHECK-NEXT: // %bb.0: | ||
; CHECK-NEXT: ld.param.f32 %f1, [fneg_free_param_0]; | ||
; CHECK-NEXT: neg.f32 %f2, %f1; | ||
; CHECK-NEXT: st.param.f32 [func_retval0], %f2; | ||
; CHECK-NEXT: ret; | ||
%b = bitcast i32 %in to float | ||
%f = fneg float %b | ||
ret float %f | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we get in SASS for those? I do not see any FABS/FNEG instructions in https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#hopper-instruction-set
If ptxas ends up lowering fabs/fneg to the logical ops, we may as well let LLVM do it which may let it optimize some of that away better than ptxas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SASS emitted will be dependent on lots of decisions in
ptxas
and I'm not able to say anything definitive or non-public about how PTX instructions will be emitted in SASS. However, here is a simple compiler explorer program which shows a case where these instructions appear to be free in terms of SASS instruction count: https://cuda.godbolt.org/z/MGeb71jjzThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My general rule of thumb is that it's likely beneficial to use specific PTX instruction, if there's a matching h/w instruction, but let LLVM handle it if it turns into generic code that LLVM can do itself. In this case I'm not convinced that folding some cases into "FADD" with a negated argument counts, as LLVM would be able to do the folding into
fsub
, which, I'm pretty sure, would result in the sameFADD a, -b
on the SASS level.This patch generates slightly more concise and readable PTX, with the downside of shifting some potential optimizations from LLVM to ptxas. Overall it looks like a wash to me.
There's also a question of whether the "free" applies to all types equally. E.g. for
bf16x2
the sign folding no longer happens: https://cuda.godbolt.org/z/e15cG6jdMThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the right rule of thumb. I tend to think we should use the most expressive/specific ptx instruction even if we suspect it may just be syntactic sugar.
ptxas
may have specific heuristics/peepholes which apply very unevenly and are dependent on which instruction is used. Even if in theory doing the expansion in LLVM should not matter, in many cases it can substantially impact perf.This particular example is simple and contrived and intended only to highlight the possibility that sometimes
ptxas
may be able to fold floating point abs and neg operands into a single instruction. There are many other similar cases. While this isn't true in all cases, it can happen enough that I think it makes sense to treat these operations as free.Yes, it may not apply to all types equally, still I think it is simplest and clearest to prefer fabs and fneg in all cases due to readability and the possibility of unknown
ptxas
optimizations.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All models are wrong, but some are still useful. :-) I'm all for refining mine.
It can be argued both ways. I've seen both the cases that support your assertions, and the cases where ptxas did not do a particularly great job. Granted, both LLVM and ptxas change over time, so there will never be one true answer suitable for all use cases all the time.
If I have to choose between the code I can observe, and improve, vs some kind of black box that requires blind trust, I'm biased towards the former, unless there's a strong evidence that whatever that black box will do is objectively better (e.g. has h/w support for exactly this operation). Without that, it boils down to who can optimize things better LLVM or PTXAS. LLVM has the benefit of having access to more information about the code. PTXAS knows more about the hardware. Which one wins depends on the details. Most of the time both are good enough, so the result is a wash.
Readability is a cosmetic factor. Unlike high level languages approximately nobody cares about PTX readability. It's nice to have, but it's not something to lose performance for.
"unknown ptxas optimizations" are not a very strong argument, IMO. For generic operations like abs/neg, I doubt that ptxas or any other compiler will have some magic sauce which will make whole lot of a difference. In the absence of specific evidence of SASS level advantage, my preference is to handle it somewhere where we control the source code.
That said, I do not see much of a downside to this change either, so for the sake of saving time for both of us I'll stamp the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, you make some good points. The main reason I think this is worth doing is that in many cases abs and neg are essentially free due to ptxas optimizations (here is a slightly larger example highlighting this: https://cuda.godbolt.org/z/86G9E6fMd).
In any case, thanks for the flexibility on this one. I'll go ahead and merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU instructions with predicate math, operand negation, and modulus are fun:
Too bad that we have no way to access those features in a deterministic way via PTX, and have to second-guess what would get ptxas to produce good SASS. :-(