-
Notifications
You must be signed in to change notification settings - Fork 14.4k
[NVPTX] Skip processing BasicBlocks with single unreachable instruction in nvptx-lower-unreachable
pass.
#72641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
; RUN: llc < %s -march=nvptx -mcpu=sm_20 -verify-machineinstrs | FileCheck %s | ||
|
||
declare noundef i32 @llvm.nvvm.read.ptx.sreg.tid.x() #1 | ||
|
||
define void @kernel_func() { | ||
|
||
%1 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() | ||
|
||
switch i32 %1, label %unreachabledefault [ | ||
i32 0, label %bb0 | ||
i32 1, label %bb1 | ||
i32 2, label %bb1 | ||
i32 3, label %bb2 | ||
] | ||
|
||
bb0: | ||
ret void | ||
|
||
bb1: | ||
ret void | ||
|
||
bb2: | ||
ret void | ||
|
||
unreachabledefault: | ||
unreachable | ||
|
||
; CHECK: @kernel_func | ||
; CHECK-NOT: unreachabledefault | ||
; CHECK: -- End function | ||
} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're still proposing to trade off correctness for performance and I still strongly believe it's not something we want to do.
I do not think we should apply this patch. If you do want to live dangerously, you may want to try resurrecting
nvptx-exit-on-unreachable
command line option and then apply it to the compilation.If you do want this patch to land, you need a strong proof that not emitting exit in this particular case is safe, which will be hard to obtain, considering that all we know is that eliminating this explicit control flow hint results in a miscompilation by ptxas, NVIDIA's optimizing assembler, which we do not control or have much visibility into.
It took quite a few years, and multiple attempts to solve the issue, to eventually arrive at this workaround which appears to address the root cause. Partially undoing it does not make sense to me. "A little bit broken" is still broken, even if your particular use case happens to be fine.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Artem-B
I disagree with your comment. The
exit
with an additional node in CGF is introduced by this. If you build the test before that PR and with my change, you'll see they have identical ptx, whereas after that PR you'll see an additional node withexit
in it.If as claimed, this patch sacrifices correctness for performance, then before this we did not have correct behaviour in such situations.
I'd suggest comparing the ptx of three revisions. 1) before this, 2) after that, 3 after the fix introduced in this patch.
Overall, this patch fixes the overhead introduced by this which the attached test clearly shows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not disagreeing that it does make a difference in your use case, and that in your case not having an exit is benign and produces faster code.
The trouble is that I can not say that the patch is benign for all NVPTX users.
As far as LLVM is concerned you patch is correct, but in case of NVIDIA GPUs, we also need to take into account NVIDIA's assembler and we do know that it miscompiles sufficiently convoluted code, unless we explicitly annotate unreachable code with an
exit
instead of allowing it to fall through. ptxas does not have the same info as LLVM and does not know that the code is unreachable. The fall-through effectively looks like a new CFG edge which confuses it, and results in thread mis-convergence.I may be wrong and would be happy to be proven wrong. And example of the "works for me here" is not sufficient. I'll readily admit that it does, and will get back to my point above -- can we guarantee it to work for all users? Until we can, we should keep generating
exit
.I would be open to resurrecting a hidden compiler flag to disable generation of
exit
, instead.I personally do not know how to prove correctness in this case. Based on the past experience with this issue (~8 years of looking for the fix: https://bugs.llvm.org/show_bug.cgi?id=27738), empirical testing on this particular issue would also be insufficient. The issue tends to pop up in odd and unpredictable places and is very hard to reproduce and diagnose in most of those cases. So, please, take my word that there's a very good reason for the abundance of caution I'm advocating for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Artem-B
ptx
Running the shared code built using syclos 8e92193 revision (related with D152789) (with --save-temps added to the last command in ethminier/ethminer/build.sh) produces a buildinfo-sycl-nvptx64-nvidia-cuda-sm_50-ae5a30.s file (811816 bytes) and reports the following measurements:
Manually modifying the above ptx file to
buildinfo-sycl-nvptx64-nvidia-cuda-sm_50-204989.s
by moving theBasicBlock
havingexit
added by the originalnvptx-lower-unreachable
pass in D152789 and rebuilding the binary yields the following measurements:which is a massive
%47
performance improvement.We can't pinpoint a specific issue as to why a minor relocation of an added basic block back to its original position in the CFG, which is typically moved to the end of the CFG by block-placement, can have such a significant impact. Obviously, widening divergent areas could be a potential reason, which the original PR has aimed to address.
It's also challenging to modify the pass to prevent that particular basic block from being affected by optimisation passes. Such changes might not be straightforward and could introduce some complexity that may not align well with the the code standards. For instance, it may be achieved by having one extra pass to undo works done by block-placement optimisation, which may some don't fancy.
It would be valuable to have your input, and possibly input from @maleadt as well.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what kind of input you're still looking for. I'm not disputing that you see a performance regression, but I still do not think it warrants disabling lowering of unreachable to exit.
I remain open to re-introduction of a hidden option to allow disabling it, as an escape hatch.
If you can figure out why a seemingly minor differences result in such a huge performance difference and if there's something we can do to improve the code, without risking correctness, that would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true. It was also not intended to buy us any performance gains, so this part is WAI.
I strongly disagree with this assertion. The patch does resolve the issue we've been struggling with for a very long time. We currently do not have any other mechanism to avoid miscompilation in ptxas, which makes this pass essential to guarantee correctness.
While I understand your frustration with the performance regression, undoing the pass is not the way forward. I've already proposed few options that would help (escape hatch option to disable this pass if/when it's needed, working around the issue on the source code level), but for some reason you keep hammering on the "disable to fix part", without providing strong enough reasons for that. "let's break things for everyone so my code would run fast" is not a very strong argument.
I recommend alternative options that do not require reintroducing the miscompilation for everyone with NVIDIA GPUs.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Artem-B
My statements were derived from the contents of the original patch. I'd be happy to see issues the patch fixes, if you have access to any.
It appears that I may need to resort to modifying the source code as a an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or reintroduce
-nvptx-exit-on-unreachable=0
option. This would be a relatively uncontroversial change.The original patch description provides a good overview https://reviews.llvm.org/D152789
It all started here: https://bugs.llvm.org/show_bug.cgi?id=27738
Julia folks eventually came up with a concise reproducer:
JuliaGPU/CUDAnative.jl#4
Since then we've attempted to keep CFG structured (it helped a lot, but not completely.)
Over time the issue kept popping up, again, and again. E.g. we had to disable some loop transform passes that happened to trigger the issue.
7 years later we've finally figured out what seems to be triggering ptxas miscompilation.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional reproducers can be found in JuliaGPU/CUDA.jl#1746 (comment), JuliaGPU/CUDA.jl#431 (comment), JuliaGPU/CUDA.jl#43 (comment). Basically, we've been running into this bug with various user applications roughly every year or so, necessitating more and more questionable workarounds (both in code, and in our compiler) until we finally found the root cause. Needless to say I'd be strongly opposed to reverting the fix.
However, the proposed
-nvptx-exit-on-unreachable=0
may be a viable default at some point in the future, because NVIDIA should have fixedptxas
to modeltrap
likeexit
, i.e., we just need to make sure that every unreachable block ends with atrap
terminator (which I guess is the current behavior). So that seems like a good option to add back, and would help with your performance issue right now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can enable unreachable->trap lowering with a flag, but by default unreachable is still lowered into nothing. AFAICT, trap will likely create the same performance regression as exit does in this case.
For NVPTX, we will continue to need
trap
orexit
to avoid confusing ptxas about the intended control flow. If someone consciously wants/needs to trade off miscompilation vs performance, we'll still need to have an explicit "shoot this foot, please" option for that.