-
Notifications
You must be signed in to change notification settings - Fork 14.4k
[NVPTX] Skip processing BasicBlocks with single unreachable instruction in nvptx-lower-unreachable
pass.
#72641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can you elaborate on the actual impact the While exit does result in code with more branches on PTX level, SASS looks largely identical with the difference that unreachable code does call exit. https://godbolt.org/z/s8f5nc8cs I'm concerned that special-casing unreachable in switch we may make it possible that we re-open a possibility that we'll re-introduce the original issue where we break structured CFG for ptxas if the switch ends up being replicated within more complicated CFG. |
This is coming from a DPCPP performance evaluation which showed 3% performance loss in a benchmark after introducing Lower unreachable to exit to allow ptxas to accurately reconstruct the CFG. I can share the ir and ptx of the related code before and after commit, if it helps. |
If you could also point at the source code that resulted in generating the switch that may help, too. AFAICT, the unreachable->exit change did not cause any obvious performance regressions on our benchmarks, so the regression may be specific to your tests. I'm really reluctant risking resurrection of a nasty miscompilation bug for the sake of a minor performance gain on a niche benchmark. The bar here is relatively high. It would either have to be a widespread regression with no workaround, or the patch would need to have a strong guarantee that it it can never break structured control flow (and I'm not quite sure how to do that. We'll need someone with more expertise than myself. @arsenm, @maleadt any suggestions? ). |
Hi @Artem-B |
I'm unfamiliar with LLVM's machine optimization level, but conceptually it does seem like a better/safer approach to try and restore the compiler's ability to remove the unreachable default block in the presence of |
Comparing the PTX in your reproducer I do not see any interesting differences. All the code gets optimized exactly the same way, down to exactly the same number of registers, so the diff is very clean. The exit is not blocking any optimizations and the difference is literally one extra predicated branch to I think compiler does exactly what it should be doing and that this particular issue must be handled at the user source code level. If some switch values are truly impossible, it should be made explicit in the source code and that should allow compiler to optimize that extra jump away. Previously the code had apparently relied on UB that resulted in a cost-free fall-through through the unreachable part. If the fall-through is intentional, or impossible and we expect compiler to do something specific, the user code should be explicit about that. We should not be restoring any particular kind of undefined behavior we may have done in the past. |
I agree the there is not much difference between the two ptx. However, it is claimed that it made ethminer -3% performance regression on A100 CUDA. Moreover, to the best of my understanding of the code, there is nothing to be done at source level to address this issue. |
You can use
That remains to be seen. There are certainly ways to tell compiler that some values are guaranteed not to be seen. E.g. something as simple as an explicit |
Updates the `nvptx-lower-unreachable` pass to bypass BasicBlocks containing just one unreachable instruction. This allows further optimization by subsequent passes, enhancing overall performance and efficiency.
nvptx-lower-unreachable
pass.
Thanks @Artem-B and apologies for late reply. |
// performance. To counteract this undesirable consequence, we choose to | ||
// refrain from processing BasicBlocks with just one unreachable instruction | ||
// in this pass. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're still proposing to trade off correctness for performance and I still strongly believe it's not something we want to do.
I do not think we should apply this patch. If you do want to live dangerously, you may want to try resurrecting nvptx-exit-on-unreachable
command line option and then apply it to the compilation.
If you do want this patch to land, you need a strong proof that not emitting exit in this particular case is safe, which will be hard to obtain, considering that all we know is that eliminating this explicit control flow hint results in a miscompilation by ptxas, NVIDIA's optimizing assembler, which we do not control or have much visibility into.
It took quite a few years, and multiple attempts to solve the issue, to eventually arrive at this workaround which appears to address the root cause. Partially undoing it does not make sense to me. "A little bit broken" is still broken, even if your particular use case happens to be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're still proposing to trade off correctness for performance and I still strongly believe it's not something we want to do.
Thank you @Artem-B
I disagree with your comment. The exit
with an additional node in CGF is introduced by this. If you build the test before that PR and with my change, you'll see they have identical ptx, whereas after that PR you'll see an additional node with exit
in it.
If as claimed, this patch sacrifices correctness for performance, then before this we did not have correct behaviour in such situations.
I'd suggest comparing the ptx of three revisions. 1) before this, 2) after that, 3 after the fix introduced in this patch.
Overall, this patch fixes the overhead introduced by this which the attached test clearly shows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not disagreeing that it does make a difference in your use case, and that in your case not having an exit is benign and produces faster code.
The trouble is that I can not say that the patch is benign for all NVPTX users.
As far as LLVM is concerned you patch is correct, but in case of NVIDIA GPUs, we also need to take into account NVIDIA's assembler and we do know that it miscompiles sufficiently convoluted code, unless we explicitly annotate unreachable code with an exit
instead of allowing it to fall through. ptxas does not have the same info as LLVM and does not know that the code is unreachable. The fall-through effectively looks like a new CFG edge which confuses it, and results in thread mis-convergence.
I may be wrong and would be happy to be proven wrong. And example of the "works for me here" is not sufficient. I'll readily admit that it does, and will get back to my point above -- can we guarantee it to work for all users? Until we can, we should keep generating exit
.
I would be open to resurrecting a hidden compiler flag to disable generation of exit
, instead.
I personally do not know how to prove correctness in this case. Based on the past experience with this issue (~8 years of looking for the fix: https://bugs.llvm.org/show_bug.cgi?id=27738), empirical testing on this particular issue would also be insufficient. The issue tends to pop up in odd and unpredictable places and is very hard to reproduce and diagnose in most of those cases. So, please, take my word that there's a very good reason for the abundance of caution I'm advocating for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running the shared code built using syclos 8e92193 revision (related with D152789) (with --save-temps added to the last command in ethminier/ethminer/build.sh) produces a buildinfo-sycl-nvptx64-nvidia-cuda-sm_50-ae5a30.s file (811816 bytes) and reports the following measurements:
Total Execution Time: 61.0493 s
Total Number of Hashes: 9875488768
Overall Hash rate: 161.763 MH/s
Manually modifying the above ptx file to buildinfo-sycl-nvptx64-nvidia-cuda-sm_50-204989.s
by moving the BasicBlock
having exit
added by the original nvptx-lower-unreachable
pass in D152789 and rebuilding the binary yields the following measurements:
Total Execution Time: 61.177 s
Total Number of Hashes: 14647558144
Overall Hash rate: 239.429 MH/s
which is a massive %47
performance improvement.
We can't pinpoint a specific issue as to why a minor relocation of an added basic block back to its original position in the CFG, which is typically moved to the end of the CFG by block-placement, can have such a significant impact. Obviously, widening divergent areas could be a potential reason, which the original PR has aimed to address.
It's also challenging to modify the pass to prevent that particular basic block from being affected by optimisation passes. Such changes might not be straightforward and could introduce some complexity that may not align well with the the code standards. For instance, it may be achieved by having one extra pass to undo works done by block-placement optimisation, which may some don't fancy.
It would be valuable to have your input, and possibly input from @maleadt as well.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what kind of input you're still looking for. I'm not disputing that you see a performance regression, but I still do not think it warrants disabling lowering of unreachable to exit.
I remain open to re-introduction of a hidden option to allow disabling it, as an escape hatch.
If you can figure out why a seemingly minor differences result in such a huge performance difference and if there's something we can do to improve the code, without risking correctness, that would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the patch does not offer any evidence of a performance improvement
This is true. It was also not intended to buy us any performance gains, so this part is WAI.
or resolution of a specific issue elsewhere,
I strongly disagree with this assertion. The patch does resolve the issue we've been struggling with for a very long time. We currently do not have any other mechanism to avoid miscompilation in ptxas, which makes this pass essential to guarantee correctness.
While I understand your frustration with the performance regression, undoing the pass is not the way forward. I've already proposed few options that would help (escape hatch option to disable this pass if/when it's needed, working around the issue on the source code level), but for some reason you keep hammering on the "disable to fix part", without providing strong enough reasons for that. "let's break things for everyone so my code would run fast" is not a very strong argument.
I recommend removing the problematic pass altogether since the proposed solutions to address performance issues of the pass have not satisfied reviewers.
I recommend alternative options that do not require reintroducing the miscompilation for everyone with NVIDIA GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly disagree with this assertion. The patch does resolve the issue we've been struggling with for a very long time. We currently do not have any other mechanism to avoid miscompilation in ptxas, which makes this pass essential to guarantee correctness.
Thanks @Artem-B
My statements were derived from the contents of the original patch. I'd be happy to see issues the patch fixes, if you have access to any.
It appears that I may need to resort to modifying the source code as a an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or reintroduce -nvptx-exit-on-unreachable=0
option. This would be a relatively uncontroversial change.
I'd be happy to see issues the patch fixes, if you have access to any.
The original patch description provides a good overview https://reviews.llvm.org/D152789
It all started here: https://bugs.llvm.org/show_bug.cgi?id=27738
Julia folks eventually came up with a concise reproducer:
JuliaGPU/CUDAnative.jl#4
Since then we've attempted to keep CFG structured (it helped a lot, but not completely.)
Over time the issue kept popping up, again, and again. E.g. we had to disable some loop transform passes that happened to trigger the issue.
7 years later we've finally figured out what seems to be triggering ptxas miscompilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional reproducers can be found in JuliaGPU/CUDA.jl#1746 (comment), JuliaGPU/CUDA.jl#431 (comment), JuliaGPU/CUDA.jl#43 (comment). Basically, we've been running into this bug with various user applications roughly every year or so, necessitating more and more questionable workarounds (both in code, and in our compiler) until we finally found the root cause. Needless to say I'd be strongly opposed to reverting the fix.
However, the proposed -nvptx-exit-on-unreachable=0
may be a viable default at some point in the future, because NVIDIA should have fixed ptxas
to model trap
like exit
, i.e., we just need to make sure that every unreachable block ends with a trap
terminator (which I guess is the current behavior). So that seems like a good option to add back, and would help with your performance issue right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every unreachable block ends with a trap terminator (which I guess is the current behavior).
I think we can enable unreachable->trap lowering with a flag, but by default unreachable is still lowered into nothing. AFAICT, trap will likely create the same performance regression as exit does in this case.
For NVPTX, we will continue to need trap
or exit
to avoid confusing ptxas about the intended control flow. If someone consciously wants/needs to trade off miscompilation vs performance, we'll still need to have an explicit "shoot this foot, please" option for that.
This option appears to be gone now. #67478 |
Addressing Lower unreachable to exit to allow ptxas to accurately reconstruct the CFG resulted in performance regression In scenarios where a BasicBlock contains only one unreachable instruction. Before the patch, in such a case the joint action of
nvptx-isel
andunreachable-mbb-elimination
effectively optimized the BasicBlock out. However, adding an exit command to such a BasicBlock, as introduced by Lower unreachable to exit to allow ptxas to accurately reconstruct the CFG, preserves it within the Control Flow Graph (CFG), thereby negatively impacting size and performance. To counteract this undesirable consequence, we choose to refrain from processing BasicBlocks with just one unreachable instruction in this pass.