[NVPTX] Skip processing BasicBlocks with single unreachable instruction in `nvptx-lower-unreachable` pass. #72641

mmoadeli · 2023-11-17T12:09:52Z

Addressing Lower unreachable to exit to allow ptxas to accurately reconstruct the CFG resulted in performance regression In scenarios where a BasicBlock contains only one unreachable instruction. Before the patch, in such a case the joint action of nvptx-isel and unreachable-mbb-elimination effectively optimized the BasicBlock out. However, adding an exit command to such a BasicBlock, as introduced by Lower unreachable to exit to allow ptxas to accurately reconstruct the CFG, preserves it within the Control Flow Graph (CFG), thereby negatively impacting size and performance. To counteract this undesirable consequence, we choose to refrain from processing BasicBlocks with just one unreachable instruction in this pass.

Artem-B · 2024-01-09T20:00:03Z

Can you elaborate on the actual impact the exit does have on performance?

While exit does result in code with more branches on PTX level, SASS looks largely identical with the difference that unreachable code does call exit. https://godbolt.org/z/s8f5nc8cs

I'm concerned that special-casing unreachable in switch we may make it possible that we re-open a possibility that we'll re-introduce the original issue where we break structured CFG for ptxas if the switch ends up being replicated within more complicated CFG.

mmoadeli · 2024-01-10T12:00:26Z

Can you elaborate on the actual impact the exit does have on performance?

While exit does result in code with more branches on PTX level, SASS looks largely identical with the difference that unreachable code does call exit. https://godbolt.org/z/s8f5nc8cs

I'm concerned that special-casing unreachable in switch we may make it possible that we re-open a possibility that we'll re-introduce the original issue where we break structured CFG for ptxas if the switch ends up being replicated within more complicated CFG.

This is coming from a DPCPP performance evaluation which showed 3% performance loss in a benchmark after introducing Lower unreachable to exit to allow ptxas to accurately reconstruct the CFG. I can share the ir and ptx of the related code before and after commit, if it helps.

Artem-B · 2024-01-10T20:07:47Z

If you could also point at the source code that resulted in generating the switch that may help, too.

AFAICT, the unreachable->exit change did not cause any obvious performance regressions on our benchmarks, so the regression may be specific to your tests.
The fact that compiler decided to keep the unreachable/exit around may indicate that it could not rule out other values.
It may be possible to give compiler an explicit hint which would allow elimination of the unreachable branch.

I'm really reluctant risking resurrection of a nasty miscompilation bug for the sake of a minor performance gain on a niche benchmark. The bar here is relatively high. It would either have to be a widespread regression with no workaround, or the patch would need to have a strong guarantee that it it can never break structured control flow (and I'm not quite sure how to do that. We'll need someone with more expertise than myself. @arsenm, @maleadt any suggestions? ).

mmoadeli · 2024-01-15T10:30:03Z

Hi @Artem-B
Apologies for delayed response. I am not able to share the zip files and other related documents here. This OneDrive shared folder contains the Reproducer .
I'd be happy to share the contents in any other way, should you have other preferences.

maleadt · 2024-01-16T08:37:28Z

It may be possible to give compiler an explicit hint which would allow elimination of the unreachable branch.

I'm unfamiliar with LLVM's machine optimization level, but conceptually it does seem like a better/safer approach to try and restore the compiler's ability to remove the unreachable default block in the presence of exit, instead of reducing the scope of the NVPTXLowerUnreachablePass and risk re-introducing this bug (which seems like a possibility if the unreachable block can't get optimized away, for whatever reason, which could then result in widening of the divergent region again).

Artem-B · 2024-01-16T18:35:22Z

Comparing the PTX in your reproducer I do not see any interesting differences. All the code gets optimized exactly the same way, down to exactly the same number of registers, so the diff is very clean. The exit is not blocking any optimizations and the difference is literally one extra predicated branch to exit at the end of a function.

I think compiler does exactly what it should be doing and that this particular issue must be handled at the user source code level. If some switch values are truly impossible, it should be made explicit in the source code and that should allow compiler to optimize that extra jump away. Previously the code had apparently relied on UB that resulted in a cost-free fall-through through the unreachable part. If the fall-through is intentional, or impossible and we expect compiler to do something specific, the user code should be explicit about that. We should not be restoring any particular kind of undefined behavior we may have done in the past.

mmoadeli · 2024-01-16T22:48:07Z

Comparing the PTX in your reproducer I do not see any interesting differences. All the code gets optimized exactly the same way, down to exactly the same number of registers, so the diff is very clean. The exit is not blocking any optimizations and the difference is literally one extra predicated branch to exit at the end of a function.

I think compiler does exactly what it should be doing and that this particular issue must be handled at the user source code level. If some switch values are truly impossible, it should be made explicit in the source code and that should allow compiler to optimize that extra jump away. Previously the code had apparently relied on UB that resulted in a cost-free fall-through through the unreachable part. If the fall-through is intentional, or impossible and we expect compiler to do something specific, the user code should be explicit about that. We should not be restoring any particular kind of undefined behavior we may have done in the past.

I agree the there is not much difference between the two ptx. However, it is claimed that it made ethminer -3% performance regression on A100 CUDA. Moreover, to the best of my understanding of the code, there is nothing to be done at source level to address this issue.

Artem-B · 2024-01-16T23:05:50Z

You can use -mllvm -nvptx-exit-on-unreachable=0 option to disable lowering unreachable as exit, as a workaround.

there is nothing to be done at source level to address this issue.

That remains to be seen. There are certainly ways to tell compiler that some values are guaranteed not to be seen. E.g. something as simple as an explicit if (threadIdx.x <4 ) switch { case 0...3 ...} would probably do the trick.

Updates the `nvptx-lower-unreachable` pass to bypass BasicBlocks containing just one unreachable instruction. This allows further optimization by subsequent passes, enhancing overall performance and efficiency.

mmoadeli · 2024-02-15T10:16:00Z

You can use -mllvm -nvptx-exit-on-unreachable=0 option to disable lowering unreachable as exit, as a workaround.

there is nothing to be done at source level to address this issue.

That remains to be seen. There are certainly ways to tell compiler that some values are guaranteed not to be seen. E.g. something as simple as an explicit if (threadIdx.x <4 ) switch { case 0...3 ...} would probably do the trick.

Thanks @Artem-B and apologies for late reply.
I updated the PR to be more generic and not specific to a particular instruction.

Artem-B · 2024-02-15T19:35:23Z

llvm/lib/Target/NVPTX/NVPTXLowerUnreachable.cpp

+  // performance. To counteract this undesirable consequence, we choose to
+  // refrain from processing BasicBlocks with just one unreachable instruction
+  // in this pass.
+


You're still proposing to trade off correctness for performance and I still strongly believe it's not something we want to do.

I do not think we should apply this patch. If you do want to live dangerously, you may want to try resurrecting nvptx-exit-on-unreachable command line option and then apply it to the compilation.

If you do want this patch to land, you need a strong proof that not emitting exit in this particular case is safe, which will be hard to obtain, considering that all we know is that eliminating this explicit control flow hint results in a miscompilation by ptxas, NVIDIA's optimizing assembler, which we do not control or have much visibility into.

It took quite a few years, and multiple attempts to solve the issue, to eventually arrive at this workaround which appears to address the root cause. Partially undoing it does not make sense to me. "A little bit broken" is still broken, even if your particular use case happens to be fine.

You're still proposing to trade off correctness for performance and I still strongly believe it's not something we want to do.

Thank you @Artem-B
I disagree with your comment. The exit with an additional node in CGF is introduced by this. If you build the test before that PR and with my change, you'll see they have identical ptx, whereas after that PR you'll see an additional node with exit in it.

If as claimed, this patch sacrifices correctness for performance, then before this we did not have correct behaviour in such situations.

I'd suggest comparing the ptx of three revisions. 1) before this, 2) after that, 3 after the fix introduced in this patch.

Overall, this patch fixes the overhead introduced by this which the attached test clearly shows.

I'm not disagreeing that it does make a difference in your use case, and that in your case not having an exit is benign and produces faster code.

The trouble is that I can not say that the patch is benign for all NVPTX users.

As far as LLVM is concerned you patch is correct, but in case of NVIDIA GPUs, we also need to take into account NVIDIA's assembler and we do know that it miscompiles sufficiently convoluted code, unless we explicitly annotate unreachable code with an exit instead of allowing it to fall through. ptxas does not have the same info as LLVM and does not know that the code is unreachable. The fall-through effectively looks like a new CFG edge which confuses it, and results in thread mis-convergence.

I may be wrong and would be happy to be proven wrong. And example of the "works for me here" is not sufficient. I'll readily admit that it does, and will get back to my point above -- can we guarantee it to work for all users? Until we can, we should keep generating exit.

I would be open to resurrecting a hidden compiler flag to disable generation of exit, instead.

I personally do not know how to prove correctness in this case. Based on the past experience with this issue (~8 years of looking for the fix: https://bugs.llvm.org/show_bug.cgi?id=27738), empirical testing on this particular issue would also be insufficient. The issue tends to pop up in odd and unpredictable places and is very hard to reproduce and diagnose in most of those cases. So, please, take my word that there's a very good reason for the abundance of caution I'm advocating for.

@Artem-B

ptx

Running the shared code built using syclos 8e92193 revision (related with D152789) (with --save-temps added to the last command in ethminier/ethminer/build.sh) produces a buildinfo-sycl-nvptx64-nvidia-cuda-sm_50-ae5a30.s file (811816 bytes) and reports the following measurements:

Total Execution Time: 61.0493 s Total Number of Hashes: 9875488768 Overall Hash rate: 161.763 MH/s

Manually modifying the above ptx file to buildinfo-sycl-nvptx64-nvidia-cuda-sm_50-204989.s by moving the BasicBlock having exit added by the original nvptx-lower-unreachable pass in D152789 and rebuilding the binary yields the following measurements:

Total Execution Time: 61.177 s Total Number of Hashes: 14647558144 Overall Hash rate: 239.429 MH/s

which is a massive %47 performance improvement.

We can't pinpoint a specific issue as to why a minor relocation of an added basic block back to its original position in the CFG, which is typically moved to the end of the CFG by block-placement, can have such a significant impact. Obviously, widening divergent areas could be a potential reason, which the original PR has aimed to address.
It's also challenging to modify the pass to prevent that particular basic block from being affected by optimisation passes. Such changes might not be straightforward and could introduce some complexity that may not align well with the the code standards. For instance, it may be achieved by having one extra pass to undo works done by block-placement optimisation, which may some don't fancy.

It would be valuable to have your input, and possibly input from @maleadt as well.

Thanks

I'm not sure what kind of input you're still looking for. I'm not disputing that you see a performance regression, but I still do not think it warrants disabling lowering of unreachable to exit.

I remain open to re-introduction of a hidden option to allow disabling it, as an escape hatch.

If you can figure out why a seemingly minor differences result in such a huge performance difference and if there's something we can do to improve the code, without risking correctness, that would be helpful.

and the patch does not offer any evidence of a performance improvement

This is true. It was also not intended to buy us any performance gains, so this part is WAI.

or resolution of a specific issue elsewhere,

I strongly disagree with this assertion. The patch does resolve the issue we've been struggling with for a very long time. We currently do not have any other mechanism to avoid miscompilation in ptxas, which makes this pass essential to guarantee correctness.

While I understand your frustration with the performance regression, undoing the pass is not the way forward. I've already proposed few options that would help (escape hatch option to disable this pass if/when it's needed, working around the issue on the source code level), but for some reason you keep hammering on the "disable to fix part", without providing strong enough reasons for that. "let's break things for everyone so my code would run fast" is not a very strong argument.

I recommend removing the problematic pass altogether since the proposed solutions to address performance issues of the pass have not satisfied reviewers.

I recommend alternative options that do not require reintroducing the miscompilation for everyone with NVIDIA GPUs.

I strongly disagree with this assertion. The patch does resolve the issue we've been struggling with for a very long time. We currently do not have any other mechanism to avoid miscompilation in ptxas, which makes this pass essential to guarantee correctness.

Thanks @Artem-B
My statements were derived from the contents of the original patch. I'd be happy to see issues the patch fixes, if you have access to any.
It appears that I may need to resort to modifying the source code as a an option.

Or reintroduce -nvptx-exit-on-unreachable=0 option. This would be a relatively uncontroversial change.

I'd be happy to see issues the patch fixes, if you have access to any.

The original patch description provides a good overview https://reviews.llvm.org/D152789

It all started here: https://bugs.llvm.org/show_bug.cgi?id=27738
Julia folks eventually came up with a concise reproducer:
JuliaGPU/CUDAnative.jl#4

Since then we've attempted to keep CFG structured (it helped a lot, but not completely.)
Over time the issue kept popping up, again, and again. E.g. we had to disable some loop transform passes that happened to trigger the issue.

7 years later we've finally figured out what seems to be triggering ptxas miscompilation.

Additional reproducers can be found in JuliaGPU/CUDA.jl#1746 (comment), JuliaGPU/CUDA.jl#431 (comment), JuliaGPU/CUDA.jl#43 (comment). Basically, we've been running into this bug with various user applications roughly every year or so, necessitating more and more questionable workarounds (both in code, and in our compiler) until we finally found the root cause. Needless to say I'd be strongly opposed to reverting the fix.

However, the proposed -nvptx-exit-on-unreachable=0 may be a viable default at some point in the future, because NVIDIA should have fixed ptxas to model trap like exit, i.e., we just need to make sure that every unreachable block ends with a trap terminator (which I guess is the current behavior). So that seems like a good option to add back, and would help with your performance issue right now.

every unreachable block ends with a trap terminator (which I guess is the current behavior).

I think we can enable unreachable->trap lowering with a flag, but by default unreachable is still lowered into nothing. AFAICT, trap will likely create the same performance regression as exit does in this case.

For NVPTX, we will continue to need trap or exit to avoid confusing ptxas about the intended control flow. If someone consciously wants/needs to trade off miscompilation vs performance, we'll still need to have an explicit "shoot this foot, please" option for that.

Artem-B · 2024-02-15T19:36:21Z

You can use -mllvm -nvptx-exit-on-unreachable=0

This option appears to be gone now. #67478

Handle unreachable default in llvm::SwitchIns.

98061ec

ldrumm requested a review from Artem-B December 12, 2023 15:39

arsenm added the backend:NVPTX label Jan 11, 2024

Skip processing BasicBlocks with single unreachable instruction.

70fa78e

Updates the `nvptx-lower-unreachable` pass to bypass BasicBlocks containing just one unreachable instruction. This allows further optimization by subsequent passes, enhancing overall performance and efficiency.

mmoadeli changed the title ~~[NVPTX] Handle unreachable default in llvm::SwitchIns.~~ [NVPTX] Skip processing BasicBlocks with single unreachable instruction in nvptx-lower-unreachable pass. Feb 15, 2024

Artem-B mentioned this pull request Feb 15, 2024

[NVPTX] Fix code generation for trap-unreachable. #67478

Merged

Artem-B reviewed Feb 15, 2024

View reviewed changes

mmoadeli mentioned this pull request Apr 5, 2024

[ethminer] Remove unreachable default case in switch statements to prevent potential performance regression. oneapi-src/Velocity-Bench#53

Merged

mmoadeli closed this by deleting the head repository Sep 24, 2024

[NVPTX] Skip processing BasicBlocks with single unreachable instruction in nvptx-lower-unreachable pass. #72641

[NVPTX] Skip processing BasicBlocks with single unreachable instruction in nvptx-lower-unreachable pass. #72641

Uh oh!

Conversation

mmoadeli commented Nov 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Artem-B commented Jan 9, 2024

Uh oh!

mmoadeli commented Jan 10, 2024

Uh oh!

Artem-B commented Jan 10, 2024

Uh oh!

mmoadeli commented Jan 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Jan 16, 2024

Uh oh!

Artem-B commented Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmoadeli commented Jan 16, 2024

Uh oh!

Artem-B commented Jan 16, 2024

Uh oh!

mmoadeli commented Feb 15, 2024

Uh oh!

Artem-B Feb 15, 2024

Choose a reason for hiding this comment

Uh oh!

mmoadeli Feb 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Artem-B Feb 16, 2024

Choose a reason for hiding this comment

Uh oh!

mmoadeli Mar 4, 2024

Choose a reason for hiding this comment

Uh oh!

Artem-B Mar 4, 2024

Choose a reason for hiding this comment

Uh oh!

Artem-B Mar 5, 2024

Choose a reason for hiding this comment

Uh oh!

mmoadeli Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Artem-B Mar 5, 2024

Choose a reason for hiding this comment

Uh oh!

maleadt Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Artem-B Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

Artem-B commented Feb 15, 2024

Uh oh!

Uh oh!

[NVPTX] Skip processing BasicBlocks with single unreachable instruction in `nvptx-lower-unreachable` pass. #72641

[NVPTX] Skip processing BasicBlocks with single unreachable instruction in `nvptx-lower-unreachable` pass. #72641

mmoadeli commented Nov 17, 2023 •

edited

Loading

mmoadeli commented Jan 15, 2024 •

edited

Loading

Artem-B commented Jan 16, 2024 •

edited

Loading

mmoadeli Feb 15, 2024 •

edited

Loading

mmoadeli Mar 5, 2024 •

edited

Loading

maleadt Mar 6, 2024 •

edited

Loading