Skip to content

[NVPTX] Skip processing BasicBlocks with single unreachable instruction in nvptx-lower-unreachable pass. #72641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion llvm/lib/Target/NVPTX/NVPTXLowerUnreachable.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,19 @@ bool NVPTXLowerUnreachable::runOnFunction(Function &F) {
InlineAsm *Exit = InlineAsm::get(ExitFTy, "exit;", "", true);

bool Changed = false;
for (auto &BB : F)

// In scenarios where a BasicBlock contains only one unreachable instruction,
// the joint action of nvptx-isel and unreachable-mbb-elimination
// effectively optimizes the BasicBlock out. However, adding an exit
// command to such a BasicBlock, as suggested by this pass, preserves it
// within the Control Flow Graph (CFG), thereby negatively impacting size and
// performance. To counteract this undesirable consequence, we choose to
// refrain from processing BasicBlocks with just one unreachable instruction
// in this pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're still proposing to trade off correctness for performance and I still strongly believe it's not something we want to do.

I do not think we should apply this patch. If you do want to live dangerously, you may want to try resurrecting nvptx-exit-on-unreachable command line option and then apply it to the compilation.

If you do want this patch to land, you need a strong proof that not emitting exit in this particular case is safe, which will be hard to obtain, considering that all we know is that eliminating this explicit control flow hint results in a miscompilation by ptxas, NVIDIA's optimizing assembler, which we do not control or have much visibility into.

It took quite a few years, and multiple attempts to solve the issue, to eventually arrive at this workaround which appears to address the root cause. Partially undoing it does not make sense to me. "A little bit broken" is still broken, even if your particular use case happens to be fine.

Copy link
Contributor Author

@mmoadeli mmoadeli Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're still proposing to trade off correctness for performance and I still strongly believe it's not something we want to do.

Thank you @Artem-B
I disagree with your comment. The exit with an additional node in CGF is introduced by this. If you build the test before that PR and with my change, you'll see they have identical ptx, whereas after that PR you'll see an additional node with exit in it.

If as claimed, this patch sacrifices correctness for performance, then before this we did not have correct behaviour in such situations.

I'd suggest comparing the ptx of three revisions. 1) before this, 2) after that, 3 after the fix introduced in this patch.

Overall, this patch fixes the overhead introduced by this which the attached test clearly shows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not disagreeing that it does make a difference in your use case, and that in your case not having an exit is benign and produces faster code.

The trouble is that I can not say that the patch is benign for all NVPTX users.

As far as LLVM is concerned you patch is correct, but in case of NVIDIA GPUs, we also need to take into account NVIDIA's assembler and we do know that it miscompiles sufficiently convoluted code, unless we explicitly annotate unreachable code with an exit instead of allowing it to fall through. ptxas does not have the same info as LLVM and does not know that the code is unreachable. The fall-through effectively looks like a new CFG edge which confuses it, and results in thread mis-convergence.

I may be wrong and would be happy to be proven wrong. And example of the "works for me here" is not sufficient. I'll readily admit that it does, and will get back to my point above -- can we guarantee it to work for all users? Until we can, we should keep generating exit.

I would be open to resurrecting a hidden compiler flag to disable generation of exit, instead.

I personally do not know how to prove correctness in this case. Based on the past experience with this issue (~8 years of looking for the fix: https://bugs.llvm.org/show_bug.cgi?id=27738), empirical testing on this particular issue would also be insufficient. The issue tends to pop up in odd and unpredictable places and is very hard to reproduce and diagnose in most of those cases. So, please, take my word that there's a very good reason for the abundance of caution I'm advocating for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Artem-B

ptx

Running the shared code built using syclos 8e92193 revision (related with D152789) (with --save-temps added to the last command in ethminier/ethminer/build.sh) produces a buildinfo-sycl-nvptx64-nvidia-cuda-sm_50-ae5a30.s file (811816 bytes) and reports the following measurements:

Total Execution Time: 61.0493 s
Total Number of Hashes: 9875488768
Overall Hash rate: 161.763 MH/s

Manually modifying the above ptx file to buildinfo-sycl-nvptx64-nvidia-cuda-sm_50-204989.s by moving the BasicBlock having exit added by the original nvptx-lower-unreachable pass in D152789 and rebuilding the binary yields the following measurements:

Total Execution Time: 61.177 s
Total Number of Hashes: 14647558144
Overall Hash rate: 239.429 MH/s

which is a massive %47 performance improvement.

We can't pinpoint a specific issue as to why a minor relocation of an added basic block back to its original position in the CFG, which is typically moved to the end of the CFG by block-placement, can have such a significant impact. Obviously, widening divergent areas could be a potential reason, which the original PR has aimed to address.
It's also challenging to modify the pass to prevent that particular basic block from being affected by optimisation passes. Such changes might not be straightforward and could introduce some complexity that may not align well with the the code standards. For instance, it may be achieved by having one extra pass to undo works done by block-placement optimisation, which may some don't fancy.

It would be valuable to have your input, and possibly input from @maleadt as well.

Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what kind of input you're still looking for. I'm not disputing that you see a performance regression, but I still do not think it warrants disabling lowering of unreachable to exit.

I remain open to re-introduction of a hidden option to allow disabling it, as an escape hatch.

If you can figure out why a seemingly minor differences result in such a huge performance difference and if there's something we can do to improve the code, without risking correctness, that would be helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the patch does not offer any evidence of a performance improvement

This is true. It was also not intended to buy us any performance gains, so this part is WAI.

or resolution of a specific issue elsewhere,

I strongly disagree with this assertion. The patch does resolve the issue we've been struggling with for a very long time. We currently do not have any other mechanism to avoid miscompilation in ptxas, which makes this pass essential to guarantee correctness.

While I understand your frustration with the performance regression, undoing the pass is not the way forward. I've already proposed few options that would help (escape hatch option to disable this pass if/when it's needed, working around the issue on the source code level), but for some reason you keep hammering on the "disable to fix part", without providing strong enough reasons for that. "let's break things for everyone so my code would run fast" is not a very strong argument.

I recommend removing the problematic pass altogether since the proposed solutions to address performance issues of the pass have not satisfied reviewers.

I recommend alternative options that do not require reintroducing the miscompilation for everyone with NVIDIA GPUs.

Copy link
Contributor Author

@mmoadeli mmoadeli Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly disagree with this assertion. The patch does resolve the issue we've been struggling with for a very long time. We currently do not have any other mechanism to avoid miscompilation in ptxas, which makes this pass essential to guarantee correctness.

Thanks @Artem-B
My statements were derived from the contents of the original patch. I'd be happy to see issues the patch fixes, if you have access to any.
It appears that I may need to resort to modifying the source code as a an option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or reintroduce -nvptx-exit-on-unreachable=0 option. This would be a relatively uncontroversial change.

I'd be happy to see issues the patch fixes, if you have access to any.

The original patch description provides a good overview https://reviews.llvm.org/D152789

It all started here: https://bugs.llvm.org/show_bug.cgi?id=27738
Julia folks eventually came up with a concise reproducer:
JuliaGPU/CUDAnative.jl#4

Since then we've attempted to keep CFG structured (it helped a lot, but not completely.)
Over time the issue kept popping up, again, and again. E.g. we had to disable some loop transform passes that happened to trigger the issue.

7 years later we've finally figured out what seems to be triggering ptxas miscompilation.

Copy link
Contributor

@maleadt maleadt Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional reproducers can be found in JuliaGPU/CUDA.jl#1746 (comment), JuliaGPU/CUDA.jl#431 (comment), JuliaGPU/CUDA.jl#43 (comment). Basically, we've been running into this bug with various user applications roughly every year or so, necessitating more and more questionable workarounds (both in code, and in our compiler) until we finally found the root cause. Needless to say I'd be strongly opposed to reverting the fix.

However, the proposed -nvptx-exit-on-unreachable=0 may be a viable default at some point in the future, because NVIDIA should have fixed ptxas to model trap like exit, i.e., we just need to make sure that every unreachable block ends with a trap terminator (which I guess is the current behavior). So that seems like a good option to add back, and would help with your performance issue right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every unreachable block ends with a trap terminator (which I guess is the current behavior).

I think we can enable unreachable->trap lowering with a flag, but by default unreachable is still lowered into nothing. AFAICT, trap will likely create the same performance regression as exit does in this case.

For NVPTX, we will continue to need trap or exit to avoid confusing ptxas about the intended control flow. If someone consciously wants/needs to trade off miscompilation vs performance, we'll still need to have an explicit "shoot this foot, please" option for that.

for (auto &BB : F) {
if ((BB.size() == 1) && (isa<UnreachableInst>(BB.front())))
continue;
for (auto &I : BB) {
if (auto unreachableInst = dyn_cast<UnreachableInst>(&I)) {
if (isLoweredToTrap(*unreachableInst))
Expand All @@ -147,6 +159,7 @@ bool NVPTXLowerUnreachable::runOnFunction(Function &F) {
Changed = true;
}
}
}
return Changed;
}

Expand Down
31 changes: 31 additions & 0 deletions llvm/test/CodeGen/NVPTX/unreachable-switch-case.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
; RUN: llc < %s -march=nvptx -mcpu=sm_20 -verify-machineinstrs | FileCheck %s

declare noundef i32 @llvm.nvvm.read.ptx.sreg.tid.x() #1

define void @kernel_func() {

%1 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x()

switch i32 %1, label %unreachabledefault [
i32 0, label %bb0
i32 1, label %bb1
i32 2, label %bb1
i32 3, label %bb2
]

bb0:
ret void

bb1:
ret void

bb2:
ret void

unreachabledefault:
unreachable

; CHECK: @kernel_func
; CHECK-NOT: unreachabledefault
; CHECK: -- End function
}