Skip to content

[OpenMPOpt] Fix incorrect end-of-kernel barrier removal #65670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 27, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 11 additions & 5 deletions llvm/lib/Transforms/IPO/OpenMPOpt.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2676,17 +2676,19 @@ struct AAExecutionDomainFunction : public AAExecutionDomain {
if (!ED.EncounteredAssumes.empty() && !A.isModulePass())
return;

// We can remove this barrier, if it is one, or all aligned barriers
// reaching the kernel end. In the latter case we can transitively work
// our way back until we find a barrier that guards a side-effect if we
// are dealing with the kernel end here.
// We can remove this barrier, if it is one, or aligned barriers reaching
// the kernel end (if CB is nullptr). Aligned barriers reaching the kernel
// end may have other successors besides the kernel end (especially if
// they're in loops) with non-local side-effects, so those barriers can
// only be removed if they also only reach the kernel end. If those
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only be removed if they also only reach the kernel end.

That is not true, nor does it reflect what happens below. The kernel end is not an aligned barrier which is why your test case starts to work (the explicit aligned barrier does not only reach aligned barriers but also the kernel end and is consequently not removed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "if they also only reach the kernel end or other aligned barriers", right?

And since IsReachingAlignedBarrierOnly doesn't take side-effects like the @unknown() call into account, I'd need to check something else to make sure there aren't any extra side-effect instructions after this barrier that aren't ahead of the kernel end. Would I need to do another backwards propagation for that? It seems like EncounteredNonLocalSideEffect is only for side-effects preceding the execution domain.

Copy link
Member

@jdoerfert jdoerfert Sep 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "if they also only reach the kernel end or other aligned barriers", right?

I don't think that is accurate at least not with the rest in place. See my example (loop_barrier3) with an explicit barrier in the end. The one in the loop will hit only aligned barriers. The one in the end is useless and removed. The end -> barrier elimination will skip it and eliminate the one in the loop.

Here are my current thoughts:
The end -> barrier case is different because we delete the "former"(/only) barrier, not the latter one.
The logic we have is basically trying to determine when a barrier is not actually synchronizing/communicating anything, and it is wrong in the kernel end case because the barrier can (evidently) synchronize with itself.

The logic we use right now is fundamentally flawed since end -> barrier does not say anything about other paths from barrier to somewhere else. In fact, there is no need for the threads reaching the barrier to go (directly) to the end at all. I think this is the crucial point. Our logic only holds if the barrier we are about to eliminate, due to the kernel end case, cannot reach any other synchronization point. If it can, the path from it to the end is not enough reason to remove it. Even if it can only reach other aligned barriers, e.g.,:

if (tid == 7)
  G = 1;
aligned_barrier();        // currently eliminated, reached by kernel end
if (runtime_false())
  return; // kernel exit
aligned_barrier();        // currently eliminated, no side effect since the last barrier
use(G);

I see two paths forward:

  1. Remove the kernel end special case (and change the kernel end everywhere to not be an aligned barrier). This is simple and prevents the problem, but might cause some leftover barriers.
  2. Verify that each barrier we remove in the kernel end case will not reach any other sync point. The easiest solution is to start at the kernel end, and only remove barriers in blocks for which the kernel end block is the unique successor.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think what we're doing now with special-casing the kernel end is not the right approach, but I think there is some value in generalizing the "former" barrier removal if we can get the conditions for it right. That would help in cases like this:

define void @only_first_barrier_redundant(i1 %c) "kernel" {
  call void @unknown()
  br i1 %c, label %if, label %end

if:
  call void @unknown()
  call void @aligned_barrier() ; Barrier A
  br label %end

end:
  call void @aligned_barrier() ; Barrier B
  call void @unknown()
  ret void
}

We can't remove Barrier B in this case and we can't remove Barrier A using our normal "latter" barrier removal because it has an unknown call before it. But Barrier A is still redundant here because it's always immediately followed by Barrier B, so if we had "former" barrier removal in cases other than the kernel end case we could safely remove it.

As in your example with a conditional return between the barriers, one of the big challenges here is that if we're removing both former and latter barriers in the common straight-line case we'll remove the former barrier because it's immediately followed by a barrier and also the latter barrier because it's immediately preceded by a barrier. This is a stale analysis problem, though: once we remove one of the barriers it's no longer the case that the remaining barrier does not have any side effects (or only reaches aligned barriers) before/after it. We could fix this by making sure the analysis is up-to-date, either by propagating the results from barriers as we remove them or by re-computing the analysis between the first sweep removing latter barriers and the second sweep removing former barriers.

This is a more ambitious strategy that will take more time to get right, though. For now I think just doing what we are now but only removing barriers in the end -> barrier case when the kernel end is the unique successor is a reasonable solution since it should fix all of the cases where barriers are being incorrectly removed while still removing barriers in simple cases. I'll update this PR to implement this instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is reasonable. Also, change the initialization of ExitED, the Kernel end is never an aligned barrier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// barriers have other barriers reaching them, those can be transitively
// removed as well.
if (CB) {
DeletedBarriers.insert(CB);
A.deleteAfterManifest(*CB);
++NumBarriersEliminated;
Changed = ChangeStatus::CHANGED;
} else if (!ED.AlignedBarriers.empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just delete 2729-2731 and this conditional? I think the idea that the kernel end is an aligned barrier was fundamentally flawed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wrapping my head around this pass, but from my understanding it doesn't seem unreasonable to treat the kernel end as an aligned barrier. Even if the kernel exit instructions themselves are divergent and different ones are reached in different threads, it seems like we can still model the control flow as re-converging after the different kernel exits before our implicit post-kernel barrier.

My understanding is that it's legal and useful to remove barriers near the end of kernels in cases like this:

define void @simple_end_barrier() "kernel" {
  call void @unknown()
  call void @aligned_barrier() ; Barrier A
  ret void
}

It seems like we need a concept of a post-kernel barrier to do this, though, since otherwise Barrier A isn't redundant and can't be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to treat the kernel end special, but it is not an aligned barrier. Threads are allowed to run into the kernel end independently, and the others can do something else in the meantime, e.g., wait in an aligned barrier. What I'm trying to say is that our current abstraction is too coarse-grained. That said, you might be right that for the purpose of barrier elimination, we can continue to "handle" the kernel end.

That said, I'm still not sure if this is the right way to fix this. I think the change in multiple_blocks_functions_kernel_effects_0 actually exposes a different error we should first address.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, can we have a few more tests though (and pre-commit the tests please).

│ define void @loop_barrier2() "kernel" {
│ entry:
│   br label %loop
│
│ loop:
│   %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
│   call void @unknown()
│   call void @aligned_barrier()
│   %i.next = add nuw nsw i32 %i, 1
│   %cond = icmp ne i32 %i.next, 128
│   br i1 %cond, label %loop, label %exit
│
│ exit:
│   call void @aligned_barrier()
│   call void @aligned_barrier()
│   call void @aligned_barrier()
│   call void @aligned_barrier()
│   ret void
│ }
│

Also add an unknown call in between the 4 aligned barriers.
Then, replace unknown with a store to a generic global.
Now the fix you proposed fails and we race on that global, at least if we put an explicit barrier at the end, e.g.,:

│ define void @loop_barrier3() "kernel" {

│ entry:
│   br label %loop
│
│ loop:
│   %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
│   store i32 %i, ptr @G1
│   call void @aligned_barrier()
│   %i.next = add nuw nsw i32 %i, 1
│   %cond = icmp ne i32 %i.next, 128
│   br i1 %cond, label %loop, label %exit
│
│ exit:
│   call void @aligned_barrier()
│   ret void
│ }

I'm still thinking about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the requested testcases in my update.

With this new updated the three barriers in multiple_blocks_functions_kernel_effects_0 are back to being removed, but is this the correct behavior? It seems to me that we should preserve at least one of the barriers between the barrier_then_write0 calls and the write_then_barrier0 call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing the aligned_barriers from both write_then_barrier0 and also m3 in multiple_blocks_functions_kernel_effects_0 added to the multiple_blocks_functions_kernel_effects_0 end-of-function ED's AlignedBarriers. The write_then_barrier0 aligned_barrier is ignored because it's in another function, but it looks like we skip over the m3 aligned_barrier because it's already been removed and (incorrectly) remove the three other barriers because those ones precede the m3 aligned_barrier. Is this the expected behavior, or should the end-of-function ED for multiple_blocks_functions_kernel_effects_0 only have the write_then_barrier0 aligned_barrier in AlignedBarriers and not the one in m3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue with the three barriers in multiple_blocks_functions_kernel_effects_0 seems like it isn't directly related to the main bug this PR is fixing; would it make sense to merge this PR now and keep working on a fix for these barriers being incorrectly removed as a separate change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdoerfert, is this issue with these barriers still being removed what's holding up this change? It looks to me like the fix for this should be separate from the fix I've implemented here to only remove barriers with a straight line to the kernel end, but is there something I'm missing?

NumBarriersEliminated += ED.AlignedBarriers.size();
Changed = ChangeStatus::CHANGED;
SmallVector<CallBase *> Worklist(ED.AlignedBarriers.begin(),
ED.AlignedBarriers.end());
Expand All @@ -2697,7 +2699,11 @@ struct AAExecutionDomainFunction : public AAExecutionDomain {
continue;
if (LastCB->getFunction() != getAnchorScope())
continue;
const ExecutionDomainTy &PostLastED = CEDMap[{LastCB, POST}];
if (!PostLastED.IsReachingAlignedBarrierOnly)
continue;
if (!DeletedBarriers.count(LastCB)) {
++NumBarriersEliminated;
A.deleteAfterManifest(*LastCB);
continue;
}
Expand Down
37 changes: 36 additions & 1 deletion llvm/test/Transforms/OpenMP/barrier_removal.ll
Original file line number Diff line number Diff line change
Expand Up @@ -736,13 +736,16 @@ define void @multiple_blocks_functions_kernel_effects_0(i1 %c0, i1 %c1, ptr %p)
; MODULE-NEXT: call void @barrier_then_write0(ptr [[P]])
; MODULE-NEXT: br label [[T0B3:%.*]]
; MODULE: t0b3:
; MODULE-NEXT: call void @aligned_barrier()
; MODULE-NEXT: br label [[M3:%.*]]
; MODULE: f03:
; MODULE-NEXT: call void @barrier_then_write0(ptr [[P]])
; MODULE-NEXT: br i1 [[C1]], label [[T13:%.*]], label [[F13:%.*]]
; MODULE: t13:
; MODULE-NEXT: call void @aligned_barrier()
; MODULE-NEXT: br label [[M3]]
; MODULE: f13:
; MODULE-NEXT: call void @aligned_barrier()
; MODULE-NEXT: br label [[M3]]
; MODULE: m3:
; MODULE-NEXT: call void @write_then_barrier0(ptr [[P]])
Expand Down Expand Up @@ -1065,8 +1068,38 @@ define void @caller_barrier2() "kernel" {
ret void
}

define void @loop_barrier() "kernel" {
; CHECK-LABEL: define {{[^@]+}}@loop_barrier
; CHECK-SAME: () #[[ATTR4]] {
; CHECK-NEXT: entry:
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[I_NEXT:%.*]], [[LOOP]] ]
; CHECK-NEXT: call void @unknown()
; CHECK-NEXT: call void @aligned_barrier()
; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i32 [[I]], 1
; CHECK-NEXT: [[COND:%.*]] = icmp ne i32 [[I_NEXT]], 128
; CHECK-NEXT: br i1 [[COND]], label [[LOOP]], label [[EXIT:%.*]]
; CHECK: exit:
; CHECK-NEXT: ret void
;
entry:
br label %loop

loop:
%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
call void @unknown()
call void @aligned_barrier()
%i.next = add nuw nsw i32 %i, 1
%cond = icmp ne i32 %i.next, 128
br i1 %cond, label %loop, label %exit

exit:
ret void
}

!llvm.module.flags = !{!16,!15}
!nvvm.annotations = !{!0,!1,!2,!3,!4,!5,!6,!7,!8,!9,!10,!11,!12,!13,!14,!17,!18,!19,!20,!21,!22,!23,!24,!25}
!nvvm.annotations = !{!0,!1,!2,!3,!4,!5,!6,!7,!8,!9,!10,!11,!12,!13,!14,!17,!18,!19,!20,!21,!22,!23,!24,!25,!26}

!0 = !{ptr @pos_empty_1, !"kernel", i32 1}
!1 = !{ptr @pos_empty_2, !"kernel", i32 1}
Expand All @@ -1079,6 +1112,7 @@ define void @caller_barrier2() "kernel" {
!23 = !{ptr @pos_empty_8, !"kernel", i32 1}
!24 = !{ptr @caller_barrier1, !"kernel", i32 1}
!25 = !{ptr @caller_barrier2, !"kernel", i32 1}
!26 = !{ptr @loop_barrier, !"kernel", i32 1}
!6 = !{ptr @neg_empty_8, !"kernel", i32 1}
!19 = !{ptr @neg_empty_9, !"kernel", i32 1}
!20 = !{ptr @pos_empty_10, !"kernel", i32 1}
Expand Down Expand Up @@ -1128,4 +1162,5 @@ define void @caller_barrier2() "kernel" {
; CHECK: [[META23:![0-9]+]] = !{ptr @pos_empty_8, !"kernel", i32 1}
; CHECK: [[META24:![0-9]+]] = !{ptr @caller_barrier1, !"kernel", i32 1}
; CHECK: [[META25:![0-9]+]] = !{ptr @caller_barrier2, !"kernel", i32 1}
; CHECK: [[META25:![0-9]+]] = !{ptr @loop_barrier, !"kernel", i32 1}
;.