Skip to content

[OpenMPOpt] Fix incorrect end-of-kernel barrier removal #65670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 27, 2023

Conversation

dwoodwor-intel
Copy link
Contributor

@dwoodwor-intel dwoodwor-intel commented Sep 7, 2023

Barrier removal in OpenMPOpt normally removes barriers by proving that they are redundant with barriers preceding them. However, it can't do this with the "pseudo-barrier" at the end of kernels because that can't be removed. Instead, it removes the barriers preceding the end of the kernel which that end-of-kernel barrier is redundant with. However, these barriers aren't always redundant with the end-of-kernel barrier when loops are involved, and removing them can lead to incorrect results in compiled code.

This change fixes this by requiring that these pre-end-of-kernel barriers also have the kernel end as a unique successor before removing them. It also changes the initialization of ExitED for kernels since the kernel end is not an aligned barrier.

Barrier removal in OpenMPOpt normally removes barriers by proving that
they are redundant with barriers preceding them. However, it can't do
this with the "pseudo-barrier" at the end of kernels because that can't
be removed. Instead, it removes the barriers preceding the end of the
kernel which that end-of-kernel barrier is redundant with. However,
these barriers aren't always redundant with the end-of-kernel barrier
when loops are involved, and removing them can lead to incorrect results
in compiled code.

This change fixes this by requiring that these pre-end-of-kernel
barriers are also only reaching aligned barriers before removing them.
@dwoodwor-intel dwoodwor-intel requested a review from a team as a code owner September 7, 2023 20:36
Copy link
Member

@jdoerfert jdoerfert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to look at it in more detail, but the reasoning makes sense (as already tried to fix in 24656e9).

// they're in loops) with non-local side-effects, so those barriers can
// only be removed if they also only reach the kernel end. If those
// barriers have other barriers reaching them, those can be transitively
// removed as well.
if (CB) {
DeletedBarriers.insert(CB);
A.deleteAfterManifest(*CB);
++NumBarriersEliminated;
Changed = ChangeStatus::CHANGED;
} else if (!ED.AlignedBarriers.empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just delete 2729-2731 and this conditional? I think the idea that the kernel end is an aligned barrier was fundamentally flawed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wrapping my head around this pass, but from my understanding it doesn't seem unreasonable to treat the kernel end as an aligned barrier. Even if the kernel exit instructions themselves are divergent and different ones are reached in different threads, it seems like we can still model the control flow as re-converging after the different kernel exits before our implicit post-kernel barrier.

My understanding is that it's legal and useful to remove barriers near the end of kernels in cases like this:

define void @simple_end_barrier() "kernel" {
  call void @unknown()
  call void @aligned_barrier() ; Barrier A
  ret void
}

It seems like we need a concept of a post-kernel barrier to do this, though, since otherwise Barrier A isn't redundant and can't be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to treat the kernel end special, but it is not an aligned barrier. Threads are allowed to run into the kernel end independently, and the others can do something else in the meantime, e.g., wait in an aligned barrier. What I'm trying to say is that our current abstraction is too coarse-grained. That said, you might be right that for the purpose of barrier elimination, we can continue to "handle" the kernel end.

That said, I'm still not sure if this is the right way to fix this. I think the change in multiple_blocks_functions_kernel_effects_0 actually exposes a different error we should first address.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, can we have a few more tests though (and pre-commit the tests please).

│ define void @loop_barrier2() "kernel" {
│ entry:
│   br label %loop
│
│ loop:
│   %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
│   call void @unknown()
│   call void @aligned_barrier()
│   %i.next = add nuw nsw i32 %i, 1
│   %cond = icmp ne i32 %i.next, 128
│   br i1 %cond, label %loop, label %exit
│
│ exit:
│   call void @aligned_barrier()
│   call void @aligned_barrier()
│   call void @aligned_barrier()
│   call void @aligned_barrier()
│   ret void
│ }
│

Also add an unknown call in between the 4 aligned barriers.
Then, replace unknown with a store to a generic global.
Now the fix you proposed fails and we race on that global, at least if we put an explicit barrier at the end, e.g.,:

│ define void @loop_barrier3() "kernel" {

│ entry:
│   br label %loop
│
│ loop:
│   %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
│   store i32 %i, ptr @G1
│   call void @aligned_barrier()
│   %i.next = add nuw nsw i32 %i, 1
│   %cond = icmp ne i32 %i.next, 128
│   br i1 %cond, label %loop, label %exit
│
│ exit:
│   call void @aligned_barrier()
│   ret void
│ }

I'm still thinking about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the requested testcases in my update.

With this new updated the three barriers in multiple_blocks_functions_kernel_effects_0 are back to being removed, but is this the correct behavior? It seems to me that we should preserve at least one of the barriers between the barrier_then_write0 calls and the write_then_barrier0 call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing the aligned_barriers from both write_then_barrier0 and also m3 in multiple_blocks_functions_kernel_effects_0 added to the multiple_blocks_functions_kernel_effects_0 end-of-function ED's AlignedBarriers. The write_then_barrier0 aligned_barrier is ignored because it's in another function, but it looks like we skip over the m3 aligned_barrier because it's already been removed and (incorrectly) remove the three other barriers because those ones precede the m3 aligned_barrier. Is this the expected behavior, or should the end-of-function ED for multiple_blocks_functions_kernel_effects_0 only have the write_then_barrier0 aligned_barrier in AlignedBarriers and not the one in m3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue with the three barriers in multiple_blocks_functions_kernel_effects_0 seems like it isn't directly related to the main bug this PR is fixing; would it make sense to merge this PR now and keep working on a fix for these barriers being incorrectly removed as a separate change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdoerfert, is this issue with these barriers still being removed what's holding up this change? It looks to me like the fix for this should be separate from the fix I've implemented here to only remove barriers with a straight line to the kernel end, but is there something I'm missing?

// the kernel end (if CB is nullptr). Aligned barriers reaching the kernel
// end may have other successors besides the kernel end (especially if
// they're in loops) with non-local side-effects, so those barriers can
// only be removed if they also only reach the kernel end. If those
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only be removed if they also only reach the kernel end.

That is not true, nor does it reflect what happens below. The kernel end is not an aligned barrier which is why your test case starts to work (the explicit aligned barrier does not only reach aligned barriers but also the kernel end and is consequently not removed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "if they also only reach the kernel end or other aligned barriers", right?

And since IsReachingAlignedBarrierOnly doesn't take side-effects like the @unknown() call into account, I'd need to check something else to make sure there aren't any extra side-effect instructions after this barrier that aren't ahead of the kernel end. Would I need to do another backwards propagation for that? It seems like EncounteredNonLocalSideEffect is only for side-effects preceding the execution domain.

Copy link
Member

@jdoerfert jdoerfert Sep 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "if they also only reach the kernel end or other aligned barriers", right?

I don't think that is accurate at least not with the rest in place. See my example (loop_barrier3) with an explicit barrier in the end. The one in the loop will hit only aligned barriers. The one in the end is useless and removed. The end -> barrier elimination will skip it and eliminate the one in the loop.

Here are my current thoughts:
The end -> barrier case is different because we delete the "former"(/only) barrier, not the latter one.
The logic we have is basically trying to determine when a barrier is not actually synchronizing/communicating anything, and it is wrong in the kernel end case because the barrier can (evidently) synchronize with itself.

The logic we use right now is fundamentally flawed since end -> barrier does not say anything about other paths from barrier to somewhere else. In fact, there is no need for the threads reaching the barrier to go (directly) to the end at all. I think this is the crucial point. Our logic only holds if the barrier we are about to eliminate, due to the kernel end case, cannot reach any other synchronization point. If it can, the path from it to the end is not enough reason to remove it. Even if it can only reach other aligned barriers, e.g.,:

if (tid == 7)
  G = 1;
aligned_barrier();        // currently eliminated, reached by kernel end
if (runtime_false())
  return; // kernel exit
aligned_barrier();        // currently eliminated, no side effect since the last barrier
use(G);

I see two paths forward:

  1. Remove the kernel end special case (and change the kernel end everywhere to not be an aligned barrier). This is simple and prevents the problem, but might cause some leftover barriers.
  2. Verify that each barrier we remove in the kernel end case will not reach any other sync point. The easiest solution is to start at the kernel end, and only remove barriers in blocks for which the kernel end block is the unique successor.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think what we're doing now with special-casing the kernel end is not the right approach, but I think there is some value in generalizing the "former" barrier removal if we can get the conditions for it right. That would help in cases like this:

define void @only_first_barrier_redundant(i1 %c) "kernel" {
  call void @unknown()
  br i1 %c, label %if, label %end

if:
  call void @unknown()
  call void @aligned_barrier() ; Barrier A
  br label %end

end:
  call void @aligned_barrier() ; Barrier B
  call void @unknown()
  ret void
}

We can't remove Barrier B in this case and we can't remove Barrier A using our normal "latter" barrier removal because it has an unknown call before it. But Barrier A is still redundant here because it's always immediately followed by Barrier B, so if we had "former" barrier removal in cases other than the kernel end case we could safely remove it.

As in your example with a conditional return between the barriers, one of the big challenges here is that if we're removing both former and latter barriers in the common straight-line case we'll remove the former barrier because it's immediately followed by a barrier and also the latter barrier because it's immediately preceded by a barrier. This is a stale analysis problem, though: once we remove one of the barriers it's no longer the case that the remaining barrier does not have any side effects (or only reaches aligned barriers) before/after it. We could fix this by making sure the analysis is up-to-date, either by propagating the results from barriers as we remove them or by re-computing the analysis between the first sweep removing latter barriers and the second sweep removing former barriers.

This is a more ambitious strategy that will take more time to get right, though. For now I think just doing what we are now but only removing barriers in the end -> barrier case when the kernel end is the unique successor is a reasonable solution since it should fix all of the cases where barriers are being incorrectly removed while still removing barriers in simple cases. I'll update this PR to implement this instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is reasonable. Also, change the initialization of ExitED, the Kernel end is never an aligned barrier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

[OpenMPOpt] Fix incorrect end-of-kernel barrier removal

Barrier removal in OpenMPOpt normally removes barriers by proving that
they are redundant with barriers preceding them. However, it can't do
this with the "pseudo-barrier" at the end of kernels because that can't
be removed. Instead, it removes the barriers preceding the end of the
kernel which that end-of-kernel barrier is redundant with. However,
these barriers aren't always redundant with the end-of-kernel barrier
when loops are involved, and removing them can lead to incorrect results
in compiled code.

This change fixes this by requiring that these pre-end-of-kernel
barriers also have the kernel end as a unique successor before removing
them.
@llvmbot
Copy link
Member

llvmbot commented Sep 11, 2023

@llvm/pr-subscribers-llvm-transforms

Changes

Barrier removal in OpenMPOpt normally removes barriers by proving that they are redundant with barriers preceding them. However, it can't do this with the "pseudo-barrier" at the end of kernels because that can't be removed. Instead, it removes the barriers preceding the end of the kernel which that end-of-kernel barrier is redundant with. However, these barriers aren't always redundant with the end-of-kernel barrier when loops are involved, and removing them can lead to incorrect results in compiled code.

This change fixes this by requiring that these pre-end-of-kernel barriers also have the kernel end as a unique successor before removing them.

Full diff: https://github.com/llvm/llvm-project/pull/65670.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/IPO/OpenMPOpt.cpp (+21-5)
  • (modified) llvm/test/Transforms/OpenMP/barrier_removal.ll (+177-1)
diff --git a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
index 63493eb78c451a6..8a8d27d9be38b8d 100644
--- a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
+++ b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
@@ -2621,6 +2621,17 @@ struct AAICVTrackerCallSiteReturned : AAICVTracker {
   }
 };
 
+/// Determines if \p BB exits the function unconditionally itself or reaches a
+/// block that does through only unique successors.
+static bool hasFunctionEndAsUniqueSuccessor(const BasicBlock *BB) {
+  if (succ_empty(BB))
+    return true;
+  const BasicBlock *const Successor = BB->getUniqueSuccessor();
+  if (!Successor)
+    return false;
+  return hasFunctionEndAsUniqueSuccessor(Successor);
+}
+
 struct AAExecutionDomainFunction : public AAExecutionDomain {
   AAExecutionDomainFunction(const IRPosition &IRP, Attributor &A)
       : AAExecutionDomain(IRP, A) {}
@@ -2676,17 +2687,19 @@ struct AAExecutionDomainFunction : public AAExecutionDomain {
       if (!ED.EncounteredAssumes.empty() && !A.isModulePass())
         return;
 
-      // We can remove this barrier, if it is one, or all aligned barriers
-      // reaching the kernel end. In the latter case we can transitively work
-      // our way back until we find a barrier that guards a side-effect if we
-      // are dealing with the kernel end here.
+      // We can remove this barrier, if it is one, or aligned barriers reaching
+      // the kernel end (if CB is nullptr). Aligned barriers reaching the kernel
+      // end should only be removed if the kernel end is their unique successor;
+      // otherwise, they may have side-effects that aren't accounted for in the
+      // kernel end in their other successors. If those barriers have other
+      // barriers reaching them, those can be transitively removed as well as
+      // long as the kernel end is also their unique successor.
       if (CB) {
         DeletedBarriers.insert(CB);
         A.deleteAfterManifest(*CB);
         ++NumBarriersEliminated;
         Changed = ChangeStatus::CHANGED;
       } else if (!ED.AlignedBarriers.empty()) {
-        NumBarriersEliminated += ED.AlignedBarriers.size();
         Changed = ChangeStatus::CHANGED;
         SmallVector Worklist(ED.AlignedBarriers.begin(),
                                          ED.AlignedBarriers.end());
@@ -2697,7 +2710,10 @@ struct AAExecutionDomainFunction : public AAExecutionDomain {
             continue;
           if (LastCB->getFunction() != getAnchorScope())
             continue;
+          if (!hasFunctionEndAsUniqueSuccessor(LastCB->getParent()))
+            continue;
           if (!DeletedBarriers.count(LastCB)) {
+            ++NumBarriersEliminated;
             A.deleteAfterManifest(*LastCB);
             continue;
           }
diff --git a/llvm/test/Transforms/OpenMP/barrier_removal.ll b/llvm/test/Transforms/OpenMP/barrier_removal.ll
index e45f746cbf4395d..8aca820afdfa883 100644
--- a/llvm/test/Transforms/OpenMP/barrier_removal.ll
+++ b/llvm/test/Transforms/OpenMP/barrier_removal.ll
@@ -1065,8 +1065,174 @@ define void @caller_barrier2() "kernel" {
   ret void
 }
 
+define void @loop_barrier() "kernel" {
+; CHECK-LABEL: define {{[^@]+}}@loop_barrier
+; CHECK-SAME: () #[[ATTR4]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[LOOP:%.*]]
+; CHECK:       loop:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[I_NEXT:%.*]], [[LOOP]] ]
+; CHECK-NEXT:    call void @unknown()
+; CHECK-NEXT:    call void @aligned_barrier()
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[COND:%.*]] = icmp ne i32 [[I_NEXT]], 128
+; CHECK-NEXT:    br i1 [[COND]], label [[LOOP]], label [[EXIT:%.*]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
+  call void @unknown()
+  call void @aligned_barrier()
+  %i.next = add nuw nsw i32 %i, 1
+  %cond = icmp ne i32 %i.next, 128
+  br i1 %cond, label %loop, label %exit
+
+exit:
+  ret void
+}
+
+define void @loop_barrier_end_barriers() "kernel" {
+; CHECK-LABEL: define {{[^@]+}}@loop_barrier_end_barriers
+; CHECK-SAME: () #[[ATTR4]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[LOOP:%.*]]
+; CHECK:       loop:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[I_NEXT:%.*]], [[LOOP]] ]
+; CHECK-NEXT:    call void @unknown()
+; CHECK-NEXT:    call void @aligned_barrier()
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[COND:%.*]] = icmp ne i32 [[I_NEXT]], 128
+; CHECK-NEXT:    br i1 [[COND]], label [[LOOP]], label [[EXIT:%.*]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
+  call void @unknown()
+  call void @aligned_barrier()
+  %i.next = add nuw nsw i32 %i, 1
+  %cond = icmp ne i32 %i.next, 128
+  br i1 %cond, label %loop, label %exit
+
+exit:
+  call void @aligned_barrier()
+  call void @aligned_barrier()
+  call void @aligned_barrier()
+  call void @aligned_barrier()
+  ret void
+}
+
+define void @loop_barrier_end_barriers_unknown() "kernel" {
+; CHECK-LABEL: define {{[^@]+}}@loop_barrier_end_barriers_unknown
+; CHECK-SAME: () #[[ATTR4]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[LOOP:%.*]]
+; CHECK:       loop:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[I_NEXT:%.*]], [[LOOP]] ]
+; CHECK-NEXT:    call void @unknown()
+; CHECK-NEXT:    call void @aligned_barrier()
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[COND:%.*]] = icmp ne i32 [[I_NEXT]], 128
+; CHECK-NEXT:    br i1 [[COND]], label [[LOOP]], label [[EXIT:%.*]]
+; CHECK:       exit:
+; CHECK-NEXT:    call void @unknown()
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
+  call void @unknown()
+  call void @aligned_barrier()
+  %i.next = add nuw nsw i32 %i, 1
+  %cond = icmp ne i32 %i.next, 128
+  br i1 %cond, label %loop, label %exit
+
+exit:
+  call void @aligned_barrier()
+  call void @aligned_barrier()
+  call void @unknown()
+  call void @aligned_barrier()
+  call void @aligned_barrier()
+  ret void
+}
+
+define void @loop_barrier_store() "kernel" {
+; CHECK-LABEL: define {{[^@]+}}@loop_barrier_store
+; CHECK-SAME: () #[[ATTR4]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[LOOP:%.*]]
+; CHECK:       loop:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[I_NEXT:%.*]], [[LOOP]] ]
+; CHECK-NEXT:    store i32 [[I]], ptr @G1, align 4
+; CHECK-NEXT:    call void @aligned_barrier()
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[COND:%.*]] = icmp ne i32 [[I_NEXT]], 128
+; CHECK-NEXT:    br i1 [[COND]], label [[LOOP]], label [[EXIT:%.*]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
+  store i32 %i, ptr @G1
+  call void @aligned_barrier()
+  %i.next = add nuw nsw i32 %i, 1
+  %cond = icmp ne i32 %i.next, 128
+  br i1 %cond, label %loop, label %exit
+
+exit:
+  ret void
+}
+
+define void @loop_barrier_end_barriers_store() "kernel" {
+; CHECK-LABEL: define {{[^@]+}}@loop_barrier_end_barriers_store
+; CHECK-SAME: () #[[ATTR4]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[LOOP:%.*]]
+; CHECK:       loop:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[I_NEXT:%.*]], [[LOOP]] ]
+; CHECK-NEXT:    store i32 [[I]], ptr @G1, align 4
+; CHECK-NEXT:    call void @aligned_barrier()
+; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[COND:%.*]] = icmp ne i32 [[I_NEXT]], 128
+; CHECK-NEXT:    br i1 [[COND]], label [[LOOP]], label [[EXIT:%.*]]
+; CHECK:       exit:
+; CHECK-NEXT:    store i32 [[I_NEXT]], ptr @G1, align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
+  store i32 %i, ptr @G1
+  call void @aligned_barrier()
+  %i.next = add nuw nsw i32 %i, 1
+  %cond = icmp ne i32 %i.next, 128
+  br i1 %cond, label %loop, label %exit
+
+exit:
+  call void @aligned_barrier()
+  call void @aligned_barrier()
+  store i32 %i.next, ptr @G1
+  call void @aligned_barrier()
+  call void @aligned_barrier()
+  ret void
+}
+
 !llvm.module.flags = !{!16,!15}
-!nvvm.annotations = !{!0,!1,!2,!3,!4,!5,!6,!7,!8,!9,!10,!11,!12,!13,!14,!17,!18,!19,!20,!21,!22,!23,!24,!25}
+!nvvm.annotations = !{!0,!1,!2,!3,!4,!5,!6,!7,!8,!9,!10,!11,!12,!13,!14,!17,!18,!19,!20,!21,!22,!23,!24,!25,!26,!27,!28,!29,!30}
 
 !0 = !{ptr @pos_empty_1, !"kernel", i32 1}
 !1 = !{ptr @pos_empty_2, !"kernel", i32 1}
@@ -1079,6 +1245,11 @@ define void @caller_barrier2() "kernel" {
 !23 = !{ptr @pos_empty_8, !"kernel", i32 1}
 !24 = !{ptr @caller_barrier1, !"kernel", i32 1}
 !25 = !{ptr @caller_barrier2, !"kernel", i32 1}
+!26 = !{ptr @loop_barrier, !"kernel", i32 1}
+!27 = !{ptr @loop_barrier_end_barriers, !"kernel", i32 1}
+!28 = !{ptr @loop_barrier_end_barriers_unknown, !"kernel", i32 1}
+!29 = !{ptr @loop_barrier_store, !"kernel", i32 1}
+!30 = !{ptr @loop_barrier_end_barriers_store, !"kernel", i32 1}
 !6 = !{ptr @neg_empty_8, !"kernel", i32 1}
 !19 = !{ptr @neg_empty_9, !"kernel", i32 1}
 !20 = !{ptr @pos_empty_10, !"kernel", i32 1}
@@ -1128,4 +1299,9 @@ define void @caller_barrier2() "kernel" {
 ; CHECK: [[META23:![0-9]+]] = !{ptr @pos_empty_8, !"kernel", i32 1}
 ; CHECK: [[META24:![0-9]+]] = !{ptr @caller_barrier1, !"kernel", i32 1}
 ; CHECK: [[META25:![0-9]+]] = !{ptr @caller_barrier2, !"kernel", i32 1}
+; CHECK: [[META26:![0-9]+]] = !{ptr @loop_barrier, !"kernel", i32 1}
+; CHECK: [[META27:![0-9]+]] = !{ptr @loop_barrier_end_barriers, !"kernel", i32 1}
+; CHECK: [[META28:![0-9]+]] = !{ptr @loop_barrier_end_barriers_unknown, !"kernel", i32 1}
+; CHECK: [[META29:![0-9]+]] = !{ptr @loop_barrier_store, !"kernel", i32 1}
+; CHECK: [[META30:![0-9]+]] = !{ptr @loop_barrier_end_barriers_store, !"kernel", i32 1}
 ;.

[OpenMPOpt] Fix incorrect end-of-kernel barrier removal

Barrier removal in OpenMPOpt normally removes barriers by proving that
they are redundant with barriers preceding them. However, it can't do
this with the "pseudo-barrier" at the end of kernels because that can't
be removed. Instead, it removes the barriers preceding the end of the
kernel which that end-of-kernel barrier is redundant with. However,
these barriers aren't always redundant with the end-of-kernel barrier
when loops are involved, and removing them can lead to incorrect results
in compiled code.

This change fixes this by requiring that these pre-end-of-kernel
barriers also have the kernel end as a unique successor before removing
them. It also changes the initialization of `ExitED` for kernels since
the kernel end is not an aligned barrier.
Copy link
Member

@jdoerfert jdoerfert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix. Apologies it took me so long to review it.

const BasicBlock *const Successor = BB->getUniqueSuccessor();
if (!Successor)
return false;
return hasFunctionEndAsUniqueSuccessor(Successor);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is likely not to matter here but we usually avoid recursion and go with loops.
Also, const BB * const is a long way to spell auto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is tail recursion so it should be functionally equivalent

@dwoodwor-intel
Copy link
Contributor Author

I don't have write access; @jdoerfert, can you merge this on my behalf?

@jdoerfert jdoerfert merged commit ac29405 into llvm:main Sep 27, 2023
legrosbuffle pushed a commit to legrosbuffle/llvm-project that referenced this pull request Sep 29, 2023
Barrier removal in OpenMPOpt normally removes barriers by proving that
they are redundant with barriers preceding them. However, it can't do
this with the "pseudo-barrier" at the end of kernels because that can't
be removed. Instead, it removes the barriers preceding the end of the
kernel which that end-of-kernel barrier is redundant with. However,
these barriers aren't always redundant with the end-of-kernel barrier
when loops are involved, and removing them can lead to incorrect results
in compiled code.

This change fixes this by requiring that these pre-end-of-kernel
barriers also have the kernel end as a unique successor before removing
them. It also changes the initialization of `ExitED` for kernels since
the kernel end is not an aligned barrier.
@dwoodwor-intel dwoodwor-intel deleted the openmp-opt-final-barrier-loop branch November 6, 2023 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants