Skip to content

[mlir][OpenMP][flang] make private variable allocation implicit in omp.private #124019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jan 31, 2025

Conversation

tblah
Copy link
Contributor

@tblah tblah commented Jan 22, 2025

The intention of this work is to give MLIR->LLVMIR conversion freedom to
control how the private variable is allocated so that it can be
allocated on the stack in ordinary cases or as part of a structure used
to give closure context for tasks which might outlive the current stack
frame. See RFC:
https://discourse.llvm.org/t/rfc-openmp-supporting-delayed-task-execution-with-firstprivate-variables/83084

For example, a privatizer for an integer used to look like

  omp.private {type = private} @x.privatizer : !fir.ref<i32> alloc {
  ^bb0(%arg0: !fir.ref<i32>):
    %0 = ... allocate proper memory for the private clone ...
    omp.yield(%0 : !fir.ref<i32>)
  }

After this change, allocation become implicit in the operation:

  omp.private {type = private} @x.privatizer : i32

For more complex types that require initialization after allocation, an init region can be used:

  omp.private {type = private} @x.privatizer : !some.type init {
  ^bb0(%arg0: !some.pointer<!some.type>, %arg1: !some.pointer<!some.type>):
    // initialize %arg1, using %arg0 as a mold for allocations
    omp.yield(%arg1 : !some.pointer<!some.type>)
  } dealloc {
    ^bb0(%arg0: !some.pointer<!some.type>):
    ... deallocate memory allocated by the init region ...
    omp.yield
  }

This patch lays the groundwork for delayed task execution but is not enough on its own.

After this patch all gfortran tests which previously passed still pass. There
are the following changes to the Fujitsu test suite:

  • 0380_0009 and 0435_0009 are fixed
  • 0688_0041 now fails at runtime. This patch is testing firstprivate variables with tasks. Previously we got lucky with the undefined behavior and won the race. After these changes we no longer get lucky. This patch lays the groundwork for a proper fix for this issue.

In flang the lowering re-uses the existing lowering used for reduction init and dealloc regions.

In flang, before this patch we hit a TODO with the same wording when
generating the copy region for firstprivate polymorphic variables. After
this patch the box-like fir.class is passed by reference into the copy
region, leading to a different path that didn't hit that old TODO but
the generated code still didn't work so I added a new TODO in
DataSharingProcessor.


Please read mlir changes first and then flang changes.

This has to be merged as one massive patch to avoid testsuite regressions,
but for ease of review I have broken it up into a series of (mostly) short patches.
In these patches I have only updated unit tests as I go along. They will only all
pass at the end of the series.

I have tried to make the changes to init/dealloc region generation self-contained
so that it is easier to see what changed when reviewing. For some commits I have
added extra commentary in the commit message for that commit.

By the end of this series the init/dealloc region generation is a bit hard to follow so I will refactor
it in a follow-up (hopefull NFC) patch.

In flang lowering I box up all arrays and pass the boxes by reference so
that the existing code for reduction init and dealloc regions can be
shared.

The TODOs for pointers, derived types, characters etc will be resolved
in later patches in this same series (to be squashed into this one). I
separated it to make it easier to review.

Assumed rank was already broken before this patch.
I can't find any mention of "assumed rank" in the openmp standard so
I guess it is not prohibited.

Other than the omp.private operation definition changes, the test changes
are mostly down to slightly different codegen from re-using the reduction
init region. That code is already well tested so I didn't want to change it.

@luporl please could you review my re-implementation of your recent fix in 4d52dde and especially the changes to the logic for the inclusion of barriers.

@tblah tblah changed the base branch from main to users/tblah/delayed-task-exec-0 January 22, 2025 21:48
@tblah tblah requested a review from kaviya2510 January 23, 2025 11:25
@tblah
Copy link
Contributor Author

tblah commented Jan 23, 2025

I'm not sure why the bot didn't run on this.

@llvm/pr-subscribers-flang-openmp

Copy link
Member

@ergawy ergawy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work @tblah! I partially reviewed since this is a huge PR (understandably). I will come back and continue reviewing the rest later.

assert(declOp &&
"Expected defining Op of privatized var to be hlfir.declare");
assert(definingOp &&
"Privatizing a block argument without any hlfir.declare");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't restrict ourselves to DeclareOp, do we still need that assertion? If so, the error message needs to be generalized.

Copy link
Contributor Author

@tblah tblah Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLIR values can come from two places:

  1. the result of an operation
  2. a block argument

We can't assume that getting the defining operation produces a non-null result because the value might be a block argument.

For example,

func.func @func(%arg0 : !type0, %arg1 : !type1) {
  %0 = hlfir.declare %arg0
  omp.target private(%0, %arg1)
}

Here we can get a defining operation for %0 because it is the hlfir.declare for that function argument (which is the normal way flang would lower a function argument). %arg1 has no defining operation because it is a block argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLIR values can come from two places:

That's exactly my point. What prevents us from working with block args here? Why do we need to assume it is defined by an op?

I am not against that, we can keep the assertion. But beyond the fact that we need to call getBase below, we only care about var and a Value and not about its defining op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was nervous to make any functional change to the target stuff because I don't know how to test it. The previous implementation also wouldn't have worked for block arguments.

I can fix this if somebody at AMD is willing to test omp target for me?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the assertion and see if any of our smoke tests triggers any problems (I will monitor that). Then we can add handle whatever is missing and add a regression test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! 44bbbe8

Copy link
Member

@ergawy ergawy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks Tom! However, I have to admit, the fir dialect type system is still a "semi-"blackbox to me, so someone more familiar with it needs to carefully review changes in PrivateReductionUtils.cpp.

assert(declOp &&
"Expected defining Op of privatized var to be hlfir.declare");
assert(definingOp &&
"Privatizing a block argument without any hlfir.declare");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLIR values can come from two places:

That's exactly my point. What prevents us from working with block args here? Why do we need to assume it is defined by an op?

I am not against that, we can keep the assertion. But beyond the fact that we need to call getBase below, we only care about var and a Value and not about its defining op.

Base automatically changed from users/tblah/delayed-task-exec-0 to main January 24, 2025 14:48
Copy link
Contributor

@luporl luporl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the effort you have put into this @tblah. I'm looking forward to privatization support in delayed tasks.

I like your approach for detecting the need of barriers, based on the uses of moldArg, it's clever.

I have a few comments, but no major concerns with this PR.

Comment on lines +431 to +494
mlir::Value newBox =
builder.create<fir::EmboxOp>(loc, boxedTy, allocatedPrivVarArg);
mlir::Value moldBox = builder.create<fir::EmboxOp>(loc, boxedTy, moldArg);
initializeIfDerivedTypeBox(
builder, loc, newBox, moldBox, needsInitialization,
/*isFirstPrivate=*/kind == DeclOperationKind::FirstPrivate);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a future optimization, you could move part of initializeIfDerivedTypeBox()'s body to a separate function. This new function would detect if any initialization is needed and could be used here to eliminate the need to always embox newBox and moldBox.

BTW, emboxing moldArg for derived types will always create an use of it, right? This causes barriers to be inserted for the privatization of any derived types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It is even worse than just derived types, we load moldArg to fetch the length parameters whether or not there really are any. That load (and probably these embox operations) get removed by DCE but not before the barriers are created.

tblah added 17 commits January 30, 2025 12:34
The intention of this work is to give MLIR->LLVMIR conversion freedom to
control how the private variable is allocated so that it can be
allocated on the stack in ordinary cases or as part of a structure used
to give closure context for tasks which might outlive the current stack
frame. See RFC:
https://discourse.llvm.org/t/rfc-openmp-supporting-delayed-task-execution-with-firstprivate-variables/83084

In flang, before this patch we hit a TODO with the same wording when
generating the copy region for firstprivate polymorphic variables. After
this patch the box-like fir.class is passed by reference into the copy
region, leading to a different path that didn't hit that old TODO but
the generated code still didn't work so I added a new TODO in
DataSharingProcessor.

---

Please read mlir changes first and then flang changes.

In flang lowering I box up all arrays and pass the boxes by reference so
that the existing code for reduction init and dealloc regions can be
shared.

The TODOs for pointers, derived types, characters etc will be resolved
in later patches in this same series (to be squashed into this one). I
separated it to make it easier to review.

Assumed rank was already broken before this patch.
I can't find any mention of "assumed rank" in the openmp standard so
I guess it is not prohibited.

Other than the omp.private operation definition changes, the test changes
are mostly down to slightly different codegen from re-using the reduction
init region. That code is already well tested so I didn't want to change it.
Arrays are boxed and those boxed arrays are passed by reference. This
can cause problems if we try to perform eager privatization (e.g.
lastprivate) on a variable already privatized using delayed
privatization. Work around this by loading the reference to the box.
The way boxchars are handled isn't ideal, but !fir.ref<!fir.boxchar<>>
seems to violate a lot of assumptions in the wider ecosystem of (hl)fir
helpers making it difficult to generate a copy region. I suspect
!fir.ref<!fir.boxchar<>> is not supposed to work (it looks like a
mutable character box, which isn't possible because a boxchar should be
an SSA value). Fixing this would be a big change beyond the scope of
this already too large PR.
The copy region codegen hasn't changed as a result of this patch series.
However I think there is a bug in the copy region generated in
equivalence.f90. This patch series is already too big so I won't change
the copy region here.
This only affects POINTERs (which should not be initialized) and NULL
allocatables so the actual contents of the shape doesn't matter. This is
just so the embox operation is converted to LLVMIR correctly.
There are three cases handled:
- Boxed (maybe allocatable or pointer) scalar derived types
- Boxed (maybe allocatable or pointer) arrays of derived types
- Unboxed scalar derived types

Currently I support both boxed and unboxed derived types because unboxed
derived types aren't hlfir::Entities so I worry there could be cases
where they are in fact boxed.

The changes to parallel-private-clause.f90 are because of the arrays
becoming boxed. I re-organised the test a bit because the CHECK-DAGs
were matching completely the wrong lines outside of the parallel region.
This will deallocate in exactly the same cases as the old implementation
before this series. I expect some cases are missed but I want to avoid
functional changes wherever possible in this (already too big) patch
series.
I made a mistake when rewriting the privatizer, which led to the test
reading incorrectly.
@tblah tblah force-pushed the users/tblah/delayed-task-exec-1 branch from ce45103 to 9284bb3 Compare January 30, 2025 13:50
@tblah
Copy link
Contributor Author

tblah commented Jan 30, 2025

@luporl ping for review

Copy link
Contributor

@luporl luporl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

If I understood correctly, with this change the generated code will have more barriers than before, although some of those might be fixing existing race conditions.
Is my understanding correct, that the goal is to prioritize correctness for now and to remove unnecessary barriers in the future?

@tblah
Copy link
Contributor Author

tblah commented Jan 30, 2025

If I understood correctly, with this change the generated code will have more barriers than before, although some of those might be fixing existing race conditions. Is my understanding correct, that the goal is to prioritize correctness for now and to remove unnecessary barriers in the future?

Yes that is my intention

Thanks for the reivew

@tblah tblah merged commit aeaafce into main Jan 31, 2025
8 checks passed
@tblah tblah deleted the users/tblah/delayed-task-exec-1 branch January 31, 2025 09:35
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Feb 3, 2025
…it in omp.private (llvm#124019)"

Breaks 535.weather , hangs on ringo

This reverts commit aeaafce.
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Feb 7, 2025
…p.private (llvm#124019)

The intention of this work is to give MLIR->LLVMIR conversion freedom to
control how the private variable is allocated so that it can be
allocated on the stack in ordinary cases or as part of a structure used
to give closure context for tasks which might outlive the current stack
frame. See RFC:

https://discourse.llvm.org/t/rfc-openmp-supporting-delayed-task-execution-with-firstprivate-variables/83084

For example, a privatizer for an integer used to look like
```mlir
  omp.private {type = private} @x.privatizer : !fir.ref<i32> alloc {
  ^bb0(%arg0: !fir.ref<i32>):
    %0 = ... allocate proper memory for the private clone ...
    omp.yield(%0 : !fir.ref<i32>)
  }
```

After this change, allocation become implicit in the operation:
```mlir
  omp.private {type = private} @x.privatizer : i32
```

For more complex types that require initialization after allocation, an
init region can be used:
``` mlir
  omp.private {type = private} @x.privatizer : !some.type init {
  ^bb0(%arg0: !some.pointer<!some.type>, %arg1: !some.pointer<!some.type>):
    // initialize %arg1, using %arg0 as a mold for allocations
    omp.yield(%arg1 : !some.pointer<!some.type>)
  } dealloc {
    ^bb0(%arg0: !some.pointer<!some.type>):
    ... deallocate memory allocated by the init region ...
    omp.yield
  }
```

This patch lays the groundwork for delayed task execution but is not
enough on its own.

After this patch all gfortran tests which previously passed still pass.
There
are the following changes to the Fujitsu test suite:
- 0380_0009 and 0435_0009 are fixed
- 0688_0041 now fails at runtime. This patch is testing firstprivate
variables with tasks. Previously we got lucky with the undefined
behavior and won the race. After these changes we no longer get lucky.
This patch lays the groundwork for a proper fix for this issue.

In flang the lowering re-uses the existing lowering used for reduction
init and dealloc regions.

In flang, before this patch we hit a TODO with the same wording when
generating the copy region for firstprivate polymorphic variables. After
this patch the box-like fir.class is passed by reference into the copy
region, leading to a different path that didn't hit that old TODO but
the generated code still didn't work so I added a new TODO in
DataSharingProcessor.
@@ -184,7 +218,8 @@ bool DataSharingProcessor::needBarrier() {
// Emit implicit barrier for linear clause. Maybe on somewhere else.
for (const semantics::Symbol *sym : allPrivatizedSymbols) {
if (sym->test(semantics::Symbol::Flag::OmpLastPrivate) &&
(sym->test(semantics::Symbol::Flag::OmpFirstPrivate) || callsInitClone))
(sym->test(semantics::Symbol::Flag::OmpFirstPrivate) ||
mightHaveReadHostSym))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @tblah for the late comment to this PR. I am looking into an issue that came up recently, and I believe it's related to this check here and your extensions to initialize that flag also during delayed privatization. So, I wanted to ask about it to try to understand it better, and see what the best solution is.

As far as I can tell, the mightHaveReadHostSym flag is set globally. That is, whenever a symbol is known to either initialize a clone or to have a privatizer with an init mold argument that is used, this will be true. Then, a barrier will be inserted e.g. if it is true and there are any lastprivate symbols. Is that the intended behavior?

My concern is that it doesn't look at whether the lastprivate symbol is the same one that caused the flag to be set. This is something that is triggered in the following example:

subroutine simd()
  integer :: i, j
  integer, dimension ( 3 ) :: x

  !$omp simd collapse(2) private(x)
  do i = 1, 10
    do j = 1, 10
      call foo(x)
    end do
  end do
  !$omp end simd
end subroutine simd

The i variable is implicitly marked as lastprivate (not sure where that's done, but I'm assuming it's simd-specific behavior) and there is a private clause for a non-primitive type, so this results in the insertion of a barrier. However, replacing simd for do no longer causes this issue because i is no longer lastprivate then. I was just wondering if it's on purpose that we're adding a barrier due to different symbols meeting each side of the condition.

I guess that would have been triggered previously as well, it's just that this change made it also happen for delayed privatization, which then impacts composite construct lowering and ends up triggering an MLIR verifier error for target SPMD due to the added barrier.

Maybe @ergawy or @luporl can also chime in if you know whether this is the expected behavior. I have a small patch updating mightHaveReadHostSym so that a barrier would not be inserted in this case, which I can share if this isn't expected behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, a barrier should not be needed in this case, since it is not possible to have concurrent read/write of the same memory region, as i is only written to in the last iteration and, as it seems, x is only read during privatization (this could probably be optimized, since x's shape is known at compile time).

I hadn't noticed it before, but having a single mightHaveReadHostSym variable for all privatized symbols is a problem indeed. A possible solution would be to have a per-symbol mightHaveReadHostSym.

Copy link
Member

@skatrak skatrak Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick response. I just created #127074 to address the second part of the issue. Though it looks like, from what you're saying, we might want to treat induction variables differently as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix.
I only wanted to point out that, in this case, x wouldn't even need to be read at all.
Actually, this optimization is implemented in #125901.

Do you think there is any advantage in treating induction variables differently?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you for the explanation, that makes sense. I would try to have as few edge cases as possible, so if we can get away with treating every variable the same way, I'd prefer it.

DO loop bounds have to be scalars, so it seems to me that the induction variable couldn't trigger this via the mightHaveReadHostSym=true path. Also, semantics triggers an error about passing a loop induction variable into a firstprivate clause, so it seems like we should be fine without adding an edge case for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my slow response @skatrak, I was on holiday. Thanks for fixing it, and thanks @luporl for chiming in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants