-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[mlir][OpenMP][flang] make private variable allocation implicit in omp.private #124019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm not sure why the bot didn't run on this. @llvm/pr-subscribers-flang-openmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work @tblah! I partially reviewed since this is a huge PR (understandably). I will come back and continue reviewing the rest later.
assert(declOp && | ||
"Expected defining Op of privatized var to be hlfir.declare"); | ||
assert(definingOp && | ||
"Privatizing a block argument without any hlfir.declare"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't restrict ourselves to DeclareOp
, do we still need that assertion? If so, the error message needs to be generalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MLIR values can come from two places:
- the result of an operation
- a block argument
We can't assume that getting the defining operation produces a non-null result because the value might be a block argument.
For example,
func.func @func(%arg0 : !type0, %arg1 : !type1) {
%0 = hlfir.declare %arg0
omp.target private(%0, %arg1)
}
Here we can get a defining operation for %0
because it is the hlfir.declare
for that function argument (which is the normal way flang would lower a function argument). %arg1
has no defining operation because it is a block argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MLIR values can come from two places:
That's exactly my point. What prevents us from working with block args here? Why do we need to assume it is defined by an op?
I am not against that, we can keep the assertion. But beyond the fact that we need to call getBase
below, we only care about var
and a Value
and not about its defining op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was nervous to make any functional change to the target stuff because I don't know how to test it. The previous implementation also wouldn't have worked for block arguments.
I can fix this if somebody at AMD is willing to test omp target
for me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the assertion and see if any of our smoke tests triggers any problems (I will monitor that). Then we can add handle whatever is missing and add a regression test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! 44bbbe8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks Tom! However, I have to admit, the fir
dialect type system is still a "semi-"blackbox to me, so someone more familiar with it needs to carefully review changes in PrivateReductionUtils.cpp
.
assert(declOp && | ||
"Expected defining Op of privatized var to be hlfir.declare"); | ||
assert(definingOp && | ||
"Privatizing a block argument without any hlfir.declare"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MLIR values can come from two places:
That's exactly my point. What prevents us from working with block args here? Why do we need to assume it is defined by an op?
I am not against that, we can keep the assertion. But beyond the fact that we need to call getBase
below, we only care about var
and a Value
and not about its defining op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the effort you have put into this @tblah. I'm looking forward to privatization support in delayed tasks.
I like your approach for detecting the need of barriers, based on the uses of moldArg
, it's clever.
I have a few comments, but no major concerns with this PR.
mlir::Value newBox = | ||
builder.create<fir::EmboxOp>(loc, boxedTy, allocatedPrivVarArg); | ||
mlir::Value moldBox = builder.create<fir::EmboxOp>(loc, boxedTy, moldArg); | ||
initializeIfDerivedTypeBox( | ||
builder, loc, newBox, moldBox, needsInitialization, | ||
/*isFirstPrivate=*/kind == DeclOperationKind::FirstPrivate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a future optimization, you could move part of initializeIfDerivedTypeBox()
's body to a separate function. This new function would detect if any initialization is needed and could be used here to eliminate the need to always embox newBox
and moldBox
.
BTW, emboxing moldArg
for derived types will always create an use of it, right? This causes barriers to be inserted for the privatization of any derived types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It is even worse than just derived types, we load moldArg
to fetch the length parameters whether or not there really are any. That load (and probably these embox operations) get removed by DCE but not before the barriers are created.
The intention of this work is to give MLIR->LLVMIR conversion freedom to control how the private variable is allocated so that it can be allocated on the stack in ordinary cases or as part of a structure used to give closure context for tasks which might outlive the current stack frame. See RFC: https://discourse.llvm.org/t/rfc-openmp-supporting-delayed-task-execution-with-firstprivate-variables/83084 In flang, before this patch we hit a TODO with the same wording when generating the copy region for firstprivate polymorphic variables. After this patch the box-like fir.class is passed by reference into the copy region, leading to a different path that didn't hit that old TODO but the generated code still didn't work so I added a new TODO in DataSharingProcessor. --- Please read mlir changes first and then flang changes. In flang lowering I box up all arrays and pass the boxes by reference so that the existing code for reduction init and dealloc regions can be shared. The TODOs for pointers, derived types, characters etc will be resolved in later patches in this same series (to be squashed into this one). I separated it to make it easier to review. Assumed rank was already broken before this patch. I can't find any mention of "assumed rank" in the openmp standard so I guess it is not prohibited. Other than the omp.private operation definition changes, the test changes are mostly down to slightly different codegen from re-using the reduction init region. That code is already well tested so I didn't want to change it.
Arrays are boxed and those boxed arrays are passed by reference. This can cause problems if we try to perform eager privatization (e.g. lastprivate) on a variable already privatized using delayed privatization. Work around this by loading the reference to the box.
The way boxchars are handled isn't ideal, but !fir.ref<!fir.boxchar<>> seems to violate a lot of assumptions in the wider ecosystem of (hl)fir helpers making it difficult to generate a copy region. I suspect !fir.ref<!fir.boxchar<>> is not supposed to work (it looks like a mutable character box, which isn't possible because a boxchar should be an SSA value). Fixing this would be a big change beyond the scope of this already too large PR.
The copy region codegen hasn't changed as a result of this patch series. However I think there is a bug in the copy region generated in equivalence.f90. This patch series is already too big so I won't change the copy region here.
This only affects POINTERs (which should not be initialized) and NULL allocatables so the actual contents of the shape doesn't matter. This is just so the embox operation is converted to LLVMIR correctly.
There are three cases handled: - Boxed (maybe allocatable or pointer) scalar derived types - Boxed (maybe allocatable or pointer) arrays of derived types - Unboxed scalar derived types Currently I support both boxed and unboxed derived types because unboxed derived types aren't hlfir::Entities so I worry there could be cases where they are in fact boxed. The changes to parallel-private-clause.f90 are because of the arrays becoming boxed. I re-organised the test a bit because the CHECK-DAGs were matching completely the wrong lines outside of the parallel region.
For reference see both #120295 and #121808 This changes the barrier logic. See discussion here: https://github.com/llvm/llvm-project/pull/120295/files#r1910663473
This will deallocate in exactly the same cases as the old implementation before this series. I expect some cases are missed but I want to avoid functional changes wherever possible in this (already too big) patch series.
I made a mistake when rewriting the privatizer, which led to the test reading incorrectly.
ce45103
to
9284bb3
Compare
@luporl ping for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
If I understood correctly, with this change the generated code will have more barriers than before, although some of those might be fixing existing race conditions.
Is my understanding correct, that the goal is to prioritize correctness for now and to remove unnecessary barriers in the future?
Yes that is my intention Thanks for the reivew |
…it in omp.private (llvm#124019)" Breaks 535.weather , hangs on ringo This reverts commit aeaafce.
…p.private (llvm#124019) The intention of this work is to give MLIR->LLVMIR conversion freedom to control how the private variable is allocated so that it can be allocated on the stack in ordinary cases or as part of a structure used to give closure context for tasks which might outlive the current stack frame. See RFC: https://discourse.llvm.org/t/rfc-openmp-supporting-delayed-task-execution-with-firstprivate-variables/83084 For example, a privatizer for an integer used to look like ```mlir omp.private {type = private} @x.privatizer : !fir.ref<i32> alloc { ^bb0(%arg0: !fir.ref<i32>): %0 = ... allocate proper memory for the private clone ... omp.yield(%0 : !fir.ref<i32>) } ``` After this change, allocation become implicit in the operation: ```mlir omp.private {type = private} @x.privatizer : i32 ``` For more complex types that require initialization after allocation, an init region can be used: ``` mlir omp.private {type = private} @x.privatizer : !some.type init { ^bb0(%arg0: !some.pointer<!some.type>, %arg1: !some.pointer<!some.type>): // initialize %arg1, using %arg0 as a mold for allocations omp.yield(%arg1 : !some.pointer<!some.type>) } dealloc { ^bb0(%arg0: !some.pointer<!some.type>): ... deallocate memory allocated by the init region ... omp.yield } ``` This patch lays the groundwork for delayed task execution but is not enough on its own. After this patch all gfortran tests which previously passed still pass. There are the following changes to the Fujitsu test suite: - 0380_0009 and 0435_0009 are fixed - 0688_0041 now fails at runtime. This patch is testing firstprivate variables with tasks. Previously we got lucky with the undefined behavior and won the race. After these changes we no longer get lucky. This patch lays the groundwork for a proper fix for this issue. In flang the lowering re-uses the existing lowering used for reduction init and dealloc regions. In flang, before this patch we hit a TODO with the same wording when generating the copy region for firstprivate polymorphic variables. After this patch the box-like fir.class is passed by reference into the copy region, leading to a different path that didn't hit that old TODO but the generated code still didn't work so I added a new TODO in DataSharingProcessor.
@@ -184,7 +218,8 @@ bool DataSharingProcessor::needBarrier() { | |||
// Emit implicit barrier for linear clause. Maybe on somewhere else. | |||
for (const semantics::Symbol *sym : allPrivatizedSymbols) { | |||
if (sym->test(semantics::Symbol::Flag::OmpLastPrivate) && | |||
(sym->test(semantics::Symbol::Flag::OmpFirstPrivate) || callsInitClone)) | |||
(sym->test(semantics::Symbol::Flag::OmpFirstPrivate) || | |||
mightHaveReadHostSym)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @tblah for the late comment to this PR. I am looking into an issue that came up recently, and I believe it's related to this check here and your extensions to initialize that flag also during delayed privatization. So, I wanted to ask about it to try to understand it better, and see what the best solution is.
As far as I can tell, the mightHaveReadHostSym
flag is set globally. That is, whenever a symbol is known to either initialize a clone or to have a privatizer with an init mold argument that is used, this will be true. Then, a barrier will be inserted e.g. if it is true and there are any lastprivate
symbols. Is that the intended behavior?
My concern is that it doesn't look at whether the lastprivate
symbol is the same one that caused the flag to be set. This is something that is triggered in the following example:
subroutine simd()
integer :: i, j
integer, dimension ( 3 ) :: x
!$omp simd collapse(2) private(x)
do i = 1, 10
do j = 1, 10
call foo(x)
end do
end do
!$omp end simd
end subroutine simd
The i
variable is implicitly marked as lastprivate
(not sure where that's done, but I'm assuming it's simd
-specific behavior) and there is a private
clause for a non-primitive type, so this results in the insertion of a barrier. However, replacing simd
for do
no longer causes this issue because i
is no longer lastprivate
then. I was just wondering if it's on purpose that we're adding a barrier due to different symbols meeting each side of the condition.
I guess that would have been triggered previously as well, it's just that this change made it also happen for delayed privatization, which then impacts composite construct lowering and ends up triggering an MLIR verifier error for target SPMD due to the added barrier.
Maybe @ergawy or @luporl can also chime in if you know whether this is the expected behavior. I have a small patch updating mightHaveReadHostSym
so that a barrier would not be inserted in this case, which I can share if this isn't expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, a barrier should not be needed in this case, since it is not possible to have concurrent read/write of the same memory region, as i
is only written to in the last iteration and, as it seems, x
is only read during privatization (this could probably be optimized, since x
's shape is known at compile time).
I hadn't noticed it before, but having a single mightHaveReadHostSym
variable for all privatized symbols is a problem indeed. A possible solution would be to have a per-symbol mightHaveReadHostSym
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the quick response. I just created #127074 to address the second part of the issue. Though it looks like, from what you're saying, we might want to treat induction variables differently as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix.
I only wanted to point out that, in this case, x
wouldn't even need to be read at all.
Actually, this optimization is implemented in #125901.
Do you think there is any advantage in treating induction variables differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you for the explanation, that makes sense. I would try to have as few edge cases as possible, so if we can get away with treating every variable the same way, I'd prefer it.
DO loop bounds have to be scalars, so it seems to me that the induction variable couldn't trigger this via the mightHaveReadHostSym=true
path. Also, semantics triggers an error about passing a loop induction variable into a firstprivate
clause, so it seems like we should be fine without adding an edge case for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention of this work is to give MLIR->LLVMIR conversion freedom to
control how the private variable is allocated so that it can be
allocated on the stack in ordinary cases or as part of a structure used
to give closure context for tasks which might outlive the current stack
frame. See RFC:
https://discourse.llvm.org/t/rfc-openmp-supporting-delayed-task-execution-with-firstprivate-variables/83084
For example, a privatizer for an integer used to look like
After this change, allocation become implicit in the operation:
For more complex types that require initialization after allocation, an init region can be used:
This patch lays the groundwork for delayed task execution but is not enough on its own.
After this patch all gfortran tests which previously passed still pass. There
are the following changes to the Fujitsu test suite:
In flang the lowering re-uses the existing lowering used for reduction init and dealloc regions.
In flang, before this patch we hit a TODO with the same wording when
generating the copy region for firstprivate polymorphic variables. After
this patch the box-like fir.class is passed by reference into the copy
region, leading to a different path that didn't hit that old TODO but
the generated code still didn't work so I added a new TODO in
DataSharingProcessor.
Please read mlir changes first and then flang changes.
This has to be merged as one massive patch to avoid testsuite regressions,
but for ease of review I have broken it up into a series of (mostly) short patches.
In these patches I have only updated unit tests as I go along. They will only all
pass at the end of the series.
I have tried to make the changes to init/dealloc region generation self-contained
so that it is easier to see what changed when reviewing. For some commits I have
added extra commentary in the commit message for that commit.
By the end of this series the init/dealloc region generation is a bit hard to follow so I will refactor
it in a follow-up (hopefull NFC) patch.
In flang lowering I box up all arrays and pass the boxes by reference so
that the existing code for reduction init and dealloc regions can be
shared.
The TODOs for pointers, derived types, characters etc will be resolved
in later patches in this same series (to be squashed into this one). I
separated it to make it easier to review.
Assumed rank was already broken before this patch.
I can't find any mention of "assumed rank" in the openmp standard so
I guess it is not prohibited.
Other than the omp.private operation definition changes, the test changes
are mostly down to slightly different codegen from re-using the reduction
init region. That code is already well tested so I didn't want to change it.
@luporl please could you review my re-implementation of your recent fix in 4d52dde and especially the changes to the logic for the inclusion of barriers.