-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[flang][OpenMP] Translate OpenMP scopes when compiling for target device #130078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[flang][OpenMP] Translate OpenMP scopes when compiling for target device #130078
Conversation
@llvm/pr-subscribers-mlir-openmp @llvm/pr-subscribers-mlir Author: Kareem Ergawy (ergawy) ChangesTodo:
With these changes, we can handle inputs like: subroutine test_sections_target_with_allocatable
implicit none
integer,parameter :: N=100, rt=8
real(kind=rt), allocatable, dimension(:) :: a
integer :: i
allocate(a(1:N))
!$omp parallel
!$omp parallel
!$omp sections
!$omp section
!$omp target
a(1) = 1.0_rt
!$omp end target
!$omp end sections
!$omp end parallel
!$omp end parallel
!$omp parallel
!$omp parallel
!$omp do
do i=1,10
!$omp target
a(i) = 1.0_rt
!$omp end target
end do
!$omp end do
!$omp end parallel
!$omp end parallel
end subroutine Full diff: https://github.com/llvm/llvm-project/pull/130078.diff 1 Files Affected:
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 32c7c501d03c3..6af8856618f27 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -542,6 +542,13 @@ static llvm::omp::ProcBindKind getProcBindKind(omp::ClauseProcBindKind kind) {
llvm_unreachable("Unknown ClauseProcBindKind kind");
}
+static void forwardArgs(LLVM::ModuleTranslation &moduleTranslation,
+ llvm::ArrayRef<BlockArgument> blockArgs,
+ OperandRange operands) {
+ for (auto [arg, var] : llvm::zip_equal(blockArgs, operands))
+ moduleTranslation.mapValue(arg, moduleTranslation.lookupValue(var));
+}
+
/// Helper function to map block arguments defined by ignored loop wrappers to
/// LLVM values and prevent any uses of those from triggering null pointer
/// dereferences.
@@ -554,18 +561,12 @@ convertIgnoredWrapper(omp::LoopWrapperInterface opInst,
// Map block arguments directly to the LLVM value associated to the
// corresponding operand. This is semantically equivalent to this wrapper not
// being present.
- auto forwardArgs =
- [&moduleTranslation](llvm::ArrayRef<BlockArgument> blockArgs,
- OperandRange operands) {
- for (auto [arg, var] : llvm::zip_equal(blockArgs, operands))
- moduleTranslation.mapValue(arg, moduleTranslation.lookupValue(var));
- };
-
return llvm::TypeSwitch<Operation *, LogicalResult>(opInst)
.Case([&](omp::SimdOp op) {
auto blockArgIface = cast<omp::BlockArgOpenMPOpInterface>(*op);
- forwardArgs(blockArgIface.getPrivateBlockArgs(), op.getPrivateVars());
- forwardArgs(blockArgIface.getReductionBlockArgs(),
+ forwardArgs(moduleTranslation, blockArgIface.getPrivateBlockArgs(),
+ op.getPrivateVars());
+ forwardArgs(moduleTranslation, blockArgIface.getReductionBlockArgs(),
op.getReductionVars());
op.emitWarning() << "simd information on composite construct discarded";
return success();
@@ -5255,6 +5256,49 @@ convertTargetOpsInNest(Operation *op, llvm::IRBuilderBase &builder,
return WalkResult::interrupt();
return WalkResult::skip();
}
+
+ // Not-target ops might nest target-related ops, therefore, we
+ // translate them as non-OpenMP scopes. Translating them is needed by
+ // nested target-related ops since they might LLVM values defined in
+ // their parent non-target ops.
+ if (isa<omp::OpenMPDialect>(oper->getDialect()) &&
+ oper->getParentOfType<LLVM::LLVMFuncOp>() &&
+ !oper->getRegions().empty()) {
+
+ // TODO Handle other ops with entry block args.
+ if (auto wsloop = dyn_cast<omp::WsloopOp>(oper)) {
+ auto blockArgIface = cast<omp::BlockArgOpenMPOpInterface>(oper);
+ if (blockArgIface) {
+ forwardArgs(moduleTranslation,
+ blockArgIface.getPrivateBlockArgs(),
+ wsloop.getPrivateVars());
+ forwardArgs(moduleTranslation,
+ blockArgIface.getReductionBlockArgs(),
+ wsloop.getReductionVars());
+ }
+ }
+
+ if (auto loopNest = dyn_cast<omp::LoopNestOp>(oper)) {
+ for (auto iv : loopNest.getIVs()) {
+ // Create fake allocas just to maintain IR validity.
+ moduleTranslation.mapValue(
+ iv, builder.CreateAlloca(
+ moduleTranslation.convertType(iv.getType())));
+ }
+ }
+
+ for (Region ®ion : oper->getRegions()) {
+ auto result = convertOmpOpRegions(
+ region, oper->getName().getStringRef().str() + ".fake.region",
+ builder, moduleTranslation);
+ if (result.takeError())
+ return WalkResult::interrupt();
+ builder.SetInsertPoint(result.get(), result.get()->end());
+ }
+
+ return WalkResult::skip();
+ }
+
return WalkResult::advance();
}).wasInterrupted();
return failure(interrupted);
|
43a0362
to
b886d11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this work Kareem. I think this approach should work, but we need to keep an eye out in case there are other edge cases that break this, since we're translating inapplicable operations that we discard later.
Long term, I think we should try to go the opposite direction and keep as few host operations as we can get away with (possibly just omp.map.info
, omp.map.bounds
, global definitions and recipe ops, like omp.declare_reduction
, omp.private
, etc). The root issue appears to be that we're gathering non-static map information during compilation for both host and device that only actually exists in the host and is only used there, and this wasn't crashing the compiler because top-level operations inside of an llvm.func
are translated to LLVM IR. I think we should probably work on avoiding doing that and to stop keeping and translating these top-level operations at all, but something for the near future IMO.
One other nit: We would probably want to rename convertTargetOpsInNest
to something like convertHostOpsForDevice
, since that's what it would be doing now.
oper->getParentOfType<LLVM::LLVMFuncOp>() && | ||
!oper->getRegions().empty()) { | ||
|
||
// TODO Handle other ops with entry block args. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an unfortunate limitation of the current implementation of the BlockArgOpenMPOpInterface
that we can't also get the matching operation arguments associated to the block arguments. I think something like what's currently done for numXyzBlockArgs()
methods should work. Then, here instead of checking for each type, we could do something like:
if (auto blockArgIface = dyn_cast<BlockArgOpenMPOpInterface>(oper)) {
for (auto [arg, var] : llvm::zip_equal(llvm::concat(blockArgIface.getPrivateBlockArgs(), ...), blockArgIface.getPrivateVars(), ...))
moduleTranslation.mapValue(arg, moduleTranslation.lookupValue(var));
}
I'll work on that, since it's been in my TODO list for some time, but no need to block this work waiting for it. We can simplify it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the interface updates. They are awesome. Used them in the PR.
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Outdated
Show resolved
Hide resolved
b886d11
to
b55bae3
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
b55bae3
to
b377e75
Compare
b377e75
to
bffc2aa
Compare
iv, builder.CreateAlloca( | ||
moduleTranslation.convertType(iv.getType()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this somehow guaranteed to be inserting the allocas in the right spot or should we set the insertion point before this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is (indirectly) called from ModuleTranslation::convertOperation
which dispatches convertOperation
to all op interfaces. Looking at the implemetation, having a valid insertion point is a pre-condition assumed by all implementations of LLVMTranslationDialectInterface
including for example: LLVMDialectLLVMIRTranslationInterface::convertOperation
; see https://github.com/llvm/llvm-project/blob/main/mlir/lib/Target/LLVMIR/Dialect/LLVMIR/LLVMToLLVMIRTranslation.cpp#L297.
To be safe, I added an assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I meant that in other places we need to call findAllocaInsertPoint
before adding new allocas. That will either return the entry block of the function or an alternative spot, if an OpenMPAllocaStackFrame
was introduced by a parent operation.
So, the current insertion point there will always be valid, as you say, but maybe it wouldn't be the right place for allocas specifically. I'm not that familiar with this, so I may be wrong. At the end of the day operations created here are later removed from the LLVM module, so maybe this doesn't even matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that I'm thinking for both this case and OpenMP ops that return values is whether it makes sense to use undef
to represent them temporarily until they are removed or something else, rather than alloca
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think undef
is a better idea than using a fake alloca
, less confusing. Did that. edit: used poison
since undef
is deprecated.
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Outdated
Show resolved
Hide resolved
}) | ||
.Case([&](omp::TaskOp taskOp) { | ||
forwardPrivateArgs(taskOp, moduleTranslation); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to figure out which ops we need to look at here is to search for OpenMP_InReductionClause
, OpenMP_PrivateClause
, OpenMP_ReductionClause
, OpenMP_TaskReductionClause
, OpenMP_UseDeviceAddrClause
, OpenMP_UseDevicePtrClause
in OpenMPOps.td. I think we need to have cases for all of them before merging this, to avoid crashing the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was going to say if there is a foolproof way of making sure we have an exhaustive and complete list of ops that can have omp.target
ops nested in them and will need to forward their args, it'd be very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should cover all ops now, after using Sergio's latest updates to BlockArgOpenMPOpInterface
.
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Outdated
Show resolved
Hide resolved
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Outdated
Show resolved
Hide resolved
}) | ||
.Case([&](omp::TaskOp taskOp) { | ||
forwardPrivateArgs(taskOp, moduleTranslation); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was going to say if there is a foolproof way of making sure we have an exhaustive and complete list of ops that can have omp.target
ops nested in them and will need to forward their args, it'd be very useful.
|
||
for (Region ®ion : oper->getRegions()) { | ||
auto result = convertOmpOpRegions( | ||
region, oper->getName().getStringRef().str() + ".fake.region", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this named .fake.region
? As far as my understanding of the problem and the solution you have implemented goes, this is a region that has been created by this pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fake in the sense that it is not a truthful translation of the OpenMP construct being converted. We just need this to prepare the kernel invocation args. Added a comment to clarify.
bf94aeb
to
bdd5e15
Compare
270790b
to
cc7cb76
Compare
// in their parent non-target ops. | ||
if (isa<omp::OpenMPDialect>(oper->getDialect()) && | ||
oper->getParentOfType<LLVM::LLVMFuncOp>() && | ||
!oper->getRegions().empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering what should happen to OpenMP ops that don't have regions. If they return a value, it seems like that value could end up impacting what's passed into an omp.map.info
as an argument. Maybe we should map their results to something as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide an example where this might happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of something like this (since omp.threadprivate
seems to be the only non map-related OpenMP op that returns some value at the moment):
module test
implicit none
integer :: n
!$omp threadprivate(n)
contains
subroutine foo(x)
integer, intent(inout) :: x(10)
!$omp target map(tofrom: x(1:n))
call bar(x)
!$omp end target
end subroutine
end module
Lowering it to MLIR for the device results in the following sequence of operations:
%3 = omp.threadprivate ...
%4 = fir.declare %3
%13 = omp.map.info var_ptr(%4 : !fir.ref<i32>, i32) ...
omp.target map_entries(%13 -> %arg2 ...) {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the example. However, this is not an issue since when translating the surrounding region, the LLVM value corresponding to the original n
value will be resolved:
omp.parallel.fake.region:
%2 = load i32, ptr @_QFtestEn, align 4
%3 = sext i32 %2 to i64
%4 = sub i64 %3, 1
%5 = sub i64 %4, 0
%6 = add i64 %5, 1
%7 = mul i64 1, %6
%8 = mul i64 %7, 4
br label %omp.region.cont
(Just to clarify, I surrounded the target op by parallel
to make sure the issue could be reproduced.)
So the translation happens without issues.
I think we should keep things simple and not add the extra translation for non-region ops since there does not seem to be need for that, at least at the moment. I think it is not much more effort, the main reason is to simplify things even by a little bit. Let me know if you disagree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that the example above isn't causing crashes, since %3
is not mapped to any LLVM values and it contributes to the initialization of %4
, which is passed to an omp.map.info
.
Actually, if that had caused a crash it looks like any non-OpenMP dialect operation found during the op->walk()
in convertTargetOpsInNest
would have triggered it in the same way, since their results are not mapped to anything there either. I'm not clear how this is working, but apparently it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will dig deeper in this and try to provide a more detailed explanation.
iv, builder.CreateAlloca( | ||
moduleTranslation.convertType(iv.getType()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I meant that in other places we need to call findAllocaInsertPoint
before adding new allocas. That will either return the entry block of the function or an alternative spot, if an OpenMPAllocaStackFrame
was introduced by a parent operation.
So, the current insertion point there will always be valid, as you say, but maybe it wouldn't be the right place for allocas specifically. I'm not that familiar with this, so I may be wrong. At the end of the day operations created here are later removed from the LLVM module, so maybe this doesn't even matter.
iv, builder.CreateAlloca( | ||
moduleTranslation.convertType(iv.getType()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that I'm thinking for both this case and OpenMP ops that return values is whether it makes sense to use undef
to represent them temporarily until they are removed or something else, rather than alloca
.
5ef645d
to
59d8dfc
Compare
✅ With the latest revision this PR passed the undef deprecator. |
59d8dfc
to
3480c75
Compare
If a `target` directive is nested in a host OpenMP directive (e.g. parallel, task, or a worksharing loop), flang currently crashes if the target directive-related MLIR ops (e.g. `omp.map.bounds` and `omp.map.info` depends on SSA values defined inside the parent host OpenMP directives/ops. This PR tries to solve this problem by treating these parent OpenMP ops as "SSA scopes". Whenever we are translating for the device, instead of completely translating host ops, we just tranlate their MLIR ops as pure SSA values.
3480c75
to
714bac2
Compare
Ping! Please take a look when you have time. |
Hi Kareem, aomp/bin$ ./run_sollve.sh test_team_default_shared.F90 lvm-project/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp:464: llvm::Expe |
Taking a look ... |
…for target device Fixes a bug introduced by #130078. For non-BlockArgOpenMPOpInterface ops, we also want to map their entry block arguments to their operands, if any. For the current support in the OpenMP dialect, the table below lists all ops that have arguments (SSA operands and/or attributes) and not target-related. Of all these ops, we need to only process `omp.atomic.update` since it is the only op that has SSA operands & an attached region. Therefore, the region's entry block arguments must be mapped to the op's operands in case they are referenced inside the region. ============================================================================= | op | operands? | region(s)? | parent is func? | processed? | ============================================================================= | atomic.read | yes | no | yes | no | | atomic.write | yes | no | yes | no | | atomic.update | yes | yes | yes | yes | | critical | no | no | yes | no | | declare_mapper | no | yes | no | no | | declare_reduction | no | yes | no | no | | flush | yes | no | yes | no | | private | no | yes | yes | no | | threadprivate | yes | no | yes | no | | yield | yes | no | yes | no | =============================================================================
…for target device Fixes a bug introduced by #130078. For non-BlockArgOpenMPOpInterface ops, we also want to map their entry block arguments to their operands, if any. For the current support in the OpenMP dialect, the table below lists all ops that have arguments (SSA operands and/or attributes) and not target-related. Of all these ops, we need to only process `omp.atomic.update` since it is the only op that has SSA operands & an attached region. Therefore, the region's entry block arguments must be mapped to the op's operands in case they are referenced inside the region. ============================================================================= | op | operands? | region(s)? | parent is func? | processed? | ============================================================================= | atomic.read | yes | no | yes | no | | atomic.write | yes | no | yes | no | | atomic.update | yes | yes | yes | yes | | critical | no | no | yes | no | | declare_mapper | no | yes | no | no | | declare_reduction | no | yes | no | no | | flush | yes | no | yes | no | | private | no | yes | yes | no | | threadprivate | yes | no | yes | no | | yield | yes | no | yes | no | =============================================================================
…for target device (#132165) Fixes a bug introduced by #130078. For non-BlockArgOpenMPOpInterface ops, we also want to map their entry block arguments to their operands, if any. For the current support in the OpenMP dialect, the table below lists all ops that have arguments (SSA operands and/or attributes) and not target-related. Of all these ops, we need to only process `omp.atomic.update` since it is the only op that has SSA operands & an attached region. Therefore, the region's entry block arguments must be mapped to the op's operands in case they are referenced inside the region. | op | operands? | region(s)? | parent is func? | processed? | |--------------|-------------|------------|------------------|-------------| | atomic.read | yes | no | yes | no | | atomic.write | yes | no | yes | no | | atomic.update | yes | yes | yes | yes | | critical | no | no | yes | no | | declare_mapper | no | yes | no | no | | declare_reduction | no | yes | no | no | | flush | yes | no | yes | no | | private | no | yes | yes | no | | threadprivate | yes | no | yes | no | | yield | yes | no | yes | no |
…ing scopes for target device (#132165) Fixes a bug introduced by llvm/llvm-project#130078. For non-BlockArgOpenMPOpInterface ops, we also want to map their entry block arguments to their operands, if any. For the current support in the OpenMP dialect, the table below lists all ops that have arguments (SSA operands and/or attributes) and not target-related. Of all these ops, we need to only process `omp.atomic.update` since it is the only op that has SSA operands & an attached region. Therefore, the region's entry block arguments must be mapped to the op's operands in case they are referenced inside the region. | op | operands? | region(s)? | parent is func? | processed? | |--------------|-------------|------------|------------------|-------------| | atomic.read | yes | no | yes | no | | atomic.write | yes | no | yes | no | | atomic.update | yes | yes | yes | yes | | critical | no | no | yes | no | | declare_mapper | no | yes | no | no | | declare_reduction | no | yes | no | no | | flush | yes | no | yes | no | | private | no | yes | yes | no | | threadprivate | yes | no | yes | no | | yield | yes | no | yes | no |
…rget device (llvm#130078)" This reverts commit e737b84.
…ice (llvm#130078) If a `target` directive is nested in a host OpenMP directive (e.g. parallel, task, or a worksharing loop), flang currently crashes if the target directive-related MLIR ops (e.g. `omp.map.bounds` and `omp.map.info` depends on SSA values defined inside the parent host OpenMP directives/ops. This PR tries to solve this problem by treating these parent OpenMP ops as "SSA scopes". Whenever we are translating for the device, instead of completely translating host ops, we just tranlate their MLIR ops as pure SSA values.
…for target device (llvm#132165) Fixes a bug introduced by llvm#130078. For non-BlockArgOpenMPOpInterface ops, we also want to map their entry block arguments to their operands, if any. For the current support in the OpenMP dialect, the table below lists all ops that have arguments (SSA operands and/or attributes) and not target-related. Of all these ops, we need to only process `omp.atomic.update` since it is the only op that has SSA operands & an attached region. Therefore, the region's entry block arguments must be mapped to the op's operands in case they are referenced inside the region. | op | operands? | region(s)? | parent is func? | processed? | |--------------|-------------|------------|------------------|-------------| | atomic.read | yes | no | yes | no | | atomic.write | yes | no | yes | no | | atomic.update | yes | yes | yes | yes | | critical | no | no | yes | no | | declare_mapper | no | yes | no | no | | declare_reduction | no | yes | no | no | | flush | yes | no | yes | no | | private | no | yes | yes | no | | threadprivate | yes | no | yes | no | | yield | yes | no | yes | no |
If a
target
directive is nested in a host OpenMP directive (e.g.parallel, task, or a worksharing loop), flang currently crashes if the
target directive-related MLIR ops (e.g.
omp.map.bounds
andomp.map.info
depends on SSA values defined inside the parent hostOpenMP directives/ops.
This PR tries to solve this problem by treating these parent OpenMP ops
as "SSA scopes". Whenever we are translating for the device, instead of
completely translating host ops, we just tranlate their MLIR ops as pure
SSA values.