Skip to content

[flang][OpenMP] Upstream do concurrent loop-nest detection. #127478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions clang/include/clang/Driver/Options.td
Original file line number Diff line number Diff line change
Expand Up @@ -6927,6 +6927,10 @@ defm loop_versioning : BoolOptionWithoutMarshalling<"f", "version-loops-for-stri

def fhermetic_module_files : Flag<["-"], "fhermetic-module-files">, Group<f_Group>,
HelpText<"Emit hermetic module files (no nested USE association)">;

def fdo_concurrent_to_openmp_EQ : Joined<["-"], "fdo-concurrent-to-openmp=">,
HelpText<"Try to map `do concurrent` loops to OpenMP [none|host|device]">,
Values<"none, host, device">;
} // let Visibility = [FC1Option, FlangOption]

def J : JoinedOrSeparate<["-"], "J">,
Expand Down
3 changes: 2 additions & 1 deletion clang/lib/Driver/ToolChains/Flang.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,8 @@ void Flang::addCodegenOptions(const ArgList &Args,
CmdArgs.push_back("-fversion-loops-for-stride");

Args.addAllArgs(CmdArgs,
{options::OPT_flang_experimental_hlfir,
{options::OPT_fdo_concurrent_to_openmp_EQ,
options::OPT_flang_experimental_hlfir,
options::OPT_flang_deprecated_no_hlfir,
options::OPT_fno_ppc_native_vec_elem_order,
options::OPT_fppc_native_vec_elem_order,
Expand Down
229 changes: 229 additions & 0 deletions flang/docs/DoConcurrentConversionToOpenMP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
<!--===- docs/DoConcurrentMappingToOpenMP.md

Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
See https://llvm.org/LICENSE.txt for license information.
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

-->

# `DO CONCURRENT` mapping to OpenMP

```{contents}
---
local:
---
```

This document seeks to describe the effort to parallelize `do concurrent` loops
by mapping them to OpenMP worksharing constructs. The goals of this document
are:
* Describing how to instruct `flang` to map `DO CONCURRENT` loops to OpenMP
constructs.
* Tracking the current status of such mapping.
* Describing the limitations of the current implementation.
* Describing next steps.
* Tracking the current upstreaming status (from the AMD ROCm fork).

## Usage

In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
1. `host`: this maps `do concurrent` loops to run in parallel on the host CPU.
This maps such loops to the equivalent of `omp parallel do`.
2. `device`: this maps `do concurrent` loops to run in parallel on a target device.
This maps such loops to the equivalent of
`omp target teams distribute parallel do`.
3. `none`: this disables `do concurrent` mapping altogether. In that case, such
loops are emitted as sequential loops.

The `-fdo-concurrent-to-openmp` compiler switch is currently available only when
OpenMP is also enabled. So you need to provide the following options to flang in
order to enable it:
```
flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
```
For mapping to device, the target device architecture must be specified as well.
See `-fopenmp-targets` and `--offload-arch` for more info.

## Current status

Under the hood, `do concurrent` mapping is implemented in the
`DoConcurrentConversionPass`. This is still an experimental pass which means
that:
* It has been tested in a very limited way so far.
* It has been tested mostly on simple synthetic inputs.

### Loop nest detection

On the `FIR` dialect level, the following loop:
```fortran
do concurrent(i=1:n, j=1:m, k=1:o)
a(i,j,k) = i + j + k
end do
```
is modelled as a nest of `fir.do_loop` ops such that an outer loop's region
contains **only** the following:
1. The operations needed to assign/update the outer loop's induction variable.
1. The inner loop itself.

So the MLIR structure for the above example looks similar to the following:
```
fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
%i_idx_2 = fir.convert %i_idx : (index) -> i32
fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>

fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
%j_idx_2 = fir.convert %j_idx : (index) -> i32
fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>

fir.do_loop %k_idx = %40 to %42 step %c1_5 unordered {
%k_idx_2 = fir.convert %k_idx : (index) -> i32
fir.store %k_idx_2 to %k_iv#1 : !fir.ref<i32>

... loop nest body goes here ...
}
}
}
```
This applies to multi-range loops in general; they are represented in the IR as
a nest of `fir.do_loop` ops with the above nesting structure.

Therefore, the pass detects such "perfectly" nested loop ops to identify multi-range
loops and map them as "collapsed" loops in OpenMP.

#### Further info regarding loop nest detection

Loop nest detection is currently limited to the scenario described in the previous
section. However, this is quite limited and can be extended in the future to cover
more cases. For example, for the following loop nest, even though, both loops are
perfectly nested; at the moment, only the outer loop is parallelized:
```fortran
do concurrent(i=1:n)
do concurrent(j=1:m)
a(i,j) = i * j
end do
end do
```

Similarly, for the following loop nest, even though the intervening statement `x = 41`
does not have any memory effects that would affect parallelization, this nest is
not parallelized as well (only the outer loop is).

```fortran
do concurrent(i=1:n)
x = 41
do concurrent(j=1:m)
a(i,j) = i * j
end do
end do
```

The above also has the consequence that the `j` variable will **not** be
privatized in the OpenMP parallel/target region. In other words, it will be
treated as if it was a `shared` variable. For more details about privatization,
see the "Data environment" section below.

See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
of what is and is not detected as a perfect loop nest.

<!--
More details about current status will be added along with relevant parts of the
implementation in later upstreaming patches.
-->

## Next steps

This section describes some of the open questions/issues that are not tackled yet
even in the downstream implementation.

### Delayed privatization

So far, we emit the privatization logic for IVs inline in the parallel/target
region. This is enough for our purposes right now since we don't
localize/privatize any sophisticated types of variables yet. Once we have need
for more advanced localization through `do concurrent`'s locality specifiers
(see below), delayed privatization will enable us to have a much cleaner IR.
Once delayed privatization's implementation upstream is supported for the
required constructs by the pass, we will move to it rather than inlined/early
privatization.

### Locality specifiers for `do concurrent`

Locality specifiers will enable the user to control the data environment of the
loop nest in a more fine-grained way. Implementing these specifiers on the
`FIR` dialect level is needed in order to support this in the
`DoConcurrentConversionPass`.

Such specifiers will also unlock a potential solution to the
non-perfectly-nested loops' IVs issue described above. In particular, for a
non-perfectly nested loop, one middle-ground proposal/solution would be to:
* Emit the loop's IV as shared/mapped just like we do currently.
* Emit a warning that the IV of the loop is emitted as shared/mapped.
* Given support for `LOCAL`, we can recommend the user to explicitly
localize/privatize the loop's IV if they choose to.

#### Sharing TableGen clause records from the OpenMP dialect

At the moment, the FIR dialect does not have a way to model locality specifiers
on the IR level. Instead, something similar to early/eager privatization in OpenMP
is done for the locality specifiers in `fir.do_loop` ops. Having locality specifier
modelled in a way similar to delayed privatization (i.e. the `omp.private` op) and
reductions (i.e. the `omp.declare_reduction` op) can make mapping `do concurrent`
to OpenMP (and other parallel programming models) much easier.

Therefore, one way to approach this problem is to extract the TableGen records
for relevant OpenMP clauses in a shared dialect for "data environment management"
and use these shared records for OpenMP, `do concurrent`, and possibly OpenACC
as well.

#### Supporting reductions

Similar to locality specifiers, mapping reductions from `do concurrent` to OpenMP
is also still an open TODO. We can potentially extend the MLIR infrastructure
proposed in the previous section to share reduction records among the different
relevant dialects as well.

### More advanced detection of loop nests

As pointed out earlier, any intervening code between the headers of 2 nested
`do concurrent` loops prevents us from detecting this as a loop nest. In some
cases this is overly conservative. Therefore, a more flexible detection logic
of loop nests needs to be implemented.

### Data-dependence analysis

Right now, we map loop nests without analysing whether such mapping is safe to
do or not. We probably need to at least warn the user of unsafe loop nests due
to loop-carried dependencies.

### Non-rectangular loop nests

So far, we did not need to use the pass for non-rectangular loop nests. For
example:
```fortran
do concurrent(i=1:n)
do concurrent(j=i:n)
...
end do
end do
```
We defer this to the (hopefully) near future when we get the conversion in a
good share for the samples/projects at hand.

### Generalizing the pass to other parallel programming models

Once we have a stable and capable `do concurrent` to OpenMP mapping, we can take
this in a more generalized direction and allow the pass to target other models;
e.g. OpenACC. This goal should be kept in mind from the get-go even while only
targeting OpenMP.


## Upstreaming status

- [x] Command line options for `flang` and `bbc`.
- [x] Conversion pass skeleton (no transormations happen yet).
- [x] Status description and tracking document (this document).
- [x] Loop nest detection to identify multi-range loops.
- [ ] Basic host/CPU mapping support.
- [ ] Basic device/GPU mapping support.
- [ ] More advanced host and device support (expaned to multiple items as needed).
1 change: 1 addition & 0 deletions flang/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ on how to get in touch with us and to learn more about the current status.
DebugGeneration
Directives
DoConcurrent
DoConcurrentConversionToOpenMP
Extensions
F202X
FIRArrayOperations
Expand Down
2 changes: 2 additions & 0 deletions flang/include/flang/Frontend/CodeGenOptions.def
Original file line number Diff line number Diff line change
Expand Up @@ -41,5 +41,7 @@ ENUM_CODEGENOPT(DebugInfo, llvm::codegenoptions::DebugInfoKind, 4, llvm::codeg
ENUM_CODEGENOPT(VecLib, llvm::driver::VectorLibrary, 3, llvm::driver::VectorLibrary::NoLibrary) ///< Vector functions library to use
ENUM_CODEGENOPT(FramePointer, llvm::FramePointerKind, 2, llvm::FramePointerKind::None) ///< Enable the usage of frame pointers

ENUM_CODEGENOPT(DoConcurrentMapping, DoConcurrentMappingKind, 2, DoConcurrentMappingKind::DCMK_None) ///< Map `do concurrent` to OpenMP

#undef CODEGENOPT
#undef ENUM_CODEGENOPT
5 changes: 5 additions & 0 deletions flang/include/flang/Frontend/CodeGenOptions.h
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
#ifndef FORTRAN_FRONTEND_CODEGENOPTIONS_H
#define FORTRAN_FRONTEND_CODEGENOPTIONS_H

#include "flang/Optimizer/OpenMP/Utils.h"
#include "llvm/Frontend/Debug/Options.h"
#include "llvm/Frontend/Driver/CodeGenOptions.h"
#include "llvm/Support/CodeGen.h"
Expand Down Expand Up @@ -143,6 +144,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
/// (-mlarge-data-threshold).
uint64_t LargeDataThreshold;

/// Optionally map `do concurrent` loops to OpenMP. This is only valid of
/// OpenMP is enabled.
using DoConcurrentMappingKind = flangomp::DoConcurrentMappingKind;

// Define accessors/mutators for code generation options of enumeration type.
#define CODEGENOPT(Name, Bits, Default)
#define ENUM_CODEGENOPT(Name, Type, Bits, Default) \
Expand Down
2 changes: 2 additions & 0 deletions flang/include/flang/Optimizer/OpenMP/Passes.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
#ifndef FORTRAN_OPTIMIZER_OPENMP_PASSES_H
#define FORTRAN_OPTIMIZER_OPENMP_PASSES_H

#include "flang/Optimizer/OpenMP/Utils.h"
#include "mlir/Dialect/Func/IR/FuncOps.h"
#include "mlir/IR/BuiltinOps.h"
#include "mlir/Pass/Pass.h"
Expand All @@ -30,6 +31,7 @@ namespace flangomp {
/// divided into units of work.
bool shouldUseWorkshareLowering(mlir::Operation *op);

std::unique_ptr<mlir::Pass> createDoConcurrentConversionPass(bool mapToDevice);
} // namespace flangomp

#endif // FORTRAN_OPTIMIZER_OPENMP_PASSES_H
30 changes: 30 additions & 0 deletions flang/include/flang/Optimizer/OpenMP/Passes.td
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,36 @@ def FunctionFilteringPass : Pass<"omp-function-filtering"> {
];
}

def DoConcurrentConversionPass : Pass<"omp-do-concurrent-conversion", "mlir::func::FuncOp"> {
let summary = "Map `DO CONCURRENT` loops to OpenMP worksharing loops.";

let description = [{ This is an experimental pass to map `DO CONCURRENT` loops
to their correspnding equivalent OpenMP worksharing constructs.

For now the following is supported:
- Mapping simple loops to `parallel do`.

Still TODO:
- More extensive testing.
}];

let dependentDialects = ["mlir::omp::OpenMPDialect"];

let options = [
Option<"mapTo", "map-to",
"flangomp::DoConcurrentMappingKind",
/*default=*/"flangomp::DoConcurrentMappingKind::DCMK_None",
"Try to map `do concurrent` loops to OpenMP [none|host|device]",
[{::llvm::cl::values(
clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_None,
"none", "Do not lower `do concurrent` to OpenMP"),
clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_Host,
"host", "Lower to run in parallel on the CPU"),
clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_Device,
"device", "Lower to run in parallel on the GPU")
)}]>,
];
}

// Needs to be scheduled on Module as we create functions in it
def LowerWorkshare : Pass<"lower-workshare", "::mlir::ModuleOp"> {
Expand Down
26 changes: 26 additions & 0 deletions flang/include/flang/Optimizer/OpenMP/Utils.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
//===-- Optimizer/OpenMP/Utils.h --------------------------------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// Coding style: https://mlir.llvm.org/getting_started/DeveloperGuide/
//
//===----------------------------------------------------------------------===//

#ifndef FORTRAN_OPTIMIZER_OPENMP_UTILS_H
#define FORTRAN_OPTIMIZER_OPENMP_UTILS_H

namespace flangomp {

enum class DoConcurrentMappingKind {
DCMK_None, ///< Do not lower `do concurrent` to OpenMP.
DCMK_Host, ///< Lower to run in parallel on the CPU.
DCMK_Device ///< Lower to run in parallel on the GPU.
};

} // namespace flangomp

#endif // FORTRAN_OPTIMIZER_OPENMP_UTILS_H
18 changes: 15 additions & 3 deletions flang/include/flang/Optimizer/Passes/Pipelines.h
Original file line number Diff line number Diff line change
Expand Up @@ -128,16 +128,28 @@ void createHLFIRToFIRPassPipeline(
mlir::PassManager &pm, bool enableOpenMP,
llvm::OptimizationLevel optLevel = defaultOptLevel);

struct OpenMPFIRPassPipelineOpts {
/// Whether code is being generated for a target device rather than the host
/// device
bool isTargetDevice;

/// Controls how to map `do concurrent` loops; to device, host, or none at
/// all.
Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind
doConcurrentMappingKind;
};

/// Create a pass pipeline for handling certain OpenMP transformations needed
/// prior to FIR lowering.
///
/// WARNING: These passes must be run immediately after the lowering to ensure
/// that the FIR is correct with respect to OpenMP operations/attributes.
///
/// \param pm - MLIR pass manager that will hold the pipeline definition.
/// \param isTargetDevice - Whether code is being generated for a target device
/// rather than the host device.
void createOpenMPFIRPassPipeline(mlir::PassManager &pm, bool isTargetDevice);
/// \param opts - options to control OpenMP code-gen; see struct docs for more
/// details.
void createOpenMPFIRPassPipeline(mlir::PassManager &pm,
OpenMPFIRPassPipelineOpts opts);

#if !defined(FLANG_EXCLUDE_CODEGEN)
void createDebugPasses(mlir::PassManager &pm,
Expand Down
Loading