Skip to content

Commit 2f9d714

Browse files
committed
[flang][OpenMP] Upstream first part of do concurrent mapping (llvm#126026)
This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces llvm#77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options. This looks like a huge PR but a lot of the added stuff is documentation. It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers. PR stack: - llvm#126026 (this PR) - llvm#127595 - llvm#127633 - llvm#127634 - llvm#127635
1 parent ae54ead commit 2f9d714

24 files changed

+147
-285
lines changed

clang/include/clang/Driver/Options.td

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7130,9 +7130,9 @@ defm loop_versioning : BoolOptionWithoutMarshalling<"f", "version-loops-for-stri
71307130
def fhermetic_module_files : Flag<["-"], "fhermetic-module-files">, Group<f_Group>,
71317131
HelpText<"Emit hermetic module files (no nested USE association)">;
71327132

7133-
def do_concurrent_parallel_EQ : Joined<["-"], "fdo-concurrent-parallel=">,
7134-
HelpText<"Try to map `do concurrent` loops to OpenMP (on host or device)">,
7135-
Values<"none,host,device">;
7133+
def fdo_concurrent_to_openmp_EQ : Joined<["-"], "fdo-concurrent-to-openmp=">,
7134+
HelpText<"Try to map `do concurrent` loops to OpenMP [none|host|device]">,
7135+
Values<"none, host, device">;
71367136
} // let Visibility = [FC1Option, FlangOption]
71377137

71387138
def J : JoinedOrSeparate<["-"], "J">,

clang/lib/Driver/ToolChains/Flang.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ void Flang::addCodegenOptions(const ArgList &Args,
165165
CmdArgs.push_back("-fversion-loops-for-stride");
166166

167167
Args.addAllArgs(CmdArgs,
168-
{options::OPT_do_concurrent_parallel_EQ,
168+
{options::OPT_fdo_concurrent_to_openmp_EQ,
169169
options::OPT_flang_experimental_hlfir,
170170
options::OPT_flang_deprecated_no_hlfir,
171171
options::OPT_fno_ppc_native_vec_elem_order,

flang/docs/DoConcurrentConversionToOpenMP.md

Lines changed: 69 additions & 245 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
77
-->
88

9-
# `DO CONCURENT` mapping to OpenMP
9+
# `DO CONCURRENT` mapping to OpenMP
1010

1111
```{contents}
1212
---
@@ -17,267 +17,52 @@ local:
1717
This document seeks to describe the effort to parallelize `do concurrent` loops
1818
by mapping them to OpenMP worksharing constructs. The goals of this document
1919
are:
20-
* Describing how to instruct `flang-new` to map `DO CONCURENT` loops to OpenMP
20+
* Describing how to instruct `flang` to map `DO CONCURRENT` loops to OpenMP
2121
constructs.
2222
* Tracking the current status of such mapping.
23-
* Describing the limitations of the current implmenentation.
23+
* Describing the limitations of the current implementation.
2424
* Describing next steps.
25+
* Tracking the current upstreaming status (from the AMD ROCm fork).
2526

2627
## Usage
2728

28-
In order to enable `do concurrent` to OpenMP mapping, `flang-new` adds a new
29-
compiler flag: `-fdo-concurrent-parallel`. This flags has 3 possible values:
30-
1. `host`: this maps `do concurent` loops to run in parallel on the host CPU.
29+
In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
30+
compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
31+
1. `host`: this maps `do concurrent` loops to run in parallel on the host CPU.
3132
This maps such loops to the equivalent of `omp parallel do`.
32-
2. `device`: this maps `do concurent` loops to run in parallel on a device
33-
(GPU). This maps such loops to the equivalent of `omp target teams
34-
distribute parallel do`.
35-
3. `none`: this disables `do concurrent` mapping altogether. In such case, such
33+
2. `device`: this maps `do concurrent` loops to run in parallel on a target device.
34+
This maps such loops to the equivalent of
35+
`omp target teams distribute parallel do`.
36+
3. `none`: this disables `do concurrent` mapping altogether. In that case, such
3637
loops are emitted as sequential loops.
3738

38-
The above compiler switch is currently avaialble only when OpenMP is also
39-
enabled. So you need to provide the following options to flang in order to
40-
enable it:
39+
The `-fdo-concurrent-to-openmp` compiler switch is currently available only when
40+
OpenMP is also enabled. So you need to provide the following options to flang in
41+
order to enable it:
4142
```
42-
flang-new ... -fopenmp -fdo-concurrent-parallel=[host|device|none] ...
43+
flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
4344
```
45+
For mapping to device, the target device architecture must be specified as well.
46+
See `-fopenmp-targets` and `--offload-arch` for more info.
4447

4548
## Current status
4649

4750
Under the hood, `do concurrent` mapping is implemented in the
4851
`DoConcurrentConversionPass`. This is still an experimental pass which means
4952
that:
5053
* It has been tested in a very limited way so far.
51-
* It has been tested on simple synthetic inputs.
54+
* It has been tested mostly on simple synthetic inputs.
5255

53-
To describe current status in more detail, following is a description of how
54-
the pass currently behaves for single-range loops and then for multi-range
55-
loops.
56-
57-
### Single-range loops
58-
59-
Given the following loop:
60-
```fortran
61-
do concurrent(i=1:n)
62-
a(i) = i * i
63-
end do
64-
```
65-
66-
#### Mapping to `host`
67-
68-
Mapping this loop to the `host`, generates MLIR operations of the following
69-
structure:
70-
71-
```mlir
72-
%4 = fir.address_of(@_QFEa) ...
73-
%6:2 = hlfir.declare %4 ...
74-
75-
omp.parallel {
76-
// Allocate private copy for `i`.
77-
%19 = fir.alloca i32 {bindc_name = "i"}
78-
%20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...
79-
80-
omp.wsloop {
81-
omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
82-
%23 = fir.convert %arg0 : (index) -> i32
83-
// Use the privatized version of `i`.
84-
fir.store %23 to %20#1 : !fir.ref<i32>
85-
...
86-
87-
// Use "shared" SSA value of `a`.
88-
%42 = hlfir.designate %6#0
89-
hlfir.assign %35 to %42
90-
...
91-
omp.yield
92-
}
93-
omp.terminator
94-
}
95-
omp.terminator
96-
}
97-
```
98-
99-
#### Mapping to `device`
100-
101-
Mapping the same loop to the `device`, generates MLIR operations of the
102-
following structure:
103-
104-
```mlir
105-
// Map `a` to the `target` region.
106-
%29 = omp.map.info ... {name = "_QFEa"}
107-
omp.target ... map_entries(..., %29 -> %arg4 ...) {
108-
...
109-
%51:2 = hlfir.declare %arg4
110-
...
111-
omp.teams {
112-
// Allocate private copy for `i`.
113-
%52 = fir.alloca i32 {bindc_name = "i"}
114-
%53:2 = hlfir.declare %52
115-
...
116-
117-
omp.distribute {
118-
omp.parallel {
119-
omp.wsloop {
120-
omp.loop_nest (%arg5) : index = (%54) to (%55) inclusive step (%c1_9) {
121-
// Use the privatized version of `i`.
122-
%56 = fir.convert %arg5 : (index) -> i32
123-
fir.store %56 to %53#1
124-
...
125-
// Use the mapped version of `a`.
126-
... = hlfir.designate %51#0
127-
...
128-
}
129-
omp.terminator
130-
}
131-
omp.terminator
132-
}
133-
omp.terminator
134-
}
135-
omp.terminator
136-
}
137-
omp.terminator
138-
}
139-
```
140-
141-
### Multi-range loops
142-
143-
The pass currently supports multi-range loops as well. Given the following
144-
example:
145-
146-
```fortran
147-
do concurrent(i=1:n, j=1:m)
148-
a(i,j) = i * j
149-
end do
150-
```
151-
152-
The generated `omp.loop_nest` operation look like:
153-
154-
```mlir
155-
omp.loop_nest (%arg0, %arg1)
156-
: index = (%17, %19) to (%18, %20)
157-
inclusive step (%c1_2, %c1_4) {
158-
fir.store %arg0 to %private_i#1 : !fir.ref<i32>
159-
fir.store %arg1 to %private_j#1 : !fir.ref<i32>
160-
...
161-
omp.yield
162-
}
163-
```
164-
165-
It is worth noting that we have privatized versions for both iteration
166-
variables: `i` and `j`. These are locally allocated inside the parallel/target
167-
OpenMP region similar to what the single-range example in previous section
168-
shows.
169-
170-
#### Multi-range and perfectly-nested loops
171-
172-
Currently, on the `FIR` dialect level, the following 2 loops are modelled in
173-
exactly the same way:
174-
175-
```fortran
176-
do concurrent(i=1:n, j=1:m)
177-
a(i,j) = i * j
178-
end do
179-
```
180-
181-
```fortran
182-
do concurrent(i=1:n)
183-
do concurrent(j=1:m)
184-
a(i,j) = i * j
185-
end do
186-
end do
187-
```
188-
189-
Both of the above loops are modelled as:
190-
191-
```mlir
192-
fir.do_loop %arg0 = %11 to %12 step %c1 unordered {
193-
...
194-
fir.do_loop %arg1 = %14 to %15 step %c1_1 unordered {
195-
...
196-
}
197-
}
198-
```
199-
200-
Consequently, from the `DoConcurrentConversionPass`' perspective, both loops
201-
are treated in the same manner. Under the hood, the pass detects
202-
perfectly-nested loop nests and maps such nests as if they were multi-range
203-
loops.
204-
205-
#### Non-perfectly-nested loops
206-
207-
One limitation that the pass currently have is that it treats any intervening
208-
code in a loop nest as being disruptive to detecting that nest as a single
209-
unit. For example, given the following input:
210-
211-
```fortran
212-
do concurrent(i=1:n)
213-
x = 41
214-
do concurrent(j=1:m)
215-
a(i,j) = i * j
216-
end do
217-
end do
218-
```
219-
220-
Since there at least one statement between the 2 loop header (i.e. `x = 41`),
221-
the pass does not detect the `i` and `j` loops as a nest. Rather, the pass in
222-
that case only maps the `i` loop to OpenMP and leaves the `j` loop in its
223-
origianl form. In theory, in this example, we can sink the intervening code
224-
into the `j` loop and detect the complete nest. However, such transformation is
225-
still to be implemented in the future.
226-
227-
The above also has the consequence that the `j` variable will **not** be
228-
privatized in the OpenMP parallel/target region. In other words, it will be
229-
treated as if it was a `shared` variable. For more details about privatization,
230-
see the "Data environment" section below.
231-
232-
### Data environment
233-
234-
By default, variables that are used inside a `do concurernt` loop nest are
235-
either treated as `shared` in case of mapping to `host`, or mapped into the
236-
`target` region using a `map` clause in case of mapping to `device`. The only
237-
exceptions to this are:
238-
1. the loop's iteration variable(s) (IV) of **perfect** loop nests. In that
239-
case, for each IV, we allocate a local copy as shown the by the mapping
240-
examples above.
241-
1. any values that are from allocations outside the loop nest and used
242-
exclusively inside of it. In such cases, a local privatized
243-
value is created in the OpenMP region to prevent multiple teams of threads
244-
from accessing and destroying the same memory block which causes runtime
245-
issues. For an example of such cases, see
246-
`flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90`.
247-
248-
#### Non-perfectly-nested loops' IVs
249-
250-
For non-perfectly-nested loops, the IVs are still treated as `shared` or
251-
`map` entries as pointed out above. This **might not** be consistent with what
252-
the Fortran specficiation tells us. In particular, taking the following
253-
snippets from the spec (version 2023) into account:
254-
255-
> § 3.35
256-
> ------
257-
> construct entity
258-
> entity whose identifier has the scope of a construct
259-
260-
> § 19.4
261-
> ------
262-
> A variable that appears as an index-name in a FORALL or DO CONCURRENT
263-
> construct, or ... is a construct entity. A variable that has LOCAL or
264-
> LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity.
265-
> ...
266-
> The name of a variable that appears as an index-name in a DO CONCURRENT
267-
> construct, FORALL statement, or FORALL construct has a scope of the statement
268-
> or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO
269-
> CONCURRENT construct has the scope of that construct.
270-
271-
From the above quotes, it seems there is an equivalence between the IV of a `do
272-
concurrent` loop and a variable with a `LOCAL` locality specifier (equivalent
273-
to OpenMP's `private` clause). Which means that we should probably
274-
localize/privatize a `do concurernt` loop's IV even if it is not perfectly
275-
nested in the nest we are parallelizing. For now, however, we **do not** do
276-
that as pointed out previously. In the near future, we propose a middle-ground
277-
solution (see the Next steps section for more details).
56+
<!--
57+
More details about current status will be added along with relevant parts of the
58+
implementation in later upstreaming patches.
59+
-->
27860

27961
## Next steps
28062

63+
This section describes some of the open questions/issues that are not tackled yet
64+
even in the downstream implementation.
65+
28166
### Delayed privatization
28267

28368
So far, we emit the privatization logic for IVs inline in the parallel/target
@@ -296,25 +81,46 @@ loop nest in a more fine-grained way. Implementing these specifiers on the
29681
`FIR` dialect level is needed in order to support this in the
29782
`DoConcurrentConversionPass`.
29883

299-
Such specified will also unlock a potential solution to the
84+
Such specifiers will also unlock a potential solution to the
30085
non-perfectly-nested loops' IVs issue described above. In particular, for a
30186
non-perfectly nested loop, one middle-ground proposal/solution would be to:
30287
* Emit the loop's IV as shared/mapped just like we do currently.
30388
* Emit a warning that the IV of the loop is emitted as shared/mapped.
30489
* Given support for `LOCAL`, we can recommend the user to explicitly
30590
localize/privatize the loop's IV if they choose to.
30691

92+
#### Sharing TableGen clause records from the OpenMP dialect
93+
94+
At the moment, the FIR dialect does not have a way to model locality specifiers
95+
on the IR level. Instead, something similar to early/eager privatization in OpenMP
96+
is done for the locality specifiers in `fir.do_loop` ops. Having locality specifier
97+
modelled in a way similar to delayed privatization (i.e. the `omp.private` op) and
98+
reductions (i.e. the `omp.declare_reduction` op) can make mapping `do concurrent`
99+
to OpenMP (and other parallel programming models) much easier.
100+
101+
Therefore, one way to approach this problem is to extract the TableGen records
102+
for relevant OpenMP clauses in a shared dialect for "data environment management"
103+
and use these shared records for OpenMP, `do concurrent`, and possibly OpenACC
104+
as well.
105+
106+
#### Supporting reductions
107+
108+
Similar to locality specifiers, mapping reductions from `do concurrent` to OpenMP
109+
is also still an open TODO. We can potentially extend the MLIR infrastructure
110+
proposed in the previous section to share reduction records among the different
111+
relevant dialects as well.
112+
307113
### More advanced detection of loop nests
308114

309115
As pointed out earlier, any intervening code between the headers of 2 nested
310-
`do concurrent` loops prevents us currently from detecting this as a loop nest.
311-
In some cases this is overly conservative. Therefore, a more flexible detection
312-
logic of loop nests needs to be implemented.
116+
`do concurrent` loops prevents us from detecting this as a loop nest. In some
117+
cases this is overly conservative. Therefore, a more flexible detection logic
118+
of loop nests needs to be implemented.
313119

314120
### Data-dependence analysis
315121

316122
Right now, we map loop nests without analysing whether such mapping is safe to
317-
do or not. We probalby need to at least warn the use of unsafe loop nests due
123+
do or not. We probably need to at least warn the user of unsafe loop nests due
318124
to loop-carried dependencies.
319125

320126
### Non-rectangular loop nests
@@ -330,3 +136,21 @@ end do
330136
```
331137
We defer this to the (hopefully) near future when we get the conversion in a
332138
good share for the samples/projects at hand.
139+
140+
### Generalizing the pass to other parallel programming models
141+
142+
Once we have a stable and capable `do concurrent` to OpenMP mapping, we can take
143+
this in a more generalized direction and allow the pass to target other models;
144+
e.g. OpenACC. This goal should be kept in mind from the get-go even while only
145+
targeting OpenMP.
146+
147+
148+
## Upstreaming status
149+
150+
- [x] Command line options for `flang` and `bbc`.
151+
- [x] Conversion pass skeleton (no transormations happen yet).
152+
- [x] Status description and tracking document (this document).
153+
- [ ] Basic host/CPU mapping support.
154+
- [ ] Basic device/GPU mapping support.
155+
- [ ] More advanced host and device support (expaned to multiple items as needed).
156+

0 commit comments

Comments
 (0)