Skip to content

Commit ae54ead

Browse files
committed
Rename DoConcurrentConversionToOpenMP.md to DoConcurrentConversionToOpenMP-atd.md
1 parent 5c3197e commit ae54ead

File tree

1 file changed

+332
-0
lines changed

1 file changed

+332
-0
lines changed
Lines changed: 332 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,332 @@
1+
<!--===- docs/DoConcurrentMappingToOpenMP.md
2+
3+
Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
See https://llvm.org/LICENSE.txt for license information.
5+
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
7+
-->
8+
9+
# `DO CONCURENT` mapping to OpenMP
10+
11+
```{contents}
12+
---
13+
local:
14+
---
15+
```
16+
17+
This document seeks to describe the effort to parallelize `do concurrent` loops
18+
by mapping them to OpenMP worksharing constructs. The goals of this document
19+
are:
20+
* Describing how to instruct `flang-new` to map `DO CONCURENT` loops to OpenMP
21+
constructs.
22+
* Tracking the current status of such mapping.
23+
* Describing the limitations of the current implmenentation.
24+
* Describing next steps.
25+
26+
## Usage
27+
28+
In order to enable `do concurrent` to OpenMP mapping, `flang-new` adds a new
29+
compiler flag: `-fdo-concurrent-parallel`. This flags has 3 possible values:
30+
1. `host`: this maps `do concurent` loops to run in parallel on the host CPU.
31+
This maps such loops to the equivalent of `omp parallel do`.
32+
2. `device`: this maps `do concurent` loops to run in parallel on a device
33+
(GPU). This maps such loops to the equivalent of `omp target teams
34+
distribute parallel do`.
35+
3. `none`: this disables `do concurrent` mapping altogether. In such case, such
36+
loops are emitted as sequential loops.
37+
38+
The above compiler switch is currently avaialble only when OpenMP is also
39+
enabled. So you need to provide the following options to flang in order to
40+
enable it:
41+
```
42+
flang-new ... -fopenmp -fdo-concurrent-parallel=[host|device|none] ...
43+
```
44+
45+
## Current status
46+
47+
Under the hood, `do concurrent` mapping is implemented in the
48+
`DoConcurrentConversionPass`. This is still an experimental pass which means
49+
that:
50+
* It has been tested in a very limited way so far.
51+
* It has been tested on simple synthetic inputs.
52+
53+
To describe current status in more detail, following is a description of how
54+
the pass currently behaves for single-range loops and then for multi-range
55+
loops.
56+
57+
### Single-range loops
58+
59+
Given the following loop:
60+
```fortran
61+
do concurrent(i=1:n)
62+
a(i) = i * i
63+
end do
64+
```
65+
66+
#### Mapping to `host`
67+
68+
Mapping this loop to the `host`, generates MLIR operations of the following
69+
structure:
70+
71+
```mlir
72+
%4 = fir.address_of(@_QFEa) ...
73+
%6:2 = hlfir.declare %4 ...
74+
75+
omp.parallel {
76+
// Allocate private copy for `i`.
77+
%19 = fir.alloca i32 {bindc_name = "i"}
78+
%20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...
79+
80+
omp.wsloop {
81+
omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
82+
%23 = fir.convert %arg0 : (index) -> i32
83+
// Use the privatized version of `i`.
84+
fir.store %23 to %20#1 : !fir.ref<i32>
85+
...
86+
87+
// Use "shared" SSA value of `a`.
88+
%42 = hlfir.designate %6#0
89+
hlfir.assign %35 to %42
90+
...
91+
omp.yield
92+
}
93+
omp.terminator
94+
}
95+
omp.terminator
96+
}
97+
```
98+
99+
#### Mapping to `device`
100+
101+
Mapping the same loop to the `device`, generates MLIR operations of the
102+
following structure:
103+
104+
```mlir
105+
// Map `a` to the `target` region.
106+
%29 = omp.map.info ... {name = "_QFEa"}
107+
omp.target ... map_entries(..., %29 -> %arg4 ...) {
108+
...
109+
%51:2 = hlfir.declare %arg4
110+
...
111+
omp.teams {
112+
// Allocate private copy for `i`.
113+
%52 = fir.alloca i32 {bindc_name = "i"}
114+
%53:2 = hlfir.declare %52
115+
...
116+
117+
omp.distribute {
118+
omp.parallel {
119+
omp.wsloop {
120+
omp.loop_nest (%arg5) : index = (%54) to (%55) inclusive step (%c1_9) {
121+
// Use the privatized version of `i`.
122+
%56 = fir.convert %arg5 : (index) -> i32
123+
fir.store %56 to %53#1
124+
...
125+
// Use the mapped version of `a`.
126+
... = hlfir.designate %51#0
127+
...
128+
}
129+
omp.terminator
130+
}
131+
omp.terminator
132+
}
133+
omp.terminator
134+
}
135+
omp.terminator
136+
}
137+
omp.terminator
138+
}
139+
```
140+
141+
### Multi-range loops
142+
143+
The pass currently supports multi-range loops as well. Given the following
144+
example:
145+
146+
```fortran
147+
do concurrent(i=1:n, j=1:m)
148+
a(i,j) = i * j
149+
end do
150+
```
151+
152+
The generated `omp.loop_nest` operation look like:
153+
154+
```mlir
155+
omp.loop_nest (%arg0, %arg1)
156+
: index = (%17, %19) to (%18, %20)
157+
inclusive step (%c1_2, %c1_4) {
158+
fir.store %arg0 to %private_i#1 : !fir.ref<i32>
159+
fir.store %arg1 to %private_j#1 : !fir.ref<i32>
160+
...
161+
omp.yield
162+
}
163+
```
164+
165+
It is worth noting that we have privatized versions for both iteration
166+
variables: `i` and `j`. These are locally allocated inside the parallel/target
167+
OpenMP region similar to what the single-range example in previous section
168+
shows.
169+
170+
#### Multi-range and perfectly-nested loops
171+
172+
Currently, on the `FIR` dialect level, the following 2 loops are modelled in
173+
exactly the same way:
174+
175+
```fortran
176+
do concurrent(i=1:n, j=1:m)
177+
a(i,j) = i * j
178+
end do
179+
```
180+
181+
```fortran
182+
do concurrent(i=1:n)
183+
do concurrent(j=1:m)
184+
a(i,j) = i * j
185+
end do
186+
end do
187+
```
188+
189+
Both of the above loops are modelled as:
190+
191+
```mlir
192+
fir.do_loop %arg0 = %11 to %12 step %c1 unordered {
193+
...
194+
fir.do_loop %arg1 = %14 to %15 step %c1_1 unordered {
195+
...
196+
}
197+
}
198+
```
199+
200+
Consequently, from the `DoConcurrentConversionPass`' perspective, both loops
201+
are treated in the same manner. Under the hood, the pass detects
202+
perfectly-nested loop nests and maps such nests as if they were multi-range
203+
loops.
204+
205+
#### Non-perfectly-nested loops
206+
207+
One limitation that the pass currently have is that it treats any intervening
208+
code in a loop nest as being disruptive to detecting that nest as a single
209+
unit. For example, given the following input:
210+
211+
```fortran
212+
do concurrent(i=1:n)
213+
x = 41
214+
do concurrent(j=1:m)
215+
a(i,j) = i * j
216+
end do
217+
end do
218+
```
219+
220+
Since there at least one statement between the 2 loop header (i.e. `x = 41`),
221+
the pass does not detect the `i` and `j` loops as a nest. Rather, the pass in
222+
that case only maps the `i` loop to OpenMP and leaves the `j` loop in its
223+
origianl form. In theory, in this example, we can sink the intervening code
224+
into the `j` loop and detect the complete nest. However, such transformation is
225+
still to be implemented in the future.
226+
227+
The above also has the consequence that the `j` variable will **not** be
228+
privatized in the OpenMP parallel/target region. In other words, it will be
229+
treated as if it was a `shared` variable. For more details about privatization,
230+
see the "Data environment" section below.
231+
232+
### Data environment
233+
234+
By default, variables that are used inside a `do concurernt` loop nest are
235+
either treated as `shared` in case of mapping to `host`, or mapped into the
236+
`target` region using a `map` clause in case of mapping to `device`. The only
237+
exceptions to this are:
238+
1. the loop's iteration variable(s) (IV) of **perfect** loop nests. In that
239+
case, for each IV, we allocate a local copy as shown the by the mapping
240+
examples above.
241+
1. any values that are from allocations outside the loop nest and used
242+
exclusively inside of it. In such cases, a local privatized
243+
value is created in the OpenMP region to prevent multiple teams of threads
244+
from accessing and destroying the same memory block which causes runtime
245+
issues. For an example of such cases, see
246+
`flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90`.
247+
248+
#### Non-perfectly-nested loops' IVs
249+
250+
For non-perfectly-nested loops, the IVs are still treated as `shared` or
251+
`map` entries as pointed out above. This **might not** be consistent with what
252+
the Fortran specficiation tells us. In particular, taking the following
253+
snippets from the spec (version 2023) into account:
254+
255+
> § 3.35
256+
> ------
257+
> construct entity
258+
> entity whose identifier has the scope of a construct
259+
260+
> § 19.4
261+
> ------
262+
> A variable that appears as an index-name in a FORALL or DO CONCURRENT
263+
> construct, or ... is a construct entity. A variable that has LOCAL or
264+
> LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity.
265+
> ...
266+
> The name of a variable that appears as an index-name in a DO CONCURRENT
267+
> construct, FORALL statement, or FORALL construct has a scope of the statement
268+
> or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO
269+
> CONCURRENT construct has the scope of that construct.
270+
271+
From the above quotes, it seems there is an equivalence between the IV of a `do
272+
concurrent` loop and a variable with a `LOCAL` locality specifier (equivalent
273+
to OpenMP's `private` clause). Which means that we should probably
274+
localize/privatize a `do concurernt` loop's IV even if it is not perfectly
275+
nested in the nest we are parallelizing. For now, however, we **do not** do
276+
that as pointed out previously. In the near future, we propose a middle-ground
277+
solution (see the Next steps section for more details).
278+
279+
## Next steps
280+
281+
### Delayed privatization
282+
283+
So far, we emit the privatization logic for IVs inline in the parallel/target
284+
region. This is enough for our purposes right now since we don't
285+
localize/privatize any sophisticated types of variables yet. Once we have need
286+
for more advanced localization through `do concurrent`'s locality specifiers
287+
(see below), delayed privatization will enable us to have a much cleaner IR.
288+
Once delayed privatization's implementation upstream is supported for the
289+
required constructs by the pass, we will move to it rather than inlined/early
290+
privatization.
291+
292+
### Locality specifiers for `do concurrent`
293+
294+
Locality specifiers will enable the user to control the data environment of the
295+
loop nest in a more fine-grained way. Implementing these specifiers on the
296+
`FIR` dialect level is needed in order to support this in the
297+
`DoConcurrentConversionPass`.
298+
299+
Such specified will also unlock a potential solution to the
300+
non-perfectly-nested loops' IVs issue described above. In particular, for a
301+
non-perfectly nested loop, one middle-ground proposal/solution would be to:
302+
* Emit the loop's IV as shared/mapped just like we do currently.
303+
* Emit a warning that the IV of the loop is emitted as shared/mapped.
304+
* Given support for `LOCAL`, we can recommend the user to explicitly
305+
localize/privatize the loop's IV if they choose to.
306+
307+
### More advanced detection of loop nests
308+
309+
As pointed out earlier, any intervening code between the headers of 2 nested
310+
`do concurrent` loops prevents us currently from detecting this as a loop nest.
311+
In some cases this is overly conservative. Therefore, a more flexible detection
312+
logic of loop nests needs to be implemented.
313+
314+
### Data-dependence analysis
315+
316+
Right now, we map loop nests without analysing whether such mapping is safe to
317+
do or not. We probalby need to at least warn the use of unsafe loop nests due
318+
to loop-carried dependencies.
319+
320+
### Non-rectangular loop nests
321+
322+
So far, we did not need to use the pass for non-rectangular loop nests. For
323+
example:
324+
```fortran
325+
do concurrent(i=1:n)
326+
do concurrent(j=i:n)
327+
...
328+
end do
329+
end do
330+
```
331+
We defer this to the (hopefully) near future when we get the conversion in a
332+
good share for the samples/projects at hand.

0 commit comments

Comments
 (0)