6
6
7
7
-->
8
8
9
- # ` DO CONCURENT ` mapping to OpenMP
9
+ # ` DO CONCURRENT ` mapping to OpenMP
10
10
11
11
``` {contents}
12
12
---
@@ -17,267 +17,52 @@ local:
17
17
This document seeks to describe the effort to parallelize ` do concurrent ` loops
18
18
by mapping them to OpenMP worksharing constructs. The goals of this document
19
19
are:
20
- * Describing how to instruct ` flang-new ` to map ` DO CONCURENT ` loops to OpenMP
20
+ * Describing how to instruct ` flang ` to map ` DO CONCURRENT ` loops to OpenMP
21
21
constructs.
22
22
* Tracking the current status of such mapping.
23
- * Describing the limitations of the current implmenentation .
23
+ * Describing the limitations of the current implementation .
24
24
* Describing next steps.
25
+ * Tracking the current upstreaming status (from the AMD ROCm fork).
25
26
26
27
## Usage
27
28
28
- In order to enable ` do concurrent ` to OpenMP mapping, ` flang-new ` adds a new
29
- compiler flag: ` -fdo-concurrent-parallel ` . This flags has 3 possible values:
30
- 1 . ` host ` : this maps ` do concurent ` loops to run in parallel on the host CPU.
29
+ In order to enable ` do concurrent ` to OpenMP mapping, ` flang ` adds a new
30
+ compiler flag: ` -fdo-concurrent-to-openmp ` . This flag has 3 possible values:
31
+ 1 . ` host ` : this maps ` do concurrent ` loops to run in parallel on the host CPU.
31
32
This maps such loops to the equivalent of ` omp parallel do ` .
32
- 2 . ` device ` : this maps ` do concurent ` loops to run in parallel on a device
33
- (GPU). This maps such loops to the equivalent of `omp target teams
34
- distribute parallel do`.
35
- 3 . ` none ` : this disables ` do concurrent ` mapping altogether. In such case, such
33
+ 2 . ` device ` : this maps ` do concurrent ` loops to run in parallel on a target device.
34
+ This maps such loops to the equivalent of
35
+ ` omp target teams distribute parallel do` .
36
+ 3 . ` none ` : this disables ` do concurrent ` mapping altogether. In that case, such
36
37
loops are emitted as sequential loops.
37
38
38
- The above compiler switch is currently avaialble only when OpenMP is also
39
- enabled. So you need to provide the following options to flang in order to
40
- enable it:
39
+ The ` -fdo-concurrent-to-openmp ` compiler switch is currently available only when
40
+ OpenMP is also enabled. So you need to provide the following options to flang in
41
+ order to enable it:
41
42
```
42
- flang-new ... -fopenmp -fdo-concurrent-parallel =[host|device|none] ...
43
+ flang ... -fopenmp -fdo-concurrent-to-openmp =[host|device|none] ...
43
44
```
45
+ For mapping to device, the target device architecture must be specified as well.
46
+ See ` -fopenmp-targets ` and ` --offload-arch ` for more info.
44
47
45
48
## Current status
46
49
47
50
Under the hood, ` do concurrent ` mapping is implemented in the
48
51
` DoConcurrentConversionPass ` . This is still an experimental pass which means
49
52
that:
50
53
* It has been tested in a very limited way so far.
51
- * It has been tested on simple synthetic inputs.
54
+ * It has been tested mostly on simple synthetic inputs.
52
55
53
- To describe current status in more detail, following is a description of how
54
- the pass currently behaves for single-range loops and then for multi-range
55
- loops.
56
-
57
- ### Single-range loops
58
-
59
- Given the following loop:
60
- ``` fortran
61
- do concurrent(i=1:n)
62
- a(i) = i * i
63
- end do
64
- ```
65
-
66
- #### Mapping to ` host `
67
-
68
- Mapping this loop to the ` host ` , generates MLIR operations of the following
69
- structure:
70
-
71
- ``` mlir
72
- %4 = fir.address_of(@_QFEa) ...
73
- %6:2 = hlfir.declare %4 ...
74
-
75
- omp.parallel {
76
- // Allocate private copy for `i`.
77
- %19 = fir.alloca i32 {bindc_name = "i"}
78
- %20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...
79
-
80
- omp.wsloop {
81
- omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
82
- %23 = fir.convert %arg0 : (index) -> i32
83
- // Use the privatized version of `i`.
84
- fir.store %23 to %20#1 : !fir.ref<i32>
85
- ...
86
-
87
- // Use "shared" SSA value of `a`.
88
- %42 = hlfir.designate %6#0
89
- hlfir.assign %35 to %42
90
- ...
91
- omp.yield
92
- }
93
- omp.terminator
94
- }
95
- omp.terminator
96
- }
97
- ```
98
-
99
- #### Mapping to ` device `
100
-
101
- Mapping the same loop to the ` device ` , generates MLIR operations of the
102
- following structure:
103
-
104
- ``` mlir
105
- // Map `a` to the `target` region.
106
- %29 = omp.map.info ... {name = "_QFEa"}
107
- omp.target ... map_entries(..., %29 -> %arg4 ...) {
108
- ...
109
- %51:2 = hlfir.declare %arg4
110
- ...
111
- omp.teams {
112
- // Allocate private copy for `i`.
113
- %52 = fir.alloca i32 {bindc_name = "i"}
114
- %53:2 = hlfir.declare %52
115
- ...
116
-
117
- omp.distribute {
118
- omp.parallel {
119
- omp.wsloop {
120
- omp.loop_nest (%arg5) : index = (%54) to (%55) inclusive step (%c1_9) {
121
- // Use the privatized version of `i`.
122
- %56 = fir.convert %arg5 : (index) -> i32
123
- fir.store %56 to %53#1
124
- ...
125
- // Use the mapped version of `a`.
126
- ... = hlfir.designate %51#0
127
- ...
128
- }
129
- omp.terminator
130
- }
131
- omp.terminator
132
- }
133
- omp.terminator
134
- }
135
- omp.terminator
136
- }
137
- omp.terminator
138
- }
139
- ```
140
-
141
- ### Multi-range loops
142
-
143
- The pass currently supports multi-range loops as well. Given the following
144
- example:
145
-
146
- ``` fortran
147
- do concurrent(i=1:n, j=1:m)
148
- a(i,j) = i * j
149
- end do
150
- ```
151
-
152
- The generated ` omp.loop_nest ` operation look like:
153
-
154
- ``` mlir
155
- omp.loop_nest (%arg0, %arg1)
156
- : index = (%17, %19) to (%18, %20)
157
- inclusive step (%c1_2, %c1_4) {
158
- fir.store %arg0 to %private_i#1 : !fir.ref<i32>
159
- fir.store %arg1 to %private_j#1 : !fir.ref<i32>
160
- ...
161
- omp.yield
162
- }
163
- ```
164
-
165
- It is worth noting that we have privatized versions for both iteration
166
- variables: ` i ` and ` j ` . These are locally allocated inside the parallel/target
167
- OpenMP region similar to what the single-range example in previous section
168
- shows.
169
-
170
- #### Multi-range and perfectly-nested loops
171
-
172
- Currently, on the ` FIR ` dialect level, the following 2 loops are modelled in
173
- exactly the same way:
174
-
175
- ``` fortran
176
- do concurrent(i=1:n, j=1:m)
177
- a(i,j) = i * j
178
- end do
179
- ```
180
-
181
- ``` fortran
182
- do concurrent(i=1:n)
183
- do concurrent(j=1:m)
184
- a(i,j) = i * j
185
- end do
186
- end do
187
- ```
188
-
189
- Both of the above loops are modelled as:
190
-
191
- ``` mlir
192
- fir.do_loop %arg0 = %11 to %12 step %c1 unordered {
193
- ...
194
- fir.do_loop %arg1 = %14 to %15 step %c1_1 unordered {
195
- ...
196
- }
197
- }
198
- ```
199
-
200
- Consequently, from the ` DoConcurrentConversionPass ` ' perspective, both loops
201
- are treated in the same manner. Under the hood, the pass detects
202
- perfectly-nested loop nests and maps such nests as if they were multi-range
203
- loops.
204
-
205
- #### Non-perfectly-nested loops
206
-
207
- One limitation that the pass currently have is that it treats any intervening
208
- code in a loop nest as being disruptive to detecting that nest as a single
209
- unit. For example, given the following input:
210
-
211
- ``` fortran
212
- do concurrent(i=1:n)
213
- x = 41
214
- do concurrent(j=1:m)
215
- a(i,j) = i * j
216
- end do
217
- end do
218
- ```
219
-
220
- Since there at least one statement between the 2 loop header (i.e. ` x = 41 ` ),
221
- the pass does not detect the ` i ` and ` j ` loops as a nest. Rather, the pass in
222
- that case only maps the ` i ` loop to OpenMP and leaves the ` j ` loop in its
223
- origianl form. In theory, in this example, we can sink the intervening code
224
- into the ` j ` loop and detect the complete nest. However, such transformation is
225
- still to be implemented in the future.
226
-
227
- The above also has the consequence that the ` j ` variable will ** not** be
228
- privatized in the OpenMP parallel/target region. In other words, it will be
229
- treated as if it was a ` shared ` variable. For more details about privatization,
230
- see the "Data environment" section below.
231
-
232
- ### Data environment
233
-
234
- By default, variables that are used inside a ` do concurernt ` loop nest are
235
- either treated as ` shared ` in case of mapping to ` host ` , or mapped into the
236
- ` target ` region using a ` map ` clause in case of mapping to ` device ` . The only
237
- exceptions to this are:
238
- 1 . the loop's iteration variable(s) (IV) of ** perfect** loop nests. In that
239
- case, for each IV, we allocate a local copy as shown the by the mapping
240
- examples above.
241
- 1 . any values that are from allocations outside the loop nest and used
242
- exclusively inside of it. In such cases, a local privatized
243
- value is created in the OpenMP region to prevent multiple teams of threads
244
- from accessing and destroying the same memory block which causes runtime
245
- issues. For an example of such cases, see
246
- ` flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90 ` .
247
-
248
- #### Non-perfectly-nested loops' IVs
249
-
250
- For non-perfectly-nested loops, the IVs are still treated as ` shared ` or
251
- ` map ` entries as pointed out above. This ** might not** be consistent with what
252
- the Fortran specficiation tells us. In particular, taking the following
253
- snippets from the spec (version 2023) into account:
254
-
255
- > § 3.35
256
- > ------
257
- > construct entity
258
- > entity whose identifier has the scope of a construct
259
-
260
- > § 19.4
261
- > ------
262
- > A variable that appears as an index-name in a FORALL or DO CONCURRENT
263
- > construct, or ... is a construct entity. A variable that has LOCAL or
264
- > LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity.
265
- > ...
266
- > The name of a variable that appears as an index-name in a DO CONCURRENT
267
- > construct, FORALL statement, or FORALL construct has a scope of the statement
268
- > or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO
269
- > CONCURRENT construct has the scope of that construct.
270
-
271
- From the above quotes, it seems there is an equivalence between the IV of a `do
272
- concurrent` loop and a variable with a ` LOCAL` locality specifier (equivalent
273
- to OpenMP's ` private ` clause). Which means that we should probably
274
- localize/privatize a ` do concurernt ` loop's IV even if it is not perfectly
275
- nested in the nest we are parallelizing. For now, however, we ** do not** do
276
- that as pointed out previously. In the near future, we propose a middle-ground
277
- solution (see the Next steps section for more details).
56
+ <!--
57
+ More details about current status will be added along with relevant parts of the
58
+ implementation in later upstreaming patches.
59
+ -->
278
60
279
61
## Next steps
280
62
63
+ This section describes some of the open questions/issues that are not tackled yet
64
+ even in the downstream implementation.
65
+
281
66
### Delayed privatization
282
67
283
68
So far, we emit the privatization logic for IVs inline in the parallel/target
@@ -296,25 +81,46 @@ loop nest in a more fine-grained way. Implementing these specifiers on the
296
81
` FIR ` dialect level is needed in order to support this in the
297
82
` DoConcurrentConversionPass ` .
298
83
299
- Such specified will also unlock a potential solution to the
84
+ Such specifiers will also unlock a potential solution to the
300
85
non-perfectly-nested loops' IVs issue described above. In particular, for a
301
86
non-perfectly nested loop, one middle-ground proposal/solution would be to:
302
87
* Emit the loop's IV as shared/mapped just like we do currently.
303
88
* Emit a warning that the IV of the loop is emitted as shared/mapped.
304
89
* Given support for ` LOCAL ` , we can recommend the user to explicitly
305
90
localize/privatize the loop's IV if they choose to.
306
91
92
+ #### Sharing TableGen clause records from the OpenMP dialect
93
+
94
+ At the moment, the FIR dialect does not have a way to model locality specifiers
95
+ on the IR level. Instead, something similar to early/eager privatization in OpenMP
96
+ is done for the locality specifiers in ` fir.do_loop ` ops. Having locality specifier
97
+ modelled in a way similar to delayed privatization (i.e. the ` omp.private ` op) and
98
+ reductions (i.e. the ` omp.declare_reduction ` op) can make mapping ` do concurrent `
99
+ to OpenMP (and other parallel programming models) much easier.
100
+
101
+ Therefore, one way to approach this problem is to extract the TableGen records
102
+ for relevant OpenMP clauses in a shared dialect for "data environment management"
103
+ and use these shared records for OpenMP, ` do concurrent ` , and possibly OpenACC
104
+ as well.
105
+
106
+ #### Supporting reductions
107
+
108
+ Similar to locality specifiers, mapping reductions from ` do concurrent ` to OpenMP
109
+ is also still an open TODO. We can potentially extend the MLIR infrastructure
110
+ proposed in the previous section to share reduction records among the different
111
+ relevant dialects as well.
112
+
307
113
### More advanced detection of loop nests
308
114
309
115
As pointed out earlier, any intervening code between the headers of 2 nested
310
- ` do concurrent ` loops prevents us currently from detecting this as a loop nest.
311
- In some cases this is overly conservative. Therefore, a more flexible detection
312
- logic of loop nests needs to be implemented.
116
+ ` do concurrent ` loops prevents us from detecting this as a loop nest. In some
117
+ cases this is overly conservative. Therefore, a more flexible detection logic
118
+ of loop nests needs to be implemented.
313
119
314
120
### Data-dependence analysis
315
121
316
122
Right now, we map loop nests without analysing whether such mapping is safe to
317
- do or not. We probalby need to at least warn the use of unsafe loop nests due
123
+ do or not. We probably need to at least warn the user of unsafe loop nests due
318
124
to loop-carried dependencies.
319
125
320
126
### Non-rectangular loop nests
@@ -330,3 +136,21 @@ end do
330
136
```
331
137
We defer this to the (hopefully) near future when we get the conversion in a
332
138
good share for the samples/projects at hand.
139
+
140
+ ### Generalizing the pass to other parallel programming models
141
+
142
+ Once we have a stable and capable ` do concurrent ` to OpenMP mapping, we can take
143
+ this in a more generalized direction and allow the pass to target other models;
144
+ e.g. OpenACC. This goal should be kept in mind from the get-go even while only
145
+ targeting OpenMP.
146
+
147
+
148
+ ## Upstreaming status
149
+
150
+ - [x] Command line options for ` flang ` and ` bbc ` .
151
+ - [x] Conversion pass skeleton (no transormations happen yet).
152
+ - [x] Status description and tracking document (this document).
153
+ - [ ] Basic host/CPU mapping support.
154
+ - [ ] Basic device/GPU mapping support.
155
+ - [ ] More advanced host and device support (expaned to multiple items as needed).
156
+
0 commit comments