|
| 1 | +<!--===- docs/DoConcurrentMappingToOpenMP.md |
| 2 | +
|
| 3 | + Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. |
| 4 | + See https://llvm.org/LICENSE.txt for license information. |
| 5 | + SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception |
| 6 | +
|
| 7 | +--> |
| 8 | + |
| 9 | +# `DO CONCURENT` mapping to OpenMP |
| 10 | + |
| 11 | +```{contents} |
| 12 | +--- |
| 13 | +local: |
| 14 | +--- |
| 15 | +``` |
| 16 | + |
| 17 | +This document seeks to describe the effort to parallelize `do concurrent` loops |
| 18 | +by mapping them to OpenMP worksharing constructs. The goals of this document |
| 19 | +are: |
| 20 | +* Describing how to instruct `flang-new` to map `DO CONCURENT` loops to OpenMP |
| 21 | + constructs. |
| 22 | +* Tracking the current status of such mapping. |
| 23 | +* Describing the limitations of the current implmenentation. |
| 24 | +* Describing next steps. |
| 25 | + |
| 26 | +## Usage |
| 27 | + |
| 28 | +In order to enable `do concurrent` to OpenMP mapping, `flang-new` adds a new |
| 29 | +compiler flag: `-fdo-concurrent-parallel`. This flags has 3 possible values: |
| 30 | +1. `host`: this maps `do concurent` loops to run in parallel on the host CPU. |
| 31 | + This maps such loops to the equivalent of `omp parallel do`. |
| 32 | +2. `device`: this maps `do concurent` loops to run in parallel on a device |
| 33 | + (GPU). This maps such loops to the equivalent of `omp target teams |
| 34 | + distribute parallel do`. |
| 35 | +3. `none`: this disables `do concurrent` mapping altogether. In such case, such |
| 36 | + loops are emitted as sequential loops. |
| 37 | + |
| 38 | +The above compiler switch is currently avaialble only when OpenMP is also |
| 39 | +enabled. So you need to provide the following options to flang in order to |
| 40 | +enable it: |
| 41 | +``` |
| 42 | +flang-new ... -fopenmp -fdo-concurrent-parallel=[host|device|none] ... |
| 43 | +``` |
| 44 | + |
| 45 | +## Current status |
| 46 | + |
| 47 | +Under the hood, `do concurrent` mapping is implemented in the |
| 48 | +`DoConcurrentConversionPass`. This is still an experimental pass which means |
| 49 | +that: |
| 50 | +* It has been tested in a very limited way so far. |
| 51 | +* It has been tested on simple synthetic inputs. |
| 52 | + |
| 53 | +To describe current status in more detail, following is a description of how |
| 54 | +the pass currently behaves for single-range loops and then for multi-range |
| 55 | +loops. |
| 56 | + |
| 57 | +### Single-range loops |
| 58 | + |
| 59 | +Given the following loop: |
| 60 | +```fortran |
| 61 | + do concurrent(i=1:n) |
| 62 | + a(i) = i * i |
| 63 | + end do |
| 64 | +``` |
| 65 | + |
| 66 | +#### Mapping to `host` |
| 67 | + |
| 68 | +Mapping this loop to the `host`, generates MLIR operations of the following |
| 69 | +structure: |
| 70 | + |
| 71 | +```mlir |
| 72 | +%4 = fir.address_of(@_QFEa) ... |
| 73 | +%6:2 = hlfir.declare %4 ... |
| 74 | +
|
| 75 | +omp.parallel { |
| 76 | + // Allocate private copy for `i`. |
| 77 | + %19 = fir.alloca i32 {bindc_name = "i"} |
| 78 | + %20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ... |
| 79 | +
|
| 80 | + omp.wsloop { |
| 81 | + omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) { |
| 82 | + %23 = fir.convert %arg0 : (index) -> i32 |
| 83 | + // Use the privatized version of `i`. |
| 84 | + fir.store %23 to %20#1 : !fir.ref<i32> |
| 85 | + ... |
| 86 | +
|
| 87 | + // Use "shared" SSA value of `a`. |
| 88 | + %42 = hlfir.designate %6#0 |
| 89 | + hlfir.assign %35 to %42 |
| 90 | + ... |
| 91 | + omp.yield |
| 92 | + } |
| 93 | + omp.terminator |
| 94 | + } |
| 95 | + omp.terminator |
| 96 | +} |
| 97 | +``` |
| 98 | + |
| 99 | +#### Mapping to `device` |
| 100 | + |
| 101 | +Mapping the same loop to the `device`, generates MLIR operations of the |
| 102 | +following structure: |
| 103 | + |
| 104 | +```mlir |
| 105 | +// Map `a` to the `target` region. |
| 106 | +%29 = omp.map.info ... {name = "_QFEa"} |
| 107 | +omp.target ... map_entries(..., %29 -> %arg4 ...) { |
| 108 | + ... |
| 109 | + %51:2 = hlfir.declare %arg4 |
| 110 | + ... |
| 111 | + omp.teams { |
| 112 | + // Allocate private copy for `i`. |
| 113 | + %52 = fir.alloca i32 {bindc_name = "i"} |
| 114 | + %53:2 = hlfir.declare %52 |
| 115 | + ... |
| 116 | +
|
| 117 | + omp.distribute { |
| 118 | + omp.parallel { |
| 119 | + omp.wsloop { |
| 120 | + omp.loop_nest (%arg5) : index = (%54) to (%55) inclusive step (%c1_9) { |
| 121 | + // Use the privatized version of `i`. |
| 122 | + %56 = fir.convert %arg5 : (index) -> i32 |
| 123 | + fir.store %56 to %53#1 |
| 124 | + ... |
| 125 | + // Use the mapped version of `a`. |
| 126 | + ... = hlfir.designate %51#0 |
| 127 | + ... |
| 128 | + } |
| 129 | + omp.terminator |
| 130 | + } |
| 131 | + omp.terminator |
| 132 | + } |
| 133 | + omp.terminator |
| 134 | + } |
| 135 | + omp.terminator |
| 136 | + } |
| 137 | + omp.terminator |
| 138 | +} |
| 139 | +``` |
| 140 | + |
| 141 | +### Multi-range loops |
| 142 | + |
| 143 | +The pass currently supports multi-range loops as well. Given the following |
| 144 | +example: |
| 145 | + |
| 146 | +```fortran |
| 147 | + do concurrent(i=1:n, j=1:m) |
| 148 | + a(i,j) = i * j |
| 149 | + end do |
| 150 | +``` |
| 151 | + |
| 152 | +The generated `omp.loop_nest` operation look like: |
| 153 | + |
| 154 | +```mlir |
| 155 | +omp.loop_nest (%arg0, %arg1) |
| 156 | + : index = (%17, %19) to (%18, %20) |
| 157 | + inclusive step (%c1_2, %c1_4) { |
| 158 | + fir.store %arg0 to %private_i#1 : !fir.ref<i32> |
| 159 | + fir.store %arg1 to %private_j#1 : !fir.ref<i32> |
| 160 | + ... |
| 161 | + omp.yield |
| 162 | +} |
| 163 | +``` |
| 164 | + |
| 165 | +It is worth noting that we have privatized versions for both iteration |
| 166 | +variables: `i` and `j`. These are locally allocated inside the parallel/target |
| 167 | +OpenMP region similar to what the single-range example in previous section |
| 168 | +shows. |
| 169 | + |
| 170 | +#### Multi-range and perfectly-nested loops |
| 171 | + |
| 172 | +Currently, on the `FIR` dialect level, the following 2 loops are modelled in |
| 173 | +exactly the same way: |
| 174 | + |
| 175 | +```fortran |
| 176 | +do concurrent(i=1:n, j=1:m) |
| 177 | + a(i,j) = i * j |
| 178 | +end do |
| 179 | +``` |
| 180 | + |
| 181 | +```fortran |
| 182 | +do concurrent(i=1:n) |
| 183 | + do concurrent(j=1:m) |
| 184 | + a(i,j) = i * j |
| 185 | + end do |
| 186 | +end do |
| 187 | +``` |
| 188 | + |
| 189 | +Both of the above loops are modelled as: |
| 190 | + |
| 191 | +```mlir |
| 192 | +fir.do_loop %arg0 = %11 to %12 step %c1 unordered { |
| 193 | + ... |
| 194 | + fir.do_loop %arg1 = %14 to %15 step %c1_1 unordered { |
| 195 | + ... |
| 196 | + } |
| 197 | +} |
| 198 | +``` |
| 199 | + |
| 200 | +Consequently, from the `DoConcurrentConversionPass`' perspective, both loops |
| 201 | +are treated in the same manner. Under the hood, the pass detects |
| 202 | +perfectly-nested loop nests and maps such nests as if they were multi-range |
| 203 | +loops. |
| 204 | + |
| 205 | +#### Non-perfectly-nested loops |
| 206 | + |
| 207 | +One limitation that the pass currently have is that it treats any intervening |
| 208 | +code in a loop nest as being disruptive to detecting that nest as a single |
| 209 | +unit. For example, given the following input: |
| 210 | + |
| 211 | +```fortran |
| 212 | +do concurrent(i=1:n) |
| 213 | + x = 41 |
| 214 | + do concurrent(j=1:m) |
| 215 | + a(i,j) = i * j |
| 216 | + end do |
| 217 | +end do |
| 218 | +``` |
| 219 | + |
| 220 | +Since there at least one statement between the 2 loop header (i.e. `x = 41`), |
| 221 | +the pass does not detect the `i` and `j` loops as a nest. Rather, the pass in |
| 222 | +that case only maps the `i` loop to OpenMP and leaves the `j` loop in its |
| 223 | +origianl form. In theory, in this example, we can sink the intervening code |
| 224 | +into the `j` loop and detect the complete nest. However, such transformation is |
| 225 | +still to be implemented in the future. |
| 226 | + |
| 227 | +The above also has the consequence that the `j` variable will **not** be |
| 228 | +privatized in the OpenMP parallel/target region. In other words, it will be |
| 229 | +treated as if it was a `shared` variable. For more details about privatization, |
| 230 | +see the "Data environment" section below. |
| 231 | + |
| 232 | +### Data environment |
| 233 | + |
| 234 | +By default, variables that are used inside a `do concurernt` loop nest are |
| 235 | +either treated as `shared` in case of mapping to `host`, or mapped into the |
| 236 | +`target` region using a `map` clause in case of mapping to `device`. The only |
| 237 | +exceptions to this are: |
| 238 | + 1. the loop's iteration variable(s) (IV) of **perfect** loop nests. In that |
| 239 | + case, for each IV, we allocate a local copy as shown the by the mapping |
| 240 | + examples above. |
| 241 | + 1. any values that are from allocations outside the loop nest and used |
| 242 | + exclusively inside of it. In such cases, a local privatized |
| 243 | + value is created in the OpenMP region to prevent multiple teams of threads |
| 244 | + from accessing and destroying the same memory block which causes runtime |
| 245 | + issues. For an example of such cases, see |
| 246 | + `flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90`. |
| 247 | + |
| 248 | +#### Non-perfectly-nested loops' IVs |
| 249 | + |
| 250 | +For non-perfectly-nested loops, the IVs are still treated as `shared` or |
| 251 | +`map` entries as pointed out above. This **might not** be consistent with what |
| 252 | +the Fortran specficiation tells us. In particular, taking the following |
| 253 | +snippets from the spec (version 2023) into account: |
| 254 | + |
| 255 | +> § 3.35 |
| 256 | +> ------ |
| 257 | +> construct entity |
| 258 | +> entity whose identifier has the scope of a construct |
| 259 | +
|
| 260 | +> § 19.4 |
| 261 | +> ------ |
| 262 | +> A variable that appears as an index-name in a FORALL or DO CONCURRENT |
| 263 | +> construct, or ... is a construct entity. A variable that has LOCAL or |
| 264 | +> LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity. |
| 265 | +> ... |
| 266 | +> The name of a variable that appears as an index-name in a DO CONCURRENT |
| 267 | +> construct, FORALL statement, or FORALL construct has a scope of the statement |
| 268 | +> or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO |
| 269 | +> CONCURRENT construct has the scope of that construct. |
| 270 | +
|
| 271 | +From the above quotes, it seems there is an equivalence between the IV of a `do |
| 272 | +concurrent` loop and a variable with a `LOCAL` locality specifier (equivalent |
| 273 | +to OpenMP's `private` clause). Which means that we should probably |
| 274 | +localize/privatize a `do concurernt` loop's IV even if it is not perfectly |
| 275 | +nested in the nest we are parallelizing. For now, however, we **do not** do |
| 276 | +that as pointed out previously. In the near future, we propose a middle-ground |
| 277 | +solution (see the Next steps section for more details). |
| 278 | + |
| 279 | +## Next steps |
| 280 | + |
| 281 | +### Delayed privatization |
| 282 | + |
| 283 | +So far, we emit the privatization logic for IVs inline in the parallel/target |
| 284 | +region. This is enough for our purposes right now since we don't |
| 285 | +localize/privatize any sophisticated types of variables yet. Once we have need |
| 286 | +for more advanced localization through `do concurrent`'s locality specifiers |
| 287 | +(see below), delayed privatization will enable us to have a much cleaner IR. |
| 288 | +Once delayed privatization's implementation upstream is supported for the |
| 289 | +required constructs by the pass, we will move to it rather than inlined/early |
| 290 | +privatization. |
| 291 | + |
| 292 | +### Locality specifiers for `do concurrent` |
| 293 | + |
| 294 | +Locality specifiers will enable the user to control the data environment of the |
| 295 | +loop nest in a more fine-grained way. Implementing these specifiers on the |
| 296 | +`FIR` dialect level is needed in order to support this in the |
| 297 | +`DoConcurrentConversionPass`. |
| 298 | + |
| 299 | +Such specified will also unlock a potential solution to the |
| 300 | +non-perfectly-nested loops' IVs issue described above. In particular, for a |
| 301 | +non-perfectly nested loop, one middle-ground proposal/solution would be to: |
| 302 | +* Emit the loop's IV as shared/mapped just like we do currently. |
| 303 | +* Emit a warning that the IV of the loop is emitted as shared/mapped. |
| 304 | +* Given support for `LOCAL`, we can recommend the user to explicitly |
| 305 | + localize/privatize the loop's IV if they choose to. |
| 306 | + |
| 307 | +### More advanced detection of loop nests |
| 308 | + |
| 309 | +As pointed out earlier, any intervening code between the headers of 2 nested |
| 310 | +`do concurrent` loops prevents us currently from detecting this as a loop nest. |
| 311 | +In some cases this is overly conservative. Therefore, a more flexible detection |
| 312 | +logic of loop nests needs to be implemented. |
| 313 | + |
| 314 | +### Data-dependence analysis |
| 315 | + |
| 316 | +Right now, we map loop nests without analysing whether such mapping is safe to |
| 317 | +do or not. We probalby need to at least warn the use of unsafe loop nests due |
| 318 | +to loop-carried dependencies. |
| 319 | + |
| 320 | +### Non-rectangular loop nests |
| 321 | + |
| 322 | +So far, we did not need to use the pass for non-rectangular loop nests. For |
| 323 | +example: |
| 324 | +```fortran |
| 325 | +do concurrent(i=1:n) |
| 326 | + do concurrent(j=i:n) |
| 327 | + ... |
| 328 | + end do |
| 329 | +end do |
| 330 | +``` |
| 331 | +We defer this to the (hopefully) near future when we get the conversion in a |
| 332 | +good share for the samples/projects at hand. |
0 commit comments