|
| 1 | +# Explicit SIMD design notes |
| 2 | + |
| 3 | +This documents is a collection of notes describing design and/or implementation |
| 4 | +of various parts of the ESIMD programming model support within the DPC++. |
| 5 | + |
| 6 | +## Overview of ESIMD support in DPC++ components |
| 7 | + |
| 8 | +ESIMD support is spread across a number of components in the oneAPI software |
| 9 | +stack, spanning compile time, link time and runtime. The picture below shows |
| 10 | +simplified view of the DPC++ compiler and runtime diagram and where ESIMD |
| 11 | +(sub-)components fit in it. |
| 12 | + |
| 13 | + |
| 14 | + |
| 15 | +### User program |
| 16 | + |
| 17 | +User program can contain both SYCL and ESIMD kernels, either in the same or |
| 18 | +different translation units. DPC++ ESIMD support will automatically split the |
| 19 | +device code into SYCL and ESIMD parts to redirect them to different back-ends. |
| 20 | +To facilitate this splitting, compiler will automatically identify markup and |
| 21 | +clone parts of the ESIMD callgraph starting from kernels and functions |
| 22 | +explicitly marked with the `intel::sycl_explicit_simd` attribute. |
| 23 | + |
| 24 | +### Device headers |
| 25 | + |
| 26 | +There are two distinct parts of ESIMD support in the SYCL device headers. The |
| 27 | +first one is ESIMD-related "tweaks" within the usual SYCL headers, and the |
| 28 | +second is the ESIMD APIs themselves. |
| 29 | + |
| 30 | +#### ESIMD tweaks in SYCL device headers |
| 31 | + |
| 32 | +The most important one is device-side definition of the sycl::accessor class |
| 33 | +definition. It has different layout for SYCL and ESIMD (but the same size due |
| 34 | +to padding), as well as different initialization function used to assemble |
| 35 | +accessor object on the device side from incoming kernel arguments. For SYCL, |
| 36 | +the function name is `sycl::accessor::__init`, for ESIMD it is |
| 37 | +`sycl::accessor::__init_esimd`. The difference is caused by the ESIMD limitation |
| 38 | +\- it does not support offset, memory and access range for an accessor object. |
| 39 | + |
| 40 | +There are also couple ESIMD-specific proxy classes used by ESIMD API |
| 41 | +implementation to access internals of ESIMD objects: |
| 42 | + |
| 43 | +- `sycl::ext::intel::esimd::detail::AccessorPrivateProxy` |
| 44 | + this one is used by ESIMD memory APIs to access internals of `sycl::accessor` |
| 45 | + objects |
| 46 | +- `sycl::ext::intel::esimd::detail::WrapperElementTypeProxy` |
| 47 | + used to access internals of SYCL types such as `sycl::detail::half_impl::half` |
| 48 | + |
| 49 | +#### ESIMD API headers |
| 50 | + |
| 51 | +These headers define ESIMD APIs to be used by ESIMD user kernels. For example, |
| 52 | +the basic vector data type `sycl::ext::intel::esimd::simd`, |
| 53 | +`sycl::ext::intel::esimd::gather` memory APIs |
| 54 | + |
| 55 | +Source locations: |
| 56 | + |
| 57 | +- `sycl/ext/intel/esimd` |
| 58 | +- `sycl/ext/intel/experimental/esimd` |
| 59 | + |
| 60 | +### Clang driver |
| 61 | + |
| 62 | +TODO: describe driver modifications. |
| 63 | + |
| 64 | +Source locations: |
| 65 | + |
| 66 | +- `clang/lib/Driver/ToolChains/Clang.cpp` |
| 67 | +- `clang/include/clang/Driver/Options.td` |
| 68 | + |
| 69 | +### Clang front-end |
| 70 | + |
| 71 | +#### Semantic analyzer, integration header generator |
| 72 | + |
| 73 | +- Generation of ESIMD kernel signature and prolog code (which re-assembles |
| 74 | + objects such as accessors from incoming arguments) is tweaked for ESIMD |
| 75 | + kernels: |
| 76 | + - accessors generate fewer kernel arguments |
| 77 | + - accessor object is assembled in the prolog using `__init_esimd` function |
| 78 | + instead of `__init`. See function name definition and its usage |
| 79 | + [here](https://github.com/intel/llvm/blob/eb33bbcfbeab7af1a7f58fb4dc6b53bc47f73dba/clang/lib/Sema/SemaSYCL.cpp#L62). |
| 80 | +- Defines `intel::sycl_explicit_simd` function and variable attribute to markup |
| 81 | + kernels top-level ESIMD external functions and "private globals". |
| 82 | +- Performs various ESIMD-specific semantic analysis and diagnostics. All kinds |
| 83 | + of ESIMD error messages can be found in |
| 84 | + `clang/include/clang/Basic/DiagnosticSemaKinds.td`. |
| 85 | +- Integration header generator inserts additional |
| 86 | + `static constexpr bool isESIMD()` function into kernel discriptor type, which |
| 87 | + is used by the host compiler to obtain compile-time details for the kernel via |
| 88 | + inclusion of the integration header generated by the generator. For ESIMD |
| 89 | + kernels `isESIMD()` returns true, and this is used in SYCL RT headers to |
| 90 | + distinguish ESIMD kernels from SYCL kernels, e.g. to customize kernel argument |
| 91 | + setting for accessors. |
| 92 | +- defines `sycl_esimd_vectorize` attribute propagated to the VC BE, aimed at |
| 93 | + making it possible to call SIMT function from ESIMD code while vectorizing the |
| 94 | + former. Used internally to implement certain math functions in SYCL spec |
| 95 | + conformant manner (proper precision) via vectorizing scalar conformant |
| 96 | + implementations - see example |
| 97 | + [usage](https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/intel/esimd/detail/math_intrin.hpp#L199) |
| 98 | + |
| 99 | +Source locations: |
| 100 | + |
| 101 | +- `clang/lib/Sema/SemaSYCL.cpp` |
| 102 | +- `clang/lib/Sema/SemaDecl.cpp` |
| 103 | +- `clang/lib/Sema/SemaExpr.cpp` |
| 104 | +- `clang/include/clang/Basic/Attr.td` |
| 105 | +- `clang/include/clang/Basic/DiagnosticSemaKinds.td` |
| 106 | + |
| 107 | +#### Code (LLVMIR) generator |
| 108 | + |
| 109 | +ESIMD-specific code generator tweaks are mostly translations of internal FE |
| 110 | +representation of variaous ESIMD attributes into LLVM IR attributes or metadata. |
| 111 | + |
| 112 | +##### Kernel signature generation |
| 113 | + |
| 114 | +For ESIMD kernels, a number of additional attributes are generated for the |
| 115 | +kernel function itself as well as certain argument. |
| 116 | + |
| 117 | +- Kernels are annotated with `sycl_explicit_simd` and |
| 118 | + `intel_reqd_sub_group_size` attributes. The latter must always be `1` for a |
| 119 | + ESIMD kernel or function. |
| 120 | +- An argument which conveys accessor's pointer is assigned a |
| 121 | + `kernel_arg_accessor_ptr` attribute |
| 122 | + |
| 123 | +##### Global variable code generation |
| 124 | + |
| 125 | +ESIMD supports "private globals" - global variables which have one copy per |
| 126 | +thread of execution (similar to C++ thread_local), normally allocated of Gen |
| 127 | +register file. To make a global variable a "private global", |
| 128 | +`__attribute__((opencl_private)) __attribute__((sycl_explicit_simd))` |
| 129 | +attributes are used. Globals of this can be forced to a specific register using |
| 130 | +the `__attribute__((register_num(n)))` attribute. The clang code generator |
| 131 | +translates these to `genx_volatile` and `genx_byte_offset` LLVM IR attributes. |
| 132 | + |
| 133 | +##### Function attributes translations |
| 134 | + |
| 135 | +- `sycl_esimd_vectorize` -> `CMGenxSIMT` |
| 136 | + |
| 137 | +Source locations: |
| 138 | + |
| 139 | +- `clang/lib/CodeGen/CGSYCLRuntime.cpp` |
| 140 | +- `clang/lib/CodeGen/CodeGenFunction.cpp` |
| 141 | +- `clang/lib/CodeGen/CodeGenModule.cpp` |
| 142 | + |
| 143 | +### Clang middle-end |
| 144 | + |
| 145 | +#### ESIMD API restriction verifier |
| 146 | + |
| 147 | +This component is an LLVM IR pass over a compiled translation unit. It checks |
| 148 | +for presence of certain SYCL APIs which are disallowed within ESIMD code. For |
| 149 | +exaple, SYCL reductions are not allowed in ESIMD. The verifier does this by |
| 150 | +demangling all the call targets within ESIMD code and matching them with |
| 151 | +internal sub-string filters. Invoked from `clang/lib/CodeGen/BackendUtil.cpp`. |
| 152 | + |
| 153 | +Source locations: |
| 154 | + |
| 155 | +- `llvm/lib/SYCLLowerIR/ESIMD/ESIMDVerifier.cpp` |
| 156 | + |
| 157 | +### sycl-post-link transformations |
| 158 | + |
| 159 | +As a part of the input device code module transformation pipeline, the |
| 160 | +sycl-post-link tool splits the input module (or modules resulting from splitting |
| 161 | +by other characteristics, such as aspects) into two - SYCL and ESIMD ones. |
| 162 | +Shared functions invoked both from SYCL and ESIMD are cloned during the process. |
| 163 | +This is necessary because SYCL and ESIMD parts must undergo different set of |
| 164 | +transformations before generating resulting SPIR-V. ESIMD modules resulting from |
| 165 | +splitting are marked with specific device binary property `isEsimdImage` (see |
| 166 | +[source](https://github.com/intel/llvm/blob/9dc14a292f479880b5ab2e95f99a8414b31d1165/llvm/tools/sycl-post-link/sycl-post-link.cpp#L453) |
| 167 | +.) |
| 168 | + |
| 169 | +`sycl-post-link` is the post-link process driver, it invokes necessary |
| 170 | +transformations as well as optimizations on fully linked device code. As a part |
| 171 | +of the process it splits SYCL and ESIMD parts of the code into separate LLVM IR |
| 172 | +modules and invokes different set or transformations on them. If a program has |
| 173 | +an `invoke_simd` call in it, then sycl-post-link will link SYCL and ESIMD parts |
| 174 | +back, cloning overlaping parts as needed. |
| 175 | + |
| 176 | +Source locations: |
| 177 | + |
| 178 | +- `llvm/tools/sycl-post-link/sycl-post-link.cpp` |
| 179 | + |
| 180 | +#### ESIMD Lowerer |
| 181 | + |
| 182 | +ESIMD part of device code undergoes a set of ESIMD-specific transformations. |
| 183 | +First, intrinsic lowering and metadata generation phase happens. It is |
| 184 | +implemented in the `SYCLLowerESIMDPass` LLVM IR Module pass. Its primary |
| 185 | +purposes are: |
| 186 | + |
| 187 | +- translate `__esimd_*` intrinsic calls into corresponding `genx.*` intrinsics |
| 188 | + known to the VC BE |
| 189 | + - in some cases, there is no direct equivalent (for example, |
| 190 | + `__esimd_pack_mask`), in which case the lowerer generates LLVM IR with |
| 191 | + desired semantics |
| 192 | +- translate some of the `__spirv.*` intrinsics to something acceptable by VC |
| 193 | + BE |
| 194 | + |
| 195 | +Source locations: |
| 196 | + |
| 197 | +- `LowerESIMD.cpp` |
| 198 | +- `ESIMDOptimizeVecArgCallConv.cpp` |
| 199 | +- `LowerESIMDVecArg.cpp` |
| 200 | +- `LowerESIMDVLoadVStore.cpp` |
| 201 | + |
| 202 | +#### Genx SPIR-V writer adaptor |
| 203 | + |
| 204 | +`(part of vc-intrinsics repo)` |
| 205 | + |
| 206 | +### SYCL Runtime |
| 207 | + |
| 208 | +SYCL runtime (RT) has a few places where ESIMD is handled specially: |
| 209 | + |
| 210 | +- When setting kernel invocation arguments corresponding to an accessor, RT will |
| 211 | + skip setting offset, memory and access ranges arguments (normally set for |
| 212 | + usual SYCL kernels), because ESIMD does not support these. In other words, an |
| 213 | + accessors used within kernel (and captured in kernel lambda) is translated to |
| 214 | + 4 SPIR-V kernel arguments for a normal SYCL kernel, and just to 1 argument for |
| 215 | + a ESIMD kernel. |
| 216 | + [Link](https://github.com/intel/llvm/blob/d7a7de79f8a6498bae52331f4789adcac76b8e8c/sycl/source/handler.cpp#L373). |
| 217 | +- When creating JIT compilation options, SYCL runtime checks if the device |
| 218 | + binary image to be JIT-compiled has "isESIMDImage" property, in which case it |
| 219 | + adds `-vc-codegen` JIT options, which makes Intel GPU runtime use the vector |
| 220 | + backend (aka 'VC BE') to JIT-compile the device binary (SPIR-V). |
| 221 | + [Link](https://github.com/intel/llvm/blob/d7a7de79f8a6498bae52331f4789adcac76b8e8c/sycl/source/detail/program_manager/program_manager.cpp#L412). |
| 222 | +- Code related to ESIMD emulator support and ESIMD kernel invocation via ESIMD |
| 223 | + emulator. Plugin |
| 224 | + [sources](https://github.com/intel/llvm/blob/sycl/sycl/plugins/esimd_emulator/pi_esimd_emulator.cpp), |
| 225 | + ESIMD kernel invocation via the emulator |
| 226 | + [link](https://github.com/intel/llvm/blob/d7a7de79f8a6498bae52331f4789adcac76b8e8c/sycl/source/detail/scheduler/commands.cpp#L2369). |
| 227 | + |
| 228 | +## TODOs |
| 229 | + |
| 230 | +This section lists current major ESIMD gaps/TODOs. |
| 231 | + |
| 232 | +1. Move all APIs out of the experimental namespace. One of the major APIs there |
| 233 | + is LSC memory accesses. The main roadblock for making it stable API is |
| 234 | + absense of specification for cache hints, which should be shared between SYCL |
| 235 | + and ESIMD. |
| 236 | +1. Architecture specific APIs should be explicitly marked as such in the user |
| 237 | + documentation with references to the |
| 238 | + [list of architectures](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_device_architecture.asciidoc) |
| 239 | + known to oneAPI. |
| 240 | +1. Properly markup architecture-specific APIs, such as `dpas`, with required |
| 241 | + aspects, according to the "optional device features" |
| 242 | + [design](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OptionalDeviceFeatures.md). |
| 243 | + This might require splitting implementations into per-architecture variants. |
| 244 | + `if_device_has` |
| 245 | + [feature](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_device_if.asciidoc) |
| 246 | + may help avoid duplication of common parts and dispatch to |
| 247 | + architecture-dependent code at fine-grained level from within a function. |
| 248 | +1. As VC BE moves away from `genx.*` intrinsics replacing them with `__spirv_*` |
| 249 | + ones defined in various extensions, ESIMD should catch up. |
| 250 | +1. Unification of common `simd_view`/`simd` interfaces in fact leads to |
| 251 | + significant complication of implementation rather than its intended |
| 252 | + simplification via avoiding code duplication, might make sense to have |
| 253 | + separate implementations. |
| 254 | + |
| 255 | +## Directions |
| 256 | + |
| 257 | +This section lists possible directions for ESIMD improvements. |
| 258 | + |
| 259 | +1. Support `std::simd`. This is the standard C++ way for explicit SIMD |
| 260 | + programming. Can help run (subsest of ESIMD) on CPU efficiently in the |
| 261 | + future. |
| 262 | +1. Clear (via namespace?) separation of ESIMD APIs into portable and |
| 263 | + architecture-specific parts. |
| 264 | +1. Standardizing `simd_view` or equivalent. This is effectively a reference |
| 265 | + to a *subset* of `esimd::simd` vector object's elements. The subset is |
| 266 | + defined in a regular way via starting offset, stride and number of elements |
| 267 | + in the subset. This proved to be very useful and loved by users. Missing in |
| 268 | + `std::simd`. |
| 269 | +1. Design something like `invoke_spmd` (similar to `invoke_simd` extension) to |
| 270 | + be able to invoke SPMD functions from ESIMD code while vectorizing the calls |
| 271 | + in the back-end. This would replace `sycl_esimd_vectorize` and make this |
| 272 | + concept usable by all users, not only internal ESIMD implementation. |
| 273 | +1. Create a specification for ESIMD kernel ABI and stand-alone kernel |
| 274 | + declaration rules to make ESIMD kernels callable by arbitrary host offload |
| 275 | + runtimes, such as Level Zero. |
0 commit comments