|
| 1 | +# ESIMD "stateless" accessors support design |
| 2 | + |
| 3 | +This document describes design of automatic conversion of "stateful" memory |
| 4 | +accesses via SYCL accessors to "stateless" accesses within ESIMD kernels by the |
| 5 | +SYCL compiler. |
| 6 | + |
| 7 | +## Overview of Intel GPU memory access types |
| 8 | +Intel GPU hardware has two main modes of accessing memory - *stateless* and |
| 9 | +*stateful*, with specific memory access data port messages corresponding to each |
| 10 | +of the modes (`send` instructions with dataport fixed function target). |
| 11 | +- In a *stateless* access base memory location(s) is represented with a single |
| 12 | +virtual memory address, which can be a USM pointer. |
| 13 | +- In *stateful* - with a <*surface index*, *32-bit offset*> pair of values, where |
| 14 | +the *surface index* is an index into a "binding table" which contains surface |
| 15 | +descriptors available to the kernel. Surface is a contigous memory area |
| 16 | +accessible through its descriptor by stateful accesses. Each descriptor contains |
| 17 | +various information about the surface - for example, its size and format. |
| 18 | + |
| 19 | +Pointers used in statless accesses are usually coming from USM or C++ memory |
| 20 | +allocation routines and are passed directly by the runtime as kernel arguments. |
| 21 | + |
| 22 | +The stateful access style has a number of drawbacks which makes it undesirable |
| 23 | +to use in HPC application. The biggest one is 4Gb limitation on the surface |
| 24 | +size. Another one is problems with creating data structures with nested |
| 25 | +pointer fields or double indirection on host and use them on the device. |
| 26 | + |
| 27 | +## Accessor and USM pointer kernel argument passing details |
| 28 | +ESIMD compiler when compiling a kernel records information about each memory |
| 29 | +argument and stores it together with the kernel's SPIRV. Basically, for each |
| 30 | +kernel argument, there is information whether it is a memory argument, and, if |
| 31 | +yes, whether it is surface-based or pointer-based. |
| 32 | + |
| 33 | +When JITting the kernel, the scalar GPU compiler back-end can convert memory |
| 34 | +arguments and memory accesses between the two modes depending on optimization |
| 35 | +or other settings, and record final type of memory argument with the generated |
| 36 | +kernel executable. GPU runtime uses that information to wrap/not wrap incoming |
| 37 | +memory pointer with a surface before passing it onto the harware into the |
| 38 | +actual kernel argument. |
| 39 | + |
| 40 | +The vector back-end can't do this in many cases, as memory accesses in SPIRV |
| 41 | +are represented by hardware-specific intrinsics rather then standard generic |
| 42 | +memory access SPIRV instructions. This design basically enables the vector BE |
| 43 | +to redirect code generation for stateful memory access APIs to stateless or |
| 44 | +stateful intrinsics, and also generate correct annotations. Since it uses the |
| 45 | +same runtime, which relies on parameter annotation when making the wrap/no-wrap |
| 46 | +decision, the runtime part does not need much changes. |
| 47 | + |
| 48 | +## Problem definition |
| 49 | + |
| 50 | +Currently, ESIMD compiler always maps buffer/accessor-based memory accesses to |
| 51 | +stateful accesses, thus imposing the 4Gb datum size limitation on user programs |
| 52 | +with accessors. |
| 53 | + |
| 54 | +## Proposed solution |
| 55 | + |
| 56 | +### Short/mid-term |
| 57 | + |
| 58 | +#### API header changes |
| 59 | +The general idea is to introduce C++ preprocessor macro |
| 60 | +`ESIMD_FORCE_STATELESS_MEM_ACCESS` which will control code generation for the |
| 61 | +stateful memory access APIs - such as: |
| 62 | + |
| 63 | +```cpp |
| 64 | +template <typename Tx, int N, typename AccessorTy, |
| 65 | + typename Flags = vector_aligned_tag, |
| 66 | + typename = std::enable_if_t<is_simd_flag_type_v<Flags>>, |
| 67 | + class T = detail::__raw_t<Tx>> |
| 68 | +__ESIMD_API simd<Tx, N> block_load(AccessorTy acc, uint32_t offset, |
| 69 | + Flags = {}); |
| 70 | + |
| 71 | +template <typename T, int N, typename AccessorTy> |
| 72 | +__ESIMD_API std::enable_if_t<(sizeof(T) <= 4) && |
| 73 | + (N == 1 || N == 8 || N == 16 || N == 32) && |
| 74 | + !std::is_pointer<AccessorTy>::value, |
| 75 | + simd<T, N>> |
| 76 | +gather(AccessorTy acc, simd<uint32_t, N> offsets, uint32_t glob_offset = 0, |
| 77 | + simd_mask<N> mask = 1); |
| 78 | +``` |
| 79 | +
|
| 80 | +Implementation of the APIs would follow this pattern: |
| 81 | +
|
| 82 | +```cpp |
| 83 | +// this API should verify that accessor is to global address space, this is needed both for |
| 84 | +// the case with conversion to stateless and the case w/o. |
| 85 | +T stateful_memory_api(accessor acc, uint32 offset, args...) { |
| 86 | +#ifdef ESIMD_FORCE_STATELESS_MEM_ACCESS |
| 87 | + accessor_elelemt_type *ptr = acc.get_pointer() + offset; |
| 88 | + return stateless_memory_api(ptr, args...); |
| 89 | +#else |
| 90 | + <original implementation> |
| 91 | +#endif |
| 92 | +} |
| 93 | +``` |
| 94 | + |
| 95 | +The new macro is supposed to be set by users directly or via some logic based on |
| 96 | +other macros set by users. |
| 97 | + |
| 98 | +#### Compiler changes |
| 99 | + |
| 100 | +The API part of the implementation is as simple as above, the compiler |
| 101 | +one is slightly more complicated. Compiler needs to make sure that in presence |
| 102 | +of `ESIMD_FORCE_STATELESS_MEM_ACCESS` macro, the actual memory parameter |
| 103 | +annotation described above is correct and tells that memory is a pointer, not a |
| 104 | +surface index. Parameter annotations are generated by the front-end - these are |
| 105 | +`kernel_arg_accessor_ptr` and `kernel_arg_type` metadata nodes, which are |
| 106 | +then translated to `buffer_t` (for surface) or `svmptr_t` (for pointer) |
| 107 | +metadata annotations consumed by the back-end. |
| 108 | + |
| 109 | +##### Variant 1 |
| 110 | +This is the recommended variant. A new driver option is added - |
| 111 | +`-fsycl-esimd-force-stateless-mem-access`. Under this option: |
| 112 | +- SYCL C++ device compiler FE defines the `ESIMD_FORCE_STATELESS_MEM_ACCESS` |
| 113 | + macro |
| 114 | +- `sycl-post-link` tool is run with a new option |
| 115 | + `-esimd-force-stateless-mem-access`. Under this option, the tool |
| 116 | + configures the LowerESIMD.cpp pass to ignore the `kernel_arg_accessor_ptr` |
| 117 | + and always generate `svmptr_t` annotation for memory arguments. |
| 118 | + |
| 119 | +##### Variant 2 |
| 120 | +Clang C++ FE is changed to generate desired `kernel_arg_accessor_ptr` |
| 121 | +metadata depending on `ESIMD_FORCE_STATELESS_MEM_ACCESS` macro setting. If |
| 122 | +set, it will mark all memory arguments as pointers in |
| 123 | +`kernel_arg_accessor_ptr` and `kernel_arg_type` MD nodes. |
| 124 | + |
| 125 | + |
| 126 | +##### Variant 3 (no go) |
| 127 | +Definition of `SYCL_ESIMD_KERNEL` is changed depending on presence of |
| 128 | +`ESIMD_FORCE_STATELESS_MEM_ACCESS`: |
| 129 | + |
| 130 | +```cpp |
| 131 | +#ifdef ESIMD_FORCE_STATELESS_MEM_ACCESS |
| 132 | +#define SYCL_ESIMD_KERNEL __attribute__((sycl_explicit_simd)) __attribute__((sycl_explicit_simd_force_stateless)) |
| 133 | +#else |
| 134 | +#define SYCL_ESIMD_KERNEL __attribute__((sycl_explicit_simd)) |
| 135 | +#endif |
| 136 | +``` |
| 137 | +Then LowerESIMD lowers parameter annotation depending on |
| 138 | +`sycl_explicit_simd_force_stateless` attribute presence. |
| 139 | +The drawback is that is allowed to use `[[intel::sycl_explicit_simd]]` w/o |
| 140 | +`SYCL_ESIMD_KERNEL` |
| 141 | + |
| 142 | +#### ESIMD Verifier changes |
| 143 | + |
| 144 | +All the compiler variants require that accessor::get_pointer() can be used in |
| 145 | +the device code. `ESIMDVerifier.cpp` needs to additionally allow the following |
| 146 | +regexps: |
| 147 | +``` |
| 148 | + "^cl::sycl::accessor<.+>::getPointerAdjusted", |
| 149 | + "^cl::sycl::accessor<.+>::getQualifiedPtr", |
| 150 | + "^cl::sycl::accessor<.+>::get_pointer", |
| 151 | + "^cl::sycl::multi_ptr<.+>::.+" |
| 152 | +``` |
| 153 | +But only if it is run in "force-stateless" mode. |
| 154 | + |
| 155 | +### Long-term |
| 156 | + |
| 157 | +Long term solution would be replacing the |
| 158 | +``` |
| 159 | + #ifdef ESIMD_FORCE_STATELESS_MEM_ACCESS |
| 160 | +``` |
| 161 | +with |
| 162 | +``` |
| 163 | +if_device_has(platform_requires_stateless_access) |
| 164 | +``` |
| 165 | +and removing all the changes in other components. |
| 166 | +Plus VC BE need to be taught to generate correct pointer parameter annotation |
| 167 | +not relying on the middle-end providing it. |
0 commit comments