|
| 1 | += sycl_ext_oneapi_named_sub_group_sizes |
| 2 | + |
| 3 | +:source-highlighter: coderay |
| 4 | +:coderay-linenums-mode: table |
| 5 | + |
| 6 | +// This section needs to be after the document title. |
| 7 | +:doctype: book |
| 8 | +:toc2: |
| 9 | +:toc: left |
| 10 | +:encoding: utf-8 |
| 11 | +:lang: en |
| 12 | +:dpcpp: pass:[DPC++] |
| 13 | + |
| 14 | +// Set the default source code type in this document to C++, |
| 15 | +// for syntax highlighting purposes. This is needed because |
| 16 | +// docbook uses c++ and html5 uses cpp. |
| 17 | +:language: {basebackend@docbook:c++:cpp} |
| 18 | + |
| 19 | + |
| 20 | +== Notice |
| 21 | + |
| 22 | +[%hardbreaks] |
| 23 | +Copyright (C) 2019-2022 Intel Corporation. All rights reserved. |
| 24 | + |
| 25 | +Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks |
| 26 | +of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by |
| 27 | +permission by Khronos. |
| 28 | + |
| 29 | + |
| 30 | +== Contact |
| 31 | + |
| 32 | +To report problems with this extension, please open a new issue at: |
| 33 | + |
| 34 | +https://github.com/intel/llvm/issues |
| 35 | + |
| 36 | + |
| 37 | +== Dependencies |
| 38 | + |
| 39 | +This extension is written against the SYCL 2020 revision 4 specification. All |
| 40 | +references below to the "core SYCL specification" or to section numbers in the |
| 41 | +SYCL specification refer to that revision. |
| 42 | + |
| 43 | +This extension also depends on the following other SYCL extensions: |
| 44 | + |
| 45 | +* link:../experimental/sycl_ext_oneapi_properties.asciidoc[ |
| 46 | + sycl_ext_oneapi_properties] |
| 47 | + |
| 48 | +* link:../proposed/sycl_ext_oneapi_kernel_properties.asciidoc[ |
| 49 | + sycl_ext_oneapi_kernel_properties] |
| 50 | + |
| 51 | + |
| 52 | +== Status |
| 53 | + |
| 54 | +This is a proposed extension specification, intended to gather community |
| 55 | +feedback. Interfaces defined in this specification may not be implemented yet |
| 56 | +or may be in a preliminary state. The specification itself may also change in |
| 57 | +incompatible ways before it is finalized. Shipping software products should not |
| 58 | +rely on APIs defined in this specification. |
| 59 | + |
| 60 | + |
| 61 | +== Overview |
| 62 | + |
| 63 | +SYCL provides a mechanism to set a required sub-group size for a kernel via |
| 64 | +an attribute, and the sycl_ext_oneapi_kernel_properties extension provides an |
| 65 | +equivalent property. |
| 66 | + |
| 67 | +Either mechanism is sufficient when tuning individual kernels for specific |
| 68 | +devices, but their usage quickly becomes complicated in real-life scenarios |
| 69 | +because: |
| 70 | + |
| 71 | +1. An integral sub-group size must be provided at host compile-time. |
| 72 | + |
| 73 | +2. The sub-group sizes supported by a device are not known until run-time. |
| 74 | + |
| 75 | +3. It is common for the same sub-group size to be used for all kernels |
| 76 | + (e.g. because the sub-group size is reflected in data structures). |
| 77 | + |
| 78 | +Applications wishing to write portable sub-group code that can target multiple |
| 79 | +architectures must therefore multi-version their C++ code (e.g. via templates), |
| 80 | +dispatch to the correct kernel(s) based on the result of a run-time query, and |
| 81 | +repeat this process for every kernel individually. |
| 82 | + |
| 83 | +This extension aims to simplify the process of using sub-groups by introducing |
| 84 | +the notion of _named_ sub-group sizes, allowing developers to request a |
| 85 | +sub-group size that meets certain requirements at host compile-time and |
| 86 | +deferring the selection of a specific sub-group size until the kernel is |
| 87 | +compiled for a specific device. |
| 88 | + |
| 89 | +This extension also defines the default behavior of sub-groups in SYCL code to |
| 90 | +improve the out-of-the-box experience for new developers, without preventing |
| 91 | +experts and existing developers from requesting the existing compiler behavior. |
| 92 | + |
| 93 | + |
| 94 | +== Specification |
| 95 | + |
| 96 | +=== Feature test macro |
| 97 | + |
| 98 | +This extension provides a feature-test macro as described in the core SYCL |
| 99 | +specification. An implementation supporting this extension must predefine the |
| 100 | +macro `SYCL_EXT_ONEAPI_NAMED_SUB_GROUP_SIZES` to one of the values defined in the |
| 101 | +table below. Applications can test for the existence of this macro to |
| 102 | +determine if the implementation supports this feature, or applications can test |
| 103 | +the macro's value to determine which of the extension's features the |
| 104 | +implementation supports. |
| 105 | + |
| 106 | +[%header,cols="1,5"] |
| 107 | +|=== |
| 108 | +|Value |
| 109 | +|Description |
| 110 | + |
| 111 | +|1 |
| 112 | +|The APIs of this experimental extension are not versioned, so the |
| 113 | + feature-test macro always has this value. |
| 114 | +|=== |
| 115 | + |
| 116 | + |
| 117 | +=== Changes to sub-group behavior |
| 118 | + |
| 119 | +Much of the behavior related to sub-groups in SYCL 2020 is |
| 120 | +implementation-defined. Different kernels may use different sub-group sizes, |
| 121 | +and even the same kernel may use different kernels on some devices (e.g. for |
| 122 | +different ND-range launch configurations). |
| 123 | + |
| 124 | +The extension introduces simpler behavior for sub-groups: |
| 125 | + |
| 126 | +- If no sub-group size property appears on a kernel or `SYCL_EXTERNAL` |
| 127 | + function, the default behavior of an implementation must be to compile and |
| 128 | + execute the kernel or function using a device's _primary_ sub-group size. The |
| 129 | + primary sub-group size must be compatible with all core language features. |
| 130 | + |
| 131 | +- If a developer does not require a stable sub-group size across all kernels |
| 132 | + and kernel launches, they can explicitly request an _automatic_ sub-group |
| 133 | + size chosen by the implementation. |
| 134 | + |
| 135 | +- Implementations are free to provide mechanisms which override the default |
| 136 | + sub-group behavior (e.g. via compiler flags), but developers must use this |
| 137 | + mechanism explicitly in order to opt-in to any change in behavior. |
| 138 | + |
| 139 | + |
| 140 | +=== Device queries |
| 141 | + |
| 142 | +A new `info::device::primary_sub_group_size` device query is introduced to |
| 143 | +query a device's primary sub-group size. |
| 144 | + |
| 145 | +[%header,cols="1,5,5"] |
| 146 | +|=== |
| 147 | +|Device Descriptor |
| 148 | +|Return Type |
| 149 | +|Description |
| 150 | + |
| 151 | +|`info::device::primary_sub_group_size` |
| 152 | +|`uint32_t` |
| 153 | +|Return a sub-group size supported by this device that is guaranteed to support |
| 154 | + all core language features for the device. |
| 155 | +|=== |
| 156 | + |
| 157 | + |
| 158 | +=== Properties |
| 159 | + |
| 160 | +```c++ |
| 161 | +namespace sycl { |
| 162 | +namespace ext { |
| 163 | +namespace oneapi { |
| 164 | +namespace experimental { |
| 165 | + |
| 166 | +struct named_sub_group_size { |
| 167 | + static constexpr uint32_t primary = /* unspecified */, |
| 168 | + static constexpr uint32_t automatic = /* unspecified */, |
| 169 | +}; |
| 170 | + |
| 171 | +inline constexpr sub_group_size_key::value_t<named_sub_group_size::primary> sub_group_size_primary; |
| 172 | + |
| 173 | +inline constexpr sub_group_size_key::value_t<named_sub_group_size::automatic> sub_group_size_automatic; |
| 174 | + |
| 175 | +} // namespace experimental |
| 176 | +} // namespace oneapi |
| 177 | +} // namespace ext |
| 178 | +} // namespace sycl |
| 179 | +``` |
| 180 | + |
| 181 | +NOTE: The named sub-group size properties are deliberately designed to reuse as |
| 182 | +much of the existing `sub_group_size` property infrastructure as possible. |
| 183 | +Implementations are free to choose the integral value associated with each |
| 184 | +named sub-group type, but it is expected that many implementations will use |
| 185 | +values like 0 (which is otherwise not a meaningful sub-group size) or -1 |
| 186 | +(which would otherwise correspond to a sub-group size so large it is unlikely |
| 187 | +any device would support it). |
| 188 | + |
| 189 | +|=== |
| 190 | +|Property|Description |
| 191 | + |
| 192 | +|`sub_group_size_primary` |
| 193 | +|The `sub_group_size_primary` property adds the requirement that the kernel |
| 194 | + must be compiled and executed with the primary sub-group size of the device to |
| 195 | + which the kernel is submitted (as reported by the |
| 196 | + `info::device::primary_sub_group_size` query). |
| 197 | + |
| 198 | +|`sub_group_size_automatic` |
| 199 | +|The `sub_group_size_automatic` property adds the requirement that the kernel |
| 200 | + can be compiled and executed with any of the valid sub-group sizes associated |
| 201 | + with the device to which the kernel is submitted (as reported by the |
| 202 | + `info::device::sub_group_sizes` query). The manner in which the sub-group size |
| 203 | + is selected is implementation-defined. |
| 204 | + |
| 205 | +|=== |
| 206 | + |
| 207 | +At most one of the `sub_group_size`, `sub_group_size_primary` and |
| 208 | +`sub_group_size_automatic` properties may be associated with a kernel or |
| 209 | +device function. |
| 210 | + |
| 211 | +NOTE: No special handling is required to detect this case, since |
| 212 | +`sub_group_size_primary` and `sub_group_size_automatic` are simply named |
| 213 | +shorthands for properties associated with `sub_group_size_key`. |
| 214 | + |
| 215 | +There are special requirements whenever a device function defined in one |
| 216 | +translation unit makes a call to a device function that is defined in a second |
| 217 | +translation unit. In such a case, the second device function is always declared |
| 218 | +using `SYCL_EXTERNAL`. If the kernel calling these device functions is defined |
| 219 | +using a sub-group size property, the functions declared using `SYCL_EXTERNAL` |
| 220 | +must be similarly decorated to ensure that the same sub-group size is used. |
| 221 | +This decoration must exist in both the translation unit making the call and |
| 222 | +also in the translation unit that defines the function. If the sub-group size |
| 223 | +property is missing in the translation unit that makes the call, or if the |
| 224 | +sub-group size of the called function does not match the sub-group size of the |
| 225 | +calling function, the program is ill-formed and the compiler must raise a |
| 226 | +diagnostic. |
| 227 | + |
| 228 | +Note that a compiler may choose a different sub-group size for each kernel and |
| 229 | +`SYCL_EXTERNAL` function using an automatic sub-group size. If kernels with an |
| 230 | +automatic sub-group size call `SYCL_EXTERNAL` functions using an automatic |
| 231 | +sub-group size, the program may be ill-formed. The behavior when |
| 232 | +`SYCL_EXTERNAL` is used in conjunction with an automatic sub-group size is |
| 233 | +implementation-defined, and code relying on specific behavior should not be |
| 234 | +expected to be portable across implementations. If a kernel calls a |
| 235 | +`SYCL_EXTERNAL` function with an incompatible sub-group size, the compiler must |
| 236 | +raise a diagnostic -- it is expected that this diagnostic will be raised during |
| 237 | +link-time, since this is the first time the compiler will see both translation |
| 238 | +units together. |
| 239 | + |
| 240 | + |
| 241 | +=== DPC++ compiler flags |
| 242 | + |
| 243 | +This non-normative section describes command line flags that the DPC++ compiler |
| 244 | +supports. Other compilers are free to provide their own command line flags (if |
| 245 | +any). |
| 246 | + |
| 247 | +The `-fsycl-default-sub-group-size` flag controls the default sub-group size |
| 248 | +used within a translation unit, which applies to all kernels and |
| 249 | +`SYCL_EXTERNAL` functions without an explicitly specified sub-group size. |
| 250 | + |
| 251 | +If the argument passed to `-fsycl-default-sub-group-size` is an integer `S`, |
| 252 | +all kernels and `SYCL_EXTERNAL` functions without an explicitly specified |
| 253 | +sub-group size are compiled as-if `sub_group_size<S>` was specified as a |
| 254 | +property of that kernel or function. |
| 255 | + |
| 256 | +If the argument passed to `-fsycl-default-sub-group-size` is a string `NAME`, |
| 257 | +all kernels and `SYCL_EXTERNAL` functions without an explicitly specified |
| 258 | +sub-group size are compiled as-if `sub_group_size_NAME` was |
| 259 | +specified as a property of that kernel or function. |
| 260 | + |
| 261 | + |
| 262 | +== Implementation notes |
| 263 | + |
| 264 | +This non-normative section provides information about one possible |
| 265 | +implementation of this extension. It is not part of the specification of the |
| 266 | +extension's API. |
| 267 | + |
| 268 | +The existing mechanism of describing a required sub-group size in SPIR-V may |
| 269 | +need to be augmented to support named sub-group sizes. The existing sub-group |
| 270 | +size descriptors could be used with reserved values (similar to the template |
| 271 | +arguments in the properties), or new descriptors could be created for each |
| 272 | +case. |
| 273 | + |
| 274 | +Device compilers will need to be taught to interpret these named sub-group |
| 275 | +sizes as equivalent to a device-specific integral sub-group size at |
| 276 | +compile-time. |
| 277 | + |
| 278 | + |
| 279 | +== Issues |
| 280 | + |
| 281 | +. What should the sub-group size compatible with all features be called? |
| 282 | ++ |
| 283 | +-- |
| 284 | +*RESOLVED*: The name adopted is "primary", to convey that it is an integral |
| 285 | +part of sub-group support provided by the device. Other names considered are |
| 286 | +listed here for posterity: "default", "stable", "fixed", "core". These terms |
| 287 | +are easy to misunderstand (i.e. the "default" size may not be chosen by |
| 288 | +default, the "stable" size is unrelated to the software release cycle, the |
| 289 | +"fixed" sub-group size may change between devices or compiler releases, the |
| 290 | +"core" size is unrelated to hardware cores). |
| 291 | +-- |
| 292 | + |
| 293 | +. How does sub-group size interact with `SYCL_EXTERNAL` functions? The current |
| 294 | +behavior requires exact matching. Should this be relaxed to allow alternative |
| 295 | +implementations (e.g. link-time optimization, multi-versioning)? |
| 296 | ++ |
| 297 | +-- |
| 298 | +*RESOLVED*: Exact matching is required to ensure that developers can reason about |
| 299 | +the portability of their code across different implementations. Setting the |
| 300 | +default sub-group size to "primary" and providing an override flag to select |
| 301 | +"automatic" everywhere means that only advanced developers who are tuning |
| 302 | +sub-group size on a per-kernel basis will have to worry about potential |
| 303 | +matching issues. |
| 304 | +-- |
0 commit comments