|
| 1 | += sycl_ext_oneapi_auto_local_range |
| 2 | + |
| 3 | +:source-highlighter: coderay |
| 4 | +:coderay-linenums-mode: table |
| 5 | + |
| 6 | +// This section needs to be after the document title. |
| 7 | +:doctype: book |
| 8 | +:toc2: |
| 9 | +:toc: left |
| 10 | +:encoding: utf-8 |
| 11 | +:lang: en |
| 12 | +:dpcpp: pass:[DPC++] |
| 13 | + |
| 14 | +// Set the default source code type in this document to C++, |
| 15 | +// for syntax highlighting purposes. This is needed because |
| 16 | +// docbook uses c++ and html5 uses cpp. |
| 17 | +:language: {basebackend@docbook:c++:cpp} |
| 18 | + |
| 19 | + |
| 20 | +== Notice |
| 21 | + |
| 22 | +[%hardbreaks] |
| 23 | +Copyright (C) 2022-2022 Intel Corporation. All rights reserved. |
| 24 | + |
| 25 | +Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks |
| 26 | +of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by |
| 27 | +permission by Khronos. |
| 28 | + |
| 29 | + |
| 30 | +== Contact |
| 31 | + |
| 32 | +To report problems with this extension, please open a new issue at: |
| 33 | + |
| 34 | +https://github.com/intel/llvm/issues |
| 35 | + |
| 36 | + |
| 37 | +== Dependencies |
| 38 | + |
| 39 | +This extension is written against the SYCL 2020 revision 4 specification. All |
| 40 | +references below to the "core SYCL specification" or to section numbers in the |
| 41 | +SYCL specification refer to that revision. |
| 42 | + |
| 43 | + |
| 44 | +== Status |
| 45 | + |
| 46 | +This is a proposed extension specification, intended to gather community |
| 47 | +feedback. Interfaces defined in this specification may not be implemented yet |
| 48 | +or may be in a preliminary state. The specification itself may also change in |
| 49 | +incompatible ways before it is finalized. *Shipping software products should |
| 50 | +not rely on APIs defined in this specification.* |
| 51 | + |
| 52 | + |
| 53 | +== Overview |
| 54 | + |
| 55 | +SYCL provides a basic form of `parallel_for` that allows developers to specify |
| 56 | +the total number of work-items to launch (i.e. the size of the global range) |
| 57 | +without also specifying the number of work-items per work-group (i.e. the size |
| 58 | +of the local range). However, this form of `parallel_for` does not provide |
| 59 | +access to the `sycl::group` or `sycl::sub_group` classes; if a developer wants |
| 60 | +to use these classes, a kernel must be launched with a `sycl::nd_range` that |
| 61 | +specifies both the global and local sizes. |
| 62 | + |
| 63 | +The only way to allow an implementation to choose the local work-group size |
| 64 | +with the ND-range form of `parallel_for` is to use kernel queries, following |
| 65 | +an approach like the one shown below: |
| 66 | + |
| 67 | +[source, c++] |
| 68 | +---- |
| 69 | +auto bundle = sycl::get_kernel_bundle(q.get_context()); |
| 70 | +auto kernel = bundle.get_kernel<class KernelName>(); |
| 71 | +auto multiple = kernel.get_info<sycl::info::kernel_device_specific::preferred_work_group_size_multiple>(q.get_device()); |
| 72 | +auto max = kernel.get_info<sycl::info::kernel_device_specific::work_group_size>(q.get_device()); |
| 73 | +sycl::range<1> local; |
| 74 | +if (N % multiple == 0) { |
| 75 | + // Use largest work-group size compatible with preferred multiple |
| 76 | + local = static_cast<size_t>(max / multiple) * multiple; |
| 77 | +} |
| 78 | +else { |
| 79 | + local = /* find largest work-group size smaller than max that divides N */; |
| 80 | +} |
| 81 | +q.parallel_for<class KernelName>(sycl::nd_range<1>{N, local}, [=](sycl::nd_item<1>) { |
| 82 | + /* kernel body */ |
| 83 | +}); |
| 84 | +---- |
| 85 | + |
| 86 | +The example above is very verbose, and requires developers to learn about |
| 87 | +kernel naming, `sycl::kernel_bundle`, `sycl::kernel` and a number of |
| 88 | +device/kernel queries. Extending the logic to support two- and |
| 89 | +three-dimensional kernels complicates things further. With this extension, the |
| 90 | +example simplifies to: |
| 91 | + |
| 92 | +[source, c++] |
| 93 | +---- |
| 94 | +q.parallel_for(sycl::nd_range<1>{N, sycl::ext::oneapi::experimental::auto_range<1>}, [=](sycl::nd_item<1>) { |
| 95 | + /* kernel body */ |
| 96 | +}); |
| 97 | +---- |
| 98 | + |
| 99 | +The SYCL 2020 specification recommends that extensions should not alter |
| 100 | +existing constructors without ensuring that one of the parameters comes from |
| 101 | +the vendor's extension namespace. This restriction prevents this extension from |
| 102 | +defaulting the second argument of the `sycl::nd_range` constructor to |
| 103 | +`auto_range`, which would make the example above even simpler: |
| 104 | + |
| 105 | +[source, c++] |
| 106 | +---- |
| 107 | +q.parallel_for(sycl::nd_range<1>{N}, [=](sycl::nd_item<1>) { |
| 108 | + /* kernel body */ |
| 109 | +}); |
| 110 | +---- |
| 111 | + |
| 112 | +If this extension is proposed for inclusion in a future SYCL standard, altering |
| 113 | +the definition of `sycl::nd_range` should be considered. |
| 114 | + |
| 115 | + |
| 116 | +== Specification |
| 117 | + |
| 118 | +=== Feature test macro |
| 119 | + |
| 120 | +This extension provides a feature-test macro as described in the core SYCL |
| 121 | +specification. An implementation supporting this extension must predefine the |
| 122 | +macro `SYCL_EXT_ONEAPI_AUTO_LOCAL_RANGE` to one of the values defined in the |
| 123 | +table below. Applications can test for the existence of this macro to |
| 124 | +determine if the implementation supports this feature, or applications can test |
| 125 | +the macro's value to determine which of the extension's features the |
| 126 | +implementation supports. |
| 127 | + |
| 128 | +[%header,cols="1,5"] |
| 129 | +|=== |
| 130 | +|Value |
| 131 | +|Description |
| 132 | + |
| 133 | +|1 |
| 134 | +|Initial version of this extension. |
| 135 | +|=== |
| 136 | + |
| 137 | + |
| 138 | +=== `auto_range` |
| 139 | + |
| 140 | +This extension defines a new `sycl::ext::oneapi::experimental::auto_range` |
| 141 | +variable which can be used to define a `sycl::nd_range` with an unspecified |
| 142 | +work-group size. If such a `sycl::nd_range` object is used to launch a SYCL |
| 143 | +kernel, an implementation is free to launch the kernel with any valid |
| 144 | +work-group size (as defined by the SYCL specification). |
| 145 | + |
| 146 | +The manner in which a work-group size is selected is implementation-defined. |
| 147 | +However, the total number of work-items launched by the kernel must match the |
| 148 | +number specified as the first argument to the `sycl::nd_range` constructor; |
| 149 | +an implementation may not adjust the size of the global range. |
| 150 | + |
| 151 | +NOTE: Developers must take care to avoid awkward global range sizes when using |
| 152 | +an `auto_range`, to avoid performance issues. Since implementations must still |
| 153 | +choose a work-group size that divides the total number of work-items, the |
| 154 | +implementation may be forced to choose a sub-optimal work-group size (e.g. if |
| 155 | +the total number of work-items is a prime number, the work-group size must be |
| 156 | +1). Although this division requirement holds for all SYCL kernels, it is easier |
| 157 | +to miss when using an `auto_range`. |
| 158 | + |
| 159 | + |
| 160 | +[source, c++] |
| 161 | +---- |
| 162 | +namespace sycl { |
| 163 | +namespace ext { |
| 164 | +namespace oneapi { |
| 165 | +namespace experimental { |
| 166 | +
|
| 167 | +template <int Dimensions> |
| 168 | +static const inline range<Dimensions> auto_range = /* implementation-defined */; |
| 169 | +
|
| 170 | +} |
| 171 | +} |
| 172 | +} |
| 173 | +} |
| 174 | +---- |
| 175 | + |
| 176 | +NOTE: The `auto_range` variable is not `constexpr` because `sycl::range` is not |
| 177 | +required to be a literal type by SYCL 2020. If this changes in the future, this |
| 178 | +extension will be updated. |
| 179 | + |
| 180 | + |
| 181 | +== Implementation notes |
| 182 | + |
| 183 | +This non-normative section provides information about one possible |
| 184 | +implementation of this extension. It is not part of the specification of the |
| 185 | +extension's API. |
| 186 | + |
| 187 | +The value of `auto_range` is implementation-defined to maximize freedom for |
| 188 | +implementations. If an implementation wants to use a reserved value to |
| 189 | +represent a request for an automatic local range, a range with every element |
| 190 | +set to 0 has no valid interpretation in SYCL 2020 and may be a logical choice. |
| 191 | +A trivial implementation of this extension can use a range with every element |
| 192 | +set to 1, since this is always a valid local range. |
| 193 | + |
| 194 | +Implementations using OpenCL backends can build on existing OpenCL |
| 195 | +functionality where `clEnqueueNDRangeKernel` is called with a `local_work_size` |
| 196 | +value of `NULL`. |
| 197 | + |
| 198 | +Implementations using other backends (e.g. Level Zero, CUDA) can use a |
| 199 | +combination of device and kernel queries to determine a good work-group size. |
| 200 | + |
| 201 | + |
| 202 | +== Issues |
| 203 | + |
| 204 | +None. |
0 commit comments