|
| 1 | +# SYCL(TM) Proposals: Reductions for ND-Range Parallelism |
| 2 | + |
| 3 | +**IMPORTANT**: The functionality introduced by this extension is deprecated in favor of the standard reduction functionality outlined in [Section 4.9.2 "Reduction variables"](https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:reduction) of the SYCL 2020 Specification, Revision 3. |
| 4 | + |
| 5 | +**NOTE**: Khronos(R) is a registered trademark and SYCL(TM) is a trademark of the Khronos Group, Inc. |
| 6 | + |
| 7 | +It is common for parallel kernels to produce a single output resulting from some combination of all inputs (e.g. the sum). Writing efficient reductions is a complex task, depending on both device and runtime characteristics: whether the value is being reduced across all parallel workers or some subset of them; whether the reduction can be accelerated by a specific scope of memory (e.g. work-group local memory); and whether the data type and reduction operator can be implemented with fast hardware atomics. Providing an abstraction for reductions in SYCL would greatly improve programmer productivity. |
| 8 | + |
| 9 | +This proposal focuses on introducing reductions to the ND-range version of `parallel_for`, using syntax that is roughly aligned with OpenMP and C++ [for_loop](https://wg21.link/p0075). Reductions within hierarchical parallelism kernels and hints describing desired properties of reductions are left to a future iteration of the proposal. |
| 10 | + |
| 11 | +# Reduction Semantics |
| 12 | + |
| 13 | +A reduction produces a single value by _combining_ multiple values in an unspecified order, using an operator that is both _associative_ and _commutative_ (e.g. addition). Only the _final_ value resulting from a reduction is of interest to the programmer. |
| 14 | + |
| 15 | +It should also be noted that reductions are not limited to scalar values: the behavior of reductions is well-defined for structs and even containers, given appropriate reduction operators. |
| 16 | + |
| 17 | +# `reduction` Objects |
| 18 | + |
| 19 | +```c++ |
| 20 | +namespace sycl { |
| 21 | +namespace ext { |
| 22 | +namespace oneapi { |
| 23 | +template <class T, class BinaryOperation> |
| 24 | +unspecified reduction(accessor<T>& var, BinaryOperation combiner); |
| 25 | + |
| 26 | +template <class T, class BinaryOperation> |
| 27 | +unspecified reduction(accessor<T>& var, const T& identity, BinaryOperation combiner); |
| 28 | + |
| 29 | +template <class T, class BinaryOperation> |
| 30 | +unspecified reduction(T* var, BinaryOperation combiner); |
| 31 | + |
| 32 | +template <class T, class BinaryOperation> |
| 33 | +unspecified reduction(T* var, const T& identity, BinaryOperation combiner); |
| 34 | + |
| 35 | +template <class T, class Extent, class BinaryOperation> |
| 36 | +unspecified reduction(span<T, Extent> var, BinaryOperation combiner); |
| 37 | + |
| 38 | +template <class T, class Extent, class BinaryOperation> |
| 39 | +unspecified reduction(span<T, Extent> var, const T& identity, BinaryOperation combiner); |
| 40 | +} |
| 41 | +} |
| 42 | +} |
| 43 | +``` |
| 44 | +
|
| 45 | +The exact behavior of a reduction is specific to an implementation; the only interface exposed to the user is the set of functions above, which construct an unspecified `reduction` object encapsulating the reduction variable, an optional operator identity and the reduction operator. For user-defined binary operations, an implementation should issue a compile-time warning if an identity is not specified and this is known to negatively impact performance (e.g. as a result of the implementation choosing a different reduction algorithm). For standard binary operations (e.g. `std::plus`) on arithmetic types, the implementation must determine the correct identity automatically in order to avoid performance penalties. |
| 46 | +
|
| 47 | +If an implementation can identify the identity value for a given combination of accumulator type `AccumulatorT` and function object type `BinaryOperation`, the value is defined as a member of the `known_identity` trait class: |
| 48 | +```c++ |
| 49 | +namespace sycl { |
| 50 | +namespace ext { |
| 51 | +namespace oneapi { |
| 52 | +template <typename BinaryOperation, typename AccumulatorT> |
| 53 | +struct known_identity { |
| 54 | + static constexpr AccumulatorT value; |
| 55 | +}; |
| 56 | +
|
| 57 | +// Available if C++17 |
| 58 | +template <typename BinaryOperation, typename AccumulatorT> |
| 59 | +inline constexpr AccumulatorT known_identity_v = known_identity<BinaryOperation, AccumulatorT>::value; |
| 60 | +} |
| 61 | +} |
| 62 | +} |
| 63 | +``` |
| 64 | + |
| 65 | +Whether `known_identity<BinaryOperation, AccumulatorT>::value` exists can be tested using the `has_known_identity` trait class: |
| 66 | + |
| 67 | +```c++ |
| 68 | +namespace sycl { |
| 69 | +namespace ext { |
| 70 | +namespace oneapi { |
| 71 | +template <typename BinaryOperation, typename AccumulatorT> |
| 72 | +struct has_known_identity { |
| 73 | + static constexpr bool value; |
| 74 | +}; |
| 75 | + |
| 76 | +// Available if C++17 |
| 77 | +template <typename BinaryOperation, typename AccumulatorT> |
| 78 | +inline constexpr bool has_known_identity_v = has_known_identity<BinaryOperation, AccumulatorT>::value; |
| 79 | +} |
| 80 | +} |
| 81 | +} |
| 82 | +``` |
| 83 | +
|
| 84 | +The dimensionality of the `accessor` passed to the `reduction` function specifies the dimensionality of the reduction variable: a 0-dimensional `accessor` represents a scalar reduction, and any other dimensionality represents an array reduction. Specifying an array reduction of size N is functionally equivalent to specifying N independent scalar reductions. The access mode of the accessor determines whether the reduction variable's original value is included in the reduction (i.e. for `access::mode::read_write` it is included, and for `access::mode::discard_write` it is not). Multiple reductions aliasing the same output results in undefined behavior. |
| 85 | +
|
| 86 | +`T` must be trivially copyable, permitting an implementation to (optionally) use atomic operations to implement the reduction. This restriction is aligned with `std::atomic<T>` and `std::atomic_ref<T>`. |
| 87 | +
|
| 88 | +# `reducer` Objects |
| 89 | +
|
| 90 | +```c++ |
| 91 | +namespace sycl { |
| 92 | +namespace ext { |
| 93 | +namespace oneapi { |
| 94 | +// Exposition only |
| 95 | +template <class T, class BinaryOperation, int Dimensions, /* implementation-defined */> |
| 96 | +class reducer |
| 97 | +{ |
| 98 | + // forbid reducer objects from being copied |
| 99 | + reducer(const reducer<T,BinaryOperation,Dimensions>&) = delete; |
| 100 | + reducer<T,BinaryOperation,Dimensions>& operator(const reducer<T,BinaryOperation,Dimensions>&) = delete; |
| 101 | +
|
| 102 | + // combine partial result with reducer |
| 103 | + // only available if Dimensions == 0 |
| 104 | + void combine(const T& partial); |
| 105 | +
|
| 106 | + // only available if Dimensions > 1 |
| 107 | + unspecified &operator[](size_t index) const; |
| 108 | +
|
| 109 | + // get identity of the associated reduction (if known) |
| 110 | + T identity() const; |
| 111 | +}; |
| 112 | +
|
| 113 | +// other operators should be made available for standard functors |
| 114 | +template <typename T> auto& operator+=(reducer<T,std::plus<T>,0>&, const T&); |
| 115 | +} |
| 116 | +} |
| 117 | +} |
| 118 | +``` |
| 119 | + |
| 120 | +The `reducer` class is not user-constructible, and can only be constructed by an implementation given a `reduction` object. The `combine` function uses the specified `BinaryOperation` to combine the `partial` result with the value held (or referenced) by an instance of `reducer`, and is the only way to update the reducer value for user-supplied combination functions. Other convenience operators should be defined for standard combination functions (e.g. `+=` for `std::plus`). |
| 121 | + |
| 122 | +To enable compile-time specialization of reduction algorithms, an implementation may define additional template arguments to the `reducer` class. The `reducer` type for a given reduction can be inspected using `decltype(reduction(var, identity, combiner))::reducer_type`. |
| 123 | + |
| 124 | +# Adding `reduction` Objects to `parallel_for` |
| 125 | + |
| 126 | +```c++ |
| 127 | +template <typename KernelName, typename KernelType, int dimensions, typename... Rest> |
| 128 | +void parallel_for(range<dimensions>numWorkItems, Rest&&... rest); |
| 129 | +``` |
| 130 | +
|
| 131 | +The `rest` parameter pack consists of 0 or more `reduction` objects followed by the kernel functor. For each `reduction` object operating on values of type `T`, the kernel functor should take an additional parameter of type `reducer<T, BinaryOperation>&`. For convenience and to avoid supplying the same information twice, it is expected that developers using C++14 will typically make use of `auto&` in place of specifying the reducer type. |
| 132 | +
|
| 133 | +The implementation must guarantee that it is safe for each concurrently executing work-item to call the `combine` function of a reducer in parallel. An implementation is free to re-use reducer variables (e.g. across work-groups scheduled to the same compute unit) if it can guarantee that it is safe to do so. |
| 134 | +
|
| 135 | +The combination order of different reducers is unspecified, as are when and how the value of each reducer is combined with the original variable. The value of the original variable at any point during execution of the kernel is undefined, and the final value is only visible after the kernel completes. |
| 136 | +
|
| 137 | +## Example |
| 138 | +```c++ |
| 139 | +// Compute a dot-product by reducing all computed values using standard plus functor |
| 140 | +queue.submit([&](handler& cgh) |
| 141 | +{ |
| 142 | + auto a = a_buf.get_access<access::mode::read>(cgh); |
| 143 | + auto b = b_buf.get_access<access::mode::read>(cgh); |
| 144 | + auto sum = accessor<int,0,access::mode::write,access::target::global_buffer>(sum_buf, cgh); |
| 145 | + cgh.parallel_for<class dot_product>(nd_range<1>{N, M}, reduction(sum, 0, plus<int>()), [=](nd_item<1> it, auto& sum) |
| 146 | + { |
| 147 | + int i = it.get_global_id(0); |
| 148 | + sum += (a[i] * b[i]); |
| 149 | + }); |
| 150 | +}); |
| 151 | +``` |
| 152 | + |
| 153 | +# Reductions using USM Pointers |
| 154 | + |
| 155 | +Unlike a buffer, a [USM pointer](https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions/USM) does not carry information describing the extent of the memory it points to; there is no way to distinguish between a scalar in device memory and an array. This proposal assumes that the majority of reductions are scalar, and that a pointer passed to a reduction should therefore always be interpreted as a reduction of a single element. The user must explicitly request an array reduction by passing a `span` denoting the memory region to include in the reduction. |
| 156 | + |
| 157 | +## Example |
| 158 | + |
| 159 | +```c++ |
| 160 | +// Treat an input pointer as N independent reductions |
| 161 | +int* out = static_cast<int*>(sycl_malloc<alloc::shared>(4 * sizeof(int))); |
| 162 | +queue.submit([&](handler& cgh) |
| 163 | +{ |
| 164 | + cgh.parallel_for<class sum>(nd_range<1>{N, M}, reduction(span(out, 4), 0, plus<int>()), [=](nd_item<1> it, auto& out) |
| 165 | + { |
| 166 | + int i = it.get_global_id(0); |
| 167 | + int j = foo(i); |
| 168 | + out[j] += in[i]; |
| 169 | + }); |
| 170 | +}); |
| 171 | +``` |
| 172 | + |
| 173 | +# Code Generation |
| 174 | + |
| 175 | +The semantics of this proposal have been carefully chosen to permit implementation freedom for different devices. Example mappings of the dot-product code above to several potential implementations are given below. This is not intended as an exhaustive list of implementations, but serves to demonstrate the flexibility of the proposal and its mapping to different hardware. |
| 176 | + |
| 177 | +## Hierarchical Reduction |
| 178 | + |
| 179 | +A simple way to implement this proposal is as a hierarchical reduction, combining results from work-items in the same work-group before combining results from different work-groups. For example, on devices with OpenCL 2.0 support this could be achieved using built-in work-group reductions, followed by an atomic update to global memory. |
| 180 | + |
| 181 | +```c++ |
| 182 | +__kernel void dot_product(__global float* a, __global float* b, __global float* sum) |
| 183 | +{ |
| 184 | + // Separate reducer per work-item is initialized to the reduction's identity value |
| 185 | + int item_partial_sum = 0; |
| 186 | + |
| 187 | + // User-provided lambda function |
| 188 | + int i = get_global_id(0); |
| 189 | + item_partial_sum += a[i] * b[i]; |
| 190 | + |
| 191 | + // Reducer values are combined within a work-group before atomically updating global value |
| 192 | + int wg_partial_sum = work_group_reduce_add(item_partial_sum); |
| 193 | + if (get_local_id(0) == 0) |
| 194 | + { |
| 195 | + atomic_add(sum, wg_partial_sum); |
| 196 | + } |
| 197 | +} |
| 198 | +``` |
| 199 | +
|
| 200 | +## Direct Atomics |
| 201 | +
|
| 202 | +For devices with very fast hardware atomics, it may be sufficient to simply update the global value atomically from each work-item. |
| 203 | +
|
| 204 | +```c++ |
| 205 | +__kernel void dot_product(__global float* a, __global float* b, __global float* sum) |
| 206 | +{ |
| 207 | + // User-provided lambda function |
| 208 | + // Each work-item directly updates the global value using (fast) hardware atomics |
| 209 | + int i = get_global_id(0); |
| 210 | + atomic_add(sum, a[i] * b[i]); |
| 211 | +} |
| 212 | +``` |
| 213 | + |
| 214 | +## Feature Test Macro |
| 215 | + |
| 216 | +This extension provides a feature-test macro as described in the core SYCL |
| 217 | +specification section 6.3.3 "Feature test macros". Therefore, an implementation |
| 218 | +supporting this extension must predefine the macro `SYCL_EXT_ONEAPI_ND_RANGE_REDUCTIONS` |
| 219 | +to one of the values defined in the table below. Applications can test for the |
| 220 | +existence of this macro to determine if the implementation supports this |
| 221 | +feature, or applications can test the macro's value to determine which of the |
| 222 | +extension's APIs the implementation supports. |
| 223 | + |
| 224 | +|Value |Description| |
| 225 | +|:---- |:---------:| |
| 226 | +|1 |Initial extension version. Base features are supported.| |
0 commit comments