Skip to content

Commit cb4e702

Browse files
authored
[SYCL][Doc] Add auto_local_range extension (#5950)
This extension defines a new auto_range variable which can be used to define a sycl::nd_range with an unspecified work-group size. If such a sycl::nd_range object is used to launch a SYCL kernel, an implementation is free to launch the kernel with any valid work-group size (as defined by the SYCL specification). Signed-off-by: John Pennycook <[email protected]>
1 parent 7fe81de commit cb4e702

File tree

1 file changed

+204
-0
lines changed

1 file changed

+204
-0
lines changed
Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
= sycl_ext_oneapi_auto_local_range
2+
3+
:source-highlighter: coderay
4+
:coderay-linenums-mode: table
5+
6+
// This section needs to be after the document title.
7+
:doctype: book
8+
:toc2:
9+
:toc: left
10+
:encoding: utf-8
11+
:lang: en
12+
:dpcpp: pass:[DPC++]
13+
14+
// Set the default source code type in this document to C++,
15+
// for syntax highlighting purposes. This is needed because
16+
// docbook uses c++ and html5 uses cpp.
17+
:language: {basebackend@docbook:c++:cpp}
18+
19+
20+
== Notice
21+
22+
[%hardbreaks]
23+
Copyright (C) 2022-2022 Intel Corporation. All rights reserved.
24+
25+
Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
26+
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by
27+
permission by Khronos.
28+
29+
30+
== Contact
31+
32+
To report problems with this extension, please open a new issue at:
33+
34+
https://github.com/intel/llvm/issues
35+
36+
37+
== Dependencies
38+
39+
This extension is written against the SYCL 2020 revision 4 specification. All
40+
references below to the "core SYCL specification" or to section numbers in the
41+
SYCL specification refer to that revision.
42+
43+
44+
== Status
45+
46+
This is a proposed extension specification, intended to gather community
47+
feedback. Interfaces defined in this specification may not be implemented yet
48+
or may be in a preliminary state. The specification itself may also change in
49+
incompatible ways before it is finalized. *Shipping software products should
50+
not rely on APIs defined in this specification.*
51+
52+
53+
== Overview
54+
55+
SYCL provides a basic form of `parallel_for` that allows developers to specify
56+
the total number of work-items to launch (i.e. the size of the global range)
57+
without also specifying the number of work-items per work-group (i.e. the size
58+
of the local range). However, this form of `parallel_for` does not provide
59+
access to the `sycl::group` or `sycl::sub_group` classes; if a developer wants
60+
to use these classes, a kernel must be launched with a `sycl::nd_range` that
61+
specifies both the global and local sizes.
62+
63+
The only way to allow an implementation to choose the local work-group size
64+
with the ND-range form of `parallel_for` is to use kernel queries, following
65+
an approach like the one shown below:
66+
67+
[source, c++]
68+
----
69+
auto bundle = sycl::get_kernel_bundle(q.get_context());
70+
auto kernel = bundle.get_kernel<class KernelName>();
71+
auto multiple = kernel.get_info<sycl::info::kernel_device_specific::preferred_work_group_size_multiple>(q.get_device());
72+
auto max = kernel.get_info<sycl::info::kernel_device_specific::work_group_size>(q.get_device());
73+
sycl::range<1> local;
74+
if (N % multiple == 0) {
75+
// Use largest work-group size compatible with preferred multiple
76+
local = static_cast<size_t>(max / multiple) * multiple;
77+
}
78+
else {
79+
local = /* find largest work-group size smaller than max that divides N */;
80+
}
81+
q.parallel_for<class KernelName>(sycl::nd_range<1>{N, local}, [=](sycl::nd_item<1>) {
82+
/* kernel body */
83+
});
84+
----
85+
86+
The example above is very verbose, and requires developers to learn about
87+
kernel naming, `sycl::kernel_bundle`, `sycl::kernel` and a number of
88+
device/kernel queries. Extending the logic to support two- and
89+
three-dimensional kernels complicates things further. With this extension, the
90+
example simplifies to:
91+
92+
[source, c++]
93+
----
94+
q.parallel_for(sycl::nd_range<1>{N, sycl::ext::oneapi::experimental::auto_range<1>}, [=](sycl::nd_item<1>) {
95+
/* kernel body */
96+
});
97+
----
98+
99+
The SYCL 2020 specification recommends that extensions should not alter
100+
existing constructors without ensuring that one of the parameters comes from
101+
the vendor's extension namespace. This restriction prevents this extension from
102+
defaulting the second argument of the `sycl::nd_range` constructor to
103+
`auto_range`, which would make the example above even simpler:
104+
105+
[source, c++]
106+
----
107+
q.parallel_for(sycl::nd_range<1>{N}, [=](sycl::nd_item<1>) {
108+
/* kernel body */
109+
});
110+
----
111+
112+
If this extension is proposed for inclusion in a future SYCL standard, altering
113+
the definition of `sycl::nd_range` should be considered.
114+
115+
116+
== Specification
117+
118+
=== Feature test macro
119+
120+
This extension provides a feature-test macro as described in the core SYCL
121+
specification. An implementation supporting this extension must predefine the
122+
macro `SYCL_EXT_ONEAPI_AUTO_LOCAL_RANGE` to one of the values defined in the
123+
table below. Applications can test for the existence of this macro to
124+
determine if the implementation supports this feature, or applications can test
125+
the macro's value to determine which of the extension's features the
126+
implementation supports.
127+
128+
[%header,cols="1,5"]
129+
|===
130+
|Value
131+
|Description
132+
133+
|1
134+
|Initial version of this extension.
135+
|===
136+
137+
138+
=== `auto_range`
139+
140+
This extension defines a new `sycl::ext::oneapi::experimental::auto_range`
141+
variable which can be used to define a `sycl::nd_range` with an unspecified
142+
work-group size. If such a `sycl::nd_range` object is used to launch a SYCL
143+
kernel, an implementation is free to launch the kernel with any valid
144+
work-group size (as defined by the SYCL specification).
145+
146+
The manner in which a work-group size is selected is implementation-defined.
147+
However, the total number of work-items launched by the kernel must match the
148+
number specified as the first argument to the `sycl::nd_range` constructor;
149+
an implementation may not adjust the size of the global range.
150+
151+
NOTE: Developers must take care to avoid awkward global range sizes when using
152+
an `auto_range`, to avoid performance issues. Since implementations must still
153+
choose a work-group size that divides the total number of work-items, the
154+
implementation may be forced to choose a sub-optimal work-group size (e.g. if
155+
the total number of work-items is a prime number, the work-group size must be
156+
1). Although this division requirement holds for all SYCL kernels, it is easier
157+
to miss when using an `auto_range`.
158+
159+
160+
[source, c++]
161+
----
162+
namespace sycl {
163+
namespace ext {
164+
namespace oneapi {
165+
namespace experimental {
166+
167+
template <int Dimensions>
168+
static const inline range<Dimensions> auto_range = /* implementation-defined */;
169+
170+
}
171+
}
172+
}
173+
}
174+
----
175+
176+
NOTE: The `auto_range` variable is not `constexpr` because `sycl::range` is not
177+
required to be a literal type by SYCL 2020. If this changes in the future, this
178+
extension will be updated.
179+
180+
181+
== Implementation notes
182+
183+
This non-normative section provides information about one possible
184+
implementation of this extension. It is not part of the specification of the
185+
extension's API.
186+
187+
The value of `auto_range` is implementation-defined to maximize freedom for
188+
implementations. If an implementation wants to use a reserved value to
189+
represent a request for an automatic local range, a range with every element
190+
set to 0 has no valid interpretation in SYCL 2020 and may be a logical choice.
191+
A trivial implementation of this extension can use a range with every element
192+
set to 1, since this is always a valid local range.
193+
194+
Implementations using OpenCL backends can build on existing OpenCL
195+
functionality where `clEnqueueNDRangeKernel` is called with a `local_work_size`
196+
value of `NULL`.
197+
198+
Implementations using other backends (e.g. Level Zero, CUDA) can use a
199+
combination of device and kernel queries to determine a good work-group size.
200+
201+
202+
== Issues
203+
204+
None.

0 commit comments

Comments
 (0)