Skip to content

Commit af7ea6d

Browse files
committed
[SYCL] Add spec for sycl_ext_intel_cslice
Add a proposed extension specification that allows partitioning a device by "cslice" (aka CCS-es).
1 parent 5d5e9f4 commit af7ea6d

File tree

1 file changed

+275
-0
lines changed

1 file changed

+275
-0
lines changed
Lines changed: 275 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,275 @@
1+
= sycl_ext_intel_cslice
2+
3+
:source-highlighter: coderay
4+
:coderay-linenums-mode: table
5+
6+
// This section needs to be after the document title.
7+
:doctype: book
8+
:toc2:
9+
:toc: left
10+
:encoding: utf-8
11+
:lang: en
12+
:dpcpp: pass:[DPC++]
13+
14+
// Set the default source code type in this document to C++,
15+
// for syntax highlighting purposes. This is needed because
16+
// docbook uses c++ and html5 uses cpp.
17+
:language: {basebackend@docbook:c++:cpp}
18+
19+
20+
== Notice
21+
22+
[%hardbreaks]
23+
Copyright (C) 2022-2022 Intel Corporation. All rights reserved.
24+
25+
Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
26+
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by
27+
permission by Khronos.
28+
29+
30+
== Contact
31+
32+
To report problems with this extension, please open a new issue at:
33+
34+
https://github.com/intel/llvm/issues
35+
36+
37+
== Dependencies
38+
39+
This extension is written against the SYCL 2020 revision 6 specification. All
40+
references below to the "core SYCL specification" or to section numbers in the
41+
SYCL specification refer to that revision.
42+
43+
44+
== Status
45+
46+
This is a proposed extension specification, intended to gather community
47+
feedback. Interfaces defined in this specification may not be implemented yet
48+
or may be in a preliminary state. The specification itself may also change in
49+
incompatible ways before it is finalized. *Shipping software products should
50+
not rely on APIs defined in this specification.*
51+
52+
53+
== Overview
54+
55+
:multi-CCS: https://github.com/intel/compute-runtime/blob/master/level_zero/doc/experimental_extensions/MULTI_CCS_MODES.md
56+
57+
Some Intel GPU devices can be partitioned at a granularity of "cslice" (compute
58+
slice), which is a smaller granularity than "tile". At present, the GPU device
59+
drivers don't expose this mode by default, so this form of partitioning is
60+
considered an advanced feature which most applications are not expected to use.
61+
This extension provides a way for these advanced applications to partition a
62+
device by cslice when it is enabled in the device driver.
63+
64+
Unlike "tile" partitions, a cslice partition does not have any different cache
65+
affinity from its sibling cslice partitions. Therefore, this extension does
66+
not expose this type of partitioning through
67+
`info::partition_property::partition_by_affinity_domain`. Instead, it adds a
68+
new partitioning type
69+
`info::partition_property::ext_intel_partition_by_cslice`.
70+
71+
Intel GPU devices that support this type of partitioning currently support it
72+
only at the "tile" level. Therefore, a device with multiple tiles (e.g. PVC)
73+
must first be partitioned into per-tile sub-devices via
74+
`partition_by_affinity_domain`, and then each of the
75+
resulting sub-devices can be further partitioned by
76+
`ext_intel_partition_by_cslice`. Single-tile devices (e.g. ATS-M) can be
77+
directly partitioned by `ext_intel_partition_by_cslice` (for those ATS-M parts
78+
that have multiple cslice partitions).
79+
80+
It is important to understand that the device driver virtualizes work
81+
submission to the cslice sub-devices. This virtualization happens only between
82+
processes, and not within a single process. For example, consider a single
83+
process that constructs two SYCL queues on cslice sub-device #0. Kernels
84+
submitted to these two queues are guaranteed to conflict, both using the same
85+
set of execution units. Therefore, if a single process wants to explicitly
86+
submit kernels to cslice sub-devices and it wants to avoid conflict, it should
87+
create queues on different sub-devices. By contrast, consider an example where
88+
two separate processes create a SYCL queue on cslice sub-device #0. In this
89+
case, the device driver virtualizes access to this cslice, and kernels
90+
submitted from the first process may run on different execution units than
91+
kernels submitted from the second process. In this second case, the device
92+
driver binds the process's requested cslice to a physical cslice according to
93+
the overall system load.
94+
95+
For information about configuring the device driver to support cslice
96+
partitioning, see the driver documentation on {multi-CCS}[multi-CCS] mode.
97+
Currently, it is only possible to partition a device by cslice if the driver is
98+
in "2 CCS Mode" or "4 CCS Mode". When in 2 CCS Mode, a tile can be partitioned
99+
into two cslice sub-devices. When in 4 CCS Mode, a tile can be partitioned
100+
into four cslice sub-devices.
101+
102+
Note that this extension can be supported by any backend and any device. If a
103+
backend or device does not support the concept of cslice partitions, it can
104+
still conform to this extension by declaring the new enumerator and member
105+
functions specified below. If the info descriptor query
106+
`info::device::partition_properties` does not report
107+
`ext_intel_partition_by_cslice`, then the implementation need not support
108+
the creation of cslice partitions.
109+
110+
111+
== Specification
112+
113+
=== Feature test macro
114+
115+
This extension provides a feature-test macro as described in the core SYCL
116+
specification. An implementation supporting this extension must predefine the
117+
macro `SYCL_EXT_INTEL_CSLICE` to one of the values defined in the table
118+
below. Applications can test for the existence of this macro to determine if
119+
the implementation supports this feature, or applications can test the macro's
120+
value to determine which of the extension's features the implementation
121+
supports.
122+
123+
[%header,cols="1,5"]
124+
|===
125+
|Value
126+
|Description
127+
128+
|1
129+
|Initial version of this extension.
130+
|===
131+
132+
=== New partition property
133+
134+
This extension adds a new enumerator named `ext_intel_partition_by_cslice` to
135+
`info::partition_property`:
136+
137+
```
138+
namespace sycl::info {
139+
140+
enum class partition_property : /* unspecified */ {
141+
// ...
142+
ext_intel_partition_by_cslice
143+
};
144+
145+
} // namespace sycl::info
146+
```
147+
148+
The behavior of the `info::device::partition_properties` info descriptor query
149+
is also extended to include `ext_intel_partition_by_cslice` in the vector of
150+
returned values if the device can be partitioned into at least two sub-devices
151+
along that partition property.
152+
153+
=== New function template specialization to create sub-devices
154+
155+
This extension adds a new function template specialization to the `device`
156+
class:
157+
158+
```
159+
namespace sycl {
160+
161+
class device {
162+
// ...
163+
164+
// Available only when
165+
// Prop == info::partition_property::ext_intel_partition_by_cslice
166+
template <info::partition_property Prop>
167+
std::vector<device> create_sub_devices() const;
168+
};
169+
170+
} // namespace sycl
171+
```
172+
173+
This function only participates in overload resolution if the `Prop` template
174+
parameter is `info::partition_property::ext_intel_partition_by_cslice`. It
175+
returns a `std::vector` of sub-devices partitioned from this SYCL `device`,
176+
each representing a fixed set of hardware cslices.
177+
178+
If the SYCL `device` does not support
179+
`info::partition_property::ext_intel_partition_by_cslice`, calling this
180+
function throws a synchronous `exception` with the
181+
`errc::feature_not_supported` error code.
182+
183+
=== Behavior of device info queries for a "cslice" sub-device
184+
185+
This section describes the behavior for some of the device info queries when
186+
applied to a `device` object that represents a "cslice" partition.
187+
188+
* `info::device::partition_type_property`
189+
+
190+
Returns `ext_intel_partition_by_cslice`.
191+
192+
* `info::device::max_compute_units`
193+
+
194+
When partitioning by `ext_intel_partition_by_cslice`, each sub-device
195+
represents a fixed subset of the parent device's compute units. This query
196+
returns the number of compute units represented by the sub-device.
197+
198+
The remaining device info queries return the properties or limits of the
199+
sub-device, as is typical for these queries. In general, if a resource is
200+
partitioned among the sub-devices, then the associated info query will
201+
return each sub-device's share of the resource. However, if a resource is
202+
shared by all of the sub-devices, then the associated info query for each
203+
sub-device will return the same value as for the parent device. For example,
204+
if device global memory is shared by all cslice partitions in a tile, then the
205+
info query `info::device::global_mem_size` will return the same value for the
206+
`device` object representing the tile as for the `device` object representing
207+
a cslice.
208+
209+
=== Behavior of the Level Zero backend interop functions
210+
211+
The Level Zero device driver doesn't use the concept of sub-device to represent
212+
a fixed partition of cslices. Instead, a Level Zero command queue can be
213+
created with a particular queue index, which represents a partition of the
214+
cslices.
215+
216+
As a result, calling `get_native` for a SYCL `device` that represents a cslice
217+
partition returns the same `ze_device_handle_t` as the parent device. If an
218+
application wants a native handle to the cslice partition, it must create a
219+
SYCL `queue` and then call `get_native` on the `queue`. This will return a
220+
`ze_command_queue_handle_t` that corresponds to the cslice partition.
221+
222+
=== Behavior of the OpenCL backend interop functions
223+
224+
The OpenCL device driver doesn't use the concept of sub-device to represent a
225+
fixed partition of cslices. Instead, an OpenCL command queue can be created
226+
with a particular queue index, which represents a partition of the cslices.
227+
228+
As a result, calling `get_native` for a SYCL `device` that represents a cslice
229+
partition returns the same `cl_device_id` as the parent device. If an
230+
application wants a native handle to the cslice partition, it must create a
231+
SYCL `queue` and then call `get_native` on the `queue`. This will return a
232+
`cl_command_queue` that corresponds to the cslice partition.
233+
234+
235+
== Impact on the ONEAPI_DEVICE_SELECTOR environment variable
236+
237+
:oneapi-device-selector: https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#oneapi_device_selector
238+
239+
This section describes the effect of this extension on the {dpcpp}
240+
`ONEAPI_DEVICE_SELECTOR` environment variable. Since this environment variable
241+
is not part of the SYCL specification, this section is not a normative part of
242+
the extension specification. Rather, it only describes the impact on {dpcpp}.
243+
244+
As described in the {oneapi-device-selector}[documentation] for the
245+
`ONEAPI_DEVICE_SELECTOR`, a term in the selector string can be an integral
246+
number followed by a decimal point (`.`), where the decimal point indicates a
247+
sub-device. For example, `1.2` means sub-device #2 of device #1. These
248+
decimal points can represent either a sub-device created via
249+
`partition_by_affinity_domain` or via `ext_intel_partition_by_cslice`. When
250+
{dpcpp} processes a term with a decimal point, it first attempts to partition
251+
by `ext_intel_partition_by_cslice`. If that is not possible, it next attempts
252+
to partition by `partition_by_affinity_domain` /
253+
`partition_affinity_domain::next_partitionable`.
254+
255+
It is important to keep in mind, though, that requesting a specific cslice via
256+
this environment variable has limited effect due to the device driver's
257+
virtualization of cslices. To illustrate, consider an example where two
258+
processes are launched as follows, selecting different cslice sub-devices:
259+
260+
```
261+
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo
262+
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.1 ZEX_NUMBER_OF_CCS=0:2 ./foo
263+
```
264+
265+
The first process selects cslice #0 while the second selects cslice #1. This
266+
does have the effect that each process is constrained to a single cslice (which
267+
is not the {dpcpp} default). However, the actual cslice number is irrelevant.
268+
Because of cslice virtualization, the device driver will choose some available
269+
cslice for each process, ignoring the value requested in the environment
270+
variable. As a result, the following example has exactly the same effect:
271+
272+
```
273+
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo
274+
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo
275+
```

0 commit comments

Comments
 (0)