Skip to content

Commit 5777e1f

Browse files
gmlueckbashbaugjbrodman
authored
[SYCL] Add spec for sycl_ext_intel_cslice (#7513)
Add a proposed extension specification that allows partitioning a device by "cslice" (aka CCS-es). Co-authored-by: Ben Ashbaugh <[email protected]> Co-authored-by: jbrodman <[email protected]>
1 parent 91b1515 commit 5777e1f

File tree

1 file changed

+279
-0
lines changed

1 file changed

+279
-0
lines changed
Lines changed: 279 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,279 @@
1+
= sycl_ext_intel_cslice
2+
3+
:source-highlighter: coderay
4+
:coderay-linenums-mode: table
5+
6+
// This section needs to be after the document title.
7+
:doctype: book
8+
:toc2:
9+
:toc: left
10+
:encoding: utf-8
11+
:lang: en
12+
:dpcpp: pass:[DPC++]
13+
14+
// Set the default source code type in this document to C++,
15+
// for syntax highlighting purposes. This is needed because
16+
// docbook uses c++ and html5 uses cpp.
17+
:language: {basebackend@docbook:c++:cpp}
18+
19+
20+
== Notice
21+
22+
[%hardbreaks]
23+
Copyright (C) 2022-2022 Intel Corporation. All rights reserved.
24+
25+
Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
26+
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by
27+
permission by Khronos.
28+
29+
30+
== Contact
31+
32+
To report problems with this extension, please open a new issue at:
33+
34+
https://github.com/intel/llvm/issues
35+
36+
37+
== Dependencies
38+
39+
This extension is written against the SYCL 2020 revision 6 specification. All
40+
references below to the "core SYCL specification" or to section numbers in the
41+
SYCL specification refer to that revision.
42+
43+
44+
== Status
45+
46+
This is a proposed extension specification, intended to gather community
47+
feedback. Interfaces defined in this specification may not be implemented yet
48+
or may be in a preliminary state. The specification itself may also change in
49+
incompatible ways before it is finalized. *Shipping software products should
50+
not rely on APIs defined in this specification.*
51+
52+
53+
== Overview
54+
55+
:multi-CCS: https://github.com/intel/compute-runtime/blob/master/level_zero/doc/experimental_extensions/MULTI_CCS_MODES.md
56+
57+
Some Intel GPU devices can be partitioned at a granularity of "cslice" (compute
58+
slice), which is a smaller granularity than "tile". This form of partitioning
59+
is not currently enabled by default, so it is considered an advanced feature
60+
which most applications are not expected to use. This extension provides a way
61+
for these advanced applications to partition a device by cslice when it is
62+
enabled in the device driver.
63+
64+
Unlike "tile" partitions, a cslice partition does not have any different cache
65+
affinity from its sibling cslice partitions. Therefore, this extension does
66+
not expose this type of partitioning through
67+
`info::partition_property::partition_by_affinity_domain`. Instead, it adds a
68+
new partitioning type
69+
`info::partition_property::ext_intel_partition_by_cslice`.
70+
71+
The only Intel GPU devices that currently support this type of partitioning
72+
are the Data Center GPU Max series (aka PVC), and this support is only
73+
available when the device driver is configured in {multi-CCS}[multi-CCS] mode.
74+
See that documentation for instructions on how to enable this mode and for
75+
other important information. Currently, it is only possible to partition a
76+
device by cslice if the driver is in "2 CCS Mode" or "4 CCS Mode". When in
77+
2 CCS Mode, a tile can be partitioned into two cslice sub-devices. When in
78+
4 CCS Mode, a tile can be partitioned into four cslice sub-devices.
79+
80+
This type of partitioning is currently supported only at the "tile" level.
81+
A device should be queried using `info::device::partition_properties` to
82+
determine if it supports partitioning by `ext_intel_partition_by_cslice`. If a
83+
device does not support partitioning by `ext_intel_partition_by_cslice` it may
84+
first need to be partitioned into per-tile sub-devices via
85+
`partition_by_affinity_domain`, and then each of the resulting sub-devices may
86+
be further partitioned by `ext_intel_partition_by_cslice`.
87+
88+
It is important to understand that the device driver virtualizes work
89+
submission to the cslice sub-devices. (More specifically, the device driver
90+
virtualizes work submission to different CCS-es, and this means that on Data
91+
Center GPU Max series devices the work submission to a cslice is virtualized.)
92+
This virtualization happens only between processes, and not within a single
93+
process. For example, consider a single process that constructs two SYCL
94+
queues on cslice sub-device #0. Kernels submitted to these two queues are
95+
guaranteed to conflict, both using the same set of execution units. Therefore,
96+
if a single process wants to explicitly submit kernels to cslice sub-devices
97+
and it wants to avoid conflict, it should create queues on different
98+
sub-devices. By contrast, consider an example where two separate processes
99+
create a SYCL queue on cslice sub-device #0. In this case, the device driver
100+
virtualizes access to this cslice, and kernels submitted from the first process
101+
may run on different execution units than kernels submitted from the second
102+
process. In this second case, the device driver binds the process's requested
103+
cslice to a physical cslice according to the overall system load.
104+
105+
Note that this extension can be supported by any implementation. If an
106+
implementation supports a backend or device without the concept of cslice
107+
partitions it can still conform to this extension by declaring the new
108+
enumerator and member functions specified below. If the info descriptor query
109+
`info::device::partition_properties` does not report
110+
`ext_intel_partition_by_cslice`, then the backend or device does not support
111+
the creation of cslice partitions.
112+
113+
114+
== Specification
115+
116+
=== Feature test macro
117+
118+
This extension provides a feature-test macro as described in the core SYCL
119+
specification. An implementation supporting this extension must predefine the
120+
macro `SYCL_EXT_INTEL_CSLICE` to one of the values defined in the table
121+
below. Applications can test for the existence of this macro to determine if
122+
the implementation supports this feature, or applications can test the macro's
123+
value to determine which of the extension's features the implementation
124+
supports.
125+
126+
[%header,cols="1,5"]
127+
|===
128+
|Value
129+
|Description
130+
131+
|1
132+
|Initial version of this extension.
133+
|===
134+
135+
=== New partition property
136+
137+
This extension adds a new enumerator named `ext_intel_partition_by_cslice` to
138+
`info::partition_property`:
139+
140+
```
141+
namespace sycl::info {
142+
143+
enum class partition_property : /* unspecified */ {
144+
// ...
145+
ext_intel_partition_by_cslice
146+
};
147+
148+
} // namespace sycl::info
149+
```
150+
151+
The behavior of the `info::device::partition_properties` info descriptor query
152+
is also extended to include `ext_intel_partition_by_cslice` in the vector of
153+
returned values if the device can be partitioned into at least two sub-devices
154+
along that partition property.
155+
156+
=== New function template specialization to create sub-devices
157+
158+
This extension adds a new function template specialization to the `device`
159+
class:
160+
161+
```
162+
namespace sycl {
163+
164+
class device {
165+
// ...
166+
167+
// Available only when
168+
// Prop == info::partition_property::ext_intel_partition_by_cslice
169+
template <info::partition_property Prop>
170+
std::vector<device> create_sub_devices() const;
171+
};
172+
173+
} // namespace sycl
174+
```
175+
176+
This function only participates in overload resolution if the `Prop` template
177+
parameter is `info::partition_property::ext_intel_partition_by_cslice`. It
178+
returns a `std::vector` of sub-devices partitioned from this SYCL `device`,
179+
each representing a fixed set of hardware cslices.
180+
181+
If the SYCL `device` does not support
182+
`info::partition_property::ext_intel_partition_by_cslice`, calling this
183+
function throws a synchronous `exception` with the
184+
`errc::feature_not_supported` error code.
185+
186+
=== Behavior of device info queries for a "cslice" sub-device
187+
188+
This section describes the behavior for some of the device info queries when
189+
applied to a `device` object that represents a "cslice" partition.
190+
191+
* `info::device::partition_type_property`
192+
+
193+
Returns `ext_intel_partition_by_cslice`.
194+
195+
* `info::device::max_compute_units`
196+
+
197+
When partitioning by `ext_intel_partition_by_cslice`, each sub-device
198+
represents a fixed subset of the parent device's compute units. This query
199+
returns the number of compute units represented by the sub-device.
200+
201+
The remaining device info queries return the properties or limits of the
202+
sub-device, as is typical for these queries. In general, if a resource is
203+
partitioned among the sub-devices, then the associated info query will
204+
return each sub-device's share of the resource. However, if a resource is
205+
shared by all of the sub-devices, then the associated info query for each
206+
sub-device will return the same value as for the parent device. For example,
207+
if device global memory is shared by all cslice partitions in a tile, then the
208+
info query `info::device::global_mem_size` will return the same value for the
209+
`device` object representing the tile as for the `device` object representing
210+
a cslice.
211+
212+
=== Behavior of the Level Zero backend interop functions
213+
214+
The Level Zero device driver doesn't use the concept of sub-device to represent
215+
a fixed partition of cslices. Instead, a Level Zero command queue can be
216+
created with a particular queue index, which represents a partition of the
217+
cslices.
218+
219+
As a result, calling `get_native` for a SYCL `device` that represents a cslice
220+
partition returns the same `ze_device_handle_t` as the parent device. If an
221+
application wants a native handle to the cslice partition, it must create a
222+
SYCL `queue` and then call `get_native` on the `queue`. This will return a
223+
`ze_command_queue_handle_t` that corresponds to the cslice partition.
224+
225+
=== Behavior of the OpenCL backend interop functions
226+
227+
The OpenCL device driver doesn't use the concept of sub-device to represent a
228+
fixed partition of cslices. Instead, an OpenCL command queue can be created
229+
with a particular queue index, which represents a partition of the cslices.
230+
231+
As a result, calling `get_native` for a SYCL `device` that represents a cslice
232+
partition returns the same `cl_device_id` as the parent device. If an
233+
application wants a native handle to the cslice partition, it must create a
234+
SYCL `queue` and then call `get_native` on the `queue`. This will return a
235+
`cl_command_queue` that corresponds to the cslice partition.
236+
237+
238+
== Impact on the ONEAPI_DEVICE_SELECTOR environment variable
239+
240+
:oneapi-device-selector: https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#oneapi_device_selector
241+
242+
This section describes the effect of this extension on the {dpcpp}
243+
`ONEAPI_DEVICE_SELECTOR` environment variable. Since this environment variable
244+
is not part of the SYCL specification, this section is not a normative part of
245+
the extension specification. Rather, it only describes the impact on {dpcpp}.
246+
247+
As described in the {oneapi-device-selector}[documentation] for the
248+
`ONEAPI_DEVICE_SELECTOR`, a term in the selector string can be an integral
249+
number followed by a decimal point (`.`), where the decimal point indicates a
250+
sub-device. For example, `1.2` means sub-device #2 of device #1. These
251+
decimal points can represent either a sub-device created via
252+
`partition_by_affinity_domain` or via `ext_intel_partition_by_cslice`. When
253+
{dpcpp} processes a term with a decimal point, it first attempts to partition
254+
by `ext_intel_partition_by_cslice`. If that is not possible, it next attempts
255+
to partition by `partition_by_affinity_domain` /
256+
`partition_affinity_domain::next_partitionable`.
257+
258+
It is important to keep in mind, though, that requesting a specific cslice via
259+
this environment variable has limited effect due to the device driver's
260+
virtualization of cslices. To illustrate, consider an example where two
261+
processes are launched as follows, selecting different cslice sub-devices:
262+
263+
```
264+
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo
265+
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.1 ZEX_NUMBER_OF_CCS=0:2 ./foo
266+
```
267+
268+
The first process selects cslice #0 while the second selects cslice #1. This
269+
does have the effect that each process is constrained to a single cslice (which
270+
is not the {dpcpp} default). However, the actual cslice number is irrelevant.
271+
Because of cslice virtualization, the device driver will choose some available
272+
cslice for each process instead of honoring the value requested in the
273+
environment variable. As a result, the following example has exactly the same
274+
effect:
275+
276+
```
277+
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo
278+
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo
279+
```

0 commit comments

Comments
 (0)