|
| 1 | += sycl_ext_intel_cslice |
| 2 | + |
| 3 | +:source-highlighter: coderay |
| 4 | +:coderay-linenums-mode: table |
| 5 | + |
| 6 | +// This section needs to be after the document title. |
| 7 | +:doctype: book |
| 8 | +:toc2: |
| 9 | +:toc: left |
| 10 | +:encoding: utf-8 |
| 11 | +:lang: en |
| 12 | +:dpcpp: pass:[DPC++] |
| 13 | + |
| 14 | +// Set the default source code type in this document to C++, |
| 15 | +// for syntax highlighting purposes. This is needed because |
| 16 | +// docbook uses c++ and html5 uses cpp. |
| 17 | +:language: {basebackend@docbook:c++:cpp} |
| 18 | + |
| 19 | + |
| 20 | +== Notice |
| 21 | + |
| 22 | +[%hardbreaks] |
| 23 | +Copyright (C) 2022-2022 Intel Corporation. All rights reserved. |
| 24 | + |
| 25 | +Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks |
| 26 | +of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by |
| 27 | +permission by Khronos. |
| 28 | + |
| 29 | + |
| 30 | +== Contact |
| 31 | + |
| 32 | +To report problems with this extension, please open a new issue at: |
| 33 | + |
| 34 | +https://github.com/intel/llvm/issues |
| 35 | + |
| 36 | + |
| 37 | +== Dependencies |
| 38 | + |
| 39 | +This extension is written against the SYCL 2020 revision 6 specification. All |
| 40 | +references below to the "core SYCL specification" or to section numbers in the |
| 41 | +SYCL specification refer to that revision. |
| 42 | + |
| 43 | + |
| 44 | +== Status |
| 45 | + |
| 46 | +This is a proposed extension specification, intended to gather community |
| 47 | +feedback. Interfaces defined in this specification may not be implemented yet |
| 48 | +or may be in a preliminary state. The specification itself may also change in |
| 49 | +incompatible ways before it is finalized. *Shipping software products should |
| 50 | +not rely on APIs defined in this specification.* |
| 51 | + |
| 52 | + |
| 53 | +== Overview |
| 54 | + |
| 55 | +:multi-CCS: https://github.com/intel/compute-runtime/blob/master/level_zero/doc/experimental_extensions/MULTI_CCS_MODES.md |
| 56 | + |
| 57 | +Some Intel GPU devices can be partitioned at a granularity of "cslice" (compute |
| 58 | +slice), which is a smaller granularity than "tile". This form of partitioning |
| 59 | +is not currently enabled by default, so it is considered an advanced feature |
| 60 | +which most applications are not expected to use. This extension provides a way |
| 61 | +for these advanced applications to partition a device by cslice when it is |
| 62 | +enabled in the device driver. |
| 63 | + |
| 64 | +Unlike "tile" partitions, a cslice partition does not have any different cache |
| 65 | +affinity from its sibling cslice partitions. Therefore, this extension does |
| 66 | +not expose this type of partitioning through |
| 67 | +`info::partition_property::partition_by_affinity_domain`. Instead, it adds a |
| 68 | +new partitioning type |
| 69 | +`info::partition_property::ext_intel_partition_by_cslice`. |
| 70 | + |
| 71 | +The only Intel GPU devices that currently support this type of partitioning |
| 72 | +are the Data Center GPU Max series (aka PVC), and this support is only |
| 73 | +available when the device driver is configured in {multi-CCS}[multi-CCS] mode. |
| 74 | +See that documentation for instructions on how to enable this mode and for |
| 75 | +other important information. Currently, it is only possible to partition a |
| 76 | +device by cslice if the driver is in "2 CCS Mode" or "4 CCS Mode". When in |
| 77 | +2 CCS Mode, a tile can be partitioned into two cslice sub-devices. When in |
| 78 | +4 CCS Mode, a tile can be partitioned into four cslice sub-devices. |
| 79 | + |
| 80 | +This type of partitioning is currently supported only at the "tile" level. |
| 81 | +A device should be queried using `info::device::partition_properties` to |
| 82 | +determine if it supports partitioning by `ext_intel_partition_by_cslice`. If a |
| 83 | +device does not support partitioning by `ext_intel_partition_by_cslice` it may |
| 84 | +first need to be partitioned into per-tile sub-devices via |
| 85 | +`partition_by_affinity_domain`, and then each of the resulting sub-devices may |
| 86 | +be further partitioned by `ext_intel_partition_by_cslice`. |
| 87 | + |
| 88 | +It is important to understand that the device driver virtualizes work |
| 89 | +submission to the cslice sub-devices. (More specifically, the device driver |
| 90 | +virtualizes work submission to different CCS-es, and this means that on Data |
| 91 | +Center GPU Max series devices the work submission to a cslice is virtualized.) |
| 92 | +This virtualization happens only between processes, and not within a single |
| 93 | +process. For example, consider a single process that constructs two SYCL |
| 94 | +queues on cslice sub-device #0. Kernels submitted to these two queues are |
| 95 | +guaranteed to conflict, both using the same set of execution units. Therefore, |
| 96 | +if a single process wants to explicitly submit kernels to cslice sub-devices |
| 97 | +and it wants to avoid conflict, it should create queues on different |
| 98 | +sub-devices. By contrast, consider an example where two separate processes |
| 99 | +create a SYCL queue on cslice sub-device #0. In this case, the device driver |
| 100 | +virtualizes access to this cslice, and kernels submitted from the first process |
| 101 | +may run on different execution units than kernels submitted from the second |
| 102 | +process. In this second case, the device driver binds the process's requested |
| 103 | +cslice to a physical cslice according to the overall system load. |
| 104 | + |
| 105 | +Note that this extension can be supported by any implementation. If an |
| 106 | +implementation supports a backend or device without the concept of cslice |
| 107 | +partitions it can still conform to this extension by declaring the new |
| 108 | +enumerator and member functions specified below. If the info descriptor query |
| 109 | +`info::device::partition_properties` does not report |
| 110 | +`ext_intel_partition_by_cslice`, then the backend or device does not support |
| 111 | +the creation of cslice partitions. |
| 112 | + |
| 113 | + |
| 114 | +== Specification |
| 115 | + |
| 116 | +=== Feature test macro |
| 117 | + |
| 118 | +This extension provides a feature-test macro as described in the core SYCL |
| 119 | +specification. An implementation supporting this extension must predefine the |
| 120 | +macro `SYCL_EXT_INTEL_CSLICE` to one of the values defined in the table |
| 121 | +below. Applications can test for the existence of this macro to determine if |
| 122 | +the implementation supports this feature, or applications can test the macro's |
| 123 | +value to determine which of the extension's features the implementation |
| 124 | +supports. |
| 125 | + |
| 126 | +[%header,cols="1,5"] |
| 127 | +|=== |
| 128 | +|Value |
| 129 | +|Description |
| 130 | + |
| 131 | +|1 |
| 132 | +|Initial version of this extension. |
| 133 | +|=== |
| 134 | + |
| 135 | +=== New partition property |
| 136 | + |
| 137 | +This extension adds a new enumerator named `ext_intel_partition_by_cslice` to |
| 138 | +`info::partition_property`: |
| 139 | + |
| 140 | +``` |
| 141 | +namespace sycl::info { |
| 142 | + |
| 143 | +enum class partition_property : /* unspecified */ { |
| 144 | + // ... |
| 145 | + ext_intel_partition_by_cslice |
| 146 | +}; |
| 147 | + |
| 148 | +} // namespace sycl::info |
| 149 | +``` |
| 150 | + |
| 151 | +The behavior of the `info::device::partition_properties` info descriptor query |
| 152 | +is also extended to include `ext_intel_partition_by_cslice` in the vector of |
| 153 | +returned values if the device can be partitioned into at least two sub-devices |
| 154 | +along that partition property. |
| 155 | + |
| 156 | +=== New function template specialization to create sub-devices |
| 157 | + |
| 158 | +This extension adds a new function template specialization to the `device` |
| 159 | +class: |
| 160 | + |
| 161 | +``` |
| 162 | +namespace sycl { |
| 163 | + |
| 164 | +class device { |
| 165 | + // ... |
| 166 | + |
| 167 | + // Available only when |
| 168 | + // Prop == info::partition_property::ext_intel_partition_by_cslice |
| 169 | + template <info::partition_property Prop> |
| 170 | + std::vector<device> create_sub_devices() const; |
| 171 | +}; |
| 172 | + |
| 173 | +} // namespace sycl |
| 174 | +``` |
| 175 | + |
| 176 | +This function only participates in overload resolution if the `Prop` template |
| 177 | +parameter is `info::partition_property::ext_intel_partition_by_cslice`. It |
| 178 | +returns a `std::vector` of sub-devices partitioned from this SYCL `device`, |
| 179 | +each representing a fixed set of hardware cslices. |
| 180 | + |
| 181 | +If the SYCL `device` does not support |
| 182 | +`info::partition_property::ext_intel_partition_by_cslice`, calling this |
| 183 | +function throws a synchronous `exception` with the |
| 184 | +`errc::feature_not_supported` error code. |
| 185 | + |
| 186 | +=== Behavior of device info queries for a "cslice" sub-device |
| 187 | + |
| 188 | +This section describes the behavior for some of the device info queries when |
| 189 | +applied to a `device` object that represents a "cslice" partition. |
| 190 | + |
| 191 | +* `info::device::partition_type_property` |
| 192 | ++ |
| 193 | +Returns `ext_intel_partition_by_cslice`. |
| 194 | + |
| 195 | +* `info::device::max_compute_units` |
| 196 | ++ |
| 197 | +When partitioning by `ext_intel_partition_by_cslice`, each sub-device |
| 198 | +represents a fixed subset of the parent device's compute units. This query |
| 199 | +returns the number of compute units represented by the sub-device. |
| 200 | + |
| 201 | +The remaining device info queries return the properties or limits of the |
| 202 | +sub-device, as is typical for these queries. In general, if a resource is |
| 203 | +partitioned among the sub-devices, then the associated info query will |
| 204 | +return each sub-device's share of the resource. However, if a resource is |
| 205 | +shared by all of the sub-devices, then the associated info query for each |
| 206 | +sub-device will return the same value as for the parent device. For example, |
| 207 | +if device global memory is shared by all cslice partitions in a tile, then the |
| 208 | +info query `info::device::global_mem_size` will return the same value for the |
| 209 | +`device` object representing the tile as for the `device` object representing |
| 210 | +a cslice. |
| 211 | + |
| 212 | +=== Behavior of the Level Zero backend interop functions |
| 213 | + |
| 214 | +The Level Zero device driver doesn't use the concept of sub-device to represent |
| 215 | +a fixed partition of cslices. Instead, a Level Zero command queue can be |
| 216 | +created with a particular queue index, which represents a partition of the |
| 217 | +cslices. |
| 218 | + |
| 219 | +As a result, calling `get_native` for a SYCL `device` that represents a cslice |
| 220 | +partition returns the same `ze_device_handle_t` as the parent device. If an |
| 221 | +application wants a native handle to the cslice partition, it must create a |
| 222 | +SYCL `queue` and then call `get_native` on the `queue`. This will return a |
| 223 | +`ze_command_queue_handle_t` that corresponds to the cslice partition. |
| 224 | + |
| 225 | +=== Behavior of the OpenCL backend interop functions |
| 226 | + |
| 227 | +The OpenCL device driver doesn't use the concept of sub-device to represent a |
| 228 | +fixed partition of cslices. Instead, an OpenCL command queue can be created |
| 229 | +with a particular queue index, which represents a partition of the cslices. |
| 230 | + |
| 231 | +As a result, calling `get_native` for a SYCL `device` that represents a cslice |
| 232 | +partition returns the same `cl_device_id` as the parent device. If an |
| 233 | +application wants a native handle to the cslice partition, it must create a |
| 234 | +SYCL `queue` and then call `get_native` on the `queue`. This will return a |
| 235 | +`cl_command_queue` that corresponds to the cslice partition. |
| 236 | + |
| 237 | + |
| 238 | +== Impact on the ONEAPI_DEVICE_SELECTOR environment variable |
| 239 | + |
| 240 | +:oneapi-device-selector: https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#oneapi_device_selector |
| 241 | + |
| 242 | +This section describes the effect of this extension on the {dpcpp} |
| 243 | +`ONEAPI_DEVICE_SELECTOR` environment variable. Since this environment variable |
| 244 | +is not part of the SYCL specification, this section is not a normative part of |
| 245 | +the extension specification. Rather, it only describes the impact on {dpcpp}. |
| 246 | + |
| 247 | +As described in the {oneapi-device-selector}[documentation] for the |
| 248 | +`ONEAPI_DEVICE_SELECTOR`, a term in the selector string can be an integral |
| 249 | +number followed by a decimal point (`.`), where the decimal point indicates a |
| 250 | +sub-device. For example, `1.2` means sub-device #2 of device #1. These |
| 251 | +decimal points can represent either a sub-device created via |
| 252 | +`partition_by_affinity_domain` or via `ext_intel_partition_by_cslice`. When |
| 253 | +{dpcpp} processes a term with a decimal point, it first attempts to partition |
| 254 | +by `ext_intel_partition_by_cslice`. If that is not possible, it next attempts |
| 255 | +to partition by `partition_by_affinity_domain` / |
| 256 | +`partition_affinity_domain::next_partitionable`. |
| 257 | + |
| 258 | +It is important to keep in mind, though, that requesting a specific cslice via |
| 259 | +this environment variable has limited effect due to the device driver's |
| 260 | +virtualization of cslices. To illustrate, consider an example where two |
| 261 | +processes are launched as follows, selecting different cslice sub-devices: |
| 262 | + |
| 263 | +``` |
| 264 | +$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo |
| 265 | +$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.1 ZEX_NUMBER_OF_CCS=0:2 ./foo |
| 266 | +``` |
| 267 | + |
| 268 | +The first process selects cslice #0 while the second selects cslice #1. This |
| 269 | +does have the effect that each process is constrained to a single cslice (which |
| 270 | +is not the {dpcpp} default). However, the actual cslice number is irrelevant. |
| 271 | +Because of cslice virtualization, the device driver will choose some available |
| 272 | +cslice for each process instead of honoring the value requested in the |
| 273 | +environment variable. As a result, the following example has exactly the same |
| 274 | +effect: |
| 275 | + |
| 276 | +``` |
| 277 | +$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo |
| 278 | +$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo |
| 279 | +``` |
0 commit comments