Skip to content

Commit 7b8e5e8

Browse files
committed
[SYCL][L0] Add MD markup
Signed-off-by: Sergey V Maslov <[email protected]>
1 parent 7ae164f commit 7b8e5e8

File tree

1 file changed

+104
-93
lines changed

1 file changed

+104
-93
lines changed
Lines changed: 104 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -1,114 +1,125 @@
1-
1 Devices discovery
2-
1.1 Root-devices
1+
# Considerations for programming to multi-tile and multi-card under Level-Zero backend
32

4-
Intel GPUs are represented as SYCL GPU devices, root-devices.
5-
The discovery of root-devices is best with "sycl-ls" tool, for example:
6-
7-
$ sycl-ls
8-
[opencl:0] GPU : Intel(R) OpenCL HD Graphics 3.0 [21.19.19792]
9-
[opencl:0] CPU : Intel(R) OpenCL 2.1 [2020.11.11.0.03_160000]
10-
[level_zero:0] GPU : Intel(R) Level-Zero 1.1 [1.1.19792]
11-
[host:0] HOST: SYCL host platform 1.2 [1.2]
12-
13-
Note that "sycl-ls" shows all devices from all platforms of all SYCL backends that are seen by SYCL runtime.
14-
So in the example above there is CPU (managed by OpenCL backend) and 2! GPUs corresponding to the single physical GPU (managed by either OpenCL or Level-Zero backend).
15-
There are few ways to filter observable root-devices.
16-
17-
One is using environment variable SYCL_DEVICE_FILTER described in https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md
18-
$ SYCL_DEVICE_FILTER=level_zero sycl-ls
19-
[level_zero:0] GPU : Intel(R) Level-Zero 1.1 [1.1.19792]
20-
21-
Another way is to use similar SYCL API described here https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/FilterSelector/FilterSelector.adoc
22-
E.g. filter_selector("level_zero") will only see Level-Zero operated devices.
3+
## 1 Devices discovery
4+
### 1.1 Root-devices
235

24-
If there are multiple GPUs in a system then they will be seen as multiple different root-devices.
25-
On Linux these would be multiple SYCL root-devices of the same SYCL platform (representing Level-Zero driver).
26-
On Windows these would appear as root-devices of multiple different SYCL platforms (Level-Zero drivers).
27-
28-
CreateMultipleRootDevices=N can be used to emulate multiple GPU cards.
29-
30-
1.2 Sub-devices
31-
32-
Some Intel GPU HW is composed of multiple tiles, e.g. 4 tile ATS.
33-
The root-device in such cases can be partitioned to sub-devices, each corresponding to the physical tiles.
34-
35-
try {
36-
vector<device> SubDevices = RootDevice.create_sub_devices<
37-
cl::sycl::info::partition_property::partition_by_affinity_domain>(
38-
cl::sycl::info::partition_affinity_domain::next_partitionable);
39-
}
40-
41-
Each call to "create_sub_devices" will return exactly the same sub-devices and in the persistent order.
42-
To control what sub-devices are exposed by Level-Zero UMD one can use ZE_AFFINITY_MASK environment variable.
6+
Intel GPUs are represented as SYCL GPU devices, root-devices.
7+
The discovery of root-devices is best with "sycl-ls" tool, for example:
8+
9+
```
10+
$ sycl-ls
11+
[opencl:0] GPU : Intel(R) OpenCL HD Graphics 3.0 [21.19.19792]
12+
[opencl:0] CPU : Intel(R) OpenCL 2.1 [2020.11.11.0.03_160000]
13+
[level_zero:0] GPU : Intel(R) Level-Zero 1.1 [1.1.19792]
14+
[host:0] HOST: SYCL host platform 1.2 [1.2]
15+
```
16+
17+
Note that "sycl-ls" shows all devices from all platforms of all SYCL backends that are seen
18+
by SYCL runtime. Thus in the example above there is CPU (managed by OpenCL backend) and 2!
19+
GPUs corresponding to the single physical GPU (managed by either OpenCL or Level-Zero backend).
20+
There are few ways to filter observable root-devices.
21+
22+
One is using environment variable SYCL_DEVICE_FILTER described in [EnvironmentVariables.md](https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md)
23+
```
24+
$ SYCL_DEVICE_FILTER=level_zero sycl-ls
25+
[level_zero:0] GPU : Intel(R) Level-Zero 1.1 [1.1.19792]
26+
```
27+
Another way is to use similar SYCL API described in [FilterSelector.md](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/FilterSelector/FilterSelector.adoc)
28+
E.g. `filter_selector("level_zero")` will only see Level-Zero operated devices.
29+
30+
If there are multiple GPUs in a system then they will be seen as multiple different root-devices.
31+
On Linux these would be multiple SYCL root-devices of the same SYCL platform (representing Level-Zero driver).
32+
On Windows these would appear as root-devices of multiple different SYCL platforms (Level-Zero drivers).
33+
34+
`CreateMultipleRootDevices=N` evironment variable can be used to emulate multiple GPU cards.
4335

44-
NOTE: The "partition_by_affinity_domain" is the only partitioning supported for Intel GPU
45-
Similar "next_partitionable" and "numa" are the only partitioning properties supported (both doing the same thing).
36+
### 1.2 Sub-devices
4637

47-
CreateMultipleSubDevices=N can be used to emulate multiple tiles of a GPU.
38+
Some Intel GPU HW is composed of multiple tiles, e.g. 4 tile ATS.
39+
The root-device in such cases can be partitioned to sub-devices, each corresponding to the physical tiles.
40+
41+
``` C++
42+
try {
43+
vector<device> SubDevices = RootDevice.create_sub_devices<
44+
cl::sycl::info::partition_property::partition_by_affinity_domain>(
45+
cl::sycl::info::partition_affinity_domain::next_partitionable);
46+
}
47+
```
48+
49+
Each call to `create_sub_devices` will return exactly the same sub-devices and in the persistent order.
50+
To control what sub-devices are exposed by Level-Zero UMD one can use ZE_AFFINITY_MASK environment variable.
51+
52+
NOTE: The `partition_by_affinity_domain` is the only partitioning supported for Intel GPU.
53+
Similar `next_partitionable` and `numa` are the only partitioning properties supported (both doing the same thing).
54+
55+
`CreateMultipleSubDevices=N` environment variable can be used to emulate multiple tiles of a GPU.
4856

49-
2 Context
57+
## 2 Context
5058

51-
Contexts are used for resources isolation and sharing. A SYCL context may consist of one or multiple devices.
52-
Both root-devices and sub-devices can be within single context, but they all should be of the same SYCL platform.
53-
A SYCL program (kernel_bundle) created against a context with multiple devices will be built to each of the root-devices in the context.
54-
For context that consists of multiple sub-devices of the same root-device only single build (to that root-device) is needed.
59+
Contexts are used for resources isolation and sharing. A SYCL context may consist of one or multiple devices.
60+
Both root-devices and sub-devices can be within single context, but they all should be of the same SYCL platform.
61+
A SYCL program (kernel_bundle) created against a context with multiple devices will be built to each of the root-devices in the context.
62+
For context that consists of multiple sub-devices of the same root-device only single build (to that root-device) is needed.
5563

56-
3 Memory
57-
3.1 USM
64+
## 3 Memory
65+
### 3.1 USM
5866

59-
There are multiple ways to allocate memory:
67+
There are multiple ways to allocate memory:
6068

61-
malloc_device: Allocation can only be accessed by the specified device but not by other devices in the context nor by host.
62-
The data stays on the device all the time and thus is the fastest available for kernel execution.
63-
Explicit copy is needed for transfer data to the host or other devices in the context.
69+
`malloc_device`:
70+
- Allocation can only be accessed by the specified device but not by other devices in the context nor by host.
71+
- The data stays on the device all the time and thus is the fastest available for kernel execution.
72+
- Explicit copy is needed for transfer data to the host or other devices in the context.
6473

65-
malloc_host: Allocation can be accessed by the host and any other device in the context.
66-
The data stays on the host all the time and is accessed via PCI from the devices.
67-
No explicit copy is needed for synchronizing of the data with the host or devices.
74+
`malloc_host`:
75+
- Allocation can be accessed by the host and any other device in the context.
76+
- The data stays on the host all the time and is accessed via PCI from the devices.
77+
- No explicit copy is needed for synchronizing of the data with the host or devices.
78+
79+
`malloc_shared`:
80+
- Allocation can be accessed by the host and the specified device only.
81+
- The data can migrate (operated by the Level-Zero driver) between the host and the device for faster access.
82+
- No explicit copy is necessary for synchronizing between the host and the device, but it is needed for other devices in the context.
6883

69-
malloc_shared: Allocation can be accessed by the host and the specified device only.
70-
The data can migrate (operated by the Level-Zero driver) between the host and the device for faster access.
71-
No explicit copy is necessary for synchronizing between the host and the device, but it is needed for other devices in the context.
72-
73-
NOTE: Memory allocated against a root-device is accessible by all of its sub-devices (tiles).
74-
So if operating on a context with multiple sub-devices of the same root-device then you can use "malloc_device" on that root-device instead of using the slower "malloc_host".
75-
Remember that if using "malloc_device" you'd need an explicit copy out to the host if it necessary to see data there.
84+
NOTE: Memory allocated against a root-device is accessible by all of its sub-devices (tiles).
85+
So if operating on a context with multiple sub-devices of the same root-device then you can use `malloc_device` on that root-device instead of using the slower `malloc_host`.
86+
Remember that if using `malloc_device` you'd need an explicit copy out to the host if it necessary to see data there.
7687

77-
3.2 Buffer
78-
79-
SYCL buffers are also created against a context and under the hood are mapped to the Level-Zero USM allocation discussed above.
80-
The current mapping is following:
81-
82-
-) For integrated device the allocations are made on host, and are accessible by the host and the device without any copying.
83-
-) Memory buffers for context with sub-devices of the same root-device (possibly including the root-device itself) are allocated on that root-device.
84-
Thus they are readily accessible by all the devices in such context. The synchronization with the host is performed by SYCL RT with map/unmap doing implicit copies when necessary.
85-
-) Memory buffers for context with devices from different root-devices in it are allocated on host (thus made accessible to all devices).
88+
### 3.2 Buffer
8689

87-
4 Queue
90+
SYCL buffers are also created against a context and under the hood are mapped to the Level-Zero USM allocation discussed above.
91+
The current mapping is following:
8892

89-
SYCL queue is always attached to a single device in a possibly multi-device context.
90-
Some typical scenarios are the following (from most performant to least performant):
93+
- For integrated device the allocations are made on host, and are accessible by the host and the device without any copying.
94+
- Memory buffers for context with sub-devices of the same root-device (possibly including the root-device itself) are allocated on that root-device.
95+
Thus they are readily accessible by all the devices in such context. The synchronization with the host is performed by SYCL RT with map/unmap doing implicit copies when necessary.
96+
- Memory buffers for context with devices from different root-devices in it are allocated on host (thus made accessible to all devices).
9197

92-
A. Context with a single sub-device in it and the queue is attached to that sub-device (tile)
93-
- The execution/visibility is limited to the single sub-device only
94-
- Expected to offer the best performance per tile
98+
## 4 Queue
99+
100+
SYCL queue is always attached to a single device in a possibly multi-device context.
101+
Some typical scenarios are the following (from most performant to least performant):
102+
103+
A. Context with a single sub-device in it and the queue is attached to that sub-device (tile)
104+
- The execution/visibility is limited to the single sub-device only
105+
- Expected to offer the best performance per tile
95106

96-
B. Context with multiple sub-devices of the same root-device (multi-tile)
97-
- Queues are to be attached to the sub-devices effectively implementing "explicit scaling"
98-
- The root-device should not be passed to such context for better performance
107+
B. Context with multiple sub-devices of the same root-device (multi-tile)
108+
- Queues are to be attached to the sub-devices effectively implementing "explicit scaling"
109+
- The root-device should not be passed to such context for better performance
99110

100-
C. Context with a single root-device in it and the queue is attached to that root-device
101-
- The work will be automatically distributed across all sub-devices/tiles via "implicit scaling" by the driver
102-
- The most simple way to enable multi-tile HW but doesn't offer possibility to target specific tiles
111+
C. Context with a single root-device in it and the queue is attached to that root-device
112+
- The work will be automatically distributed across all sub-devices/tiles via "implicit scaling" by the driver
113+
- The most simple way to enable multi-tile HW but doesn't offer possibility to target specific tiles
103114

104-
D. Contexts with multiple root-devices (multi-card)
105-
- The most unrestrictive context with queues attached to different root-devices
106-
- Offers most sharing possibilities at the cost of slow access through host memory or explicit copies needed
115+
D. Contexts with multiple root-devices (multi-card)
116+
- The most unrestrictive context with queues attached to different root-devices
117+
- Offers most sharing possibilities at the cost of slow access through host memory or explicit copies needed
107118

108-
Depending on the chosen programming model (A,B,C,D) and algorithm used make sure to do proper memory allocation/synchronization.
119+
Depending on the chosen programming model (A,B,C,D) and algorithm used make sure to do proper memory allocation/synchronization.
109120

110-
5 Examples
121+
## 5 Examples
111122

112-
These are few examples of programming to multiple tiles and multiple cards:
113-
- https://github.com/jeffhammond/PRK/blob/dpct/Cxx11/dgemm-multigpu-onemkl.cc
114-
- https://github.com/pvelesko/PPP/tree/master/languages/c%2B%2B/sycl/gpu2gpu
123+
These are few examples of programming to multiple tiles and multiple cards:
124+
- https://github.com/jeffhammond/PRK/blob/dpct/Cxx11/dgemm-multigpu-onemkl.cc
125+
- https://github.com/pvelesko/PPP/tree/master/languages/c%2B%2B/sycl/gpu2gpu

0 commit comments

Comments
 (0)