Skip to content

Commit f1992a0

Browse files
authored
[SYCL][Graph] Support for native-command (#16871)
Support [sycl_ext_codeplay_enqueue_native_command](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_enqueue_native_command.asciidoc) with SYCL-Graph for all of L0, CUDA, HIP, and OpenCL backends. Introduces `interop_handle::ext_oneapi_get_native_graph<backend>()` to give the user access to the native graph object which native commands can be appended to. Implemented using new UR command-buffer entry-points `urCommandBufferAppendNativeCommandExp` and `urCommandBufferGetNativeHandleExp`. To use CUDA as an example, code using `ext_codeplay_enqueue_native_command` eagerly can be updated from: ```cpp CGH.ext_codeplay_enqueue_native_command([=](interop_handle IH) { auto NativeStream = IH.get_native_queue<cuda>(); myNativeLibraryCall(NativeStream); } ``` To ```cpp CGH.ext_codeplay_enqueue_native_command([=](interop_handle IH) { if (IH.ext_oneapi_has_graph()) { auto NativeGraph = IH.ext_oneapi_get_native_graph<cuda>(); auto NativeStream = IH.get_native_queue<cuda>(); // Start capture stream calls into graph cuStreamBeginCaptureToGraph(NativeStream, NativeGraph, nullptr, nullptr, 0, CU_STREAM_CAPTURE_MODE_GLOBAL); myNativeLibraryCall(NativeStream); // Stop capturing stream calls into graph cuStreamEndCapture(NativeStream, &NativeGraph); } else { auto NativeStream = IH.get_native_queue<cuda>(); myNativeLibraryCall(NativeStream ); } } ``` Example of how this integration could work in GROMACS https://gitlab.com/gromacs/gromacs/-/merge_requests/4954
1 parent d43aef3 commit f1992a0

File tree

73 files changed

+2951
-45
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

73 files changed

+2951
-45
lines changed

sycl/doc/extensions/experimental/sycl_ext_codeplay_enqueue_native_command.asciidoc

Lines changed: 238 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -53,12 +53,16 @@ specification.*
5353

5454
== Backend support status
5555

56-
This extension is currently implemented in {dpcpp} only for GPU devices and
56+
The base extension is currently implemented in {dpcpp} only for GPU devices and
5757
only when using the CUDA or HIP backends. Attempting to use this extension in
5858
kernels that run on other devices or backends may result in undefined
5959
behavior. Be aware that the compiler is not able to issue a diagnostic to
6060
warn you if this happens.
6161

62+
The semantics in the <<sycl-graph-interaction, SYCL-Graph Interaction>> section
63+
however are implemented for all of Level-Zero, OpenCL, CUDA and HIP devices.
64+
Where support is conditional on the device reporting the
65+
`aspect::ext_oneapi_limited_graph` aspect.
6266

6367
== Overview
6468

@@ -173,13 +177,118 @@ dependencies are satisfied.
173177
The SYCL command described above completes once all of the native asynchronous
174178
tasks it contains have completed.
175179

176-
The call to `interopCallable` must not submit any synchronous tasks to the
177-
native backend object, and it must not block waiting for any tasks to complete.
178-
The call also must not add tasks to backend objects that underly any other
179-
queue, aside from the queue that is associated with this handler. If it does
180-
any of these things, the behavior is undefined.
180+
The call to `interopCallable` must not add tasks to backend objects that underly
181+
any other queue, aside from the queue that is associated with this handler,
182+
otherwise, the behavior is undefined.
181183

182-
== Example
184+
[_Note:_ The function object `interopCallable` is invoked to enqueue commands to a
185+
native queue or graph and therefore, APIs which block or synchronize could
186+
prolong or interfere with other commands being enqueued to the backend.
187+
_{endnote}_]
188+
189+
=== SYCL-Graph Interaction
190+
191+
This section defines the interaction with the
192+
link:../experimental/sycl_ext_oneapi_graph.asciidoc[sycl_ext_oneapi_graph]
193+
extension.
194+
195+
The `interopCallable` object will be invoked during `command_graph::finalize()`
196+
when the backend object for the graph is available to give to the user as a
197+
handle. The user may then add nodes using native APIs to the backend graph
198+
object queried with `interop_handle::ext_codeplay_get_native_graph()`. The
199+
runtime will schedule the dependencies of the user added nodes such
200+
that they respect the graph node edges.
201+
202+
==== Interop Handle Class Modifications
203+
204+
```c++
205+
// Alias is for editorial brevity in the ext_codeplay_get_native_graph
206+
// definition, and is non-normative.
207+
using graph = ext::oneapi::experimental::command_graph<
208+
ext::oneapi::experimental::graph_state::executable>;
209+
210+
class interop_handle {
211+
bool ext_codeplay_has_graph() const;
212+
213+
template <backend Backend>
214+
backend_return_t<Backend, graph> ext_codeplay_get_native_graph() const;
215+
};
216+
```
217+
218+
==== New Interop Handle Member Functions
219+
220+
Table {counter: tableNumber}. Additional member functions of the `sycl::interop_handle` class.
221+
[cols="2a,a"]
222+
|===
223+
|Member function|Description
224+
225+
|
226+
[source,c++]
227+
----
228+
bool interop_handle::ext_codeplay_has_graph() const;
229+
----
230+
231+
|
232+
_Returns_: True if the `interop_handle object` was constructed and passed to
233+
an enqueue native command function object by `ext_codeplay_enqueue_native_command`,
234+
that was invoked when adding a graph node, either explicitly or implicitly
235+
via graph record.
236+
237+
[_Note:_ that host-task nodes in a `command_graph` will return `false` from this
238+
query, as the host-task callable is invoked during graph execution rather than
239+
graph finalization.
240+
_{endnote}_]
241+
242+
|
243+
[source,c++]
244+
----
245+
template <backend Backend>
246+
backend_return_t<Backend, graph>
247+
interop_handle::ext_codeplay_get_native_graph() const;
248+
----
249+
250+
|
251+
_Returns_: The native graph object associated with the `interop_handle`.
252+
253+
_Throws_: An exception with the `errc::invalid` error code if
254+
`ext_codeplay_has_graph()` returns `false`.
255+
256+
|===
257+
258+
== Implementation Notes
259+
260+
When `interop_handle::get_native_queue()` is invoked in a native command
261+
function object on graph finalize, the queue that is returned to the user is an
262+
internal queue created by the SYCL runtime, as there is no user provided queue
263+
at the point of graph finalization. This queue has the same device and context
264+
as the graph was created with. The only valid usage of this queue is to perform
265+
stream capture to a graph for backend APIs that provide this functionality.
266+
267+
Table {counter: tableNumber}. Native types for
268+
`template <backend Backend, class T> backend_return_t<Backend, T>` where `T` is
269+
instantiated as `command_graph<graph_state::executable>`.
270+
271+
[cols="2a,a"]
272+
|===
273+
|Backend|Native Graph Type
274+
275+
| `backend::opencl`
276+
| `cl_command_buffer_khr`
277+
278+
| `backend::ext_oneapi_level_zero`
279+
| `ze_command_list_handle_t`
280+
281+
| `backend::ext_oneapi_cuda`
282+
| `CUGraph`
283+
284+
| `backend::ext_oneapi_hip`
285+
| `hipGraph_t`
286+
287+
|===
288+
289+
== Examples
290+
291+
=== HIP Native Task
183292

184293
This example demonstrates how to use this extension to enqueue asynchronous
185294
native tasks on the HIP backend.
@@ -197,21 +306,135 @@ q.submit([&](sycl::handler &cgh) {
197306

198307
// Can enqueue arbitrary backend operations. This could also be a kernel
199308
// launch or call to a library that enqueues operations on the stream etc
200-
//
201-
// Important: Enqueuing a *synchronous* backend operation results in
202-
// undefined behavior.
203309
hipMemcpyAsync(target_ptr, native_mem, test_size * sizeof(int),
204310
hipMemcpyDeviceToHost, stream);
205311
});
206312
});
207313
q.wait();
208314
```
209315

210-
== Issues
316+
=== Level-Zero Add Native Graph Node
211317

212-
=== sycl_ext_oneapi_graph
318+
This example demonstrates how to use this extension to add a native command
319+
to a SYCL-Graph object on the Level-Zero backend. The command is doing a memory
320+
copy between two USM pointers.
213321

214-
`ext_codeplay_enqueue_native_command`
215-
cannot be used in graph nodes. A synchronous exception will be thrown with error
216-
code `invalid` if a user tries to add them to a graph.
322+
```c++
323+
Graph.add([&](sycl::handler &CGH) {
324+
CGH.ext_codeplay_enqueue_native_command([=](sycl::interop_handle IH) {
325+
ze_command_list_handle_t NativeGraph =
326+
IH.ext_codeplay_get_native_graph<sycl::backend::ext_oneapi_level_zero>();
217327

328+
zeCommandListAppendMemoryCopy(
329+
NativeGraph, PtrY, PtrX, Size * sizeof(int), nullptr, 0, nullptr);
330+
});
331+
});
332+
```
333+
334+
=== OpenCL Add Native Graph Node
335+
336+
This example demonstrates how to use this extension to add a native command to
337+
a SYCL-Graph object on the OpenCL backend. The command is doing a copy between
338+
two buffer objects.
339+
340+
```c++
341+
sycl::queue Queue;
342+
auto Platform = get_native<sycl::backend::opencl>(Queue.get_context().get_platform());
343+
clCommandCopyBufferKHR_fn clCommandCopyBufferKHR =
344+
reinterpret_cast<clCommandCopyBufferKHR_fn>(
345+
clGetExtensionFunctionAddressForPlatform(Platform, "clCommandCopyBufferKHR"));
346+
347+
Graph.add([&](sycl::handler &CGH) {
348+
auto AccX = BufX.get_access(CGH);
349+
auto AccY = BufY.get_access(CGH);
350+
CGH.ext_codeplay_enqueue_native_command([=](sycl::interop_handle IH) {
351+
cl_command_buffer_khr NativeGraph =
352+
IH.ext_codeplay_get_native_graph<sycl::backend::opencl>();
353+
auto SrcBuffer = IH.get_native_mem<sycl::backend::opencl>(AccX);
354+
auto DstBuffer = IH.get_native_mem<sycl::backend::opencl>(AccY);
355+
356+
clCommandCopyBufferKHR(
357+
NativeGraph, nullptr, nullptr, SrcBuffer[0], DstBuffer[0], 0, 0,
358+
Size * sizeof(int), 0, nullptr, nullptr, nullptr);
359+
});
360+
});
361+
```
362+
363+
=== CUDA Add Native Graph Node
364+
365+
This example demonstrates how to use this extension to add a native command to
366+
a SYCL-Graph object on the CUDA backend. The command is doing a memory copy
367+
between two device USM pointers.
368+
369+
```c++
370+
Graph.add([&](sycl::handler &CGH) {
371+
CGH.ext_codeplay_enqueue_native_command([=](sycl::interop_handle IH) {
372+
CUgraph NativeGraph =
373+
IH.ext_codeplay_get_native_graph<sycl::backend::ext_oneapi_cuda>();
374+
375+
CUDA_MEMCPY3D Params;
376+
std::memset(&Params, 0, sizeof(CUDA_MEMCPY3D));
377+
Params.srcMemoryType = CU_MEMORYTYPE_DEVICE;
378+
Params.srcDevice = (CUdeviceptr)PtrX;
379+
Params.srcHost = nullptr;
380+
Params.dstMemoryType = CU_MEMORYTYPE_DEVICE;
381+
Params.dstDevice = (CUdeviceptr)PtrY;
382+
Params.dstHost = nullptr;
383+
Params.WidthInBytes = Size * sizeof(int);
384+
Params.Height = 1;
385+
Params.Depth = 1;
386+
387+
CUgraphNode Node;
388+
CUcontext Context = IH.get_native_context<sycl::backend::ext_oneapi_cuda>();
389+
cuGraphAddMemcpyNode(&Node, NativeGraph, nullptr, 0, &Params, Context);
390+
});
391+
});
392+
```
393+
394+
=== HIP Add Native Graph Node
395+
396+
This example demonstrates how to use this extension to add a native command to
397+
a SYCL-Graph object on the HIP backend. The command is doing a memory copy
398+
between two device USM pointers.
399+
400+
```c++
401+
Graph.add([&](sycl::handler &CGH) {
402+
CGH.ext_codeplay_enqueue_native_command([=](sycl::interop_handle IH) {
403+
HIPGraph NativeGraph =
404+
IH.ext_codeplay_get_native_graph<sycl::backend::ext_oneapi_hip>();
405+
406+
HIPGraphNode Node;
407+
hipGraphAddMemcpyNode1D(&Node, NativeGraph, nullptr, 0, PtrY, PtrX,
408+
Size * sizeof(int), hipMemcpyDefault);
409+
});
410+
});
411+
```
412+
413+
=== CUDA Stream Record Native Graph Nodes
414+
415+
This example demonstrates how to use this extension to add stream recorded
416+
native nodes to a SYCL-Graph object on the CUDA backend.
417+
418+
```c++
419+
q.submit([&](sycl::handler &CGH) {
420+
CGH.ext_codeplay_enqueue_native_command([=](sycl::interop_handle IH) {
421+
auto NativeStream = h.get_native_queue<cuda>();
422+
if (IH.ext_codeplay_has_graph()) {
423+
auto NativeGraph =
424+
IH.ext_codeplay_get_native_graph<sycl::backend::ext_oneapi_cuda>();
425+
426+
// Start capture stream calls into graph
427+
cuStreamBeginCaptureToGraph(NativeStream, NativeGraph, nullptr,
428+
nullptr, 0,
429+
CU_STREAM_CAPTURE_MODE_GLOBAL);
430+
431+
myNativeLibraryCall(NativeStream);
432+
433+
// Stop capturing stream calls into graph
434+
cuStreamEndCapture(NativeStream, &NativeGraph);
435+
} else {
436+
myNativeLibraryCall(NativeStream);
437+
}
438+
});
439+
});
440+
```

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2119,13 +2119,10 @@ extensions.
21192119

21202120
==== sycl_ext_codeplay_enqueue_native_command
21212121

2122-
`ext_codeplay_enqueue_native_command`, defined in
2122+
The new methods defined by
21232123
link:../experimental/sycl_ext_codeplay_enqueue_native_command.asciidoc[sycl_ext_codeplay_enqueue_native_command]
2124-
cannot be used in graph nodes. A synchronous exception will be thrown with error
2125-
code `invalid` if a user tries to add them to a graph.
2126-
2127-
Removing this restriction is something we may look at for future revisions of
2128-
`sycl_ext_oneapi_graph`.
2124+
can be used in graph nodes. For futher details see the section on
2125+
link:../experimental/sycl_ext_codeplay_enqueue_native_command.asciidoc#sycl-graph-interaction[SYCL-Graph interaction].
21292126

21302127
==== sycl_ext_intel_queue_index
21312128

sycl/doc/extensions/supported/sycl_ext_oneapi_backend_level_zero.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -270,6 +270,28 @@ struct {
270270
```
271271
</td>
272272
</tr>
273+
274+
<tr>
275+
<td>
276+
277+
``` C++
278+
ext::oneapi::experimental::command_graph<
279+
ext::oneapi::experimental::graph_state::executable>
280+
```
281+
282+
See [sycl_ext_oneapi_graph](../experimental/sycl_ext_oneapi_graph.asciidoc)
283+
284+
</td>
285+
286+
<td>
287+
288+
``` C++
289+
ze_command_list_handle_t
290+
```
291+
</td>
292+
<td></td>
293+
</tr>
294+
273295
</table>
274296

275297
### 4.2 Obtaining of native Level-Zero handles from SYCL objects
@@ -642,3 +664,4 @@ The behavior of the SYCL buffer destructor depends on the Ownership flag. As wit
642664
|10|2022-08-18|Sergey Maslov|Moved free_memory device info query to be sycl_ext_intel_device_info extension
643665
|11|2023-03-14|Rajiv Deodhar|Added support for Level Zero immediate command lists
644666
|12|2023-04-06|Chris Perkins|Introduced make_image() API
667+
|13|2023-04-06|Ewan Crawford|Add backend_return_t for SYCL-Graph

sycl/doc/syclgraph/SYCLGraphUsageGuide.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,57 @@ In this case it may be necessary to first manually trigger the warmup by calling
171171
`Graph.begin_recording(Queue)` to prevent the warmup from being captured in a
172172
graph when recording.
173173
174+
#### ext_codeplay_enqueue_native_command
175+
176+
The SYCL-Graph extension is compatible with the
177+
[ext_codeplay_enqueue_native_command](../extensions/experimental/sycl_ext_codeplay_enqueue_native_command.asciidoc)
178+
extension that can be used to capture asynchronous library commands as graph
179+
nodes. However, existing `ext_codeplay_enqueue_native_command` user code will
180+
need modifications to work correctly for submission to a sycl queue that can be
181+
in either the executable or recording state.
182+
183+
Using the CUDA backend as an example, existing code which uses a
184+
native-command to invoke a library call:
185+
186+
```c++
187+
q.submit([&](sycl::handler &CGH) {
188+
CGH.ext_codeplay_enqueue_native_command([=](sycl::interop_handle IH) {
189+
auto NativeStream = IH.get_native_queue<cuda>();
190+
myNativeLibraryCall(NativeStream);
191+
});
192+
});
193+
```
194+
195+
Can be ported as below to work with SYCL-Graph, where the queue may be in
196+
a recording state. If the code is not ported but the queue is in a recording
197+
state, then asynchronous work in `myNativeLibraryCall` will be scheduled
198+
immediately as part of graph finalize, rather than being added to the graph as
199+
a node, which is unlikely to be the desired user behavior.
200+
201+
```c++
202+
q.submit([&](sycl::handler &CGH) {
203+
CGH.ext_codeplay_enqueue_native_command([=](sycl::interop_handle IH) {
204+
auto NativeStream = h.get_native_queue<cuda>();
205+
if (IH.ext_codeplay_has_graph()) {
206+
auto NativeGraph =
207+
IH.ext_codeplay_get_native_graph<sycl::backend::ext_oneapi_cuda>();
208+
209+
// Start capture stream calls into graph
210+
cuStreamBeginCaptureToGraph(NativeStream, NativeGraph, nullptr,
211+
nullptr, 0,
212+
CU_STREAM_CAPTURE_MODE_GLOBAL);
213+
214+
myNativeLibraryCall(NativeStream);
215+
216+
// Stop capturing stream calls into graph
217+
cuStreamEndCapture(NativeStream, &NativeGraph);
218+
} else {
219+
myNativeLibraryCall(NativeStream);
220+
}
221+
});
222+
});
223+
```
224+
174225
## Code Examples
175226

176227
The examples below demonstrate intended usage of the extension, but may not be

0 commit comments

Comments
 (0)