Skip to content

[SYCL][Doc] Expand device_global related PI API #5906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 31 additions & 9 deletions sycl/doc/design/DeviceGlobal.md
Original file line number Diff line number Diff line change
Expand Up @@ -563,15 +563,37 @@ instance of a device global variable in a `pi_program`. This functionality is
exposed as two new PI interfaces:

```
pi_result piextCopyToDeviceVariable(pi_device Device, pi_program Program,
const char *name, const void *src, size_t count, size_t offset);

pi_result piextCopyFromDeviceVariable(pi_device Device, pi_program Program,
const char *name, void *dst, size_t count, size_t offset);
pi_result piextEnqueueDeviceVariableRead(pi_queue Queue, pi_program Program,
const char *Name, pi_bool BlockingRead,
size_t Count, size_t Offset, void *Dst,
pi_uint32 NumEventsInWaitList,
const pi_event *EventsWaitList,
pi_event *Event);

pi_result piextEnqueueDeviceVariableWrite(pi_queue Queue, pi_program Program,
const char *Name,
pi_bool BlockingWrite, size_t Count,
size_t Offset, const void *Src,
pi_uint32 NumEventsInWaitList,
const pi_event *EventsWaitList,
pi_event *Event);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to this, but I'm curious. Does the DPC++ runtime really need all this flexibility? For example, will it ever use this API with BlockingRead set to false? If not, why expose all this in the PI interface?

This is one of the criticisms I have about the PI interface in general. It just duplicates much of the OpenCL API even if the DPC++ runtime doesn't need all that flexibility. This makes the API more difficult to port to new backends because you need to implement stuff that is allowed in the PI API even though it's never used by the runtime.

In reality, backend ports often don't implement the PI interfaces fully because they know the runtime doesn't really needs all that functionality. This creates a sort of double standard. We have the "official" PI interface that no one really implements, and the "reality" version of the interface that isn't documented anywhere.

It seems like much of this would be improved if the PI interface exposed only the functionality that DPC++ really needs. So, does the DPC++ runtime really need the extra parameters you added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I don't disagree with most of your sentiments. However, the feature generally boils down to getting the pointer to the device variable and then a generic copy to or from it, so I think it makes sense to have it in line with the other memory operations, which have similar parameters to this.

Additionally, I expect to use all parameters except for blocking_read and blocking_write. Note also that the first parameter has been changed from pi_device to pi_queue as it is an enqueued operation and we can get the device from the queue but not vice versa.

```

In both cases the `name` parameter is the same as the `sycl-unique-id` string
that is associated with the device global variable.
The `piextEnqueueDeviceVariableRead` function reads `Count` bytes at byte-offset
`Offset` from a device global variable in `Program` identified by the name
`Name`. The read data is stored in `Dst`. Likewise, the
`piextEnqueueDeviceVariableWrite` function reads `Count` bytes from `Dst` and
stores them at byte-offset `Offset` in the device global variable in `Program`
identified by the name `Name`.

Both functions will enqueue the associated memory command on `Queue` where it
will first wait for `NumEventsInWaitList` events in `EventsWaitList` to finish.
`Event` will be populated with the event associated with resulting enqueued
command. If either `BlockingRead` or `BlockingWrite` is `true` the call will
block on the host until the enqueued command finishes execution.

For `device_global` variables the `Name` parameter in calls to these functions
is the same as the associated `sycl-unique-id` string.

The Level Zero backend has existing APIs that can implement these PI
interfaces. The plugin first calls [`zeModuleGetGlobalPointer()`][8] to get a
Expand Down Expand Up @@ -616,8 +638,8 @@ depends upon implementation of that OpenCL extension.

[10]: <opencl-extensions/cl_intel_global_variable_access.asciidoc>

The CUDA backend has existing APIs `cudaMemcpyToSymbol()` and
`cudaMemcpyFromSymbol()` which can be used to implement these PI interfaces.
The CUDA backend has existing APIs `cuModuleGetGlobal()` and `cuMemcpyAsync()`
which can be used to implement these PI interfaces.


## Design choices
Expand Down