-
Notifications
You must be signed in to change notification settings - Fork 790
[SYCL][Doc] device_global: device_image_scope Update #11212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@GarveyJoe @gmlueck @steffenlarsen Please take a look and let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the use of "instance" here is fine, but I am unsure what is meant by "queue whose device cannot load a device image that references the device global." In what case would that happen? Is this not already implied by the fact that the device image could be loaded onto the device for the copy operation?
sycl/doc/extensions/experimental/sycl_ext_oneapi_device_global.asciidoc
Outdated
Show resolved
Hide resolved
Correct. What I am trying to communicate is what happens in the scenario below. My opinion is that this is UB. I think this, because there is no device_image that a device_global can be associated with, and so there is no device_global to access. device_global<int, decltype(properties(device_image_scope))> dg;
int main () {
queue q;
int val;
// can't load a device_image that references a device_global - what is expected behaviour?
// is this nonsensical?
q.copy(&val, dg);
int * other = malloc_host<int>(1, q);
q.single_task([=]() {
*other = 7;
});
// what is expected here?
q.copy(dg, &val);
} |
This is covered in the design document for the device global feature. See the section "Implementing the copy and memcpy functions in queue and handler":
|
@@ -615,18 +615,18 @@ global variable is allocated for each device, and that instance is shared by | |||
all kernels that belong to the same context and are submitted to the same device, | |||
regardless of which _device image_ contains the kernel. When this property is | |||
specified, it is an assertion by the user that the device global is referenced | |||
only from kernels that are contained by the same _device image_. An | |||
only from kernels that are contained by the same _device image_ instance, a _device image_ loaded onto a specific device. An |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you made this change to address this point in the PR description:
- The spec doesn't currently state that all accesses to the device_global from device code must be from the same device.
- In the JiT flow you can build a device_image containing a device_global and then run kernels from that device_image on multiple targets. By the current wording of the spec, we were worried that, even with device_image_scope, a user could expect that device_global to retain its value across device.
I think this is not a valid concern, and I'd rather not make the change you suggest here.
The first paragraph of the section "Representation of device globals" says:
device_global
provides a mechanism to allocate device scope memory - memory which has unique underlying storage (of type T) for eachsycl::device
andsycl::context
combination. If multiple valid device and context combinations are present then each receives its own unique underlying allocation. All kernels that reference the samedevice_global
entity (either directly or via a pointer to its underlying object of type T) share the same allocation of that object when those kernels run on the same device and context.
All device_global
variables store values that are scoped to a particular device, and this paragraph is attempting to state that. This is true regardless of whether the device global uses the device_image_scope
property.
While reading this, though, I think there is a different problem with the spec. The paragraph I quoted above says that the storage of a device global variable is unique for each combination of device
and context
. I think the part about context
is not true when using the device_image_scope
property. In that case, I think the storage is the same regardless of context
.
Even when not using device_image_scope
, I think that paragraph is not quite right regarding context
. Instead, I think the implementation has a requirement that all queues that access a device global variable must use the same context
. Tagging @steffenlarsen for his opinion here.
It occurs to me now that this is an unfortunate limitation, which could make it difficult to use device global variables in some programs. I think the limitation comes from the requirement that USM device memory may only be accessed from queues that have the same context as the context used to allocate that USM memory. I wonder if there should be some way to allocate USM device memory that can be accessed from any context that is valid for the device on which it is allocated? Tagging @bashbaug for his thoughts here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which could make it difficult to use device global variables in some programs.
I think the proper "fix" for this issue is to minimize the number of contexts per device in a program, ideally down to just one, unless an application chooses to explicitly opt-in to multiple contexts. We're already doing this with things like the "default context" that make it harder to accidentally create more contexts than intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this is a good BKM. However, the problem remains for applications that opt-in to using multiple contexts. For these applications, it will be hard to use device global variables.
Is there a good reason to have such strict context rules for USM device memory? If this type of memory is allocated on device A with context C1, is it really necessary that a kernel running on device A must also be launched with context C1 in order to access it? Regardless of context, it seems like a kernel running on device A should have access to any device memory allocated on that same device. Is there really something in the hardware or low-level software stacks that requires the contexts to be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there really something in the hardware or low-level software stacks that requires the contexts to be the same?
I think this discussion is expanding beyond the scope of this PR, but in general: yes, the contexts need to be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @bashbaug. I agree it's a bit of a tangent, but this discussion points to an error in that first paragraph in "Representation of device globals" that I quote above.
I'll describe the correction I think we need to make, but we can do that in a different PR. My overall point is that the change proposed in this PR is not necessary because of this paragraph, which already states that the value of a device_global
is per-device. I think the correct wording of that paragraph should be like this:
device_global
provides a mechanism to allocate device scope memory - memory which has unique underlying storage (of type T) for eachsycl::device
. All kernels that reference the samedevice_global
entity (either directly or via a pointer to its underlying object of type T) share the same allocation of that object when those kernels run on the same device.Accesses to
device_global
have an important limitation with respect to thesycl::context
. All accesses to adevice_global
variable from a specific device must use the same context. To illustrate, consider a kernel that is submitted to device D using context C, where the kernel accesses adevice_global
variable V. Any time that kernel is submitted to device D, it must be submitted using context C. Any other kernels that access the same variable V must also use context C when submitted to device D. Any copy operations to or from variable V on device D must also use context C.
sycl/doc/extensions/experimental/sycl_ext_oneapi_device_global.asciidoc
Outdated
Show resolved
Hide resolved
however no diagnostic is required for an indirect access from another _device | ||
image_. | ||
image_ instance. | ||
|
||
When a device global is decorated with this property, the implementation | ||
re-initializes it whenever the _device image_ is loaded onto the device. As a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's orthogonal to this change, but should we simplify this by saying that it is implementation defined when a device_image_scope is reinitialized and leave the specifics to our docs? I worry that even the little we've defined here is wrong. "Loaded onto the device" is imprecise and not totally correct. For example, if I run the same host program containing only one device image twice in a row we re-initialize the device_image_scope device_globals even if reloading the device image is not necessary. Not doing so would cause program behaviour to be dependent on previously run programs which would result in very hard to debug issues for users and potentially security vulnerabilities. On the FPGA, a precise definition would be:
Define an "activity associated with a device image" as a run of a kernel defined in that device image or a copy to or from a device_image_scope device_global defined in that device image.
Define the "currently active" device image for a device as the last device image to have an associated activity on this device.
On FPGA devices we reinitialize device_image_scope device_globals before any activities associated with that device image on that device if that device image is not the currently active device image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our offline discussion, we could go with something like this to be more useful than just saying "implementation defined" while still providing some implementation flexibility:
When a device global is decorated with this property, in addition to the general device_global rules about initialization mentioned previously, the implementation may also re-initialize it at implementation-defined times. However, the implementation is guaranteed not to re-initialize the device_global on a given deveice if all kernels run and all copied to/from device_image_scope device_globals on that device are from the same device image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this:
A device global variable is guaranteed to be initialized for a device prior to the first time it is accessed (whether from a kernel or a copy operation). Device globals may also be re-initialized at implementation-defined times if multiple device images are used on the same device. To avoid unexpected re-initializations, applications should ensure that all kernels that are enqueued to a device D come from the same device image. In addition, applications should ensure that all device global copy operation enqueued to device D correspond to that same device image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like @gmlueck version better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that.
If `PropertyListT` contains the `device_image_scope` property, kernels in the | ||
_device image_ containing the _dest_ variable must be compatible with the | ||
device that is associated with the `handler`. If this is not the case, the | ||
implementation throws an `exception` with the `errc::kernel_not_supported` | ||
error code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck I don't like the term compatible
here. It is not clear that we are going to throw an error if there are no kernels that use the device_global.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wording that I proposed about "compatible" was intended to address this concern that you raised:
It is undefined behavior to copy to or from a device global on a queue whose device cannot load a device image that references the device global
I.e. "cannot load" --> "not compatible"
The request to throw an error when no kernel references the device global is something new that we just decided in the meeting today. We would need some new wording to specify that error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this error be a runtime error? I was speaking with Artem today and it came up that there would be instances where we don't know in all cases if a particular device_image is going to be loaded or not, which would make this difficult to implement as a compile time error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck how about?
If
PropertyListT
contains thedevice_image_scope
property, at least one kernel in the
device image containing the dest variable must access the dest variable. If this is not the case, the
implementation throws anexception
with theerrc::kernel_not_supported
error code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justin-rosner yes this will be a runtime error.
Two issues have been identified with the wording of the
device_image_scope
property.