Skip to content

[SYCL][Doc] device_global: device_image_scope Update #11212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 19, 2023

Conversation

artemrad
Copy link
Contributor

Two issues have been identified with the wording of the device_image_scope property.

  1. The spec doesn't currently state that all accesses to the device_global from device code must be from the same device.
    • In the JiT flow you can build a device_image containing a device_global and then run kernels from that device_image on multiple targets. By the current wording of the spec, we were worried that, even with device_image_scope, a user could expect that device_global to retain its value across device.
  2. The spec doesn't constrain which queues can be used to access a device_global from host code.
    • A queue is associated with a specific device. We were worried that it’s legal to copy to/from a device that doesn’t use a device_image_scope device_global and that that would require every device image to have a copy of every device global.
    • How should copies behave when no device_image accesses the device global?

@artemrad artemrad requested a review from a team as a code owner September 18, 2023 20:19
@Pennycook Pennycook assigned GarveyJoe and unassigned GarveyJoe Sep 18, 2023
@Pennycook Pennycook requested a review from GarveyJoe September 18, 2023 20:56
@artemrad
Copy link
Contributor Author

@GarveyJoe @gmlueck @steffenlarsen Please take a look and let me know what you think.

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use of "instance" here is fine, but I am unsure what is meant by "queue whose device cannot load a device image that references the device global." In what case would that happen? Is this not already implied by the fact that the device image could be loaded onto the device for the copy operation?

@artemrad
Copy link
Contributor Author

Is this not already implied by the fact that the device image could be loaded onto the device for the copy operation?

Correct. What I am trying to communicate is what happens in the scenario below. My opinion is that this is UB. I think this, because there is no device_image that a device_global can be associated with, and so there is no device_global to access.

device_global<int, decltype(properties(device_image_scope))> dg;

int main () {
  queue q;
  int val;
  
  // can't load a device_image that references a device_global - what is expected behaviour?
  // is this nonsensical?
  q.copy(&val, dg); 

  int * other = malloc_host<int>(1, q);
  q.single_task([=]() {
    *other = 7;
  });

  // what is expected here?
  q.copy(dg, &val); 

}

@gmlueck
Copy link
Contributor

gmlueck commented Sep 22, 2023

Correct. What I am trying to communicate is what happens in the scenario below. My opinion is that this is UB. I think this, because there is no device_image that a device_global can be associated with, and so there is no device_global to access.

This is covered in the design document for the device global feature. See the section "Implementing the copy and memcpy functions in queue and handler":

It is possible that a device global variable with device_image_scope is not referenced by any kernel, in which case the variable's unique string will not exist in any property set. In this case, the runtime simply uses the host instance of the device_global variable to hold the value and copies to or from the val member.

@@ -615,18 +615,18 @@ global variable is allocated for each device, and that instance is shared by
all kernels that belong to the same context and are submitted to the same device,
regardless of which _device image_ contains the kernel. When this property is
specified, it is an assertion by the user that the device global is referenced
only from kernels that are contained by the same _device image_. An
only from kernels that are contained by the same _device image_ instance, a _device image_ loaded onto a specific device. An
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you made this change to address this point in the PR description:

  1. The spec doesn't currently state that all accesses to the device_global from device code must be from the same device.
    • In the JiT flow you can build a device_image containing a device_global and then run kernels from that device_image on multiple targets. By the current wording of the spec, we were worried that, even with device_image_scope, a user could expect that device_global to retain its value across device.

I think this is not a valid concern, and I'd rather not make the change you suggest here.

The first paragraph of the section "Representation of device globals" says:

device_global provides a mechanism to allocate device scope memory - memory which has unique underlying storage (of type T) for each sycl::device and sycl::context combination. If multiple valid device and context combinations are present then each receives its own unique underlying allocation. All kernels that reference the same device_global entity (either directly or via a pointer to its underlying object of type T) share the same allocation of that object when those kernels run on the same device and context.

All device_global variables store values that are scoped to a particular device, and this paragraph is attempting to state that. This is true regardless of whether the device global uses the device_image_scope property.

While reading this, though, I think there is a different problem with the spec. The paragraph I quoted above says that the storage of a device global variable is unique for each combination of device and context. I think the part about context is not true when using the device_image_scope property. In that case, I think the storage is the same regardless of context.

Even when not using device_image_scope, I think that paragraph is not quite right regarding context. Instead, I think the implementation has a requirement that all queues that access a device global variable must use the same context. Tagging @steffenlarsen for his opinion here.

It occurs to me now that this is an unfortunate limitation, which could make it difficult to use device global variables in some programs. I think the limitation comes from the requirement that USM device memory may only be accessed from queues that have the same context as the context used to allocate that USM memory. I wonder if there should be some way to allocate USM device memory that can be accessed from any context that is valid for the device on which it is allocated? Tagging @bashbaug for his thoughts here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which could make it difficult to use device global variables in some programs.

I think the proper "fix" for this issue is to minimize the number of contexts per device in a program, ideally down to just one, unless an application chooses to explicitly opt-in to multiple contexts. We're already doing this with things like the "default context" that make it harder to accidentally create more contexts than intended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this is a good BKM. However, the problem remains for applications that opt-in to using multiple contexts. For these applications, it will be hard to use device global variables.

Is there a good reason to have such strict context rules for USM device memory? If this type of memory is allocated on device A with context C1, is it really necessary that a kernel running on device A must also be launched with context C1 in order to access it? Regardless of context, it seems like a kernel running on device A should have access to any device memory allocated on that same device. Is there really something in the hardware or low-level software stacks that requires the contexts to be the same?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there really something in the hardware or low-level software stacks that requires the contexts to be the same?

I think this discussion is expanding beyond the scope of this PR, but in general: yes, the contexts need to be the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @bashbaug. I agree it's a bit of a tangent, but this discussion points to an error in that first paragraph in "Representation of device globals" that I quote above.

I'll describe the correction I think we need to make, but we can do that in a different PR. My overall point is that the change proposed in this PR is not necessary because of this paragraph, which already states that the value of a device_global is per-device. I think the correct wording of that paragraph should be like this:

device_global provides a mechanism to allocate device scope memory - memory which has unique underlying storage (of type T) for each sycl::device. All kernels that reference the same device_global entity (either directly or via a pointer to its underlying object of type T) share the same allocation of that object when those kernels run on the same device.

Accesses to device_global have an important limitation with respect to the sycl::context. All accesses to a device_global variable from a specific device must use the same context. To illustrate, consider a kernel that is submitted to device D using context C, where the kernel accesses a device_global variable V. Any time that kernel is submitted to device D, it must be submitted using context C. Any other kernels that access the same variable V must also use context C when submitted to device D. Any copy operations to or from variable V on device D must also use context C.

however no diagnostic is required for an indirect access from another _device
image_.
image_ instance.

When a device global is decorated with this property, the implementation
re-initializes it whenever the _device image_ is loaded onto the device. As a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's orthogonal to this change, but should we simplify this by saying that it is implementation defined when a device_image_scope is reinitialized and leave the specifics to our docs? I worry that even the little we've defined here is wrong. "Loaded onto the device" is imprecise and not totally correct. For example, if I run the same host program containing only one device image twice in a row we re-initialize the device_image_scope device_globals even if reloading the device image is not necessary. Not doing so would cause program behaviour to be dependent on previously run programs which would result in very hard to debug issues for users and potentially security vulnerabilities. On the FPGA, a precise definition would be:

Define an "activity associated with a device image" as a run of a kernel defined in that device image or a copy to or from a device_image_scope device_global defined in that device image.

Define the "currently active" device image for a device as the last device image to have an associated activity on this device.

On FPGA devices we reinitialize device_image_scope device_globals before any activities associated with that device image on that device if that device image is not the currently active device image.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our offline discussion, we could go with something like this to be more useful than just saying "implementation defined" while still providing some implementation flexibility:

When a device global is decorated with this property, in addition to the general device_global rules about initialization mentioned previously, the implementation may also re-initialize it at implementation-defined times. However, the implementation is guaranteed not to re-initialize the device_global on a given deveice if all kernels run and all copied to/from device_image_scope device_globals on that device are from the same device image.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this:

A device global variable is guaranteed to be initialized for a device prior to the first time it is accessed (whether from a kernel or a copy operation). Device globals may also be re-initialized at implementation-defined times if multiple device images are used on the same device. To avoid unexpected re-initializations, applications should ensure that all kernels that are enqueued to a device D come from the same device image. In addition, applications should ensure that all device global copy operation enqueued to device D correspond to that same device image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @gmlueck version better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that.

Comment on lines 1043 to 1047
If `PropertyListT` contains the `device_image_scope` property, kernels in the
_device image_ containing the _dest_ variable must be compatible with the
device that is associated with the `handler`. If this is not the case, the
implementation throws an `exception` with the `errc::kernel_not_supported`
error code.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlueck I don't like the term compatible here. It is not clear that we are going to throw an error if there are no kernels that use the device_global.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording that I proposed about "compatible" was intended to address this concern that you raised:

It is undefined behavior to copy to or from a device global on a queue whose device cannot load a device image that references the device global

I.e. "cannot load" --> "not compatible"

The request to throw an error when no kernel references the device global is something new that we just decided in the meeting today. We would need some new wording to specify that error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this error be a runtime error? I was speaking with Artem today and it came up that there would be instances where we don't know in all cases if a particular device_image is going to be loaded or not, which would make this difficult to implement as a compile time error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlueck how about?

If PropertyListT contains the device_image_scope property, at least one kernel in the
device image containing the dest variable must access the dest variable. If this is not the case, the
implementation throws an exception with the errc::kernel_not_supported
error code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justin-rosner yes this will be a runtime error.

@againull againull changed the title device_global: device_image_scope Update [SYCL][Doc] device_global: device_image_scope Update Oct 19, 2023
@againull againull merged commit be8e031 into intel:sycl Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants