Skip to content

[SYCL] Fix broken uniqueness in make_device #6204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

JackAKirk
Copy link
Contributor

This is a general fix for #6055.
CUDA device interop is not available yet but a corresponding fix will be added to the CUDA specialization of make_device in #6202 shortly.

If it is uncontroversial I will add an overload for device::get_devices that takes a backend argument and then only returns devices corresponding to that backend? This would allow removing the dev.get_backend() == Backend check introduced in this PR and make the implementation more efficient.

Note that I expect there could be similar uniqueness issues for other interop types such as context etc.. We will be considering such issues in #6055.

I think that interop tests basically comprising the code sample in #6055 would be useful. I can add such a test for device to intel/llvm-test-suite.

Signed-off-by: JackAKirk [email protected]

@smaslov-intel
Copy link
Contributor

Why is this not relying on the existing device-cache? I would expect that the new "device_impl" would not be created because of this code:

// If we've already seen this device, return the impl

If a device is created via the interop API it will always create a new device_impl even when there already y exist a device_impl with the same backend handle.

I do not say it is not a problem, but where in the SYCL2020 OpenCL backend spec is it saying what should be returned by "make_device"? Do we need to update it?

Note that I expect there could be similar uniqueness issues for other interop types such as context etc.. We will be considering such issues in #6055.

Why context? I think only platform and device should be such, no?
Others would create unique objects.

I think that interop tests basically comprising the code sample in #6055 would be useful. I can add such a test for device to intel/llvm-test-suite.

Yes, please add such test(s).

@JackAKirk
Copy link
Contributor Author

JackAKirk commented May 31, 2022

Why is this not relying on the existing device-cache? I would expect that the new "device_impl" would not be created because of this code:

// If we've already seen this device, return the impl

If a device is created via the interop API it will always create a new device_impl even when there already y exist a device_impl with the same backend handle.

I do not say it is not a problem, but where in the SYCL2020 OpenCL backend spec is it saying what should be returned by "make_device"? Do we need to update it?

Note that I expect there could be similar uniqueness issues for other interop types such as context etc.. We will be considering such issues in #6055.

Why context? I think only platform and device should be such, no? Others would create unique objects.

I think that interop tests basically comprising the code sample in #6055 would be useful. I can add such a test for device to intel/llvm-test-suite.

Yes, please add such test(s).

It is relying on device-cache in the platform class since this patch.
You can see here:

std::make_shared<device_impl>(PiDevice, Plugin));
that createSyclObjFromImpl<device> does not call the Platform->getOrMakeDeviceImpl method that checks the device-cache. This is why we needed to call device::get_devices that does check the device-cache and as far as I can tell is the only existing route that checks the device cache of all platforms within a backend.

I think you're right that this use-case and expected behaviour is not described in the OpenCL backend spec. However currently the runtime is allowing the creation of a second shared_ptr that takes ownership of the native object in the sample code here: #6055.
It doesn't seem to be a good idea to allow multiple independent shared_ptrs own the same object if it can be helped. What do you think?

I don't see any difference between the above motivation for device and the context case: however a sycl::context maps to multiple cuda contexts, so that I'm not sure we can easily do much about this issue for context at least for the CUDA backend (the same is true of queue and event). We are still finalizing the correct behaviour of interop within the CUDA backend. The most straightforward interop case is device for the CUDA backend.

@smaslov-intel
Copy link
Contributor

that createSyclObjFromImpl<device> does not call the Platform->getOrMakeDeviceImpl method that checks the device-cache. This is why we needed to call device::get_devices that does check the device-cache and as far as I can tell is the only existing route that checks the device cache of all platforms within a backend.

That's why I am asking should we rather change the "make_device" to be aware of the device-cache instead of using get_devices where there was no need to read all the devices?

I think you're right that this use-case and expected behaviour is not described in the OpenCL backend spec.

Would you spawn a spec issue to clarify this, please?

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Jun 2, 2022

that createSyclObjFromImpl<device> does not call the Platform->getOrMakeDeviceImpl method that checks the device-cache. This is why we needed to call device::get_devices that does check the device-cache and as far as I can tell is the only existing route that checks the device cache of all platforms within a backend.

That's why I am asking should we rather change the "make_device" to be aware of the device-cache instead of using get_devices where there was no need to read all the devices?

MDeviceCache is a private member of platform_impl so I thought it could be controversial to let make_device access it directly? It is true that if make_device were to access it directly the impl could be made a bit more efficient, if it is worth it.
Unless you meant just iterating over only the devices that are formed from platforms within the specified backend directly within make_device, as I have done in the latest commit, rather than overloading device::get_devices to take a backend argument?

I think you're right that this use-case and expected behaviour is not described in the OpenCL backend spec.

Would you spawn a spec issue to clarify this, please?

OK

Signed-off-by: JackAKirk <[email protected]>
@smaslov-intel
Copy link
Contributor

MDeviceCache is a private member of platform_impl so I thought it could be controversial to let make_device access it directly? It is true that if make_device were to access it directly the impl could be made a bit more efficient, if it is worth it.
Unless you meant just iterating over only the devices that are formed from platforms within the specified backend directly within make_device, as I have done in the latest commit, rather than overloading device::get_devices to take a backend argument?

The platform.get_devices() uses platform_impl.getOrMakeDeviceImpl to do the caching. I think having the make_device to similarly obtain a device impl through platform_impl.getOrMakeDeviceImpl makes it a more consistent (and performant) implementation.

@JackAKirk JackAKirk marked this pull request as draft July 14, 2022 14:08
@JackAKirk
Copy link
Contributor Author

JackAKirk commented Jul 14, 2022

MDeviceCache is a private member of platform_impl so I thought it could be controversial to let make_device access it directly? It is true that if make_device were to access it directly the impl could be made a bit more efficient, if it is worth it.
Unless you meant just iterating over only the devices that are formed from platforms within the specified backend directly within make_device, as I have done in the latest commit, rather than overloading device::get_devices to take a backend argument?

The platform.get_devices() uses platform_impl.getOrMakeDeviceImpl to do the caching. I think having the make_device to similarly obtain a device impl through platform_impl.getOrMakeDeviceImpl makes it a more consistent (and performant) implementation.

I don't think that we can directly use platform_impl.getOrMakeDeviceImpl because a backend can contain more than one platform. Therefore if we do not find the device in MDeviceCache we do not want to call:

  // Otherwise make the impl
  std::shared_ptr<device_impl> Result =
      std::make_shared<device_impl>(PiDevice, PlatformImpl);
  MDeviceCache.emplace_back(Result);

until we have checked all platforms in the backend.

I've followed through with your suggestion with a rough draft implementation using a new member: platform_impl.getDeviceImpl which only performs the first part of platform_impl.getOrMakeDeviceImpl. My latest commit will fix the assert fail in:

#include <sycl/sycl.hpp>

int main() {
  sycl::device sdev(sycl::default_selector{});
  auto ocl_dev = sycl::get_native<sycl::backend::opencl>(sdev);
  sycl::device dev(sycl::make_device<sycl::backend::opencl>(ocl_dev));
  assert(sdev == dev);
}

However if we have something like this (perhaps unusual) case:

#include "cuda_helper.hpp"
#define SYCL_EXT_ONEAPI_BACKEND_CUDA_EXPERIMENTAL 1
#include <sycl/ext/oneapi/experimental/backend/cuda.hpp>
#include <sycl/sycl.hpp>

int main() {
  sycl::device sdev(sycl::default_selector{});
  CUdevice m_cu_device;
  cuDeviceGet(&m_cu_device, 0);
  CUDA_CHECK(cuDeviceGet(&m_cu_device, 0));
  sycl::device dev1(
      sycl::make_device<sycl::backend::ext_oneapi_cuda>(m_cu_device));

  auto m_cu_device_clone =
      sycl::get_native<sycl::backend::ext_oneapi_cuda>(dev1);
  sycl::device dev1_clone(
      sycl::make_device<sycl::backend::ext_oneapi_cuda>(m_cu_device_clone));

  assert(dev1 == dev1_clone);
}

we will hit the assert because we have not updated the MDeviceCache (first we will have to find the platform associated with the device_impl that was made by the plugin). So we will need to create another route to find the correct platform and update platform_impl.MDeviceCache.
This change would also mean that platform_impl.getOrMakeDeviceImpl becomes redundant, and we should call platform_impl.getDeviceImpl instead from platform_impl::get_devices and remove platform_impl.getOrMakeDeviceImpl.

I'm not sure that make_device is performance critical code since I don't imagine it will be called many times per application; perhaps I am wrong? This solution could be improving the performance of non performance critical code at the expense of a reduction of the encapsulation of platform_impl, additional class methods to platform_impl and probably worse code readability.
I want to check that I did not misunderstand what you meant and you have a better solution? Or that this is still the correct direction? Or whether my original solution that made use of platform::get_devices (that I have added to the cuda implementation in the last commit for reference: beginning from:

auto plts = platform::get_platforms();
) is better?

Thanks

@JackAKirk
Copy link
Contributor Author

@smaslov-intel What do you think about my above comment/questions?

@AerialMantis
Copy link
Contributor

@smaslov-intel friendly ping on this, what are your thoughts on @JackAKirk's comments above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants