Skip to content

[RFC] Improve argument passing / Support some non device copiable types #5320

Open
@Naghasan

Description

@Naghasan

While working on Kokkos, we looked at supporting directly called kernels, but this however implies supporting types that are does not meet device copyable criteria.

Context:

By default Kokkos passes it arguments by storing the values into a USM buffer and pass the pointer to the kernel. This is a common approach that is also used for the CUDA backend.

So in pseudo code we have something like this:

auto * args = sycl::malloc_shared<T>(...)
*args = /* set struct */

[...]

parallel_for(range, [=](item i) {
  args->use_fields;
})

The type T used in the pseudo code doesn't meet the requirements to be device copyable (like non trivial destructor). And setting specializing is_device_copyable_v is of no help. T contains a std::string field (unused) and generate a call to delete. However, T is intended to be bitwise copied to the device, so even if it is technically not device copyable, there is a "promise" it is.

Further more, we also noticed that large arrays are copied over using a copy loop rather than a mem copy. This prevent SROA to operate (dynamic indexing makes it bails out) and as a consequence prevents DAE as well.

Proposition:

During the generation of the opencl/spir like kernel's body:

  • Wrap the sycl kernel inside a union instead of direct instantiation (this will prevent the call to the destructor);
  • Initialize each non special types using a memcopy;
  • Initialize each special types using a placement new (force the call to the default Ctor) and call __init/__finalize to maintain current behavior.

The proposition aims to tackle 2 aspects:

  • Allow non device copiable type to be usable
  • Shape code to be more SROA friendly

This bend the specs as it will allow types that should normally be rejected, but remains within the limits (https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec::device.copyable). Namely, this exploits:

  • It is unspecified whether the implementation actually calls the copy constructor, move constructor, copy assignment operator, or move assignment operator of a class declared as is_device_copyable_v when doing an inter-device copy.
  • The destructor has no effect when executed on the device

I have a prototype that is close to be finished, if code owners are fine with the approach, I should be able to push it this week.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions