Description
While working on Kokkos, we looked at supporting directly called kernels, but this however implies supporting types that are does not meet device copyable criteria.
Context:
By default Kokkos passes it arguments by storing the values into a USM buffer and pass the pointer to the kernel. This is a common approach that is also used for the CUDA backend.
So in pseudo code we have something like this:
auto * args = sycl::malloc_shared<T>(...)
*args = /* set struct */
[...]
parallel_for(range, [=](item i) {
args->use_fields;
})
The type T
used in the pseudo code doesn't meet the requirements to be device copyable (like non trivial destructor). And setting specializing is_device_copyable_v
is of no help. T
contains a std::string field (unused) and generate a call to delete. However, T
is intended to be bitwise copied to the device, so even if it is technically not device copyable, there is a "promise" it is.
Further more, we also noticed that large arrays are copied over using a copy loop rather than a mem copy. This prevent SROA to operate (dynamic indexing makes it bails out) and as a consequence prevents DAE as well.
Proposition:
During the generation of the opencl/spir like kernel's body:
- Wrap the sycl kernel inside a union instead of direct instantiation (this will prevent the call to the destructor);
- Initialize each non special types using a memcopy;
- Initialize each special types using a placement new (force the call to the default Ctor) and call
__init
/__finalize
to maintain current behavior.
The proposition aims to tackle 2 aspects:
- Allow non device copiable type to be usable
- Shape code to be more SROA friendly
This bend the specs as it will allow types that should normally be rejected, but remains within the limits (https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec::device.copyable). Namely, this exploits:
It is unspecified whether the implementation actually calls the copy constructor, move constructor, copy assignment operator, or move assignment operator of a class declared as is_device_copyable_v when doing an inter-device copy.
The destructor has no effect when executed on the device
I have a prototype that is close to be finished, if code owners are fine with the approach, I should be able to push it this week.