Skip to content

[SYCL][CUDA][HIP] Support zero range kernel for cuda and hip backends. #7044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Nov 10, 2022
Merged
5 changes: 5 additions & 0 deletions sycl/plugins/cuda/pi_cuda.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2988,6 +2988,11 @@ pi_result cuda_piEnqueueKernelLaunch(
assert(work_dim > 0);
assert(work_dim < 4);

if (*global_work_size == 0) {
return cuda_piEnqueueEventsWaitWithBarrier(
command_queue, num_events_in_wait_list, event_wait_list, event);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing an early exit here means that we do not create an event. I fear that could cause unexpected problems. Same goes for HIP.

Copy link
Contributor Author

@mmoadeli mmoadeli Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steffenlarsen I agree with you on this regarding the event.
If I move the condition (if (*global_work_size =! 0 )) to guard the call to PI_CHECK_ERROR(cuLaunchKernel( it will preserve the functionality related to event handling. Do you have any reservation doing that?
I have not tried the L0, but the opencl:cpu and esimd (and host in an earlier version of DP++) backend are tried and work without the need for any modifications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either that or you could copy the event creation to here, something like:

if (event) {
  std::unique_ptr<_pi_event> retImplEv{nullptr};
  retImplEv = std::unique_ptr<_pi_event>(_pi_event::make_native(
      PI_COMMAND_TYPE_NDRANGE_KERNEL, command_queue));
  retImplEv->start();
  retImplEv->record();
  *event = retImplEv.release();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it will be a correct implementation. A call with NDRange == 0 while should do nothing in terms of running the kernel, the event produced should be as if we run a kernel:

  1. Completion of such an event should guarantee completion of events passed in event_wait_list
  2. The event can be used in a event_wait_list of a subsequent enqueue.

So, I think that the more correct implementation would be just calling cuda_piEnqueueEventsWaitWithBarrier if NDRange == 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. Now that there are multiple queues, recording an event is not enough to act like a barrier.


// Set the number of threads per block to the number of threads per warp
// by default unless user has provided a better number
size_t threadsPerBlock[3] = {32u, 1u, 1u};
Expand Down
5 changes: 5 additions & 0 deletions sycl/plugins/hip/pi_hip.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2758,6 +2758,11 @@ pi_result hip_piEnqueueKernelLaunch(
assert(work_dim > 0);
assert(work_dim < 4);

if (*global_work_size == 0) {
return hip_piEnqueueEventsWaitWithBarrier(
command_queue, num_events_in_wait_list, event_wait_list, event);
}

// Set the number of threads per block to the number of threads per warp
// by default unless user has provided a better number
size_t threadsPerBlock[3] = {32u, 1u, 1u};
Expand Down
2 changes: 1 addition & 1 deletion sycl/source/detail/scheduler/commands.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1934,7 +1934,7 @@ static void adjustNDRangePerKernel(NDRDescT &NDR, RT::PiKernel Kernel,
if (NDR.GlobalSize[0] != 0)
return; // GlobalSize is set - no need to adjust
// check the prerequisites:
assert(NDR.NumWorkGroups[0] != 0 && NDR.LocalSize[0] == 0);
assert(NDR.LocalSize[0] == 0);
// TODO might be good to cache this info together with the kernel info to
// avoid get_kernel_work_group_info on every kernel run
range<3> WGSize = get_kernel_device_specific_info<
Expand Down