Skip to content

Commit fa501fd

Browse files
authored
[SYCL][CUDA] Add fix for local size calculation regression (#9736)
This PR fixes a performance regression wrt work-group size selection when only `sycl::range` is used. The regression was reported in issue [#5627](#5627). We want the work-groups to be uniformly distributed but that could lead to non-optmially sized work-groups is the global work size is not an even number. Ideally, we want ensure that the work-group size is a power of two.
1 parent 37bb6a2 commit fa501fd

File tree

1 file changed

+8
-1
lines changed

1 file changed

+8
-1
lines changed

sycl/plugins/cuda/pi_cuda.cpp

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -325,9 +325,16 @@ void guessLocalWorkSize(_pi_device *device, size_t *threadsPerBlock,
325325
std::min(maxThreadsPerBlock[0],
326326
std::min(global_work_size[0], static_cast<size_t>(gridDim[0])));
327327

328+
static auto isPowerOf2 = [](size_t value) -> bool {
329+
return value && !(value & (value - 1));
330+
};
331+
328332
// Find a local work group size that is a divisor of the global
329333
// work group size to produce uniform work groups.
330-
while (0u != (global_work_size[0] % threadsPerBlock[0])) {
334+
// Additionally, for best compute utilisation, the local size has
335+
// to be a power of two.
336+
while (0u != (global_work_size[0] % threadsPerBlock[0]) ||
337+
!isPowerOf2(threadsPerBlock[0])) {
331338
--threadsPerBlock[0];
332339
}
333340
}

0 commit comments

Comments
 (0)