Skip to content

Commit 35fba19

Browse files
authored
[UR][CUDA] Avoid unnecessary calls to cuFuncSetAttribute (#16928)
Calling `cuFuncSetAttribute` to set `CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES` is required to launch kernels using more than 48 kB of local memory[1] (CUDA dynamic shared memory). Without this, `cuLaunchKernel` fails with `CUDA_ERROR_INVALID_VALUE`. However, calling `cuFuncSetAttribute` introduces synchronisation in the CUDA runtime which blocks its execution until all H2D/D2H memory copies are finished (don't know why), therefore effectively blocking kernel launches from overlapping with memory copies. This introduces significant performance degradation in some workflows, specifically in applications launching overlapping memory copies and kernels from multiple host threads into multiple CUDA streams to the same GPU. Avoid the CUDA runtime synchronisation causing poor performance by removing the `cuFuncSetAttribute` call unless it's strictly necessary. Call it only when a specific carveout is requested by user (using env variables) or when the kernel launch would fail without it (local memory size >48kB). Good performance is recovered for default settings with kernels using little or no local memory. No performance effects were observed for kernel execution time after removing the attribute across a wide range of tested kernels using various amounts of local memory. [1] Related to the 48 kB static shared memory limit, see the footnote for "Maximum amount of shared memory per thread block" in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability
1 parent 671468a commit 35fba19

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

unified-runtime/source/adapters/cuda/enqueue.cpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -290,7 +290,9 @@ setKernelParams([[maybe_unused]] const ur_context_handle_t Context,
290290
CuFunc, CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES,
291291
Device->getMaxChosenLocalMem()));
292292

293-
} else {
293+
} else if (LocalSize > 48 * 1024) {
294+
// CUDA requires explicit carveout of dynamic shared memory size if larger
295+
// than 48 kB, otherwise cuLaunchKernel fails.
294296
UR_CHECK_ERROR(cuFuncSetAttribute(
295297
CuFunc, CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES, LocalSize));
296298
}

0 commit comments

Comments
 (0)