-
Notifications
You must be signed in to change notification settings - Fork 787
[UR][CUDA] Avoid unnecessary calls to cuFuncSetAttribute #16928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UR][CUDA] Avoid unnecessary calls to cuFuncSetAttribute #16928
Conversation
Update UR tag to include oneapi-src/unified-runtime#2678
Calling cuFuncSetAttribute to set CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES is required to launch kernels using more than 48 kB of local memory[1] (CUDA dynamic shared memory). Without this, cuLaunchKernel fails with CUDA_ERROR_INVALID_VALUE. However, calling cuFuncSetAttribute introduces synchronisation in the CUDA runtime which blocks its execution until all H2D/D2H memory copies are finished (don't know why), therefore effectively blocking kernel launches from overlapping with memory copies. This introduces significant performance degradation in some workflows, specifically in applications launching overlapping memory copies and kernels from multiple host threads into multiple CUDA streams to the same GPU. Avoid the CUDA runtime synchronisation causing poor performance by removing the cuFuncSetAttribute call unless it's strictly necessary. Call it only when a specific carveout is requested by user (using env variables) or when the kernel launch would fail without it (local memory size >48kB). Good performance is recovered for default settings with kernels using little or no local memory. No performance effects were observed for kernel execution time after removing the attribute across a wide range of tested kernels using various amounts of local memory. [1] Related to the 48 kB static shared memory limit, see the footnote for "Maximum amount of shared memory per thread block" in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability
e682c9b
to
6429ab6
Compare
@frasercrmck @Seanst98, I rebased this PR following the UR move. This is now making the same changes as the already-approved UR PR oneapi-src/unified-runtime#2678 |
friendly ping @intel/llvm-reviewers-cuda 👋 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, missed this
Thank you, reviewers! @intel/llvm-gatekeepers this is ready to be merged |
Calling
cuFuncSetAttribute
to setCU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES
is required to launch kernels using more than 48 kB of local memory[1] (CUDA dynamic shared memory). Without this,cuLaunchKernel
fails withCUDA_ERROR_INVALID_VALUE
. However, callingcuFuncSetAttribute
introduces synchronisation in the CUDA runtime which blocks its execution until all H2D/D2H memory copies are finished (don't know why), therefore effectively blocking kernel launches from overlapping with memory copies. This introduces significant performance degradation in some workflows, specifically in applications launching overlapping memory copies and kernels from multiple host threads into multiple CUDA streams to the same GPU.Avoid the CUDA runtime synchronisation causing poor performance by removing the
cuFuncSetAttribute
call unless it's strictly necessary. Call it only when a specific carveout is requested by user (using env variables) or when the kernel launch would fail without it (local memory size >48kB). Good performance is recovered for default settings with kernels using little or no local memory.No performance effects were observed for kernel execution time after removing the attribute across a wide range of tested kernels using various amounts of local memory.
[1] Related to the 48 kB static shared memory limit, see the footnote for "Maximum amount of shared memory per thread block" in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability