[UR][CUDA] Avoid unnecessary calls to cuFuncSetAttribute #16928

rafbiels · 2025-02-07T21:35:53Z

Calling cuFuncSetAttribute to set CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES is required to launch kernels using more than 48 kB of local memory[1] (CUDA dynamic shared memory). Without this, cuLaunchKernel fails with CUDA_ERROR_INVALID_VALUE. However, calling cuFuncSetAttribute introduces synchronisation in the CUDA runtime which blocks its execution until all H2D/D2H memory copies are finished (don't know why), therefore effectively blocking kernel launches from overlapping with memory copies. This introduces significant performance degradation in some workflows, specifically in applications launching overlapping memory copies and kernels from multiple host threads into multiple CUDA streams to the same GPU.

Avoid the CUDA runtime synchronisation causing poor performance by removing the cuFuncSetAttribute call unless it's strictly necessary. Call it only when a specific carveout is requested by user (using env variables) or when the kernel launch would fail without it (local memory size >48kB). Good performance is recovered for default settings with kernels using little or no local memory.

No performance effects were observed for kernel execution time after removing the attribute across a wide range of tested kernels using various amounts of local memory.

[1] Related to the 48 kB static shared memory limit, see the footnote for "Maximum amount of shared memory per thread block" in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability

Update UR tag to include oneapi-src/unified-runtime#2678

Calling cuFuncSetAttribute to set CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES is required to launch kernels using more than 48 kB of local memory[1] (CUDA dynamic shared memory). Without this, cuLaunchKernel fails with CUDA_ERROR_INVALID_VALUE. However, calling cuFuncSetAttribute introduces synchronisation in the CUDA runtime which blocks its execution until all H2D/D2H memory copies are finished (don't know why), therefore effectively blocking kernel launches from overlapping with memory copies. This introduces significant performance degradation in some workflows, specifically in applications launching overlapping memory copies and kernels from multiple host threads into multiple CUDA streams to the same GPU. Avoid the CUDA runtime synchronisation causing poor performance by removing the cuFuncSetAttribute call unless it's strictly necessary. Call it only when a specific carveout is requested by user (using env variables) or when the kernel launch would fail without it (local memory size >48kB). Good performance is recovered for default settings with kernels using little or no local memory. No performance effects were observed for kernel execution time after removing the attribute across a wide range of tested kernels using various amounts of local memory. [1] Related to the 48 kB static shared memory limit, see the footnote for "Maximum amount of shared memory per thread block" in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability

rafbiels · 2025-02-18T20:01:58Z

@frasercrmck @Seanst98, I rebased this PR following the UR move. This is now making the same changes as the already-approved UR PR oneapi-src/unified-runtime#2678

rafbiels · 2025-02-25T21:01:40Z

friendly ping @intel/llvm-reviewers-cuda 👋

frasercrmck

Sorry, missed this

rafbiels · 2025-02-27T16:39:59Z

Thank you, reviewers! @intel/llvm-gatekeepers this is ready to be merged

rafbiels requested a review from a team as a code owner February 7, 2025 21:35

rafbiels mentioned this pull request Feb 7, 2025

Avoid unnecessary calls to cuFuncSetAttribute oneapi-src/unified-runtime#2678

Closed

rafbiels temporarily deployed to WindowsCILock February 7, 2025 21:37 — with GitHub Actions Inactive

rafbiels temporarily deployed to WindowsCILock February 7, 2025 22:11 — with GitHub Actions Inactive

rafbiels added 2 commits February 18, 2025 19:54

[UR][CUDA] Avoid unnecessary calls to cuFuncSetAttribute

14bafb8

Update UR tag to include oneapi-src/unified-runtime#2678

rafbiels force-pushed the cuda-avoid-cuFuncSetAttribute branch from e682c9b to 6429ab6 Compare February 18, 2025 19:59

rafbiels requested a review from a team as a code owner February 18, 2025 19:59

rafbiels requested a review from Seanst98 February 18, 2025 19:59

rafbiels temporarily deployed to WindowsCILock February 18, 2025 19:59 — with GitHub Actions Inactive

rafbiels temporarily deployed to WindowsCILock February 18, 2025 20:17 — with GitHub Actions Inactive

kbenzie approved these changes Feb 19, 2025

View reviewed changes

Seanst98 approved these changes Feb 25, 2025

View reviewed changes

frasercrmck approved these changes Feb 26, 2025

View reviewed changes

uditagarwal97 merged commit 35fba19 into intel:sycl Feb 27, 2025
30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[UR][CUDA] Avoid unnecessary calls to cuFuncSetAttribute #16928

[UR][CUDA] Avoid unnecessary calls to cuFuncSetAttribute #16928

Uh oh!

rafbiels commented Feb 7, 2025 •

edited

Loading

Uh oh!

rafbiels commented Feb 18, 2025

Uh oh!

rafbiels commented Feb 25, 2025

Uh oh!

frasercrmck left a comment

Uh oh!

rafbiels commented Feb 27, 2025

Uh oh!

Uh oh!

Uh oh!

[UR][CUDA] Avoid unnecessary calls to cuFuncSetAttribute #16928

[UR][CUDA] Avoid unnecessary calls to cuFuncSetAttribute #16928

Uh oh!

Conversation

rafbiels commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rafbiels commented Feb 18, 2025

Uh oh!

rafbiels commented Feb 25, 2025

Uh oh!

frasercrmck left a comment

Choose a reason for hiding this comment

Uh oh!

rafbiels commented Feb 27, 2025

Uh oh!

Uh oh!

Uh oh!

rafbiels commented Feb 7, 2025 •

edited

Loading