Skip to content

Feature/occupancy #648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jun 4, 2025
Merged

Feature/occupancy #648

merged 18 commits into from
Jun 4, 2025

Conversation

oleksandr-pavlyk
Copy link
Contributor

Description

closes #504

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link
Contributor

copy-pr-bot bot commented May 20, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@oleksandr-pavlyk
Copy link
Contributor Author

oleksandr-pavlyk commented May 20, 2025

Outstanding issues:

  • Write tests
  • Should we support variable dynamic shared memory size dependent on block size in max_potential_block_size?
  • In method names use fully spelled shared_memory_size, should shmem_size be used instead?
  • For cluster-related occupancy queries, driver API takes CUlaunchConfig struct which maps to pair LaunchConfig data class instance and Stream class instance, because the data class does not contain the stream information. Is this design decision acceptable?

@oleksandr-pavlyk oleksandr-pavlyk added cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do! labels May 20, 2025
Stream class does not have _handle data member.
This is necessary to avoid circular dependency.

Cluster-related occupancy functions need LaunchConfig.
Occupancy functions are defined in _module.py, and _launcher.py
that used to house definition of LaunchConfig imports Kernel
from _module.py
This class defines kernel occupancy query methods.

  - max_active_blocks_per_multiprocessor
  - max_potential_block_size
  - available_dynamic_shared_memory_per_block
  - max_potential_cluster_size
  - max_active_clusters

Implementation is based on driver API. The following
occupancy-related driver functions are not used
 - `cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags`
 - `cuOccupancyMaxPotentialBlockSizeWithFlags`

In `cuOccupancyMaxPotentialBlockSize`, only constant dynamic shared-memory size
is supported for now. Supporting variable dynamic shared-memory size that depends
on the block size is deferred until design is resolved.
@oleksandr-pavlyk
Copy link
Contributor Author

/ok to test

This comment has been minimized.

@leofang leofang self-requested a review May 20, 2025 19:47
@leofang leofang added this to the cuda.core beta 4 milestone May 20, 2025
Use it as return type for the KernelOccupancy.max_potential_block_size
output.
cuda_utils.driver.CUoccupancyB2DSize type is supported. Required
size of dynamic shared memory allocation renamed to
dynamic_shared_memory_needed
Test requires Numba. If numba is absent, it is skipped, otherwise
`numba.cfunc` is used to compile Python function. ctypes.CFuncPtr
object obtained from cfunc_res.ctypes is converted to
CUoccupancyB2DSize.
@oleksandr-pavlyk oleksandr-pavlyk marked this pull request as ready for review May 22, 2025 21:38
Copy link
Contributor

copy-pr-bot bot commented May 22, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@oleksandr-pavlyk
Copy link
Contributor Author

oleksandr-pavlyk commented May 22, 2025

In case we do not want numba as test dependency (even optional), we could consider compiling the following b2dsize.c library:

// gcc -fshared -fPIC b2dsize.c -o b2dsize.so
#include <stddef.h>

size_t dynamic_shared_memory_needed(int blockSize) {
  return (blockSize <= 32) ? (size_t)0 : (size_t)((blockSize - 1)  / 32) * ((size_t)1024);
}

Then

import ctypes
from cuda.core.experimental._utils.cuda_utils import driver

lib = ctypes.cdll.LoadLibrary("./b2dsize.so")
cfunc = lib.dynamic_shared_memory_needed
fn_ptr = ctypes.cast(cfunc, ctypes.c_void_p).value

dynamic_smem_needed_fn = driver.CUoccupancyB2DSize(_ptr = fn_ptr)

This would require compiler being available at test time, which is easy to arrange in conda, at least.

We could build a fixture that would build such a library, and skip the test if building step fails due to absent compiler.

@oleksandr-pavlyk
Copy link
Contributor Author

/ok to test

@leofang
Copy link
Member

leofang commented May 29, 2025

  • document that Python callbacks are uncharted territory (due to GIL contention)
  • add release note

@leofang
Copy link
Member

leofang commented May 29, 2025

/ok to test 436f111

@leofang
Copy link
Member

leofang commented May 29, 2025

cc @dongxiao92 @pentschev @bandokihiro for vis

Expanded the docstring, added advisory about possibility of deadlocks
should function encoded CUoccupancyB2DSize require GIL.

Added argument type validation for dynamic_shared_memory_needed
argument.
@oleksandr-pavlyk
Copy link
Contributor Author

Performed additional manual testing with Cython-generated C-API functions produced using api keyword, including CAPI functions holding GIL.

Steps to create Cython extension and run tests

Create Cython source file

# filename: cyx_b2ds.pyx

cdef inline int align_up(int num, int den) nogil:
    return ((num + den - 1) // den) * den

cdef inline size_t smem_needed(int block_size, size_t smem_bytes_per_warp) nogil:
    cdef int warp_size = 32
    cdef int bs = block_size * (block_size > 0)
    return (<size_t>align_up(bs, warp_size)) * smem_bytes_per_warp 

cdef api size_t smem_needed_64(int block_size) nogil:
    return smem_needed(block_size, 64)

cdef api size_t smem_needed_96(int block_size) nogil:
    return smem_needed(block_size, 96)

cdef api size_t smem_needed_128(int block_size) nogil:
    return smem_needed(block_size, 128)

cdef api size_t smem_needed_196(int block_size) nogil:
    return smem_needed(block_size, 196)

cdef api size_t smem_needed_256(int block_size) nogil:
    return smem_needed(block_size, 256)

cdef api size_t smem_needed_384(int block_size) nogil:
    return smem_needed(block_size, 384)

cdef api size_t smem_needed_512(int block_size) nogil:
    return smem_needed(block_size, 512)

cdef api size_t smem_needed_gil(int block_size):
    return smem_needed(block_size, 32)

Compile and build

cython -3 cyx_b2ds.pyx
cc cyx_b2ds.c -shared -fPIC $(python3-config --cflags) $(python3-config --ldflags) -o cyx_b2ds$(python3-config --extension-suffix)

Run test

import cuda.core.experimental as cc
cc.Device(0).set_current()

o1 = cc.Program("__global__ void bar(double *p, int n, double x) { *p = n * x; }", code_type="c++").compile("cubin", name_expressions=("bar",))
k1 = o1.get_kernel("bar")

import ctypes
import cyx_b2ds as ext
from cuda.core.experimental._utils.cuda_utils import driver

gp_fn = ctypes.pythonapi.PyCapsule_GetPointer
gp_fn.restype, gp_fn.argtypes = ctypes.c_void_p, [ctypes.py_object, ctypes.c_char_p]

def get_capi_fn_ptr(name):
    caps = ext.__pyx_capi__[name]
    capi_ptr = gp_fn(caps, b'size_t (int)')
    return driver.CUoccupancyB2DSize(_ptr=capi_ptr)

<<< Elided execution of code-block given above >>>

In [2]: [k1.occupancy.max_potential_block_size(get_capi_fn_ptr(name), 0) for name in ['smem_needed_64', 'smem_needed_96', 'smem_needed_128', 'smem_needed_196', 'smem_needed_256', 'smem_needed_384', 'smem_needed_512']]
Out[2]: 
[MaxPotential(min_grid_size=168, max_block_size=768),
 MaxPotential(min_grid_size=168, max_block_size=512),
 MaxPotential(min_grid_size=168, max_block_size=384),
 MaxPotential(min_grid_size=252, max_block_size=160),
 MaxPotential(min_grid_size=168, max_block_size=192),
 MaxPotential(min_grid_size=168, max_block_size=128),
 MaxPotential(min_grid_size=168, max_block_size=96)]

In [3]: k1.occupancy.max_potential_block_size(get_capi_fn_ptr('smem_needed_gil'), 0)
Out[3]: MaxPotential(min_grid_size=168, max_block_size=768)

In [4]: quit

@kkraus14
Copy link
Collaborator

kkraus14 commented May 30, 2025

  • document that Python callbacks are uncharted territory (due to GIL contention)

To expand on this and capture offline discussion... The concern here is that we have two global locks in play, 1 from the Python Global Interpreter Lock, and 1 from the CUDA driver. We risk running into the following situation:

  • Thread 1: Releases the GIL and acquires the CUDA driver lock when calling cuOccupancyMaxPotentialBlockSize, gets blocked in waiting to reacquire the GIL for the callback function
  • Thread 2: Potentially calls a CUDA API without releasing the GIL, gets blocked in waiting to acquire the CUDA driver lock while holding the GIL

This would lead to a deadlock and we've seen this behavior in the past, i.e. numba/numba#4581

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Sasha! LGTM overall, most comments below are doc-related.

For example, we need to add _launch_config.LaunchConfig, _module.KernelOccupancy, etc, to cuda_core/docs/source/api_private.rst to get them rendered and cross-ref'd.

@leofang leofang mentioned this pull request Jun 3, 2025
2 tasks
Occupancy tests need not contain saxpy in the test name even though it
uses saxpy kernel for testing.
@leofang
Copy link
Member

leofang commented Jun 4, 2025

/ok to test 496eb5b

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Sasha! I made a doc-only fix. The CI was green so let me admin-merge to save some resources.

@leofang leofang merged commit 19c4169 into NVIDIA:main Jun 4, 2025
1 check passed
Copy link

github-actions bot commented Jun 4, 2025

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cover occupancy calculator APIs
3 participants