-
Notifications
You must be signed in to change notification settings - Fork 171
Feature/occupancy #648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/occupancy #648
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Outstanding issues:
|
Stream class does not have _handle data member.
This is necessary to avoid circular dependency. Cluster-related occupancy functions need LaunchConfig. Occupancy functions are defined in _module.py, and _launcher.py that used to house definition of LaunchConfig imports Kernel from _module.py
This class defines kernel occupancy query methods. - max_active_blocks_per_multiprocessor - max_potential_block_size - available_dynamic_shared_memory_per_block - max_potential_cluster_size - max_active_clusters Implementation is based on driver API. The following occupancy-related driver functions are not used - `cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags` - `cuOccupancyMaxPotentialBlockSizeWithFlags` In `cuOccupancyMaxPotentialBlockSize`, only constant dynamic shared-memory size is supported for now. Supporting variable dynamic shared-memory size that depends on the block size is deferred until design is resolved.
1e77bc0
to
b89c95f
Compare
/ok to test |
This comment has been minimized.
This comment has been minimized.
Use it as return type for the KernelOccupancy.max_potential_block_size output.
cuda_utils.driver.CUoccupancyB2DSize type is supported. Required size of dynamic shared memory allocation renamed to dynamic_shared_memory_needed
Test requires Numba. If numba is absent, it is skipped, otherwise `numba.cfunc` is used to compile Python function. ctypes.CFuncPtr object obtained from cfunc_res.ctypes is converted to CUoccupancyB2DSize.
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
In case we do not want // gcc -fshared -fPIC b2dsize.c -o b2dsize.so
#include <stddef.h>
size_t dynamic_shared_memory_needed(int blockSize) {
return (blockSize <= 32) ? (size_t)0 : (size_t)((blockSize - 1) / 32) * ((size_t)1024);
} Then import ctypes
from cuda.core.experimental._utils.cuda_utils import driver
lib = ctypes.cdll.LoadLibrary("./b2dsize.so")
cfunc = lib.dynamic_shared_memory_needed
fn_ptr = ctypes.cast(cfunc, ctypes.c_void_p).value
dynamic_smem_needed_fn = driver.CUoccupancyB2DSize(_ptr = fn_ptr) This would require compiler being available at test time, which is easy to arrange in conda, at least. We could build a fixture that would build such a library, and skip the test if building step fails due to absent compiler. |
/ok to test |
|
/ok to test 436f111 |
cc @dongxiao92 @pentschev @bandokihiro for vis |
Expanded the docstring, added advisory about possibility of deadlocks should function encoded CUoccupancyB2DSize require GIL. Added argument type validation for dynamic_shared_memory_needed argument.
Performed additional manual testing with Cython-generated C-API functions produced using Steps to create Cython extension and run testsCreate Cython source file# filename: cyx_b2ds.pyx
cdef inline int align_up(int num, int den) nogil:
return ((num + den - 1) // den) * den
cdef inline size_t smem_needed(int block_size, size_t smem_bytes_per_warp) nogil:
cdef int warp_size = 32
cdef int bs = block_size * (block_size > 0)
return (<size_t>align_up(bs, warp_size)) * smem_bytes_per_warp
cdef api size_t smem_needed_64(int block_size) nogil:
return smem_needed(block_size, 64)
cdef api size_t smem_needed_96(int block_size) nogil:
return smem_needed(block_size, 96)
cdef api size_t smem_needed_128(int block_size) nogil:
return smem_needed(block_size, 128)
cdef api size_t smem_needed_196(int block_size) nogil:
return smem_needed(block_size, 196)
cdef api size_t smem_needed_256(int block_size) nogil:
return smem_needed(block_size, 256)
cdef api size_t smem_needed_384(int block_size) nogil:
return smem_needed(block_size, 384)
cdef api size_t smem_needed_512(int block_size) nogil:
return smem_needed(block_size, 512)
cdef api size_t smem_needed_gil(int block_size):
return smem_needed(block_size, 32) Compile and buildcython -3 cyx_b2ds.pyx
cc cyx_b2ds.c -shared -fPIC $(python3-config --cflags) $(python3-config --ldflags) -o cyx_b2ds$(python3-config --extension-suffix) Run testimport cuda.core.experimental as cc
cc.Device(0).set_current()
o1 = cc.Program("__global__ void bar(double *p, int n, double x) { *p = n * x; }", code_type="c++").compile("cubin", name_expressions=("bar",))
k1 = o1.get_kernel("bar")
import ctypes
import cyx_b2ds as ext
from cuda.core.experimental._utils.cuda_utils import driver
gp_fn = ctypes.pythonapi.PyCapsule_GetPointer
gp_fn.restype, gp_fn.argtypes = ctypes.c_void_p, [ctypes.py_object, ctypes.c_char_p]
def get_capi_fn_ptr(name):
caps = ext.__pyx_capi__[name]
capi_ptr = gp_fn(caps, b'size_t (int)')
return driver.CUoccupancyB2DSize(_ptr=capi_ptr)
|
To expand on this and capture offline discussion... The concern here is that we have two global locks in play, 1 from the Python Global Interpreter Lock, and 1 from the CUDA driver. We risk running into the following situation:
This would lead to a deadlock and we've seen this behavior in the past, i.e. numba/numba#4581 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Sasha! LGTM overall, most comments below are doc-related.
For example, we need to add _launch_config.LaunchConfig
, _module.KernelOccupancy
, etc, to cuda_core/docs/source/api_private.rst
to get them rendered and cross-ref'd.
Occupancy tests need not contain saxpy in the test name even though it uses saxpy kernel for testing.
/ok to test 496eb5b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Sasha! I made a doc-only fix. The CI was green so let me admin-merge to save some resources.
|
Description
closes #504
Checklist