Skip to content

Commit 5d64579

Browse files
committed
Merge remote-tracking branch 'upstream/main' into nvjitlink-documentation
2 parents 297947b + dde2fe2 commit 5d64579

File tree

20 files changed

+321
-66
lines changed

20 files changed

+321
-66
lines changed

.github/actions/build/action.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,12 @@ runs:
4444
$CHOWN -R $(whoami) ${{ env.CUDA_CORE_ARTIFACTS_DIR }}
4545
ls -lahR ${{ env.CUDA_CORE_ARTIFACTS_DIR }}
4646
47+
- name: Check cuda.core wheel
48+
shell: bash --noprofile --norc -xeuo pipefail {0}
49+
run: |
50+
pip install twine
51+
twine check ${{ env.CUDA_CORE_ARTIFACTS_DIR }}/*.whl
52+
4753
- name: Upload cuda.core build artifacts
4854
uses: actions/upload-artifact@v4
4955
with:
@@ -82,6 +88,12 @@ runs:
8288
$CHOWN -R $(whoami) ${{ env.CUDA_BINDINGS_ARTIFACTS_DIR }}
8389
ls -lahR ${{ env.CUDA_BINDINGS_ARTIFACTS_DIR }}
8490
91+
# TODO: enable this after NVIDIA/cuda-python#297 is resolved
92+
# - name: Check cuda.bindings wheel
93+
# shell: bash --noprofile --norc -xeuo pipefail {0}
94+
# run: |
95+
# twine check ${{ env.CUDA_BINDINGS_ARTIFACTS_DIR }}/*.whl
96+
8597
- name: Upload cuda.bindings build artifacts
8698
uses: actions/upload-artifact@v4
8799
with:

.github/actions/test/action.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,13 @@ runs:
1414
shell: bash --noprofile --norc -xeuo pipefail {0}
1515
run: nvidia-smi
1616

17+
# The cache action needs this
18+
- name: Install zstd
19+
shell: bash --noprofile --norc -xeuo pipefail {0}
20+
run: |
21+
apt update
22+
apt install zstd
23+
1724
- name: Download bindings build artifacts
1825
uses: actions/download-artifact@v4
1926
with:

.github/workflows/gh-build-and-test.yml

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -76,17 +76,19 @@ jobs:
7676
test:
7777
# TODO: improve the name once a separate test matrix is defined
7878
name: Test (CUDA ${{ inputs.cuda-version }})
79-
# TODO: enable testing once linux-aarch64 & win-64 GPU runners are up
79+
# TODO: enable testing once win-64 GPU runners are up
8080
if: ${{ (github.repository_owner == 'nvidia') &&
81-
startsWith(inputs.host-platform, 'linux-x64') }}
81+
startsWith(inputs.host-platform, 'linux') }}
8282
permissions:
8383
id-token: write # This is required for configure-aws-credentials
8484
contents: read # This is required for actions/checkout
85-
runs-on: ${{ (inputs.host-platform == 'linux-x64' && 'linux-amd64-gpu-v100-latest-1') }}
86-
# TODO: use a different (nvidia?) container, or just run on bare image
85+
runs-on: ${{ (inputs.host-platform == 'linux-x64' && 'linux-amd64-gpu-v100-latest-1') ||
86+
(inputs.host-platform == 'linux-aarch64' && 'linux-arm64-gpu-a100-latest-1') }}
87+
# Our self-hosted runners require a container
88+
# TODO: use a different (nvidia?) container
8789
container:
8890
options: -u root --security-opt seccomp=unconfined --privileged --shm-size 16g
89-
image: condaforge/miniforge3:latest
91+
image: ubuntu:22.04
9092
env:
9193
NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}
9294
needs:

cuda_core/DESCRIPTION.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
*******************************************************
2+
cuda-core: Pythonic access to CUDA core functionalities
3+
*******************************************************
4+
5+
`cuda.core <https://nvidia.github.io/cuda-python/cuda-core/>`_ bridges Python's productivity
6+
with CUDA's performance through intuitive and pythonic APIs.
7+
The mission is to provide users full access to all of the core CUDA features in Python,
8+
such as runtime control, compiler and linker.
9+
10+
* `Repository <https://github.com/NVIDIA/cuda-python/tree/main/cuda_core>`_
11+
* `Documentation <https://nvidia.github.io/cuda-python/cuda-core/>`_
12+
* `Examples <https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples>`_
13+
* `Issue tracker <https://github.com/NVIDIA/cuda-python/issues/>`_
14+
15+
`cuda.core` is currently under active development. Any feedbacks or suggestions are welcomed!
16+
17+
18+
Installation
19+
============
20+
21+
.. code-block:: bash
22+
23+
pip install cuda-core[cu12]
24+
25+
Please refer to the `installation instructions
26+
<https://nvidia.github.io/cuda-python/cuda-core/latest/install.html>`_ for different
27+
ways of installing `cuda.core`, including building from source.

cuda_core/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# `cuda.core`: (experimental) pythonic CUDA module
22

3-
Currently under active developmen; see [the documentation](https://nvidia.github.io/cuda-python/cuda-core/latest/) for more details.
3+
Currently under active development; see [the documentation](https://nvidia.github.io/cuda-python/cuda-core/latest/) for more details.
44

55
## Installing
66

cuda_core/cuda/core/_version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@
22
#
33
# SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
44

5-
__version__ = "0.1.0"
5+
__version__ = "0.1.1"

cuda_core/cuda/core/experimental/_device.py

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,11 @@ class Device:
2323
and use the same GPU device.
2424
2525
While acting as the entry point, many other CUDA resources can be
26-
allocated such as streams and buffers. Any :obj:`Context` dependent
26+
allocated such as streams and buffers. Any :obj:`~_context.Context` dependent
2727
resource created through this device, will continue to refer to
2828
this device's context.
2929
30-
Newly returend :obj:`Device` object are is a thread-local singleton
30+
Newly returned :obj:`~_device.Device` objects are thread-local singletons
3131
for a specified device.
3232
3333
Note
@@ -37,7 +37,7 @@ class Device:
3737
Parameters
3838
----------
3939
device_id : int, optional
40-
Device ordinal to return a :obj:`Device` object for.
40+
Device ordinal to return a :obj:`~_device.Device` object for.
4141
Default value of `None` return the currently used device.
4242
4343
"""
@@ -144,7 +144,7 @@ def compute_capability(self) -> ComputeCapability:
144144
@property
145145
@precondition(_check_context_initialized)
146146
def context(self) -> Context:
147-
"""Return the current :obj:`Context` associated with this device.
147+
"""Return the current :obj:`~_context.Context` associated with this device.
148148
149149
Note
150150
----
@@ -157,7 +157,7 @@ def context(self) -> Context:
157157

158158
@property
159159
def memory_resource(self) -> MemoryResource:
160-
"""Return :obj:`MemoryResource` associated with this device."""
160+
"""Return :obj:`~_memory.MemoryResource` associated with this device."""
161161
return self._mr
162162

163163
@memory_resource.setter
@@ -168,7 +168,7 @@ def memory_resource(self, mr):
168168

169169
@property
170170
def default_stream(self) -> Stream:
171-
"""Return default CUDA :obj:`Stream` associated with this device.
171+
"""Return default CUDA :obj:`~_stream.Stream` associated with this device.
172172
173173
The type of default stream returned depends on if the environment
174174
variable CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM is set.
@@ -191,18 +191,18 @@ def set_current(self, ctx: Context = None) -> Union[Context, None]:
191191
192192
Initializes CUDA and sets the calling thread to a valid CUDA
193193
context. By default the primary context is used, but optional `ctx`
194-
parameter can be used to explicitly supply a :obj:`Context` object.
194+
parameter can be used to explicitly supply a :obj:`~_context.Context` object.
195195
196196
Providing a `ctx` causes the previous set context to be popped and returned.
197197
198198
Parameters
199199
----------
200-
ctx : :obj:`Context`, optional
200+
ctx : :obj:`~_context.Context`, optional
201201
Optional context to push onto this device's current thread stack.
202202
203203
Returns
204204
-------
205-
Union[:obj:`Context`, None], optional
205+
Union[:obj:`~_context.Context`, None], optional
206206
Popped context.
207207
208208
Examples
@@ -247,20 +247,20 @@ def set_current(self, ctx: Context = None) -> Union[Context, None]:
247247
self._has_inited = True
248248

249249
def create_context(self, options: ContextOptions = None) -> Context:
250-
"""Create a new :obj:`Context` object.
250+
"""Create a new :obj:`~_context.Context` object.
251251
252252
Note
253253
----
254254
The newly context will not be set as current.
255255
256256
Parameters
257257
----------
258-
options : :obj:`ContextOptions`, optional
258+
options : :obj:`~_context.ContextOptions`, optional
259259
Customizable dataclass for context creation options.
260260
261261
Returns
262262
-------
263-
:obj:`Context`
263+
:obj:`~_context.Context`
264264
Newly created context object.
265265
266266
"""
@@ -286,12 +286,12 @@ def create_stream(self, obj=None, options: StreamOptions = None) -> Stream:
286286
----------
287287
obj : Any, optional
288288
Any object supporting the __cuda_stream__ protocol.
289-
options : :obj:`StreamOptions`, optional
289+
options : :obj:`~_stream.StreamOptions`, optional
290290
Customizable dataclass for stream creation options.
291291
292292
Returns
293293
-------
294-
:obj:`Stream`
294+
:obj:`~_stream.Stream`
295295
Newly created stream object.
296296
297297
"""
@@ -314,13 +314,13 @@ def allocate(self, size, stream=None) -> Buffer:
314314
----------
315315
size : int
316316
Number of bytes to allocate.
317-
stream : :obj:`Stream`, optional
317+
stream : :obj:`~_stream.Stream`, optional
318318
The stream establishing the stream ordering semantic.
319319
Default value of `None` uses default stream.
320320
321321
Returns
322322
-------
323-
:obj:`Buffer`
323+
:obj:`~_memory.Buffer`
324324
Newly created buffer object.
325325
326326
"""

cuda_core/cuda/core/experimental/_event.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212

1313
@dataclass
1414
class EventOptions:
15-
"""Customizable :obj:`Event` options.
15+
"""Customizable :obj:`~_event.Event` options.
1616
1717
Attributes
1818
----------
@@ -46,8 +46,8 @@ class Event:
4646
of work up to event's record, and help establish dependencies
4747
between GPU work submissions.
4848
49-
Directly creating an :obj:`Event` is not supported due to ambiguity,
50-
and they should instead be created through a :obj:`Stream` object.
49+
Directly creating an :obj:`~_event.Event` is not supported due to ambiguity,
50+
and they should instead be created through a :obj:`~_stream.Stream` object.
5151
5252
"""
5353

cuda_core/cuda/core/experimental/_launcher.py

Lines changed: 30 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from typing import Optional, Union
88

99
from cuda import cuda
10+
from cuda.core.experimental._device import Device
1011
from cuda.core.experimental._kernel_arg_handler import ParamHolder
1112
from cuda.core.experimental._module import Kernel
1213
from cuda.core.experimental._stream import Stream
@@ -38,11 +39,15 @@ class LaunchConfig:
3839
----------
3940
grid : Union[tuple, int]
4041
Collection of threads that will execute a kernel function.
42+
cluster : Union[tuple, int]
43+
Group of blocks (Thread Block Cluster) that will execute on the same
44+
GPU Processing Cluster (GPC). Blocks within a cluster have access to
45+
distributed shared memory and can be explicitly synchronized.
4146
block : Union[tuple, int]
4247
Group of threads (Thread Block) that will execute on the same
43-
multiprocessor. Threads within a thread blocks have access to
44-
shared memory and can be explicitly synchronized.
45-
stream : :obj:`Stream`
48+
streaming multiprocessor (SM). Threads within a thread blocks have
49+
access to shared memory and can be explicitly synchronized.
50+
stream : :obj:`~_stream.Stream`
4651
The stream establishing the stream ordering semantic of a
4752
launch.
4853
shmem_size : int, optional
@@ -53,13 +58,22 @@ class LaunchConfig:
5358

5459
# TODO: expand LaunchConfig to include other attributes
5560
grid: Union[tuple, int] = None
61+
cluster: Union[tuple, int] = None
5662
block: Union[tuple, int] = None
5763
stream: Stream = None
5864
shmem_size: Optional[int] = None
5965

6066
def __post_init__(self):
67+
_lazy_init()
6168
self.grid = self._cast_to_3_tuple(self.grid)
6269
self.block = self._cast_to_3_tuple(self.block)
70+
# thread block clusters are supported starting H100
71+
if self.cluster is not None:
72+
if not _use_ex:
73+
raise CUDAError("thread block clusters require cuda.bindings & driver 11.8+")
74+
if Device().compute_capability < (9, 0):
75+
raise CUDAError("thread block clusters are not supported on devices with compute capability < 9.0")
76+
self.cluster = self._cast_to_3_tuple(self.cluster)
6377
# we handle "stream=None" in the launch API
6478
if self.stream is not None and not isinstance(self.stream, Stream):
6579
try:
@@ -69,8 +83,6 @@ def __post_init__(self):
6983
if self.shmem_size is None:
7084
self.shmem_size = 0
7185

72-
_lazy_init()
73-
7486
def _cast_to_3_tuple(self, cfg):
7587
if isinstance(cfg, int):
7688
if cfg < 1:
@@ -96,16 +108,16 @@ def _cast_to_3_tuple(self, cfg):
96108

97109

98110
def launch(kernel, config, *kernel_args):
99-
"""Launches a :obj:`~cuda.core.experimental._module.Kernel`
111+
"""Launches a :obj:`~_module.Kernel`
100112
object with launch-time configuration.
101113
102114
Parameters
103115
----------
104-
kernel : :obj:`~cuda.core.experimental._module.Kernel`
116+
kernel : :obj:`~_module.Kernel`
105117
Kernel to launch.
106-
config : :obj:`LaunchConfig`
118+
config : :obj:`~_launcher.LaunchConfig`
107119
Launch configurations inline with options provided by
108-
:obj:`LaunchConfig` dataclass.
120+
:obj:`~_launcher.LaunchConfig` dataclass.
109121
*kernel_args : Any
110122
Variable length argument list that is provided to the
111123
launching kernel.
@@ -133,7 +145,15 @@ def launch(kernel, config, *kernel_args):
133145
drv_cfg.blockDimX, drv_cfg.blockDimY, drv_cfg.blockDimZ = config.block
134146
drv_cfg.hStream = config.stream.handle
135147
drv_cfg.sharedMemBytes = config.shmem_size
136-
drv_cfg.numAttrs = 0 # TODO
148+
attrs = [] # TODO: support more attributes
149+
if config.cluster:
150+
attr = cuda.CUlaunchAttribute()
151+
attr.id = cuda.CUlaunchAttributeID.CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION
152+
dim = attr.value.clusterDim
153+
dim.x, dim.y, dim.z = config.cluster
154+
attrs.append(attr)
155+
drv_cfg.numAttrs = len(attrs)
156+
drv_cfg.attrs = attrs
137157
handle_return(cuda.cuLaunchKernelEx(drv_cfg, int(kernel._handle), args_ptr, 0))
138158
else:
139159
# TODO: check if config has any unsupported attrs

cuda_core/cuda/core/experimental/_memory.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ class Buffer:
3737
Allocated buffer handle object
3838
size : Any
3939
Memory size of the buffer
40-
mr : :obj:`MemoryResource`, optional
40+
mr : :obj:`~_memory.MemoryResource`, optional
4141
Memory resource associated with the buffer
4242
4343
"""
@@ -126,7 +126,7 @@ def copy_to(self, dst: Buffer = None, *, stream) -> Buffer:
126126
127127
Parameters
128128
----------
129-
dst : :obj:`Buffer`
129+
dst : :obj:`~_memory.Buffer`
130130
Source buffer to copy data from
131131
stream : Any
132132
Keyword argument specifying the stream for the
@@ -149,7 +149,7 @@ def copy_from(self, src: Buffer, *, stream):
149149
150150
Parameters
151151
----------
152-
src : :obj:`Buffer`
152+
src : :obj:`~_memory.Buffer`
153153
Source buffer to copy data from
154154
stream : Any
155155
Keyword argument specifying the stream for the

0 commit comments

Comments
 (0)