Merge branch 'main' into path_finder_review1

rwgk · rwgk · commit 2279bda65640 · 2025-04-29T21:58:23.000-07:00
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -12,3 +12,29 @@ Thank you for your interest in contributing to CUDA Python! Based on the type of
     - Please refer to each component's guideline:
        - [`cuda.core`](https://nvidia.github.io/cuda-python/cuda-core/latest/contribute.html)
        - [`cuda.bindings`](https://nvidia.github.io/cuda-python/cuda-bindings/latest/contribute.html)
+
+## Pre-commit
+This project uses [pre-commit.ci](https://pre-commit.ci/) with GitHub Actions. All pull requests are automatically checked for pre-commit compliance, and any pre-commit failures will block merging until resolved.
+
+To set yourself up for running pre-commit checks locally and to catch issues before pushing your changes, follow these steps:
+
+* Install pre-commit with: `pip install pre-commit`
+* You can manually check all files at any time by running: `pre-commit run --all-files`
+
+This command runs all configured hooks (such as linters and formatters) across your repository, letting you review and address issues before committing.
+
+**Optional: Enable automatic checks on every commit**
+If you want pre-commit hooks to run automatically each time you make a commit, install the git hook with:
+
+`pre-commit install`
+
+This sets up a git pre-commit hook so that all configured checks will run before each commit is accepted. If any hook fails, the commit will be blocked until the issues are resolved.
+
+**Note on workflow flexibility**
+Some contributors prefer to commit intermediate or work-in-progress changes that may not pass all pre-commit checks, and only clean up their commits before pushing (for example, by squashing and running `pre-commit run --all-files` manually at the end). If this fits your workflow, you may choose not to run `pre-commit install` and instead rely on manual checks. This approach avoids disruption during iterative development, while still ensuring code quality before code is shared or merged.
+
+Choose the setup that best fits your workflow and development style.
+
+## Code signing
+
+This repository implements a security check to prevent the CI system from running untrusted code. A part of the security check consists of checking if the git commits are signed. Please ensure that your commits are signed [following GitHub’s instruction](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification).
diff --git a/cuda_bindings/DESCRIPTION.rst b/cuda_bindings/DESCRIPTION.rst
@@ -1,7 +1,14 @@
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
 ****************************************
-cuda.bindings: Low-level CUDA interfaces
+cuda-bindings: Low-level CUDA interfaces
 ****************************************
 
-`cuda.bindings` is a standard set of low-level interfaces, providing full coverage of and 1:1 access to the CUDA host APIs from Python. Checkout the `Overview <https://nvidia.github.io/cuda-python/cuda-bindings/latest/overview.html>`_ for the workflow and performance results.
+`cuda.bindings <https://nvidia.github.io/cuda-python/cuda-bindings/>`_ is a standard set of low-level interfaces, providing full coverage of and 1:1 access to the CUDA host APIs from Python. Checkout the `Overview <https://nvidia.github.io/cuda-python/cuda-bindings/latest/overview.html>`_ for the workflow and performance results.
+
+* `Repository <https://github.com/NVIDIA/cuda-python/tree/main/cuda_bindings>`_
+* `Documentation <https://nvidia.github.io/cuda-python/cuda-bindings/>`_
+* `Examples <https://github.com/NVIDIA/cuda-python/tree/main/cuda_bindings/examples>`_
+* `Issue tracker <https://github.com/NVIDIA/cuda-python/issues/>`_
 
 For the installation instruction, please refer to the `Installation <https://nvidia.github.io/cuda-python/cuda-bindings/latest/install.html>`_ page.
diff --git a/cuda_bindings/README.md b/cuda_bindings/README.md
@@ -2,35 +2,13 @@
 
 `cuda.bindings` is a standard set of low-level interfaces, providing full coverage of and access to the CUDA host APIs from Python. Checkout the [Overview page](https://nvidia.github.io/cuda-python/cuda-bindings/latest/overview.html) for the workflow and performance results.
 
-`cuda.bindings` is a subpackage of `cuda-python`.
-
 ## Installing
 
 Please refer to the [Installation page](https://nvidia.github.io/cuda-python/cuda-bindings/latest/install.html) for instructions and required/optional dependencies.
 
 ## Developing
 
-We use `pre-commit` to manage various tools to help development and ensure consistency.
-```shell
-pip install pre-commit
-```
-
-### Code linting
-
-Run this command before checking in the code changes
-```shell
-pre-commit run -a --show-diff-on-failure
-```
-to ensure the code formatting is in line of the requirements (as listed in [`pyproject.toml`](./pyproject.toml)).
-
-### Code signing
-
-This repository implements a security check to prevent the CI system from running untrusted code. A part of the
-security check consists of checking if the git commits are signed. See
-[here](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/faqs/#why-did-i-receive-a-comment-that-my-pull-request-requires-additional-validation)
-and
-[here](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
-for more details, including how to sign your commits.
+This subpackage adheres to the developing practices described in the parent metapackage [CONTRIBUTING.md](https://github.com/NVIDIA/cuda-python/blob/main/CONTRIBUTING.md).
 
 ## Testing
 
diff --git a/cuda_core/DESCRIPTION.rst b/cuda_core/DESCRIPTION.rst
@@ -1,11 +1,10 @@
+.. SPDX-License-Identifier: Apache-2.0
+
 *******************************************************
 cuda-core: Pythonic access to CUDA core functionalities
 *******************************************************
 
-`cuda.core <https://nvidia.github.io/cuda-python/cuda-core/>`_ bridges Python's productivity
-with CUDA's performance through intuitive and pythonic APIs.
-The mission is to provide users full access to all of the core CUDA features in Python,
-such as runtime control, compiler and linker.
+`cuda.core <https://nvidia.github.io/cuda-python/cuda-core/>`_ bridges Python's productivity with CUDA's performance through intuitive and pythonic APIs. The mission is to provide users full access to all of the core CUDA features in Python, such as runtime control, compiler and linker.
 
 * `Repository <https://github.com/NVIDIA/cuda-python/tree/main/cuda_core>`_
 * `Documentation <https://nvidia.github.io/cuda-python/cuda-core/>`_
@@ -22,6 +21,4 @@ Installation
 
    pip install cuda-core[cu12]
 
-Please refer to the `installation instructions
-<https://nvidia.github.io/cuda-python/cuda-core/latest/install.html>`_ for different
-ways of installing `cuda.core`, including building from source.
+Please refer to the `installation instructions <https://nvidia.github.io/cuda-python/cuda-core/latest/install.html>`_ for different ways of installing `cuda.core`, including building from source.
diff --git a/cuda_core/README.md b/cuda_core/README.md
@@ -4,37 +4,11 @@ Currently under active development; see [the documentation](https://nvidia.githu
 
 ## Installing
 
-To build from source, just do:
-```shell
-$ git clone https://github.com/NVIDIA/cuda-python
-$ cd cuda-python/cuda_core  # move to the directory where this README locates
-$ pip install .
-```
-For now `cuda-python` is a required dependency.
+Please refer to the [Installation page](https://nvidia.github.io/cuda-python/cuda-bindings/latest/install.html) for instructions and required/optional dependencies.
 
 ## Developing
 
-We use `pre-commit` to manage various tools to help development and ensure consistency.
-```shell
-pip install pre-commit
-```
-
-### Code linting
-
-Run this command before checking in the code changes
-```shell
-pre-commit run -a --show-diff-on-failure
-```
-to ensure the code formatting is in line of the requirements (as listed in [`pyproject.toml`](./pyproject.toml)).
-
-### Code signing
-
-This repository implements a security check to prevent the CI system from running untrusted code. A part of the
-security check consists of checking if the git commits are signed. See
-[here](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/faqs/#why-did-i-receive-a-comment-that-my-pull-request-requires-additional-validation)
-and
-[here](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
-for more details, including how to sign your commits.
+This subpackage adheres to the developing practices described in the parent metapackage [CONTRIBUTING.md](https://github.com/NVIDIA/cuda-python/blob/main/CONTRIBUTING.md).
 
 ## Testing
 
diff --git a/cuda_core/examples/pytorch_example.py b/cuda_core/examples/pytorch_example.py
@@ -0,0 +1,106 @@
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. ALL RIGHTS RESERVED.
+#
+# SPDX-License-Identifier: Apache-2.0
+
+## Usage: pip install "cuda-core[cu12]"
+## python python_example.py
+import sys
+
+import torch
+
+from cuda.core.experimental import Device, LaunchConfig, Program, ProgramOptions, launch
+
+# SAXPY kernel - passing a as a pointer to avoid any type issues
+code = """
+template<typename T>
+__global__ void saxpy_kernel(const T* a, const T* x, const T* y, T* out, size_t N) {
+ const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
+ if (tid < N) {
+   // Dereference a to get the scalar value
+   out[tid] = (*a) * x[tid] + y[tid];
+ }
+}
+"""
+
+dev = Device()
+dev.set_current()
+
+# Get PyTorch's current stream
+pt_stream = torch.cuda.current_stream()
+print(f"PyTorch stream: {pt_stream}")
+
+
+# Create a wrapper class that implements __cuda_stream__
+class PyTorchStreamWrapper:
+    def __init__(self, pt_stream):
+        self.pt_stream = pt_stream
+
+    def __cuda_stream__(self):
+        stream_id = self.pt_stream.cuda_stream
+        return (0, stream_id)  # Return format required by CUDA Python
+
+
+s = PyTorchStreamWrapper(pt_stream)
+
+# prepare program
+arch = "".join(f"{i}" for i in dev.compute_capability)
+program_options = ProgramOptions(std="c++11", arch=f"sm_{arch}")
+prog = Program(code, code_type="c++", options=program_options)
+mod = prog.compile(
+    "cubin",
+    logs=sys.stdout,
+    name_expressions=("saxpy_kernel<float>", "saxpy_kernel<double>"),
+)
+
+# Run in single precision
+ker = mod.get_kernel("saxpy_kernel<float>")
+dtype = torch.float32
+
+# prepare input/output
+size = 64
+# Use a single element tensor for 'a'
+a = torch.tensor([10.0], dtype=dtype, device="cuda")
+x = torch.rand(size, dtype=dtype, device="cuda")
+y = torch.rand(size, dtype=dtype, device="cuda")
+out = torch.empty_like(x)
+
+# prepare launch
+block = 32
+grid = int((size + block - 1) // block)
+config = LaunchConfig(grid=grid, block=block)
+ker_args = (a.data_ptr(), x.data_ptr(), y.data_ptr(), out.data_ptr(), size)
+
+# launch kernel on our stream
+launch(s, config, ker, *ker_args)
+
+# check result
+assert torch.allclose(out, a.item() * x + y)
+print("Single precision test passed!")
+
+# let's repeat again with double precision
+ker = mod.get_kernel("saxpy_kernel<double>")
+dtype = torch.float64
+
+# prepare input
+size = 128
+# Use a single element tensor for 'a'
+a = torch.tensor([42.0], dtype=dtype, device="cuda")
+x = torch.rand(size, dtype=dtype, device="cuda")
+y = torch.rand(size, dtype=dtype, device="cuda")
+
+# prepare output
+out = torch.empty_like(x)
+
+# prepare launch
+block = 64
+grid = int((size + block - 1) // block)
+config = LaunchConfig(grid=grid, block=block)
+ker_args = (a.data_ptr(), x.data_ptr(), y.data_ptr(), out.data_ptr(), size)
+
+# launch kernel on PyTorch's stream
+launch(s, config, ker, *ker_args)
+
+# check result
+assert torch.allclose(out, a * x + y)
+print("Double precision test passed!")
+print("All tests passed successfully!")
diff --git a/cuda_core/examples/saxpy.py b/cuda_core/examples/saxpy.py
@@ -18,7 +18,7 @@
                       size_t N) {
     const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
     for (size_t i=tid; i<N; i+=gridDim.x*blockDim.x) {
-        out[tid] = a * x[tid] + y[tid];
+        out[i] = a * x[i] + y[i];
     }
 }
 """
diff --git a/cuda_core/tests/example_tests/utils.py b/cuda_core/tests/example_tests/utils.py
@@ -37,7 +37,7 @@ def run_example(samples_path, filename, env=None):
         exec(script, env if env else {})  # nosec B102
     except ImportError as e:
         # for samples requiring any of optional dependencies
-        for m in ("cupy",):
+        for m in ("cupy", "torch"):
             if f"No module named '{m}'" in str(e):
                 pytest.skip(f"{m} not installed, skipping related tests")
                 break
diff --git a/cuda_python/DESCRIPTION.rst b/cuda_python/DESCRIPTION.rst
@@ -0,0 +1,48 @@
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+**************************************************************
+cuda-python: Metapackage collection of CUDA Python subpackages
+**************************************************************
+
+CUDA Python is the home for accessing NVIDIA's CUDA platform from Python. It consists of multiple components:
+
+* `cuda.core <https://nvidia.github.io/cuda-python/cuda-core/latest>`_: Pythonic access to CUDA Runtime and other core functionalities
+* `cuda.bindings <https://nvidia.github.io/cuda-python/cuda-bindings/latest>`_: Low-level Python bindings to CUDA C APIs
+* `cuda.cooperative <https://nvidia.github.io/cccl/cuda_cooperative/>`_: A Python package providing CCCL's reusable block-wide and warp-wide *device* primitives for use within Numba CUDA kernels
+* `cuda.parallel <https://nvidia.github.io/cccl/cuda_parallel/>`_: A Python package for easy access to CCCL's highly efficient and customizable parallel algorithms, like `sort`, `scan`, `reduce`, `transform`, etc, that are callable on the *host*
+* `numba.cuda <https://nvidia.github.io/numba-cuda/>`_: Numba's target for CUDA GPU programming by directly compiling a restricted subset of Python code into CUDA kernels and device functions following the CUDA execution model.
+
+For access to NVIDIA CPU & GPU Math Libraries, please refer to `nvmath-python <https://docs.nvidia.com/cuda/nvmath-python/latest>`_.
+
+CUDA Python is currently undergoing an overhaul to improve existing and bring up new components. All of the previously available functionalities from the `cuda-python` package will continue to be available, please refer to the `cuda.bindings <https://nvidia.github.io/cuda-python/cuda-bindings/latest>`_ documentation for installation guide and further detail.
+
+cuda-python as a metapackage
+============================
+
+`cuda-python` is now a metapackage that contains a collection of subpackages. Each subpackage is versioned independently, allowing installation of each component as needed.
+
+Subpackage: cuda.core
+---------------------
+
+The `cuda.core` package offers idiomatic, pythonic access to CUDA Runtime and other functionalities.
+
+The goals are to
+
+1. Provide **idiomatic ("pythonic")** access to CUDA Driver, Runtime, and JIT compiler toolchain
+2. Focus on **developer productivity** by ensuring end-to-end CUDA development can be performed quickly and entirely in Python
+3. **Avoid homegrown** Python abstractions for CUDA for new Python GPU libraries starting from scratch
+4. **Ease** developer **burden of maintaining** and catching up with latest CUDA features
+5. **Flatten the learning curve** for current and future generations of CUDA developers
+
+Subpackage: cuda.bindings
+-------------------------
+
+The `cuda.bindings` package is a standard set of low-level interfaces, providing full coverage of and access to the CUDA host APIs from Python.
+
+The list of available interfaces are:
+
+* CUDA Driver
+* CUDA Runtime
+* NVRTC
+* nvJitLink
+* NVVM
diff --git a/cuda_python/pyproject.toml b/cuda_python/pyproject.toml
@@ -9,7 +9,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "cuda-python"
 description = "CUDA Python: Performance meets Productivity"
-readme = {file = "README.md", content-type = "text/markdown"}
+readme = {file = "DESCRIPTION.rst", content-type = "text/x-rst"}
 authors = [{name = "NVIDIA Corporation", email = "cuda-python-conduct@nvidia.com"},]
 license = "LicenseRef-NVIDIA-SOFTWARE-LICENSE"
 classifiers = [

Original file line number	Diff line number	Diff line change
`@@ -18,7 +18,7 @@`
`18`	`18`	`size_t N) {`
`19`	`19`	`const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;`
`20`	`20`	`for (size_t i=tid; i<N; i+=gridDim.x*blockDim.x) {`
`21`		`- out[tid] = a * x[tid] + y[tid];`
	`21`	`+ out[i] = a * x[i] + y[i];`
`22`	`22`	`}`
`23`	`23`	`}`
`24`	`24`	`"""`