cuda clang: Add support for CUDA surfaces #132883

AustinSchuh · 2025-03-25T04:56:17Z

This adds support for all the surface read and write calls to clang. It extends the pattern used for textures to surfaces too.

I tested this by generating all the various permutations of the calls and argument types in a python script, compiling them with both clang and nvcc, and comparing the generated ptx for equivilence. They all agree, ignoring register allocation, and some places where Clang picks different memory write instructions. An example kernel is:

__global__ void testKernel(cudaSurfaceObject_t surfObj, int x, float2* result) {
    *result = surf1Dread<float2>(surfObj, x, cudaBoundaryModeZero);
}

This adds support for all the surface read and write calls to clang. It extends the pattern used for textures to surfaces too. I tested this by generating all the various permutations of the calls and argument types in a python script, compiling them with both clang and nvcc, and comparing the generated ptx for equivilence. They all agree, ignoring register allocation, and some places where Clang does different memory writes. An example kernel is: __global__ void testKernel(cudaSurfaceObject_t surfObj, int x, float2* result) { *result = surf1Dread<float2>(surfObj, x, cudaBoundaryModeZero); } Signed-off-by: Austin Schuh <[email protected]>

github-actions · 2025-03-25T04:56:34Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-03-25T04:57:04Z

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-clang

Author: Austin Schuh (AustinSchuh)

Changes

This adds support for all the surface read and write calls to clang. It extends the pattern used for textures to surfaces too.

I tested this by generating all the various permutations of the calls and argument types in a python script, compiling them with both clang and nvcc, and comparing the generated ptx for equivilence. They all agree, ignoring register allocation, and some places where Clang picks different memory write instructions. An example kernel is:

__global__ void testKernel(cudaSurfaceObject_t surfObj, int x, float2* result) {
    *result = surf1Dread&lt;float2&gt;(surfObj, x, cudaBoundaryModeZero);
}

Patch is 28.37 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/132883.diff

2 Files Affected:

(modified) clang/lib/Headers/__clang_cuda_runtime_wrapper.h (+1)
(modified) clang/lib/Headers/__clang_cuda_texture_intrinsics.h (+417-2)

diff --git a/clang/lib/Headers/__clang_cuda_runtime_wrapper.h b/clang/lib/Headers/__clang_cuda_runtime_wrapper.h
index d369c86fe1064..8182c961ec32f 100644
--- a/clang/lib/Headers/__clang_cuda_runtime_wrapper.h
+++ b/clang/lib/Headers/__clang_cuda_runtime_wrapper.h
@@ -386,6 +386,7 @@ __host__ __device__ void __nv_tex_surf_handler(const char *name, T *ptr,
 #endif // __cplusplus >= 201103L && CUDA_VERSION >= 9000
 #include "texture_fetch_functions.h"
 #include "texture_indirect_functions.h"
+#include "surface_indirect_functions.h"
 
 // Restore state of __CUDA_ARCH__ and __THROW we had on entry.
 #pragma pop_macro("__CUDA_ARCH__")
diff --git a/clang/lib/Headers/__clang_cuda_texture_intrinsics.h b/clang/lib/Headers/__clang_cuda_texture_intrinsics.h
index a71952211237b..2ea83f66036d4 100644
--- a/clang/lib/Headers/__clang_cuda_texture_intrinsics.h
+++ b/clang/lib/Headers/__clang_cuda_texture_intrinsics.h
@@ -28,6 +28,7 @@
 #pragma push_macro("__Args")
 #pragma push_macro("__ID")
 #pragma push_macro("__IDV")
+#pragma push_macro("__OP_TYPE_SURFACE")
 #pragma push_macro("__IMPL_2DGATHER")
 #pragma push_macro("__IMPL_ALIAS")
 #pragma push_macro("__IMPL_ALIASI")
@@ -45,6 +46,64 @@
 #pragma push_macro("__IMPL_SI")
 #pragma push_macro("__L")
 #pragma push_macro("__STRIP_PARENS")
+#pragma push_macro("__SURF_WRITE_V2")
+#pragma push_macro("__SW_ASM_ARGS")
+#pragma push_macro("__SW_ASM_ARGS1")
+#pragma push_macro("__SW_ASM_ARGS2")
+#pragma push_macro("__SW_ASM_ARGS4")
+#pragma push_macro("__SURF_WRITE_V2")
+#pragma push_macro("__SURF_READ_V2")
+#pragma push_macro("__SW_ASM_ARGS")
+#pragma push_macro("__SW_ASM_ARGS1")
+#pragma push_macro("__SW_ASM_ARGS2")
+#pragma push_macro("__SW_ASM_ARGS4")
+#pragma push_macro("__SURF_READ1D");
+#pragma push_macro("__SURF_READ2D");
+#pragma push_macro("__SURF_READ3D");
+#pragma push_macro("__SURF_READ1DLAYERED");
+#pragma push_macro("__SURF_READ2DLAYERED");
+#pragma push_macro("__SURF_READCUBEMAP");
+#pragma push_macro("__SURF_READCUBEMAPLAYERED");
+#pragma push_macro("__1DV1");
+#pragma push_macro("__1DV2");
+#pragma push_macro("__1DV4");
+#pragma push_macro("__2DV1");
+#pragma push_macro("__2DV2");
+#pragma push_macro("__2DV4");
+#pragma push_macro("__1DLAYERV1");
+#pragma push_macro("__1DLAYERV2");
+#pragma push_macro("__1DLAYERV4");
+#pragma push_macro("__3DV1");
+#pragma push_macro("__3DV2");
+#pragma push_macro("__3DV4");
+#pragma push_macro("__2DLAYERV1");
+#pragma push_macro("__2DLAYERV2");
+#pragma push_macro("__2DLAYERV4");
+#pragma push_macro("__CUBEMAPV1");
+#pragma push_macro("__CUBEMAPV2");
+#pragma push_macro("__CUBEMAPV4");
+#pragma push_macro("__CUBEMAPLAYERV1");
+#pragma push_macro("__CUBEMAPLAYERV2");
+#pragma push_macro("__CUBEMAPLAYERV4");
+#pragma push_macro("__SURF_READXD_ALL");
+#pragma push_macro("__SURF_WRITE1D_V2");
+#pragma push_macro("__SURF_WRITE1DLAYERED_V2");
+#pragma push_macro("__SURF_WRITE2D_V2");
+#pragma push_macro("__SURF_WRITE2DLAYERED_V2");
+#pragma push_macro("__SURF_WRITE3D_V2");
+#pragma push_macro("__SURF_CUBEMAPWRITE_V2");
+#pragma push_macro("__SURF_CUBEMAPLAYEREDWRITE_V2");
+#pragma push_macro("__SURF_WRITEXD_V2_ALL");
+#pragma push_macro("__1DV1");
+#pragma push_macro("__1DV2");
+#pragma push_macro("__1DV4");
+#pragma push_macro("__2DV1");
+#pragma push_macro("__2DV2");
+#pragma push_macro("__2DV4");
+#pragma push_macro("__3DV1");
+#pragma push_macro("__3DV2");
+#pragma push_macro("__3DV4");
+
 
 // Put all functions into anonymous namespace so they have internal linkage.
 // The device-only function here must be internal in order to avoid ODR
@@ -186,6 +245,20 @@ template <class __T> struct __TypeInfoT {
   using __fetch_t = typename __TypeInfoT<__base_t>::__fetch_t;
 };
 
+// Tag structs to distinguish operation types
+struct __texture_op_tag {};
+struct __surface_op_tag {};
+
+// Template specialization to determine operation type based on tag value
+template <class __op>
+struct __op_type_traits {
+  using type = __texture_op_tag;
+};
+
+// Specialize for known surface operation tags
+#define __OP_TYPE_SURFACE(__op) \
+    template <> struct __op_type_traits<__op> { using type = __surface_op_tag; };
+
 // Classes that implement specific texture ops.
 template <class __op> struct __tex_fetch_v4;
 
@@ -649,6 +722,283 @@ template <class __DestT, class __SrcT> struct __convert {
   }
 };
 
+// There are a couple of layers here.  First, __op_type_traits is used to dispatch to either surface write calls, or to
+// the texture read calls.
+//
+// Then, that dispatches to __tex_fetch_impl below, which dispatches by both tag and datatype to the appropriate
+// __surf_read_write_v2.
+// TODO(austin): Do the reads too.
+
+// Mark which of the ids we should be dispatching to surface write calls.
+__OP_TYPE_SURFACE(__ID("__isurf1Dread"));
+__OP_TYPE_SURFACE(__ID("__isurf2Dread"));
+__OP_TYPE_SURFACE(__ID("__isurf3Dread"));
+__OP_TYPE_SURFACE(__ID("__isurf1DLayeredread"));
+__OP_TYPE_SURFACE(__ID("__isurf2DLayeredread"));
+__OP_TYPE_SURFACE(__ID("__isurfCubemapread"));
+__OP_TYPE_SURFACE(__ID("__isurfCubemapLayeredread"));
+__OP_TYPE_SURFACE(__ID("__isurf1Dwrite_v2"));
+__OP_TYPE_SURFACE(__ID("__isurf2Dwrite_v2"));
+__OP_TYPE_SURFACE(__ID("__isurf3Dwrite_v2"));
+__OP_TYPE_SURFACE(__ID("__isurf1DLayeredwrite_v2"));
+__OP_TYPE_SURFACE(__ID("__isurf2DLayeredwrite_v2"));
+__OP_TYPE_SURFACE(__ID("__isurfCubemapwrite_v2"));
+__OP_TYPE_SURFACE(__ID("__isurfCubemapLayeredwrite_v2"));
+
+template <class __op, typename __type>
+struct __surf_read_write_v2;
+
+// For the various write calls, we need to be able to generate variations with different IDs, different numbers of
+// arguments, and different numbers of outputs.
+
+#define __SURF_WRITE_V2(__op, __asm_dim, __asmtype, __type, __index_op_args, __index_args, __index_asm_args,          \
+                        __asm_op_args, __asm_args)                                                                    \
+    template <>                                                                                                       \
+    struct __surf_read_write_v2<__op, __type> {                                                                       \
+        static __device__ void __run(__type *__ptr, cudaSurfaceObject_t obj, __L(__index_args),                       \
+                                     cudaSurfaceBoundaryMode mode) {                                                  \
+            switch (mode) {                                                                                           \
+                case cudaBoundaryModeZero:                                                                            \
+                    asm volatile("sust.b." __asm_dim "." __asmtype ".zero [%0, " __index_op_args "], " __asm_op_args  \
+                                 ";"                                                                                  \
+                                 :                                                                                    \
+                                 : "l"(obj), __L(__index_asm_args), __L(__asm_args));                                 \
+                    break;                                                                                            \
+                case cudaBoundaryModeClamp:                                                                           \
+                    asm volatile("sust.b." __asm_dim "." __asmtype ".clamp [%0, " __index_op_args "], " __asm_op_args \
+                                 ";"                                                                                  \
+                                 :                                                                                    \
+                                 : "l"(obj), __L(__index_asm_args), __L(__asm_args));                                 \
+                    break;                                                                                            \
+                case cudaBoundaryModeTrap:                                                                            \
+                    asm volatile("sust.b." __asm_dim "." __asmtype ".trap [%0, " __index_op_args "], " __asm_op_args  \
+                                 ";"                                                                                  \
+                                 :                                                                                    \
+                                 : "l"(obj), __L(__index_asm_args), __L(__asm_args));                                 \
+                    break;                                                                                            \
+            }                                                                                                         \
+        }                                                                                                             \
+    }
+
+#define __SURF_READ_V2(__op, __asm_dim, __asmtype, __type, __asm_op_args, __asm_args, __index_args, __index_asm_args) \
+    template <>                                                                                                       \
+    struct __surf_read_write_v2<__op, __type> {                                                                       \
+        static __device__ void __run(__type *__ptr, cudaSurfaceObject_t obj, __L(__index_args),                       \
+                                     cudaSurfaceBoundaryMode mode) {                                                  \
+            switch (mode) {                                                                                           \
+                case cudaBoundaryModeZero:                                                                            \
+                    asm("suld.b." __asm_dim "." __asmtype ".zero " __asm_op_args ";"                                  \
+                        : __L(__asm_args)                                                                             \
+                        : "l"(obj), __L(__index_asm_args));                                                           \
+                    break;                                                                                            \
+                case cudaBoundaryModeClamp:                                                                           \
+                    asm("suld.b." __asm_dim "." __asmtype ".clamp " __asm_op_args ";"                                 \
+                        : __L(__asm_args)                                                                             \
+                        : "l"(obj), __L(__index_asm_args));                                                           \
+                    break;                                                                                            \
+                case cudaBoundaryModeTrap:                                                                            \
+                    asm("suld.b." __asm_dim "." __asmtype ".trap " __asm_op_args ";"                                  \
+                        : __L(__asm_args)                                                                             \
+                        : "l"(obj), __L(__index_asm_args));                                                           \
+                    break;                                                                                            \
+            }                                                                                                         \
+        }                                                                                                             \
+    }
+
+// Amazing, the read side should follow the same flow, I just need to change the generated assembly calls, and the rest
+// should fall in line.
+
+#define __SW_ASM_ARGS(__type) (__type(*__ptr))
+#define __SW_ASM_ARGS1(__type) (__type(__ptr->x))
+#define __SW_ASM_ARGS2(__type) (__type(__ptr->x), __type(__ptr->y))
+#define __SW_ASM_ARGS4(__type) (__type(__ptr->x), __type(__ptr->y), __type(__ptr->z), __type(__ptr->w))
+
+#define __SURF_READ1D(__asmtype, __type, __asm_op_args, __asm_args) \
+    __SURF_READ_V2(__ID("__isurf1Dread"), "1d", __asmtype, __type, __asm_op_args, __asm_args, (int x), ("r"(x)))
+#define __SURF_READ2D(__asmtype, __type, __asm_op_args, __asm_args)                                           \
+    __SURF_READ_V2(__ID("__isurf2Dread"), "2d", __asmtype, __type, __asm_op_args, __asm_args, (int x, int y), \
+                   ("r"(x), "r"(y)))
+#define __SURF_READ3D(__asmtype, __type, __asm_op_args, __asm_args)                                                  \
+    __SURF_READ_V2(__ID("__isurf3Dread"), "3d", __asmtype, __type, __asm_op_args, __asm_args, (int x, int y, int z), \
+                   ("r"(x), "r"(y), "r"(z)))
+
+#define __SURF_READ1DLAYERED(__asmtype, __type, __asm_op_args, __asm_args)                            \
+    __SURF_READ_V2(__ID("__isurf1DLayeredread"), "a1d", __asmtype, __type, __asm_op_args, __asm_args, \
+                   (int x, int layer), ("r"(x), "r"(layer)))
+#define __SURF_READ2DLAYERED(__asmtype, __type, __asm_op_args, __asm_args)                            \
+    __SURF_READ_V2(__ID("__isurf2DLayeredread"), "a2d", __asmtype, __type, __asm_op_args, __asm_args, \
+                   (int x, int y, int layer), ("r"(x), "r"(y), "r"(layer)))
+#define __SURF_READCUBEMAP(__asmtype, __type, __asm_op_args, __asm_args)                            \
+    __SURF_READ_V2(__ID("__isurfCubemapread"), "a2d", __asmtype, __type, __asm_op_args, __asm_args, \
+                   (int x, int y, int face), ("r"(x), "r"(y), "r"(face)))
+#define __SURF_READCUBEMAPLAYERED(__asmtype, __type, __asm_op_args, __asm_args)                            \
+    __SURF_READ_V2(__ID("__isurfCubemapLayeredread"), "a2d", __asmtype, __type, __asm_op_args, __asm_args, \
+                   (int x, int y, int layerface), ("r"(x), "r"(y), "r"(layerface)))
+
+#define __1DV1 "{%0}, [%1, {%2}]"
+#define __1DV2 "{%0, %1}, [%2, {%3}]"
+#define __1DV4 "{%0, %1, %2, %3}, [%4, {%5}]"
+
+#define __2DV1 "{%0}, [%1, {%2, %3}]"
+#define __2DV2 "{%0, %1}, [%2, {%3, %4}]"
+#define __2DV4 "{%0, %1, %2, %3}, [%4, {%5, %6}]"
+
+#define __1DLAYERV1 "{%0}, [%1, {%3, %2}]"
+#define __1DLAYERV2 "{%0, %1}, [%2, {%4, %3}]"
+#define __1DLAYERV4 "{%0, %1, %2, %3}, [%4, {%6, %5}]"
+
+#define __3DV1 "{%0}, [%1, {%2, %3, %4, %4}]"
+#define __3DV2 "{%0, %1}, [%2, {%3, %4, %5, %5}]"
+#define __3DV4 "{%0, %1, %2, %4}, [%4, {%5, %6, %7, %7}]"
+
+#define __2DLAYERV1 "{%0}, [%1, {%4, %2, %3, %3}]"
+#define __2DLAYERV2 "{%0, %1}, [%2, {%5, %3, %4, %4}]"
+#define __2DLAYERV4 "{%0, %1, %2, %3}, [%4, {%7, %5, %6, %6}]"
+
+#define __CUBEMAPV1 "{%0}, [%1, {%4, %2, %3, %3}]"
+#define __CUBEMAPV2 "{%0, %1}, [%2, {%5, %3, %4, %4}]"
+#define __CUBEMAPV4 "{%0, %1, %2, %3}, [%4, {%7, %5, %6, %6}]"
+
+#define __CUBEMAPLAYERV1 "{%0}, [%1, {%4, %2, %3, %3}]"
+#define __CUBEMAPLAYERV2 "{%0, %1}, [%2, {%5, %3, %4, %4}]"
+#define __CUBEMAPLAYERV4 "{%0, %1, %2, %3}, [%4, {%7, %5, %6, %6}]"
+
+#define __SURF_READXD_ALL(__xdv1, __xdv2, __xdv4, __surf_readxd_v2)           \
+    __surf_readxd_v2("b8", char, __xdv1, __SW_ASM_ARGS("=h"));                \
+    __surf_readxd_v2("b8", signed char, __xdv1, __SW_ASM_ARGS("=h"));         \
+    __surf_readxd_v2("b8", char1, __xdv1, __SW_ASM_ARGS1("=h"));              \
+    __surf_readxd_v2("b8", unsigned char, __xdv1, __SW_ASM_ARGS("=h"));       \
+    __surf_readxd_v2("b8", uchar1, __xdv1, __SW_ASM_ARGS1("=h"));             \
+    __surf_readxd_v2("b16", short, __xdv1, __SW_ASM_ARGS("=h"));              \
+    __surf_readxd_v2("b16", short1, __xdv1, __SW_ASM_ARGS1("=h"));            \
+    __surf_readxd_v2("b16", unsigned short, __xdv1, __SW_ASM_ARGS("=h"));     \
+    __surf_readxd_v2("b16", ushort1, __xdv1, __SW_ASM_ARGS1("=h"));           \
+    __surf_readxd_v2("b32", int, __xdv1, __SW_ASM_ARGS("=r"));                \
+    __surf_readxd_v2("b32", int1, __xdv1, __SW_ASM_ARGS1("=r"));              \
+    __surf_readxd_v2("b32", unsigned int, __xdv1, __SW_ASM_ARGS("=r"));       \
+    __surf_readxd_v2("b32", uint1, __xdv1, __SW_ASM_ARGS1("=r"));             \
+    __surf_readxd_v2("b64", long long, __xdv1, __SW_ASM_ARGS("=l"));          \
+    __surf_readxd_v2("b64", longlong1, __xdv1, __SW_ASM_ARGS1("=l"));         \
+    __surf_readxd_v2("b64", unsigned long long, __xdv1, __SW_ASM_ARGS("=l")); \
+    __surf_readxd_v2("b64", ulonglong1, __xdv1, __SW_ASM_ARGS1("=l"));        \
+    __surf_readxd_v2("b32", float, __xdv1, __SW_ASM_ARGS("=r"));              \
+    __surf_readxd_v2("b32", float1, __xdv1, __SW_ASM_ARGS1("=r"));            \
+                                                                              \
+    __surf_readxd_v2("v2.b8", char2, __xdv2, __SW_ASM_ARGS2("=h"));           \
+    __surf_readxd_v2("v2.b8", uchar2, __xdv2, __SW_ASM_ARGS2("=h"));          \
+    __surf_readxd_v2("v2.b16", short2, __xdv2, __SW_ASM_ARGS2("=h"));         \
+    __surf_readxd_v2("v2.b16", ushort2, __xdv2, __SW_ASM_ARGS2("=h"));        \
+    __surf_readxd_v2("v2.b32", int2, __xdv2, __SW_ASM_ARGS2("=r"));           \
+    __surf_readxd_v2("v2.b32", uint2, __xdv2, __SW_ASM_ARGS2("=r"));          \
+    __surf_readxd_v2("v2.b64", longlong2, __xdv2, __SW_ASM_ARGS2("=l"));      \
+    __surf_readxd_v2("v2.b64", ulonglong2, __xdv2, __SW_ASM_ARGS2("=l"));     \
+    __surf_readxd_v2("v2.b32", float2, __xdv2, __SW_ASM_ARGS2("=r"));         \
+                                                                              \
+    __surf_readxd_v2("v4.b8", char4, __xdv4, __SW_ASM_ARGS4("=h"));           \
+    __surf_readxd_v2("v4.b8", uchar4, __xdv4, __SW_ASM_ARGS4("=h"));          \
+    __surf_readxd_v2("v4.b16", short4, __xdv4, __SW_ASM_ARGS4("=h"));         \
+    __surf_readxd_v2("v4.b16", ushort4, __xdv4, __SW_ASM_ARGS4("=h"));        \
+    __surf_readxd_v2("v4.b32", int4, __xdv4, __SW_ASM_ARGS4("=r"));           \
+    __surf_readxd_v2("v4.b32", uint4, __xdv4, __SW_ASM_ARGS4("=r"));          \
+    __surf_readxd_v2("v4.b32", float4, __xdv4, __SW_ASM_ARGS4("=r"))
+
+__SURF_READXD_ALL(__1DV1, __1DV2, __1DV4, __SURF_READ1D);
+__SURF_READXD_ALL(__2DV1, __2DV2, __2DV4, __SURF_READ2D);
+__SURF_READXD_ALL(__3DV1, __3DV2, __3DV4, __SURF_READ3D);
+__SURF_READXD_ALL(__1DLAYERV1, __1DLAYERV2, __1DLAYERV4, __SURF_READ1DLAYERED);
+__SURF_READXD_ALL(__2DLAYERV1, __2DLAYERV2, __2DLAYERV4, __SURF_READ2DLAYERED);
+__SURF_READXD_ALL(__CUBEMAPV1, __CUBEMAPV2, __CUBEMAPV4, __SURF_READCUBEMAP);
+__SURF_READXD_ALL(__CUBEMAPLAYERV1, __CUBEMAPLAYERV2, __CUBEMAPLAYERV4, __SURF_READCUBEMAPLAYERED);
+
+
+#define __SURF_WRITE1D_V2(__asmtype, __type, __asm_op_args, __asm_args)                                     \
+    __SURF_WRITE_V2(__ID("__isurf1Dwrite_v2"), "1d", __asmtype, __type, "{%1}", (int x), ("r"(x)), __asm_op_args, \
+                    __asm_args)
+#define __SURF_WRITE1DLAYERED_V2(__asmtype, __type, __asm_op_args, __asm_args)                                  \
+    __SURF_WRITE_V2(__ID("__isurf1DLayeredwrite_v2"), "a1d", __asmtype, __type, "{%2, %1}", (int x, int layer), \
+                    ("r"(x), "r"(layer)), __asm_op_args, __asm_args)
+#define __SURF_WRITE2D_V2(__asmtype, __type, __asm_op_args, __asm_args)                                               \
+    __SURF_WRITE_V2(__ID("__isurf2Dwrite_v2"), "2d", __asmtype, __type, "{%1, %2}", (int x, int y), ("r"(x), "r"(y)), \
+                    __asm_op_args, __asm_args)
+#define __SURF_WRITE2DLAYERED_V2(__asmtype, __type, __asm_op_args, __asm_args)                      \
+    __SURF_WRITE_V2(__ID("__isurf2DLayeredwrite_v2"), "a2d", __asmtype, __type, "{%3, %1, %2, %2}", \
+                    (int x, int y, int layer), ("r"(x), "r"(y), "r"(layer)), __asm_op_args, __asm_args)
+#define __SURF_WRITE3D_V2(__asmtype, __type, __asm_op_args, __asm_args)                                            \
+    __SURF_WRITE_V2(__ID("__isurf3Dwrite_v2"), "3d", __asmtype, __type, "{%1, %2, %3, %3}", (int x, int y, int z), \
+                    ("r"(x), "r"(y), "r"(z)), __asm_op_args, __asm_args)
+
+#define __SURF_CUBEMAPWRITE_V2(__asmtype, __type, __asm_op_args, __asm_args)                      \
+    __SURF_WRITE_V2(__ID("__isurfCubemapwrite_v2"), "a2d", __asmtype, __type, "{%3, %1, %2, %2}", \
+                    (int x, int y, int face), ("r"(x),...
[truncated]

AustinSchuh · 2025-03-25T05:03:24Z

@Artem-B I think you added texture support originally.

A lot of the language in that file is focused on just textures, not textures and surfaces. I am happy to adjust that if that is desired. I figured a bit ugly, working, and early feedback was preferable.

Artem-B

LGTM in principle, but it could use some tests.
The change is surprisingly nicely compact. Thank you for filling in one of the long-standing gaps in clang's cuda support story.

Artem-B · 2025-03-25T23:59:32Z

clang/lib/Headers/__clang_cuda_texture_intrinsics.h

+__OP_TYPE_SURFACE(__ID("__isurf1Dread"));
+__OP_TYPE_SURFACE(__ID("__isurf2Dread"));
+__OP_TYPE_SURFACE(__ID("__isurf3Dread"));
+__OP_TYPE_SURFACE(__ID("__isurf1DLayeredread"));
+__OP_TYPE_SURFACE(__ID("__isurf2DLayeredread"));
+__OP_TYPE_SURFACE(__ID("__isurfCubemapread"));
+__OP_TYPE_SURFACE(__ID("__isurfCubemapLayeredread"));
+__OP_TYPE_SURFACE(__ID("__isurf1Dwrite_v2"));
+__OP_TYPE_SURFACE(__ID("__isurf2Dwrite_v2"));
+__OP_TYPE_SURFACE(__ID("__isurf3Dwrite_v2"));
+__OP_TYPE_SURFACE(__ID("__isurf1DLayeredwrite_v2"));
+__OP_TYPE_SURFACE(__ID("__isurf2DLayeredwrite_v2"));
+__OP_TYPE_SURFACE(__ID("__isurfCubemapwrite_v2"));
+__OP_TYPE_SURFACE(__ID("__isurfCubemapLayeredwrite_v2"));


The perfect hash used by __ID may need to be re-generated. We may be lucky and it may happen to work well enough for the surface strings. It will likely eventually fail if you add more operations in the future. If the strings above are all you need, then we're OK.

I was worried about that, but surprisingly, there is no collision. I ran all the different strings through it (texture + surface), and none of them collided. Happy to share my test program if that is helpful.

AustinSchuh · 2025-03-26T00:10:44Z

LGTM in principle, but it could use some tests. The change is surprisingly nicely compact. Thank you for filling in one of the long-standing gaps in clang's cuda support story.

I might need some hints on where to start. How would you go about testing this, or are there any tests I can start from? I cribbed heavily from your texture code, which only has a compile test of the headers.

Artem-B · 2025-03-26T21:41:40Z

LGTM in principle, but it could use some tests. The change is surprisingly nicely compact. Thank you for filling in one of the long-standing gaps in clang's cuda support story.

I might need some hints on where to start. How would you go about testing this, or are there any tests I can start from? I cribbed heavily from your texture code, which only has a compile test of the headers.

Ideally it would be great to see that each builtin call generates correct surface instruction, and that those instructions are accepted by ptxas. The catch is that ptxas is part of CUDA SDK and we can't use it for testing unconditionally. We've plumbed it through as an optional testing tool on the LLVM side, but I do not think we have it on clang side at the moment.

If the header is compileable without CUDA SDK (maybe with some stub headers in tests Inputs), then a source file with a lot of builtin calls and autogenerated checks to verify produced instructions would do.

AustinSchuh · 2025-03-30T02:23:28Z

If the header is compileable without CUDA SDK (maybe with some stub headers in tests Inputs), then a source file with a lot of builtin calls and autogenerated checks to verify produced instructions would do.

That wasn't too bad, done!

github-actions · 2025-03-30T02:24:48Z

✅ With the latest revision this PR passed the C/C++ code formatter.

Artem-B

Nice.

Now we're missing the two last steps;

that ptxas accepts the inline asm instructions we generate
that those instructions actually do what they are intended to do.

Can you manually verify that the test file actually compiles to a GPU binary?
I.e. try to manually compile the test file to .o file, so clang produces PTX file and compiles it with ptxas. We do not have ptxas plumbed into testing on clang side, so manual verification will do for now.

As for the functional testing, that would need to be done in LLVM test-suite, where we do have ability to compile code with actual CUDA SDK, and run it on a GPU, but we do not have any functional tests for surfaces, so that's optional.

AustinSchuh · 2025-04-02T20:53:06Z

that ptxas accepts the inline asm instructions we generate

that those instructions actually do what they are intended to do.

I ran our internal test suite against this patch, and surf2Dwrite with an int32_t passes. Not an exhaustive test at all, but at least good evidence that it isn't completely wrong.

Can you manually verify that the test file actually compiles to a GPU binary? I.e. try to manually compile the test file to .o file, so clang produces PTX file and compiles it with ptxas. We do not have ptxas plumbed into testing on clang side, so manual verification will do for now.

/home/austin/local/llvm-project/bin/clang -resource-dir /home/austin/local/llvm-project/lib/clang/21 -Xclang -nostdsysteminc --cuda-path=/usr/lib/cuda/ --cuda-gpu-arch=sm_87 -O3 -o foo.o -c /home/austin/local/llvm-project/clang/test/CodeGen/nvptx-surface.cu --save-temps

succeeds. I looked at the temporary files left behind, and the .s files look like it is actually compiling the intended code. I did a s/__device__/__global__/ before compiling.

(I haven't done 3 commits yet to llvm, so you'll have to merge this one too when you are happy)

github-actions · 2025-04-03T17:08:21Z

@AustinSchuh Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

llvm-ci · 2025-04-03T17:47:44Z

LLVM Buildbot has detected a new failure on builder sanitizer-aarch64-linux running on sanitizer-buildbot7 while building clang at step 2 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/51/builds/13779

Here is the relevant piece of the build log for the reference

Step 2 (annotate) failure: 'python ../sanitizer_buildbot/sanitizers/zorg/buildbot/builders/sanitizers/buildbot_selector.py' (failure)
...
[182/186] Generating MSAN_INST_TEST_OBJECTS.msan_test.cpp.aarch64-with-call.o
[183/186] Generating Msan-aarch64-with-call-Test
[184/186] Generating MSAN_INST_TEST_OBJECTS.msan_test.cpp.aarch64.o
[185/186] Generating Msan-aarch64-Test
[185/186] Running compiler_rt regression tests
llvm-lit: /home/b/sanitizer-aarch64-linux/build/llvm-project/llvm/utils/lit/lit/discovery.py:276: warning: input '/home/b/sanitizer-aarch64-linux/build/build_default/runtimes/runtimes-bins/compiler-rt/test/interception/Unit' contained no tests
llvm-lit: /home/b/sanitizer-aarch64-linux/build/llvm-project/llvm/utils/lit/lit/discovery.py:276: warning: input '/home/b/sanitizer-aarch64-linux/build/build_default/runtimes/runtimes-bins/compiler-rt/test/sanitizer_common/Unit' contained no tests
llvm-lit: /home/b/sanitizer-aarch64-linux/build/llvm-project/llvm/utils/lit/lit/main.py:72: note: The test suite configuration requested an individual test timeout of 0 seconds but a timeout of 900 seconds was requested on the command line. Forcing timeout to be 900 seconds.
-- Testing: 6015 tests, 72 workers --
Testing:  0.. 10.. 20
FAIL: HWAddressSanitizer-aarch64 :: TestCases/Linux/fixed-shadow.c (1536 of 6015)
******************** TEST 'HWAddressSanitizer-aarch64 :: TestCases/Linux/fixed-shadow.c' FAILED ********************
Exit Code: 2

Command Output (stdout):
--
1 0x10000000000
2 0x20000000000
3 0x30000000000
4 0x40000000000
5 0x50000000000
6 0x60000000000
7 0x70000000000
8 0x80000000000
9 0x90000000000
10 0xa0000000000
11 0xb0000000000
12 0xc0000000000
13 0xd0000000000
14 0xe0000000000
15 0xf0000000000
16 0x100000000000
17 0x110000000000
18 0x120000000000
19 0x130000000000
20 0x140000000000
21 0x150000000000
22 0x160000000000
23 0x170000000000
24 0x180000000000
25 0x190000000000
26 0x1a0000000000
27 0x1b0000000000
28 0x1c0000000000
29 0x1d0000000000
30 0x1e0000000000
31 0x1f0000000000
32 0x200000000000
33 0x210000000000
Step 14 (test compiler-rt default) failure: test compiler-rt default (failure)
...
[182/186] Generating MSAN_INST_TEST_OBJECTS.msan_test.cpp.aarch64-with-call.o
[183/186] Generating Msan-aarch64-with-call-Test
[184/186] Generating MSAN_INST_TEST_OBJECTS.msan_test.cpp.aarch64.o
[185/186] Generating Msan-aarch64-Test
[185/186] Running compiler_rt regression tests
llvm-lit: /home/b/sanitizer-aarch64-linux/build/llvm-project/llvm/utils/lit/lit/discovery.py:276: warning: input '/home/b/sanitizer-aarch64-linux/build/build_default/runtimes/runtimes-bins/compiler-rt/test/interception/Unit' contained no tests
llvm-lit: /home/b/sanitizer-aarch64-linux/build/llvm-project/llvm/utils/lit/lit/discovery.py:276: warning: input '/home/b/sanitizer-aarch64-linux/build/build_default/runtimes/runtimes-bins/compiler-rt/test/sanitizer_common/Unit' contained no tests
llvm-lit: /home/b/sanitizer-aarch64-linux/build/llvm-project/llvm/utils/lit/lit/main.py:72: note: The test suite configuration requested an individual test timeout of 0 seconds but a timeout of 900 seconds was requested on the command line. Forcing timeout to be 900 seconds.
-- Testing: 6015 tests, 72 workers --
Testing:  0.. 10.. 20
FAIL: HWAddressSanitizer-aarch64 :: TestCases/Linux/fixed-shadow.c (1536 of 6015)
******************** TEST 'HWAddressSanitizer-aarch64 :: TestCases/Linux/fixed-shadow.c' FAILED ********************
Exit Code: 2

Command Output (stdout):
--
1 0x10000000000
2 0x20000000000
3 0x30000000000
4 0x40000000000
5 0x50000000000
6 0x60000000000
7 0x70000000000
8 0x80000000000
9 0x90000000000
10 0xa0000000000
11 0xb0000000000
12 0xc0000000000
13 0xd0000000000
14 0xe0000000000
15 0xf0000000000
16 0x100000000000
17 0x110000000000
18 0x120000000000
19 0x130000000000
20 0x140000000000
21 0x150000000000
22 0x160000000000
23 0x170000000000
24 0x180000000000
25 0x190000000000
26 0x1a0000000000
27 0x1b0000000000
28 0x1c0000000000
29 0x1d0000000000
30 0x1e0000000000
31 0x1f0000000000
32 0x200000000000
33 0x210000000000

slackito · 2025-04-03T23:55:36Z

clang/test/CodeGen/nvptx-surface.cu

@@ -0,0 +1,3329 @@
+// RUN: %clang_cc1 -triple nvptx-unknown-unknown -fcuda-is-device -O3 -o - %s -emit-llvm | FileCheck %s
+// RUN: %clang_cc1 -triple nvptx64-unknown-unknown -fcuda-is-device -O3 -o - %s -emit-llvm | FileCheck %s
+#include "../Headers/Inputs/include/cuda.h"


Should "include/cuda.h" be in clang/test/CodeGen/Inputs because it's needed for a CodeGen test?

test/Headers contains (if I'm not mistaken) test cases related to headers, not header files for other tests. And it seems unusual for a test to reach into the input files of a different category of tests.

Any suggestions for how to best not duplicate it? I could copy it over, but that feels worse. It is needed for the headers tests too. Essentially, I need definitions for all the vectorized numbers (int2, etc) and other things to make it so the texture + surface header can be included without errors.

Relative include paths are a problem for tests -- some run remotely with only the test directory staged, so there may not always be ../Headers directory where the tests actually run. Replicating test input changes is OK.

#134459 has the fixup

llvm#132883 added support for cuda surfaces but reached into clang/test/Headers/ from clang/test/CodeGen/ to grab the minimal cuda.h. Duplicate that file instead based on comments in the review, to fix remote test runs. Signed-off-by: Austin Schuh <[email protected]>

#132883 added support for cuda surfaces but reached into clang/test/Headers/ from clang/test/CodeGen/ to grab the minimal cuda.h. Duplicate that file instead based on comments in the review, to fix remote test runs. Signed-off-by: Austin Schuh <[email protected]>

…34459) llvm/llvm-project#132883 added support for cuda surfaces but reached into clang/test/Headers/ from clang/test/CodeGen/ to grab the minimal cuda.h. Duplicate that file instead based on comments in the review, to fix remote test runs. Signed-off-by: Austin Schuh <[email protected]>

Artem-B · 2025-04-07T17:38:26Z

@AustinSchuh One thing I've missed during review is that the test clang/test/CodeGen/nvptx-surface.cu should probably go into clang/test/CodeGenCUDA

This would also obviate the need for #134459.

Can you send the patch to move the test to the right location?

AustinSchuh · 2025-04-08T00:21:37Z

You got it. #134758

llvmbot added clang Clang issues not falling into any other category backend:X86 clang:headers Headers provided by Clang, e.g. for intrinsics labels Mar 25, 2025

AustinSchuh added 3 commits March 25, 2025 15:51

Hopefully fix test failures, and fix register mapping issue

dc5fd8f

Add 1d variants

3be9f51

Fix missing test header.

0d2ea2e

Artem-B reviewed Mar 26, 2025

View reviewed changes

Test surfaces

e61144b

AustinSchuh added 2 commits March 29, 2025 19:27

Clang format

d892319

Clang format try 2

4dde1b1

Artem-B approved these changes Mar 31, 2025

View reviewed changes

Artem-B merged commit 2abcdd8 into llvm:main Apr 3, 2025
11 checks passed

slackito reviewed Apr 3, 2025

View reviewed changes

AustinSchuh mentioned this pull request Apr 4, 2025

cuda clang: Clean up test dependency for CUDA surfaces #134459

Merged

Lai-YT mentioned this pull request Apr 8, 2025

Use of undeclared identifier 'surf2Dwrite' #29289

Open

cuda clang: Add support for CUDA surfaces #132883

cuda clang: Add support for CUDA surfaces #132883

Uh oh!

Conversation

AustinSchuh commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

llvmbot commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AustinSchuh commented Mar 25, 2025

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

Artem-B Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AustinSchuh Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

AustinSchuh commented Mar 26, 2025

Uh oh!

Artem-B commented Mar 26, 2025

Uh oh!

AustinSchuh commented Mar 30, 2025

Uh oh!

github-actions bot commented Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

AustinSchuh commented Apr 2, 2025

Uh oh!

Uh oh!

github-actions bot commented Apr 3, 2025

Uh oh!

llvm-ci commented Apr 3, 2025

Uh oh!

slackito Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

AustinSchuh Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Artem-B Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

AustinSchuh Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

Artem-B commented Apr 7, 2025

Uh oh!

AustinSchuh commented Apr 8, 2025

Uh oh!

Uh oh!

AustinSchuh commented Mar 25, 2025 •

edited

Loading

llvmbot commented Mar 25, 2025 •

edited

Loading

Artem-B Mar 25, 2025 •

edited

Loading

github-actions bot commented Mar 30, 2025 •

edited

Loading

AustinSchuh Apr 3, 2025 •

edited

Loading