You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*TODO: the diagram needs to be updating to reflect the latest driver additions.*
246
+
245
247
The clang driver orchestrates compilation and linking process based on a
246
248
SYCL-specific offload action builder and invokes external tools as needed. On
247
249
the diagram above, every dark-blue box is a tool invoked as a separate process
@@ -434,6 +436,74 @@ unit)
434
436
* `per_kernel` - enables emitting a separate module for each kernel
435
437
* `off` - disables device code split
436
438
439
+
#### CUDA support
440
+
441
+
The driver supports compilation to NVPTX when the `nvptx64-nvidia-cuda-sycldevice` is passed to `-fsycl-targets`.
442
+
443
+
Unlike other AOT targets, the bitcode module linked from intermediate compiled objects never goes through SPIR-V. Instead it is passed directly in bitcode form down to the NVPTX Back End. All produced bitcode depends on two libraries, `libdevice.bc` (provided by the CUDA SDK) and `libspirv-nvptx64--nvidiacl.bc` (built by the libclc project).
444
+
445
+
During the device linking step (device linker box in the [Separate Compilation and Linking](#separate-compilation-and-linking) illustration), llvm bitcode objects for the CUDA target are linked together alongside `libspirv-nvptx64--nvidiacl.bc` and `libdevice.bc`, compiled to PTX using the NVPTX backend, and assembled into a cubin using the `ptxas` tool (part of the CUDA SDK). The PTX file and cubin are assembled together using `fatbinary` to produce a CUDA fatbin. The CUDA fatbin is then passed to the offload wrapper tool.
446
+
447
+
##### Checking if the compiler is targeting NVPTX
448
+
449
+
When the SYCL compiler is in device mode and targeting the NVPTX backend, compiler defines the macro `__SYCL_NVPTX__`.
450
+
This macro can safely be used to enable NVPTX specific code path in SYCL kernels.
451
+
452
+
*Note: this macro is only define during the device compilation phase.*
453
+
454
+
##### NVPTX Builtins
455
+
456
+
When the SYCL compiler is in device mode and targeting the NVPTX backend, the compiler exposes NVPTX builtins supported by clang.
457
+
458
+
*Note: this enable NVPTX specific features which cannot be supported by other targets or the host.*
459
+
460
+
Example:
461
+
```cpp
462
+
double my_min(double x, double y) {
463
+
#ifdef __SYCL_NVPTX__
464
+
// Only available if in device mode and
465
+
// while compiling for the NVPTX target.
466
+
return __nvvm_fmin_d(x, y);
467
+
#else
468
+
return x < y ? x : y;
469
+
#endif
470
+
}
471
+
```
472
+
473
+
##### Local memory support
474
+
475
+
In CUDA, users can only allocate one chunk of host allocated shared memory (which maps to SYCL's local accessors).
476
+
This chunk of memory is allocated as an array `extern __shared__ <type> <name>[];` which LLVM represents as an external global symbol to the CUDA shared memory address space.
477
+
The NVPTX backend then lowers this into a `.extern .shared .align 4 .b8` PTX instruction.
478
+
479
+
In SYCL, users can allocate multiple local accessors and pass them as kernel parameters. When the SYCL frontend lowers the SYCL kernel invocation into an OpenCL compliant kernel entry, it lowers local accessors into a pointer to OpenCL local memory (CUDA shared memory) but this is not legal for CUDA kernels.
480
+
481
+
To legalize the SYCL lowering for CUDA, a SYCL for CUDA specific pass will do the following:
482
+
- Create a global symbol to the CUDA shared memory address space
483
+
- Transform all pointers to CUDA shared memory into a 32 bit integer representing the offset in bytes to use with the global symbol
484
+
- Replace all uses of the transformed pointers by the address to global symbol offset by the value of the integer passed as parameter
%new_local_ptr = getelementptr inbounds [0 x i8], [0 x i8] addrspace(3)* @SYCL_generated_kernel.shared_mem, i32 0, i32 %local_ptr_offset
499
+
%new_local_ptr2 = getelementptr inbounds [0 x i8], [0 x i8] addrspace(3)* @SYCL_generated_kernel.shared_mem, i32 0, i32 %local_ptr_offset2
500
+
%0 = load i32, i32 addrspace(3)* %new_local_ptr
501
+
%1 = load i64, i64 addrspace(3)* %new_local_ptr2
502
+
}
503
+
```
504
+
505
+
On the runtime side, when setting local memory arguments, the CUDA PI implementation will internally set the argument as the offset with respect to the accumulated size of used local memory. This approach preserves the exisiting PI interface.
506
+
437
507
### Integration with SPIR-V format
438
508
439
509
This section explains how to generate SPIR-V specific types and operations from
@@ -553,4 +623,3 @@ compiler:
553
623
## DPC++ Language extensions to SYCL
554
624
555
625
List of language extensions can be found at [extensions](extensions)
0 commit comments