Skip to content

Commit 928b815

Browse files
authored
[SYCL-PTX] Update documentation for the CUDA backend. (#1820)
Sync the post link process for PTX following the illustration update. Add a description for the global offset handling. Signed-off-by: Victor Lomuller <[email protected]>
1 parent e911de7 commit 928b815

File tree

2 files changed

+2954
-9
lines changed

2 files changed

+2954
-9
lines changed

sycl/doc/CompilerAndRuntimeDesign.md

Lines changed: 101 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -510,14 +510,18 @@ down to the NVPTX Back End. All produced bitcode depends on two libraries,
510510
`libdevice.bc` (provided by the CUDA SDK) and `libspirv-nvptx64--nvidiacl.bc`
511511
(built by the libclc project).
512512

513-
During the device linking step (device linker box in the
514-
[Separate Compilation and Linking](#separate-compilation-and-linking)
515-
illustration), llvm bitcode objects for the CUDA target are linked together
516-
alongside `libspirv-nvptx64--nvidiacl.bc` and `libdevice.bc`, compiled to PTX
517-
using the NVPTX backend, and assembled into a cubin using the `ptxas` tool (part
518-
of the CUDA SDK). The PTX file and cubin are assembled together using
519-
`fatbinary` to produce a CUDA fatbin. The CUDA fatbin is then passed to the
520-
offload wrapper tool.
513+
##### Device code post-link step
514+
515+
During the "PTX target processing" in the device linking step [Device
516+
code post-link step](#device-code-post-link-step), the llvm bitcode
517+
objects for the CUDA target are linked together alongside
518+
`libspirv-nvptx64--nvidiacl.bc` and `libdevice.bc`, compiled to PTX
519+
using the NVPTX backend and assembled into a cubin using the `ptxas`
520+
tool (part of the CUDA SDK). The PTX file and cubin are assembled
521+
together using `fatbinary` to produce a CUDA fatbin. The CUDA fatbin
522+
is then passed to the offload wrapper tool.
523+
524+
![NVPTX AOT build](images/DevicePTXProcessing.svg)
521525

522526
##### Checking if the compiler is targeting NVPTX
523527

@@ -592,9 +596,97 @@ define void @SYCL_generated_kernel(i32 %local_ptr_offset, i32 %arg, i32 %local_p
592596
593597
On the runtime side, when setting local memory arguments, the CUDA PI
594598
implementation will internally set the argument as the offset with respect to
595-
the accumulated size of used local memory. This approach preserves the exisiting
599+
the accumulated size of used local memory. This approach preserves the existing
596600
PI interface.
597601
602+
##### Global offset support
603+
604+
The CUDA API does not natively support the global offset parameter
605+
expected by the SYCL.
606+
607+
In order to emulate this and make generated kernel compliant, an
608+
intrinsic `llvm.nvvm.implicit.offset` (clang builtin
609+
`__builtin_ptx_implicit_offset`) was introduced materializing the use
610+
of this implicit parameter for the NVPTX backend. The intrinsic returns
611+
a pointer to `i32` referring to a 3 elements array.
612+
613+
Each non-kernel function reaching the implicit offset intrinsic in the
614+
call graph is augmented with an extra implicit parameter of type
615+
pointer to `i32`. Kernels calling one of these functions using
616+
this intrinsic are cloned:
617+
618+
- the original kernel initializes an array of 3 `i32` to 0 and passes
619+
the pointer to this array to each function with the implicit
620+
parameter;
621+
- the cloned function type is augmented with an implicit parameter of
622+
type array of 3 `i32`. The pointer to this array is then passed each
623+
function with the implicit parameter.
624+
625+
The runtime will query both kernels and call the appropriate one based
626+
on the following logic:
627+
628+
- If the 2 versions exist, the original kernel is called if global
629+
offset is 0 otherwise it will call the cloned one and pass the
630+
offset by value;
631+
- If only 1 function exist, it is assumed that the kernel makes no use
632+
of this parameter and therefore ignores it.
633+
634+
As an example, the following code:
635+
```
636+
declare i32* @llvm.nvvm.implicit.offset()
637+
638+
define weak_odr dso_local i64 @other_function() {
639+
%1 = tail call i32* @llvm.nvvm.implicit.offset()
640+
%2 = getelementptr inbounds i32, i32* %1, i64 2
641+
%3 = load i32, i32* %2, align 4
642+
%4 = zext i32 %3 to i64
643+
ret i64 %4
644+
}
645+
646+
define weak_odr dso_local void @other_function2() {
647+
ret
648+
}
649+
650+
define weak_odr dso_local void @example_kernel() {
651+
entry:
652+
%0 = call i64 @other_function()
653+
call void @other_function2()
654+
ret void
655+
}
656+
```
657+
658+
Is transformed into this in the `sycldevice` environment:
659+
```
660+
define weak_odr dso_local i64 @other_function(i32* %0) {
661+
%2 = getelementptr inbounds i32, i32* %0, i64 2
662+
%3 = load i32, i32* %2, align 4
663+
%4 = zext i32 %3 to i64
664+
665+
ret i64 %4
666+
}
667+
668+
define weak_odr dso_local void @example_kernel() {
669+
entry:
670+
%0 = alloca [3 x i32], align 4
671+
%1 = bitcast [3 x i32]* %0 to i8*
672+
call void @llvm.memset.p0i8.i64(i8* nonnull align 4 dereferenceable(12) %1, i8 0, i64 12, i1 false)
673+
%2 = getelementptr inbounds [3 x i32], [3 x i32]* %0, i32 0, i32 0
674+
%3 = call i64 @other_function(i32* %2)
675+
call void @other_function2()
676+
ret void
677+
}
678+
679+
define weak_odr dso_local void @example_kernel_with_offset([3 x i32]* byval([3 x i32]) %0) {
680+
entry:
681+
%1 = bitcast [3 x i32]* %0 to i32*
682+
%2 = call i64 @other_function(i32* %1)
683+
call void @other_function2()
684+
ret void
685+
}
686+
```
687+
688+
Note: Kernel naming is not fully stable for now.
689+
598690
### Integration with SPIR-V format
599691
600692
This section explains how to generate SPIR-V specific types and operations from

0 commit comments

Comments
 (0)