@@ -510,14 +510,18 @@ down to the NVPTX Back End. All produced bitcode depends on two libraries,
510
510
` libdevice.bc ` (provided by the CUDA SDK) and ` libspirv-nvptx64--nvidiacl.bc `
511
511
(built by the libclc project).
512
512
513
- During the device linking step (device linker box in the
514
- [ Separate Compilation and Linking] ( #separate-compilation-and-linking )
515
- illustration), llvm bitcode objects for the CUDA target are linked together
516
- alongside ` libspirv-nvptx64--nvidiacl.bc ` and ` libdevice.bc ` , compiled to PTX
517
- using the NVPTX backend, and assembled into a cubin using the ` ptxas ` tool (part
518
- of the CUDA SDK). The PTX file and cubin are assembled together using
519
- ` fatbinary ` to produce a CUDA fatbin. The CUDA fatbin is then passed to the
520
- offload wrapper tool.
513
+ ##### Device code post-link step
514
+
515
+ During the "PTX target processing" in the device linking step [ Device
516
+ code post-link step] ( #device-code-post-link-step ) , the llvm bitcode
517
+ objects for the CUDA target are linked together alongside
518
+ ` libspirv-nvptx64--nvidiacl.bc ` and ` libdevice.bc ` , compiled to PTX
519
+ using the NVPTX backend and assembled into a cubin using the ` ptxas `
520
+ tool (part of the CUDA SDK). The PTX file and cubin are assembled
521
+ together using ` fatbinary ` to produce a CUDA fatbin. The CUDA fatbin
522
+ is then passed to the offload wrapper tool.
523
+
524
+ ![ NVPTX AOT build] ( images/DevicePTXProcessing.svg )
521
525
522
526
##### Checking if the compiler is targeting NVPTX
523
527
@@ -592,9 +596,97 @@ define void @SYCL_generated_kernel(i32 %local_ptr_offset, i32 %arg, i32 %local_p
592
596
593
597
On the runtime side, when setting local memory arguments, the CUDA PI
594
598
implementation will internally set the argument as the offset with respect to
595
- the accumulated size of used local memory. This approach preserves the exisiting
599
+ the accumulated size of used local memory. This approach preserves the existing
596
600
PI interface.
597
601
602
+ ##### Global offset support
603
+
604
+ The CUDA API does not natively support the global offset parameter
605
+ expected by the SYCL.
606
+
607
+ In order to emulate this and make generated kernel compliant, an
608
+ intrinsic `llvm.nvvm.implicit.offset` (clang builtin
609
+ `__builtin_ptx_implicit_offset`) was introduced materializing the use
610
+ of this implicit parameter for the NVPTX backend. The intrinsic returns
611
+ a pointer to `i32` referring to a 3 elements array.
612
+
613
+ Each non-kernel function reaching the implicit offset intrinsic in the
614
+ call graph is augmented with an extra implicit parameter of type
615
+ pointer to `i32`. Kernels calling one of these functions using
616
+ this intrinsic are cloned:
617
+
618
+ - the original kernel initializes an array of 3 `i32` to 0 and passes
619
+ the pointer to this array to each function with the implicit
620
+ parameter;
621
+ - the cloned function type is augmented with an implicit parameter of
622
+ type array of 3 `i32`. The pointer to this array is then passed each
623
+ function with the implicit parameter.
624
+
625
+ The runtime will query both kernels and call the appropriate one based
626
+ on the following logic:
627
+
628
+ - If the 2 versions exist, the original kernel is called if global
629
+ offset is 0 otherwise it will call the cloned one and pass the
630
+ offset by value;
631
+ - If only 1 function exist, it is assumed that the kernel makes no use
632
+ of this parameter and therefore ignores it.
633
+
634
+ As an example, the following code:
635
+ ```
636
+ declare i32* @llvm .nvvm.implicit.offset()
637
+
638
+ define weak_odr dso_local i64 @other_function() {
639
+ %1 = tail call i32* @llvm .nvvm.implicit.offset()
640
+ %2 = getelementptr inbounds i32, i32* %1, i64 2
641
+ %3 = load i32, i32* %2, align 4
642
+ %4 = zext i32 %3 to i64
643
+ ret i64 %4
644
+ }
645
+
646
+ define weak_odr dso_local void @other_function2() {
647
+ ret
648
+ }
649
+
650
+ define weak_odr dso_local void @example_kernel() {
651
+ entry:
652
+ %0 = call i64 @other_function()
653
+ call void @other_function2()
654
+ ret void
655
+ }
656
+ ```
657
+
658
+ Is transformed into this in the `sycldevice` environment:
659
+ ```
660
+ define weak_odr dso_local i64 @other_function(i32* %0) {
661
+ %2 = getelementptr inbounds i32, i32* %0, i64 2
662
+ %3 = load i32, i32* %2, align 4
663
+ %4 = zext i32 %3 to i64
664
+
665
+ ret i64 %4
666
+ }
667
+
668
+ define weak_odr dso_local void @example_kernel() {
669
+ entry:
670
+ %0 = alloca [ 3 x i32] , align 4
671
+ %1 = bitcast [ 3 x i32] * %0 to i8*
672
+ call void @llvm .memset.p0i8.i64(i8* nonnull align 4 dereferenceable(12) %1, i8 0, i64 12, i1 false)
673
+ %2 = getelementptr inbounds [ 3 x i32] , [ 3 x i32] * %0, i32 0, i32 0
674
+ %3 = call i64 @other_function(i32* %2)
675
+ call void @other_function2()
676
+ ret void
677
+ }
678
+
679
+ define weak_odr dso_local void @example_kernel_with_offset([ 3 x i32] * byval([ 3 x i32] ) %0) {
680
+ entry:
681
+ %1 = bitcast [ 3 x i32] * %0 to i32*
682
+ %2 = call i64 @other_function(i32* %1)
683
+ call void @other_function2()
684
+ ret void
685
+ }
686
+ ```
687
+
688
+ Note: Kernel naming is not fully stable for now.
689
+
598
690
### Integration with SPIR-V format
599
691
600
692
This section explains how to generate SPIR-V specific types and operations from
0 commit comments