[SYCL] Global offset docs

jchlanda · jchlanda · commit ffddff3a4be2 · 2022-04-08T08:10:55.000Z
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
@@ -14985,6 +14985,33 @@ track the usage for each kernel. However, in some cases careful organization of
 the kernels and functions in the source file means there is minimal additional
 effort required to accurately calculate GPR usage.
 
+SYCL Kernel Metadata
+====================
+
+This section describes the additional metadata that is inserted for SYCL
+kernels. As SYCL is a single source programming model functions can either
+execute on a host or a device (i.e. GPU). Device kernels are akin to kernel
+entry-points in GPU program. To mark an LLVM IR function as a device kernel
+function, we make use of special LLVM metadata. The AMDGCN back-end will look
+for a named metadata node called ``amdgcn.annotations``. This named metadata
+must contain a list of metadata that describe the kernel IR. For our purposes,
+we need to declare a metadata node that assigns the `"kernel"` attribute to the
+LLVM IR function that should be emitted as a SYCL kernel function. These
+metadata nodes take the form:
+
+.. code-block:: text
+
+  !{<function ref>, metadata !"kernel", i32 1}
+
+Consider the metadata generated by global-offset pass, showing a void kernel
+function `example_kernel_with_offset` taking one argument, a pointer to 3 i32
+integers:
+
+.. code-block:: llvm
+
+  !amdgcn.annotations = !{!0}
+  !0 = !{void ([3 x i32]*)* @_ZTS14example_kernel_with_offset, !"kernel", i32 1}
+
 Additional Documentation
 ========================
 
diff --git a/sycl/doc/design/CompilerAndRuntimeDesign.md b/sycl/doc/design/CompilerAndRuntimeDesign.md
@@ -659,11 +659,12 @@ PI interface.
 The CUDA API does not natively support the global offset parameter
 expected by the SYCL.
 
-In order to emulate this and make generated kernel compliant, an
-intrinsic `llvm.nvvm.implicit.offset` (clang builtin
-`__builtin_ptx_implicit_offset`) was introduced materializing the use
-of this implicit parameter for the NVPTX backend. The intrinsic returns
-a pointer to `i32` referring to a 3 elements array.
+In order to emulate this and make generated kernel compliant, an intrinsic
+`llvm.nvvm.implicit.offset` (clang builtin `__builtin_ptx_implicit_offset`) was
+introduced materializing the use of this implicit parameter for the NVPTX
+backend. AMDGCN uses the same approach with `llvm.andgpu.implicit.offset` and
+`__builtin_amdgcn_implicit_offset`. The intrinsic returns a pointer to `i32`
+referring to a 3 elements array.
 
 Each non-kernel function reaching the implicit offset intrinsic in the
 call graph is augmented with an extra implicit parameter of type
@@ -682,7 +683,7 @@ on the following logic:
 
 - If the 2 versions exist, the original kernel is called if global
   offset is 0 otherwise it will call the cloned one and pass the
-  offset by value;
+  offset by value (for CUDA backend), or by ref for AMD;
 - If only 1 function exist, it is assumed that the kernel makes no use
   of this parameter and therefore ignores it.