Skip to content

[SYCL] Update compiler design doc to reflect changed action graphs. #1625

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 8, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
310 changes: 185 additions & 125 deletions sycl/doc/CompilerAndRuntimeDesign.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,21 +12,38 @@ DPC++ application compilation flow:

![High level component diagram for DPC++ Compiler](images/Compiler-HLD.svg)

*Diagram 1. Application build flow.*

DPC++ compiler logically can be split into the host compiler and a number of
device compilers—one per each supported target. Clang driver orchestrates the
compilation process, it will invoke the device compiler once per each requested
target, then it will invoke the host compiler to compile the host part of a
SYCL source. The result of compilation is a set of so-called "fat objects" -
one fat object per SYCL source file. A fat object contains compiled host code
and a number of compiled device code instances—one per each target. Fat
objects can be linked into "fat binary".
SYCL source. In the simplest case, when compilation and linkage are done in one
compiler driver invocation, once compilation is finished, the device object
files (which are really LLVM IR files) are linked with the `llvm-link` tool.
The resulting LLVM IR module is then translated into a SPIRV module using the
`llvm-spirv` tool and wrapped in a host object file using the
`clang-offload-wrapper` tool. Once all the host object files and the wrapped
object with device code are ready, the driver invokes the usual platform linker
and the final executable called "fat binary" is produced. This is a host
executable or library with embedded linked images for each target specified at the command
line.

There are many variations of the compilation process depending on whether user
chose to do one or more of the following:
- perform compilation separately from linkage
- compile the device SPIRV module ahead-of-time for one or more targets
- perform device code splitting so that device code is distributed across
multiple modules rather than enclosed in a single one
- perform linkage of static device libraries
Sections below provide more details on some of those scenarios.

SYCL sources can be also compiled as a regular C++ code, in this mode there is
no "device part" of the codeeverything is executed on the host.
no "device part" of the codeeverything is executed on the host.

Device compiler is further split into the following major components:

- **Front-end** - parses input source, outlines "device part" of the code,
- **Front-end** - parses input source, "outlines" device part of the code,
applies additional restrictions on the device code (e.g. no exceptions or
virtual calls), generates LLVM IR for the device code only and "integration
header" which provides information like kernel name, parameters order and data
Expand All @@ -38,8 +55,17 @@ back-end. Today middle-end transformations include just a couple of passes:
transformation with only one limitation: back-end compiler should be able to
handle transformed LLVM IR.
- Optionally: LLVM IR → SPIR-V translator.
- **Back-end** - produces native "device" code in ahead-of-time compilation
mode.
- **Back-end** - produces native "device" code. It is shown as
"Target-specific LLVM compiler" box on Diagram 1. It is invoked either at
compile time (in ahead-of-time compilatin scenario) or at runtime
(in just-in-time compilation scenario).

*Design note: in current design we use SYCL device front-end compiler to produce the
integration header for two reasons. First, it must be possible to use any host
compiler to produce SYCL heterogeneous applications. Second, even if the
same clang compiler is used for the host compilation, information provided in the
integration header is used (included) by the SYCL runtime implementation, so the
header must be available before the host compilation starts.*

### SYCL support in Clang front-end

Expand Down Expand Up @@ -150,7 +176,69 @@ defines:

- target triple and a native tool chain for each target (including "virtual"
targets like SPIR-V).
- SYCL offload action based on generic offload action
- SYCL offload action based on generic offload action.

SYCL compilation pipeline has a peculiarity compared to other compilation
scenarios - some of the actions in the pipeline may output multiple "clusters"
of files, consumed later by other actions. For example, each device binary maybe
accompanied by a symbol table and a specialization constant map - additional
information used by the SYCL runtime library - and it needs to be stored into
the device binary descriptor by the offload wrapper tool. With device code
splitting feature enabled, there can be multiple such sets (clusters) of files -
one per each separate device binary.

Current design of clang driver doesn't allow to model that, namely:
1. Multiple inputs/outputs in the action graph.
2. Logical grouping of multiple inputs/outputs. For example, an input or output can consist of multiple pairs of files, where each pair represents information for a single device code module: [a file with device code, a file with exported symbols].

To support this, SYCL introduces the `file-table-tform` tool. This tool can
transform file tables following commands passed as input arguments. Each row
in the table represents a file cluster, each column - a type of data associated
with a cluster. The tool can replace and extract columns. For example, the
`sycl-post-link` tool can output two file clusters and the following file
table referencing all the files in the clusters:
```
[Code|Symbols|Properties]
a_0.bc|a_0.sym|a_0.props
a_1.bc|a_1.sym|a_1.props
```

When participating in the action graph this tool inputs a file table
(`TY_Tempfiletable` clang input type) and/or a file list (`TY_Tempfilelist`),
performs requested transformations and outputs a file table or list. From the
clang design standpoint there is still single input and output, even though in
reality there are multiple.

For example, depending on compilation options, files from the "Code" column
above may need to undergo AOT compilation after the device code splitting step,
performed as a part of the code transformation sequence done by the
`sycl-post-link` tool. The driver will then:
- Use the `file-table-tform` to extract the code files and produce a file
list:
```
a_0.bc
a_1.bc
```
- Pass this file list to the `llvm-foreach` tool along with AOT compilation
command to invoke it on every file in the list. This will result in another
file list
```
a_0.bin
a_1.bin
```
- Then `file-table-tform` is invoked again to replace `.bc` with `.bin` in
the filetable to get a new filetable:
```
[Code|Symbols|Properties]
a_0.bin|a_0.sym|a_0.props
a_1.bin|a_1.sym|a_1.props
```
- Finally, this filetable is passed to the `clang-offfload-wrapper` tool to
construct a wrapper object which embeds all those files.

Note that the graph does not change when more rows (clusters) or columns
(e.g. a "manifest" file) are added to the table.


#### Enable SYCL offload

Expand Down Expand Up @@ -188,14 +276,8 @@ a set of target architectures for which to compile device code. By default the
compiler generates SPIR-V and OpenCL device JIT compiler produces native target
binary.

There are existing options for OpenMP\* offload:

`-fopenmp-targets=triple1,triple2`

would produce binaries for target architectures identified by target triples
`triple1` and `triple2`.

A similar approach is used for SYCL:
To produce binaries for target architectures identified by target triples
`triple1` and `triple2`, the following SYCL compiler options are used:

`-fsycl-targets=triple1,triple2`

Expand Down Expand Up @@ -232,56 +314,29 @@ generation?

#### Separate Compilation and Linking

The compiler supports linking of device code obtained from different source
files before generating the final SPIR-V to be fed to the back-end. The basic
mechanism is to produce "fat objects" as a result of compilation—object files
containing both host and device code for all targets—then break fat objects
into their constituents before linking and link host code and device code
(per-target) separately and finally produce a "fat binary" - a host executable
with embedded linked images for each target specified at the command line.

![Multi source compilation flow](Multi-source-compilation-flow.png)

*TODO: the diagram needs to be updating to reflect the latest driver additions.*

The clang driver orchestrates compilation and linking process based on a
SYCL-specific offload action builder and invokes external tools as needed. On
the diagram above, every dark-blue box is a tool invoked as a separate process
by the clang driver.

Compilation starts with compiling the input source `a.cpp` for one of the
targets requested via the command line - `T2`. When doing this first
compilation, the driver requests the device compiler to generate an
"integration header" via a special option. Device compilation for other targets
\- `T1` - don't need to generate the integration header, as it must be the same
for all the targets.

*Design note: Current design does not use the host compiler to produce the
integration header for two reasons: first, it must be possible to use any host
compiler to produce SYCL heterogeneous application, and second, even if the
same clang is used for the host compilation, information provided in the
integration header is used (included) by the SYCL runtime implementation so it
must be ready before host compilation starts.*

Now, after all the device compilations are completed resulting in `a_T2.bin`
and `a_T1.bin`, and the integration header `a.h` is generated, the driver
invokes the host compiler passing it the integration header via `-include`
option to produce the host object `a.obj`. Then the offload bundler tool is
invoked to pack `a_T2.bin`, `a_T1.bin` and `a.obj` into `a_fat.obj` - the fat
object file for the source `a.cpp`.

The compilation process is repeated for all the sources in the application
(maybe on different machines).

Device linking starts with breaking the fat objects back into constituents with
the unbundler tool (bundler invoked with `-unbundle` option). For each fat
object the unbundler produces a target list file which contains pairs
"`<target-triple>, <filename>`" each representing a device object extracted
from the fat object and its target. Once all the fat objects are unbundled, the
driver uses the target list files to construct a list of targets available for
linking and a list of corresponding object files for each: "`<T1: a_T1.bin>,
<T2: a_T2.bin, b_T2.bin>, <T3: b_T3.bin>`". Then the driver invokes linkers for
each of the targets to produce device binary images for those targets.
The compiler supports such features as
- linking of device code obtained from different source files before generating
the final SPIR-V to be fed to the back-end.
- splitting application build into separate compile and link steps.

Overall build flow changes compared to the one shown on the Diagram 1
above in the following way.
**Compilation step** ends with engaging the offload
bundler to generate so-called "fat object" for each
<host object, device code IR> pair produced from the same heterogeneous source.
The fat object files become the result of compilation similar to object
files with usual non-offload compiler.
**Link step** starts with breaking the input fat objects back into their
constituents, then continue the same way as on the Diagram 1 - link host code
and device code separately and finally produce a "fat binary".

The diagram below illustrates the changes in the build flow. The offload
bundler/unbundler actions are basically inserted between the `llvm-link` and
the `linker` invocations as shown on the Diagram 1.

![Multi source compilation flow](images/SplitCompileAndLink.svg)
*Diagram 2. Split compilation and linkage.*


*Current implementation uses LLVM IR as a default device binary format for `fat
objects` and translates "linked LLVM IR" to SPIR-V. One of the reasons for this
Expand All @@ -290,68 +345,35 @@ be defined in multiple modules and linker must resolve multiple definitions.
LLVM IR uses function attributes to satisfy "one definition rule", which have
no counterparts in SPIR-V.*

Host linking starts after all device images are produced - with invocation of
the offload wrapper tool. Its main function is to create a host object file
wrapping all the device images and provide the runtime with access to that
information. So when creating the host wrapper object the offload wrapper tool
does the following:
#### Fat binary creation details

- creates a `.sycl_offloading.descriptor` symbol which is a structure
containing the number of device images and the array of the device images
themselves
"Fat binary" is a result of the final host linking step - this is a host binary
with device binary(s) embedded. When run, it automatically registers
all available device binaries within the SYCL runtime library. This section
describes how this is achieved.

```C++
struct __tgt_device_image {
void *ImageStart;
void *ImageEnd;
};
struct __tgt_bin_desc {
int32_t NumDeviceImages;
__tgt_device_image *DeviceImages;
};
__tgt_bin_desc .sycl_offloading.descriptor;
```

- creates a `void .sycl_offloading.descriptor_reg()` function and registers it
for execution at module loading; this function invokes the `__tgt_register_lib`
registration function (the name can also be specified via an option) which must
be implemented by the runtime and which registers the device images with the
runtime:

```C++
void __tgt_register_lib(__tgt_bin_desc *desc);
```

- creates a `void .sycl_offloading.descriptor_unreg()` function and registers
it for execution at module unloading; this function calls the
`__tgt_unregister_lib` function (the name can also be specified via an option)
which must be implemented by the runtime and which unregisters the device
images with the runtime:

```C++
void __tgt_unregister_lib(__tgt_bin_desc *desc);
```

Once the offload wrapper object file is ready, the driver finally invokes the
host linker giving it the following input:
The output fat binary is created with usual linker - e.g. `ld` on Linux and
`link.exe` on Windows. For the linker to be able to embed the device binaries,
they are first "wrapped" into a host object file called "wrapper object". Then
this wrapper object is linked normally with the rest of host objects and/or
libraries.

- all the application host objects (result of compilation or unbundling)
- the offload wrapper object file
- all the host libraries needed by the application
- the SYCL runtime library
The wrapper object is created by the `clang-offload-wrapper` tool, or simply
"offload wrapper". The created wrapper object has two main components:
1. Global symbol - offload descriptor - pointing to a special data structure
put into in the object's data section. It encompasses all needed information
about the wrapped device binaries - number of binaries, symbols each binary
defines, etc. - as well as the binaries themselves.
2. Registration/unregistration functions. The first one is put into a special
section so that it is invoked when the parent fat binary is loaded into a
process at runtime, the second one is put into another section to be invoked
when the parent fat binary is unloaded. The registration function basically
takes the pointer to the offload descriptor and invokes SYCL runtime library's
registration function passing it as a parameter.

The result is so-called "fat binary image" containing the host code, code for
all the targets plus the registration/unregistration functions and the
information about the device binary images.
The offload descriptor type hierarchy is described in the `pi.h` header. The
top level structure is `pi_device_binaries_struct`.

When compilation and linking is done in single compiler driver invocation, the
bundling and unbundling steps are skipped.

*Design note: the described scheme differs from current llvm.org
implementation. Current design uses Linux-specific linker script approach and
requires that all the linked fat objects are compiled for the same set of
targets. The described design uses OS-neutral offload-wrapper tool and does not
impose restrictions on fat objects.*

#### Device Link
The -fsycl-link flag instructs the compiler to fully link device code without
Expand Down Expand Up @@ -390,7 +412,39 @@ llvm-no-spir-kernel host.bc

It returns 0 if no kernels are present and 1 otherwise.

#### Device code split
#### Device code post-link step

At link time all the device code is always linked into a single LLVM IR module.
`sycl-post-link` tool performs a number of final transformations on this LLVM IR module before handing it off to
the offload wrapper. Those include:
- device code splitting
- symbol table generation
- specialization constants lowering

Depending on options, `sycl-post-link` can output either a single LLVM IR file,
or multiple files plus a file table referencing all of them. See the
"SYCL support in the driver" section for overall description of file table. The
diagram below shows possible clang action graphs which compilation process will
follow from the single linked LLVM IR module to creating the wrapper object.
There are multiple possible variants of the graph depending on:
- specific target requirements
- device code splitting
- AOT compilation

![Multi source compilation flow](images/DeviceLinkAndWrap.svg)

*Diagram 3. Device code link flows.*

Colors of the graph's edges show which paths are taken depending on the above
factors. Each edge is also annotated with the input/output file type.
The diagram does not show the `llvm-foreach` tool invocations for clarity.
This tool invokes given command line over each file in a file list. In this
diagram the tool is applied to `llvm-spirv` and AOT backend whenever the
input/output type is `TY_tempfilelist`. The second invocation of the
`file-table-tform` takes two inputs - the file table and a file list coming
either from `llvm-spirv` or from the AOT backend.

##### Device code splitting

Putting all device code into a single SPIRV module does not work well in the
following cases:
Expand Down Expand Up @@ -436,6 +490,12 @@ unit)
* `per_kernel` - enables emitting a separate module for each kernel
* `off` - disables device code split

##### Symbol table generation
TBD

##### Specialization constants lowering
TBD

#### CUDA support

The driver supports compilation to NVPTX when the `nvptx64-nvidia-cuda-sycldevice` is passed to `-fsycl-targets`.
Expand Down
Loading