Skip to content

Commit c6197e3

Browse files
kbobrovspvchupinmdtoguchiArtem Gindinson
committed
[SYCL] Update compiler design doc to reflect changed action graphs.
- Correct the high-level application build diagram. - Describe the new file-table-tform tool usage. - Describe clang action graphs used in various SYCL compilation scenarios in more details. Signed-off-by: Konstantin S Bobrovsky <[email protected]> Co-authored-by: Pavel Chupin <[email protected]> Co-authored-by: mdtoguchi <[email protected]> Co-authored-by: Artem Gindinson <[email protected]>
1 parent 5e21b71 commit c6197e3

File tree

4 files changed

+22783
-2483
lines changed

4 files changed

+22783
-2483
lines changed

sycl/doc/CompilerAndRuntimeDesign.md

Lines changed: 183 additions & 125 deletions
Original file line numberDiff line numberDiff line change
@@ -12,21 +12,38 @@ DPC++ application compilation flow:
1212

1313
![High level component diagram for DPC++ Compiler](images/Compiler-HLD.svg)
1414

15+
<div align="center"> Diagram 1. Application build flow. </div>
16+
1517
DPC++ compiler logically can be split into the host compiler and a number of
1618
device compilers—one per each supported target. Clang driver orchestrates the
1719
compilation process, it will invoke the device compiler once per each requested
1820
target, then it will invoke the host compiler to compile the host part of a
19-
SYCL source. The result of compilation is a set of so-called "fat objects" -
20-
one fat object per SYCL source file. A fat object contains compiled host code
21-
and a number of compiled device code instances—one per each target. Fat
22-
objects can be linked into "fat binary".
21+
SYCL source. In the simplest case, when compilation and linkage are done in one
22+
compiler driver invocation, once compilation is finished, the device object
23+
files (which are really LLVM IR files) are linked with the `llvm-link` tool.
24+
The resulting LLVM IR module is then translated into a SPIRV module using the
25+
`llvm-spirv` tool and wrapped in a host object file using the
26+
`clang-offload-wrapper` tool. Once all the host object files and the wrapped
27+
object with device code are ready, the driver invokes the usual platform linker
28+
and the final executable called "fat binary" is produced. This is a host
29+
executable or library with embedded linked images for each target specified at the command
30+
line.
31+
32+
There are many variations of the compilation process depending on whether user
33+
chose to do one or more of the following:
34+
- perform compilation separately from linkage
35+
- compile the device SPIRV module ahead-of-time for one or more targets
36+
- perform device code splitting so that device code is distributed across
37+
multiple modules rather than enclosed in a single one
38+
- linkage of static device libraries is requested
39+
Sections below provide more details on some of those scenarios.
2340

2441
SYCL sources can be also compiled as a regular C++ code, in this mode there is
25-
no "device part" of the codeeverything is executed on the host.
42+
no "device part" of the codeeverything is executed on the host.
2643

2744
Device compiler is further split into the following major components:
2845

29-
- **Front-end** - parses input source, outlines "device part" of the code,
46+
- **Front-end** - parses input source, "outlines" device part of the code,
3047
applies additional restrictions on the device code (e.g. no exceptions or
3148
virtual calls), generates LLVM IR for the device code only and "integration
3249
header" which provides information like kernel name, parameters order and data
@@ -38,8 +55,17 @@ back-end. Today middle-end transformations include just a couple of passes:
3855
transformation with only one limitation: back-end compiler should be able to
3956
handle transformed LLVM IR.
4057
- Optionally: LLVM IR → SPIR-V translator.
41-
- **Back-end** - produces native "device" code in ahead-of-time compilation
42-
mode.
58+
- **Back-end** - produces native "device" code. It is shown as
59+
"Target-specific LLVM compiler" box on Diagram 1. It is invoked either at
60+
compile time (in ahead-of-time compilatin scenario) or at runtime
61+
(in just-in-time compilation scenario).
62+
63+
*Design note: in current design we use SYCL device front-end compiler to produce the
64+
integration header for two reasons. First, it must be possible to use any host
65+
compiler to produce SYCL heterogeneous applications. Second, even if the
66+
same clang compiler is used for the host compilation, information provided in the
67+
integration header is used (included) by the SYCL runtime implementation, so the
68+
header must be available before the host compilation starts.*
4369

4470
### SYCL support in Clang front-end
4571

@@ -150,7 +176,69 @@ defines:
150176

151177
- target triple and a native tool chain for each target (including "virtual"
152178
targets like SPIR-V).
153-
- SYCL offload action based on generic offload action
179+
- SYCL offload action based on generic offload action.
180+
181+
SYCL compilation pipeline has a peculiarity compared to other compilation
182+
scenarios - some of the actions in the pipeline may output multiple "clusters"
183+
of files, consumed later by other actions. For example, each device binary maybe
184+
accompanied by a symbol table and a specialization constant map - additional
185+
information used by the SYCL runtime library - and it needs to be stored into
186+
the device binary descriptor by the offload wrapper tool. With device code
187+
splitting feature enabled, there can be multiple such sets (clusters) of files -
188+
one per each separate device binary.
189+
190+
Current design of clang driver doesn't allow to model that, namely:
191+
1. Multiple inputs/outputs in the action graph.
192+
1. Logical grouping of multiple inputs/outputs. For example, an input or output can consist of multiple pairs of files, where each pair represents information for a single device code module: [a file with device code, a file with exported symbols].
193+
194+
To support this, SYCL introduces the `file-table-tform` tool. This tool can
195+
transform file tables following commands passed as input arguments. Each row
196+
in the table represents a file cluster, each column - a type of data associated
197+
with a cluster. The tool can replace and extract columns. For example, the
198+
`sycl-post-link` tool can output two file clusters and the following file
199+
table referencing all the files in the clusters:
200+
```
201+
[Code|Symbols|Properties]
202+
a_0.bc|a_0.sym|a_0.props
203+
a_1.bc|a_1.sym|a_1.props
204+
```
205+
206+
When participating in the action graph this tool inputs a file table
207+
(`TY_Tempfiletable` clang input type) and/or a file list (`TY_Tempfilelist`),
208+
performs requested transformations and outputs a file table or list. From the
209+
clang design standpoint there is still single input and output, even though in
210+
reality there are multiple.
211+
212+
For example, depending on compilation options, files from the "Code" column
213+
above may need to undergo AOT compilation after the device code splitting step,
214+
performed as a part of the code transformation sequence done by the
215+
`sycl-post-link` tool. The driver will then:
216+
- Use the `file-table-tform` to extract the code files and produce a file
217+
list:
218+
```
219+
a_0.bc
220+
a_1.bc
221+
```
222+
- Pass this file list to the `llvm-for-each` tool along with AOT compilation
223+
command to invoke it on every file in the list. This will result in another
224+
file list
225+
```
226+
a_0.bin
227+
a_1.bin
228+
```
229+
- Then `file-table-tform` is invoked again to replace `.bc` with `.bin` in
230+
the filetable to get a new filetable:
231+
```
232+
[Code|Symbols|Properties]
233+
a_0.bin|a_0.sym|a_0.props
234+
a_1.bin|a_1.sym|a_1.props
235+
```
236+
- Finally, this filetable is passed to the `clang-offfload-wrapper` tool to
237+
construct a wrapper object which embeds all those files.
238+
239+
Note that the graph does not change when more rows (clusters) or columns
240+
(e.g. a "manifest" file) are added to the table.
241+
154242

155243
#### Enable SYCL offload
156244

@@ -188,14 +276,8 @@ a set of target architectures for which to compile device code. By default the
188276
compiler generates SPIR-V and OpenCL device JIT compiler produces native target
189277
binary.
190278

191-
There are existing options for OpenMP\* offload:
192-
193-
`-fopenmp-targets=triple1,triple2`
194-
195-
would produce binaries for target architectures identified by target triples
196-
`triple1` and `triple2`.
197-
198-
A similar approach is used for SYCL:
279+
To produce binaries for target architectures identified by target triples
280+
`triple1` and `triple2`, the following SYCL compiler options are used:
199281

200282
`-fsycl-targets=triple1,triple2`
201283

@@ -232,56 +314,29 @@ generation?
232314

233315
#### Separate Compilation and Linking
234316

235-
The compiler supports linking of device code obtained from different source
236-
files before generating the final SPIR-V to be fed to the back-end. The basic
237-
mechanism is to produce "fat objects" as a result of compilation—object files
238-
containing both host and device code for all targets—then break fat objects
239-
into their constituents before linking and link host code and device code
240-
(per-target) separately and finally produce a "fat binary" - a host executable
241-
with embedded linked images for each target specified at the command line.
242-
243-
![Multi source compilation flow](Multi-source-compilation-flow.png)
244-
245-
*TODO: the diagram needs to be updating to reflect the latest driver additions.*
246-
247-
The clang driver orchestrates compilation and linking process based on a
248-
SYCL-specific offload action builder and invokes external tools as needed. On
249-
the diagram above, every dark-blue box is a tool invoked as a separate process
250-
by the clang driver.
251-
252-
Compilation starts with compiling the input source `a.cpp` for one of the
253-
targets requested via the command line - `T2`. When doing this first
254-
compilation, the driver requests the device compiler to generate an
255-
"integration header" via a special option. Device compilation for other targets
256-
\- `T1` - don't need to generate the integration header, as it must be the same
257-
for all the targets.
258-
259-
*Design note: Current design does not use the host compiler to produce the
260-
integration header for two reasons: first, it must be possible to use any host
261-
compiler to produce SYCL heterogeneous application, and second, even if the
262-
same clang is used for the host compilation, information provided in the
263-
integration header is used (included) by the SYCL runtime implementation so it
264-
must be ready before host compilation starts.*
265-
266-
Now, after all the device compilations are completed resulting in `a_T2.bin`
267-
and `a_T1.bin`, and the integration header `a.h` is generated, the driver
268-
invokes the host compiler passing it the integration header via `-include`
269-
option to produce the host object `a.obj`. Then the offload bundler tool is
270-
invoked to pack `a_T2.bin`, `a_T1.bin` and `a.obj` into `a_fat.obj` - the fat
271-
object file for the source `a.cpp`.
272-
273-
The compilation process is repeated for all the sources in the application
274-
(maybe on different machines).
275-
276-
Device linking starts with breaking the fat objects back into constituents with
277-
the unbundler tool (bundler invoked with `-unbundle` option). For each fat
278-
object the unbundler produces a target list file which contains pairs
279-
"`<target-triple>, <filename>`" each representing a device object extracted
280-
from the fat object and its target. Once all the fat objects are unbundled, the
281-
driver uses the target list files to construct a list of targets available for
282-
linking and a list of corresponding object files for each: "`<T1: a_T1.bin>,
283-
<T2: a_T2.bin, b_T2.bin>, <T3: b_T3.bin>`". Then the driver invokes linkers for
284-
each of the targets to produce device binary images for those targets.
317+
The compiler supports such features as
318+
- linking of device code obtained from different source files before generating
319+
the final SPIR-V to be fed to the back-end.
320+
- splitting application build into separate compile and link steps.
321+
322+
Overall build flow changes compared to the one shown on the Diagram 1
323+
above in the following way.
324+
**Compilation step** ends with engaging the offload
325+
bundler to generate so-called "fat object" for each
326+
<host object, device code IR> pair produced from the same heterogeneous source.
327+
The fat object files become the result of compilation similar to object
328+
files with usual non-offload compiler.
329+
**Link step** starts with breaking the input fat objects back into their
330+
constituents, then continue the same way as on the Diagram 1 - link host code
331+
and device code separately and finally produce a "fat binary".
332+
333+
The diagram below illustrates the changes in the build flow. The offload
334+
bundler/unbundler actions are basically inserted between the `llvm-link` and
335+
the `linker` invocations as shown on the Diagram 1.
336+
337+
![Multi source compilation flow](images/SplitCompileAndLink.svg)
338+
<div align="center"> Diagram 2. Split compilation and linkage. </div>
339+
285340

286341
*Current implementation uses LLVM IR as a default device binary format for `fat
287342
objects` and translates "linked LLVM IR" to SPIR-V. One of the reasons for this
@@ -290,68 +345,35 @@ be defined in multiple modules and linker must resolve multiple definitions.
290345
LLVM IR uses function attributes to satisfy "one definition rule", which have
291346
no counterparts in SPIR-V.*
292347

293-
Host linking starts after all device images are produced - with invocation of
294-
the offload wrapper tool. Its main function is to create a host object file
295-
wrapping all the device images and provide the runtime with access to that
296-
information. So when creating the host wrapper object the offload wrapper tool
297-
does the following:
298-
299-
- creates a `.sycl_offloading.descriptor` symbol which is a structure
300-
containing the number of device images and the array of the device images
301-
themselves
302-
303-
```C++
304-
struct __tgt_device_image {
305-
void *ImageStart;
306-
void *ImageEnd;
307-
};
308-
struct __tgt_bin_desc {
309-
int32_t NumDeviceImages;
310-
__tgt_device_image *DeviceImages;
311-
};
312-
__tgt_bin_desc .sycl_offloading.descriptor;
313-
```
314-
315-
- creates a `void .sycl_offloading.descriptor_reg()` function and registers it
316-
for execution at module loading; this function invokes the `__tgt_register_lib`
317-
registration function (the name can also be specified via an option) which must
318-
be implemented by the runtime and which registers the device images with the
319-
runtime:
320-
321-
```C++
322-
void __tgt_register_lib(__tgt_bin_desc *desc);
323-
```
324-
325-
- creates a `void .sycl_offloading.descriptor_unreg()` function and registers
326-
it for execution at module unloading; this function calls the
327-
`__tgt_unregister_lib` function (the name can also be specified via an option)
328-
which must be implemented by the runtime and which unregisters the device
329-
images with the runtime:
330-
331-
```C++
332-
void __tgt_unregister_lib(__tgt_bin_desc *desc);
333-
```
348+
#### Fat binary creation details
334349

335-
Once the offload wrapper object file is ready, the driver finally invokes the
336-
host linker giving it the following input:
350+
"Fat binary" is a result of the final host linking step - this is a host binary
351+
with device binary(s) embedded. When run, it automatically registers
352+
all available device binaries within the SYCL runtime library. This section
353+
describes how this is achieved.
337354

338-
- all the application host objects (result of compilation or unbundling)
339-
- the offload wrapper object file
340-
- all the host libraries needed by the application
341-
- the SYCL runtime library
355+
The output fat binary is created with usual linker - e.g. `ld` on Linux and
356+
`link.exe` on Windows. For the linker to be able to embed the device binaries,
357+
they are first "wrapped" into a host object file called "wrapper object". Then
358+
this wrapper object is linked normally with the rest of host objects and/or
359+
libraries.
342360

343-
The result is so-called "fat binary image" containing the host code, code for
344-
all the targets plus the registration/unregistration functions and the
345-
information about the device binary images.
361+
The wrapper object is created by the `clang-offload-wrapper` tool, or simply
362+
"offload wrapper". The created wrapper object has two main components:
363+
1. Global symbol - offload descriptor - pointing to a special data structure
364+
put into in the object's data section. It encompasses all needed information
365+
about the wrapped device binaries - number of binaries, symbols each binary
366+
defines, etc. - as well as the binaries themselves.
367+
1. Registration/unregistration functions. The first one is put into a special
368+
section so that it is invoked when the parent fat binary is loaded into a
369+
process at runtime, the second one is put into another section to be invoked
370+
when the parent fat binary is unloaded. The registration function basically
371+
takes the pointer to the offload descriptor and invokes SYCL runtime library's
372+
registration function passing it as a parameter.
346373

347-
When compilation and linking is done in single compiler driver invocation, the
348-
bundling and unbundling steps are skipped.
374+
The offload descriptor type hierarchy is described in the `pi.h` header. The
375+
top level structure is `pi_device_binaries_struct`.
349376

350-
*Design note: the described scheme differs from current llvm.org
351-
implementation. Current design uses Linux-specific linker script approach and
352-
requires that all the linked fat objects are compiled for the same set of
353-
targets. The described design uses OS-neutral offload-wrapper tool and does not
354-
impose restrictions on fat objects.*
355377

356378
#### Device Link
357379
The -fsycl-link flag instructs the compiler to fully link device code without
@@ -390,7 +412,37 @@ llvm-no-spir-kernel host.bc
390412

391413
It returns 0 if no kernels are present and 1 otherwise.
392414

393-
#### Device code split
415+
#### Device code post-link step
416+
417+
At link time all the device code is always linked into a single LLVM IR module.
418+
`sycl-post-link` tool performs a number of final transformations on this LLVM IR module before handing it off to
419+
the offload wrapper. Those include:
420+
- device code splitting
421+
- symbol table generation
422+
- specialization constants lowering
423+
424+
Depending on options, `sycl-post-link` can output either a single LLVM IR file,
425+
or multiple files plus a file table referencing all of them. See the
426+
"SYCL support in the driver" section for overall description of file table. The
427+
diagram below shows possible clang action graphs which compilation process will
428+
follow from the single linked LLVM IR module to creating the wrapper object.
429+
There are multiple possible variants of the graph depending on:
430+
- specific target requirements
431+
- device code splitting
432+
- AOT compilation
433+
434+
![Multi source compilation flow](images/DeviceLinkAndWrap.svg)
435+
<div align="center"> Diagram 3. Device code link flows</div>
436+
Colors of the graph's edges show which paths are taken depending on the above
437+
factors. Each edge is also annotated with the input/output file type.
438+
The diagram does not show the `llvm-for-each` tool invocations for clarity.
439+
This tool invokes given command line over each file in a file list. In this
440+
diagram the tool is applied to `llvm-spirv` and AOT backend whenever the
441+
input/output type is `TY_tempfilelist`. The second invocation of the
442+
`file-table-tform` takes two inputs - the file table and a file list coming
443+
either from `llvm-spirv` or from the AOT backend.
444+
445+
##### Device code splitting
394446

395447
Putting all device code into a single SPIRV module does not work well in the
396448
following cases:
@@ -436,6 +488,12 @@ unit)
436488
* `per_kernel` - enables emitting a separate module for each kernel
437489
* `off` - disables device code split
438490

491+
##### Symbol table generation
492+
TBD
493+
494+
##### Specialization constants lowering
495+
TBD
496+
439497
#### CUDA support
440498

441499
The driver supports compilation to NVPTX when the `nvptx64-nvidia-cuda-sycldevice` is passed to `-fsycl-targets`.

0 commit comments

Comments
 (0)