-
Notifications
You must be signed in to change notification settings - Fork 787
[Doc] Add design doc for dynamic linking of device code feature #3210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
a27bcbf
6fe222d
604909c
df953fc
702e1a4
60054b1
459730b
7f95079
93e202c
b8fb778
850b94f
d176e1c
fbb67d1
7b7aa66
9f2b787
b496ef4
03f6b9f
8ba2c92
7b60419
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,379 @@ | ||||||||||||||||||||
# Dynamic linking of device code | ||||||||||||||||||||
|
||||||||||||||||||||
This document describes purpose and design of dynamic linking of device code | ||||||||||||||||||||
feature. | ||||||||||||||||||||
|
||||||||||||||||||||
## Background | ||||||||||||||||||||
Sometimes users want to link device code dynamically at run time. One possible | ||||||||||||||||||||
use case for such linkage - providing device functions via shared libraries. | ||||||||||||||||||||
Simple source example: | ||||||||||||||||||||
``` | ||||||||||||||||||||
// App: | ||||||||||||||||||||
|
||||||||||||||||||||
CGH.parallel_for<app_kernel>(/* ... */ { | ||||||||||||||||||||
library_function(); | ||||||||||||||||||||
}); | ||||||||||||||||||||
|
||||||||||||||||||||
|
||||||||||||||||||||
// Shared library: | ||||||||||||||||||||
SYCL_EXTERNAL void library_function() { | ||||||||||||||||||||
// do something | ||||||||||||||||||||
} | ||||||||||||||||||||
``` | ||||||||||||||||||||
It is possible to manually create `sycl::program` in both app and shared | ||||||||||||||||||||
library, then use `link` SYCL API to get a single program and launch kernels | ||||||||||||||||||||
using it. But it is not user-friendly and it is very different from regular | ||||||||||||||||||||
C/C++ workflow. | ||||||||||||||||||||
|
||||||||||||||||||||
Another possible scenario - use functions defined in pre-compiled device image | ||||||||||||||||||||
provided by user. Example: | ||||||||||||||||||||
``` | ||||||||||||||||||||
// a.cpp | ||||||||||||||||||||
SYCL_EXTERNAL void foo(); | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
... | ||||||||||||||||||||
parallel_for([]() { foo(); }); | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
|
||||||||||||||||||||
// b.cpp | ||||||||||||||||||||
/*no SYCL_EXTERNAL*/ void foo() { ... } | ||||||||||||||||||||
``` | ||||||||||||||||||||
We have a `SYCL_EXTERNAL` function `foo` called from a kernel, but the | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
application defined only host version of this function. Then user adds device | ||||||||||||||||||||
image with definition of `foo` to the fat object via special option. | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
|
||||||||||||||||||||
The main purpose of this feature is to provide a mechanism which allows to | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
link device code dynamically at runtime. | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
|
||||||||||||||||||||
## Requirements: | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NIT:
Suggested change
|
||||||||||||||||||||
User's device code that consists of some device API (`SYCL_EXTERNAL` functions), | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||
is compiled into some form and it is not linked statically with device code of | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||
application. It can be a shared library that contains some device code or a | ||||||||||||||||||||
separate device image supplied with property information. This code is linked | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
dynamically at run time with device code of a user's application in order to | ||||||||||||||||||||
resolve dependencies. | ||||||||||||||||||||
For this combination the following statements must be true: | ||||||||||||||||||||
|
||||||||||||||||||||
- `SYCL_EXTERNAL` functions defined in dynamically linked code can be called | ||||||||||||||||||||
(directly or indirectly) from device code of the application. | ||||||||||||||||||||
- Function pointers taken in application should work inside the dynamically | ||||||||||||||||||||
linked code. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some parts of the reqs seem too specific and not general enough. As a variant: The presented dynamic device code linkage mechanism must:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure about this point:
Do you mean some specific SYCL API? Because if dlopen is used it will be just specific usage of dynamically linked code that "embedded into a host shared object by standard SYCL compiler driver invocation". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I meant some new API, which would load a device library from disk (optionally returning // suppose, mylib.spv defines SYCL_EXTERNAL function foo, then this call:
device_image img = device_dlopen("mylib.spv");
// will make foo available for dynamic symbol resolution. If any subsequent JIT compilations
// try to compile device code with external reference to foo, it can now be resolved following
// the resolution mechanism described in this doc, and JIT compilation will succeed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, thanks. I mentioned that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we really need a new API? Can't we just suggest compiling marked with SYCL_EXTERNAL as a regular "fat" .so and then dlopen it? dlopen'ing should trigger device image registration, so such a device image should be available for symbols resolution. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sorry for delay. Yes, you are right, that should be the preferred way. But sometimes there is a need to load x.spv originating from different tools - e.g. OpenCL. It would be UB trying to define kernels in this x.spv and then trying to call them from SYCL, but it should be perfectly OK to link x.spv with .spv originating from SYCL. |
||||||||||||||||||||
- Specific code changes are not required, i.e. the mechanism of linking works | ||||||||||||||||||||
as close as possible to regular shared libraries. | ||||||||||||||||||||
|
||||||||||||||||||||
## Design | ||||||||||||||||||||
The overall idea: | ||||||||||||||||||||
|
||||||||||||||||||||
- Each device image is supplied with a list of imported symbol names | ||||||||||||||||||||
through device image properties mechanism | ||||||||||||||||||||
- `SYCL_EXTERNAL` functions are arranged into separate device images supplied | ||||||||||||||||||||
with a list of exported symbol names | ||||||||||||||||||||
- Before compiling a device image DPC++ RT will check if device image has a list | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
of imported symbols and if it has, then RT will search for device images which | ||||||||||||||||||||
define required symbols using lists of exported symbols. | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
- Besides symbol names, additional attributes are taken into account (like | ||||||||||||||||||||
device image format: SPIR-V or device asm) | ||||||||||||||||||||
smaslov-intel marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
- Actual linking is performed by underlying backend (OpenCL/L0/etc.) | ||||||||||||||||||||
|
||||||||||||||||||||
Next sections describe details of changes in each component. | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
|
||||||||||||||||||||
### DPC++ front-end changes | ||||||||||||||||||||
|
||||||||||||||||||||
DPC++ front-end generates `sycl-module-id` attribute on each `SYCL_EXTERNAL` function. | ||||||||||||||||||||
It was generated only on kernels earlier. There are two reasons to start | ||||||||||||||||||||
generating this attribute on `SYCL_EXTERNAL` functions: | ||||||||||||||||||||
|
||||||||||||||||||||
- Later in pipeline, this attribute will be used by `sycl-post-link` tool to | ||||||||||||||||||||
separate `SYCL_EXTERNAL` functions from non-`SYCL_EXTERNAL` functions with | ||||||||||||||||||||
external linkage. | ||||||||||||||||||||
- `sycl-module-id` attribute also contains information about source file where the | ||||||||||||||||||||
function comes from. This information will be used to perform device code | ||||||||||||||||||||
split on device images that contain only exported functions. | ||||||||||||||||||||
AlexeySachkov marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
|
||||||||||||||||||||
### sycl-post-link changes | ||||||||||||||||||||
|
||||||||||||||||||||
To support dynamic device linkage, `sycl-post-link` performs 3 main tasks: | ||||||||||||||||||||
- Arranges `SYCL_EXTERNAL` functions into a separate device image(s) | ||||||||||||||||||||
- Supplies device images containing exports with an information about exported | ||||||||||||||||||||
symbols | ||||||||||||||||||||
- Supplies each device image with an information about imported symbols | ||||||||||||||||||||
|
||||||||||||||||||||
`sycl-post-link` outlines `SYCL_EXTERNAL` functions with all their reachable | ||||||||||||||||||||
dependencies (functions with definitions called from `SYCL_EXTERNAL` ones) | ||||||||||||||||||||
into a separate device image(s) in order to create minimal self-contained | ||||||||||||||||||||
device images that can be linked from the user's app. There are several | ||||||||||||||||||||
notable moments though. | ||||||||||||||||||||
|
||||||||||||||||||||
If a `SYCL_EXTERNAL` function is used within a kernel defined in a shared | ||||||||||||||||||||
library, it will be duplicated: one instance will be stored in the kernel's | ||||||||||||||||||||
device image and the function won't exported from this device image, while the | ||||||||||||||||||||
other will be stored in a special device image for other `SYCL_EXTERNAL` | ||||||||||||||||||||
functions and will be marked as exported there. Such duplication is need for | ||||||||||||||||||||
two reasons: | ||||||||||||||||||||
- We aim to make device images with kernels self-contained so no JIT linker | ||||||||||||||||||||
invocations would be needed if we have definitions of all called functions. | ||||||||||||||||||||
Also note that if AOT is requested, it would be impossible to link anything | ||||||||||||||||||||
at runtime. | ||||||||||||||||||||
- We could export `SYCL_EXTERNAL` functions from device images with kernels, | ||||||||||||||||||||
but it would mean that when user's app calls `SYCL_EXTERNAL` function, it has | ||||||||||||||||||||
to link a whole kernel and all its dependencies - not only it increases the | ||||||||||||||||||||
amount of unnecessary linked code, but might also lead to build errors if the | ||||||||||||||||||||
kernel uses some features, which are not supported by target device (and they | ||||||||||||||||||||
are not used in the `SYCL_EXTERNAL` function). | ||||||||||||||||||||
Besides separating `SYCL_EXTERNAL` functions from kernels, `sycl-post-link` | ||||||||||||||||||||
can also distribute those functions into separate device images if device code | ||||||||||||||||||||
split is requested. This is done by grouping them using `module-id` attribute. | ||||||||||||||||||||
Non-`SYCL_EXTERNAL` functions used by `SYCL_EXTERNAL` functions with different | ||||||||||||||||||||
`sycl-module-id` attributes are copied to device images corresponding to those | ||||||||||||||||||||
`SYCL_EXTERNAL` functions to make them self-contained. | ||||||||||||||||||||
In case one `SYCL_EXTERNAL` function uses another `SYCL_EXTERNAL` function | ||||||||||||||||||||
with different `sycl-module-id` attribute, the second one is not copied to the | ||||||||||||||||||||
device image with the first function, but dependency between those device images | ||||||||||||||||||||
is recorder instead. | ||||||||||||||||||||
|
||||||||||||||||||||
AlexeySachkov marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
After `SYCL_EXTERNAL` functions are arranged into a separate device image(s), | ||||||||||||||||||||
all non-`SYCL_EXTERNAL` functions and `SYCL_EXTERNAL` functions left in device | ||||||||||||||||||||
images with kernels marked with internal linkage to avoid multiple definition | ||||||||||||||||||||
errors during runtime linking. | ||||||||||||||||||||
Device images with `SYCL_EXTERNAL` functions will also get a list of names | ||||||||||||||||||||
of exported functions attached to them through device image properties | ||||||||||||||||||||
(described below). | ||||||||||||||||||||
|
||||||||||||||||||||
**NOTE**: If device code split is enabled, it seems reasonable to perform | ||||||||||||||||||||
exports arrangement before device code split procedure. | ||||||||||||||||||||
|
||||||||||||||||||||
In order to collect information about imported symbols `sycl-post-link` looks | ||||||||||||||||||||
through LLVM IR and for each declared but not defined symbol records its name, | ||||||||||||||||||||
except the following cases: | ||||||||||||||||||||
- Declarations with `__` prefix in demangled name are not recorded as imported | ||||||||||||||||||||
functions | ||||||||||||||||||||
- Declarations with `__spirv_*` prefix should not be recorded as dependencies | ||||||||||||||||||||
since they represent SPIR-V operations and will be transformed to SPIR-V | ||||||||||||||||||||
instructions during LLVM->SPIR-V translation. | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
- Based on some attributes (which could be defined later) we may want to avoid | ||||||||||||||||||||
listing some functions as imported ones | ||||||||||||||||||||
- This is needed to have possibility to call device-specific builtins not | ||||||||||||||||||||
starting with `__` by forward-declaring them in DPC++ code | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
|
||||||||||||||||||||
**NOTE**: If device code split is enabled, imports collection is performed after | ||||||||||||||||||||
split and it is performed on splitted images. | ||||||||||||||||||||
|
||||||||||||||||||||
All collected information is attached to a device image via properties | ||||||||||||||||||||
mechanism. | ||||||||||||||||||||
|
||||||||||||||||||||
Each device image is supplied with an array of property sets: | ||||||||||||||||||||
``` | ||||||||||||||||||||
struct pi_device_binary_struct { | ||||||||||||||||||||
... | ||||||||||||||||||||
// Array of property sets | ||||||||||||||||||||
pi_device_binary_property_set PropertySetsBegin; | ||||||||||||||||||||
pi_device_binary_property_set PropertySetsEnd; | ||||||||||||||||||||
}; | ||||||||||||||||||||
|
||||||||||||||||||||
``` | ||||||||||||||||||||
Each property set is represented by the following struct: | ||||||||||||||||||||
``` | ||||||||||||||||||||
// Named array of properties. | ||||||||||||||||||||
struct _pi_device_binary_property_set_struct { | ||||||||||||||||||||
char *Name; // the name | ||||||||||||||||||||
pi_device_binary_property PropertiesBegin; // array start | ||||||||||||||||||||
pi_device_binary_property PropertiesEnd; // array end | ||||||||||||||||||||
}; | ||||||||||||||||||||
``` | ||||||||||||||||||||
It contains name of property set and array of properties. Each property is | ||||||||||||||||||||
represented by the following struct: | ||||||||||||||||||||
``` | ||||||||||||||||||||
struct _pi_device_binary_property_struct { | ||||||||||||||||||||
char *Name; // null-terminated property name | ||||||||||||||||||||
void *ValAddr; // address of property value | ||||||||||||||||||||
uint32_t Type; // _pi_property_type | ||||||||||||||||||||
uint64_t ValSize; // size of property value in bytes | ||||||||||||||||||||
}; | ||||||||||||||||||||
``` | ||||||||||||||||||||
|
||||||||||||||||||||
List of imported symbols is represented as a single property set with name | ||||||||||||||||||||
`SYCL/imported symbols` recorded in the `Name` field of property set. | ||||||||||||||||||||
Each property in this set holds name of the particular imported symbol recorded | ||||||||||||||||||||
in the `Name` field of the property. | ||||||||||||||||||||
s-kanaev marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
List of exported symbols is represented in the same way, except the | ||||||||||||||||||||
corresponding set has the name `SYCL/exported symbols`. | ||||||||||||||||||||
|
||||||||||||||||||||
### DPC++ runtime changes | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I found it very difficult to understand the big picture for the DPC++runtime changes. It would be nice to have an overview up front. Maybe something like this:
Note that I'm not really sure this is how the design will work, but I think this is what you intend. Can you confirm if my overall understanding of the runtime design is accurate? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think your understanding is correct, except that the design doesn't actually depend on device images format. I.e. the algorithm of searching won't be changed if device images are pre-compiled native device binaries. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, the symbol-search part of the algorithm can probably work for device images in native code format. However, I think the online linking part will only work for device images in SPIR-V format. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also wanted to stress that I think it's important to add some description similar to what I suggest to the introduction to the "DPC++ runtime changes" section. I think this will clarify the following points which are currently unclear:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't agree with you here. Yes, right now we don't have backends that support linking of native device binaries. But we don't say that it is not possible in the future. I think format of device image shouldn't affect design of runtime changes. Most likely when linking of native device binaries is supported, it won't matter for runtime which format device image has, it will just call some PI API for link. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @smaslov-intel, my understanding is that in OpenCL terminology, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/clCompileProgram.html
Compile does produce "binary", otherwise the above wouldn't be possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think @bashbaug should be able to clarify this. In the meantime: From clCreateKernel:
From clBuildProgram:
From clLinkProgram:
From clCompileProgram:
It seems to me that intent was that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Can we somehow query OpenCL about what is in the binary such that we could properly control it's linking? |
||||||||||||||||||||
|
||||||||||||||||||||
DPC++ RT performs *device images collection* task by grouping all device | ||||||||||||||||||||
AlexeySachkov marked this conversation as resolved.
Show resolved
Hide resolved
kbobrovs marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
images required to execute a kernel based on the list of exports/imports and | ||||||||||||||||||||
links them together using PI API. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest that we add a sub-section about this The sub-section could describe the API, its behavior (whether we assume that it is capable to link native binaries regardless of device or that it is some optional capability which should be checked before usage), possible implementation/limitations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added, please see lines 236-259. |
||||||||||||||||||||
|
||||||||||||||||||||
#### Device images collection | ||||||||||||||||||||
|
||||||||||||||||||||
DPC++ Runtime class named ProgramManager stores device images using following | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
data structure: | ||||||||||||||||||||
``` | ||||||||||||||||||||
/// Keeps all available device executable images added via \ref addImages. | ||||||||||||||||||||
/// Organizes the images as a map from a kernel set id to the vector of images | ||||||||||||||||||||
/// containing kernels from that set. | ||||||||||||||||||||
/// Access must be guarded by the \ref Sync::getGlobalLock() | ||||||||||||||||||||
std::unordered_map<SymbolSetId, | ||||||||||||||||||||
std::unique_ptr<std::vector<RTDeviceBinaryImageUPtr>>> | ||||||||||||||||||||
m_DeviceImages; | ||||||||||||||||||||
|
||||||||||||||||||||
using StrToKSIdMap = std::unordered_map<string_class, SymbolSetId>; | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think use-case when kernels have the same name is not really legal. SYCL 2020 spec says "In the case of the shared-source compilation model, the kernels have to be uniquely identified by both host and device compiler.". So if two kernels (even if they are in different objects) have the same name, they don't really uniquely identified, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's have the following use-case: // Application
queue Q;
Lib1Func(Q);
Lib2Func(Q);
// Lib1
void Lib1Func(queue &Q) {
Q.sbumit([](handle &H) {
H.parallel_for<class Calculate>(...);
});
}
// Lib2
void Lib2Func(queue &Q) {
Q.sbumit([](handle &H) {
H.parallel_for<class Calculate>(...);
});
}
Is this use-case handled by SYCL 2020 spec? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure that SYCL 2020 (or other) spec say something about such use cases, especially when libraries are involved. However, looking at this comment #3210 (comment) , I think this case is illegal. @gmlueck , could you please confirm? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe this is illegal because the spec says this:
The spec could certainly be more clear about this, though. I'll propose updated wording to the SYCL 2020 spec clarifying that a kernel name must be unique across the entire application. Note that it is legal for two translation units (or two libraries) to call the same kernel, in which case both instances of the kernel will have the same name. For example, consider a named kernel object that is called from two different libraries:
In such a case, both libraries will have a definition of the kernel named |
||||||||||||||||||||
/// Maps names of kernels from a specific OS module (.exe .dll) to their set | ||||||||||||||||||||
/// id (the sets are disjoint). | ||||||||||||||||||||
std::unordered_map<OSModuleHandle, SymbolSetId> m_SymbolSets; | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
``` | ||||||||||||||||||||
Assume each device image represents some combination of symbols and different | ||||||||||||||||||||
device images may contain only exactly the same or not overlapping combination | ||||||||||||||||||||
of symbols. If it is not so, there can be two cases: | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
- Symbols are the same. In this case it doesn't matter which device image is | ||||||||||||||||||||
s-kanaev marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
taken to use duplicated symbol | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
- Symbols are not the same. In this case ODR violation takes place, such | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
situation leads to undefined behaviour. For more details refer to | ||||||||||||||||||||
[ODR violations](#ODR-violations) section. | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
|
||||||||||||||||||||
Each combination of symbols is assigned with an Id number - symbol set Id. | ||||||||||||||||||||
A combination of symbols can exist in different formats (i.e. SPIR-V/AOT | ||||||||||||||||||||
compiled binary and etc). | ||||||||||||||||||||
`m_DeviceImages` maps an Id number to an array with device images which represent | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
the same combination of symbols in different formats. | ||||||||||||||||||||
`m_SymbolSets` contains mapping from symbol name to symbol set Id for each OS | ||||||||||||||||||||
module (.exe/.so/.dll). | ||||||||||||||||||||
`std::unordered_map` allows to search and access its elements with constant-time | ||||||||||||||||||||
s-kanaev marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
complexity. | ||||||||||||||||||||
|
||||||||||||||||||||
Before compilation of device image to execute a kernel RT checks if the image | ||||||||||||||||||||
contains any import information in its properies and if it does, then RT | ||||||||||||||||||||
performs device images collection in order to resolve dependencies. | ||||||||||||||||||||
|
||||||||||||||||||||
Ids of all needed symbol sets are found. This is done by iterating through | ||||||||||||||||||||
`m_SymbolSets` map, i.e. iterating through all available OS modules without | ||||||||||||||||||||
predefined order and searching for first unresolved symbol in list of imports | ||||||||||||||||||||
s-kanaev marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
set of target device image. Once device image that contains first symbol is | ||||||||||||||||||||
met, remaining exported symbols are checked in found image and if | ||||||||||||||||||||
they match some imported symbols then these matched symbols will be marked as | ||||||||||||||||||||
resolved. The procedure repeats until all imported symbols are resolved. | ||||||||||||||||||||
For each found symbol set Id program cache is checked in case if | ||||||||||||||||||||
necessary set of `SYCL_EXTERNAL` functions has been compiled and if it is true, | ||||||||||||||||||||
then compiled device image will be re-used for linking. | ||||||||||||||||||||
Otherwise device image containing required symbols set will be compiled and | ||||||||||||||||||||
stored in cache. | ||||||||||||||||||||
|
||||||||||||||||||||
#### Program caching | ||||||||||||||||||||
|
||||||||||||||||||||
Existing support for device code caching is re-used to cache programs created | ||||||||||||||||||||
from device images with SYCL external functions and linked device images with | ||||||||||||||||||||
imports information. | ||||||||||||||||||||
|
||||||||||||||||||||
##### In-memory cache | ||||||||||||||||||||
|
||||||||||||||||||||
Programs that contain only `SYCL_EXTERNAL` functions will be cached only in | ||||||||||||||||||||
compiled state, so they can be linked with other programs during dependency | ||||||||||||||||||||
resolution. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, there can be two cases when
Does it make sense? |
||||||||||||||||||||
|
||||||||||||||||||||
The existing mechanism of caching is not changed for programs with | ||||||||||||||||||||
imports information. They are stored in cache after they compiled and linked | ||||||||||||||||||||
with programs that provide their dependencies. To identify linked programs | ||||||||||||||||||||
Id of "main" set of symbols (i.e. the one which actually contain kernels) will | ||||||||||||||||||||
be used. | ||||||||||||||||||||
|
||||||||||||||||||||
##### Persistent cache | ||||||||||||||||||||
|
||||||||||||||||||||
The documented approach to persistent cache needs to be expanded in presence | ||||||||||||||||||||
of dynamic linking support. One of the identifiers for built image hash is | ||||||||||||||||||||
hash made out of device image used as input for the JIT compilation. | ||||||||||||||||||||
In case when "main" image have imports information, device image hash should be | ||||||||||||||||||||
created from all device images that are necessary to build it, i.e. hash out | ||||||||||||||||||||
of "main" device image and set of 'SYCL_EXTERNAL'-only images that define all | ||||||||||||||||||||
symbols imported by "main device image. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Current approach assume storing final binary after linking it means:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Do you mean compile to binary "main" image and "library" image and store them on disk separately? It sounds reasonable (the real dynamic libraries actually work this way), however I don't think we have devices that support linking of native device binaries yet, so right now It doesn't seem possible. |
||||||||||||||||||||
|
||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
## Corner cases and limitations | ||||||||||||||||||||
|
||||||||||||||||||||
It is not guaranteed that behaviour of host shared libraries and device shared | ||||||||||||||||||||
libraries will always match. There are several cases when it can occur, the | ||||||||||||||||||||
Fznamznon marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
next sections will cover details of such cases. | ||||||||||||||||||||
|
||||||||||||||||||||
### ODR violations | ||||||||||||||||||||
|
||||||||||||||||||||
C++ standard defines One Definition Rule as: | ||||||||||||||||||||
> Every program shall contain exactly one definition of every non-inline | ||||||||||||||||||||
function or variable that is odr-used in that program outside of a discarded | ||||||||||||||||||||
statement; no diagnostic required. | ||||||||||||||||||||
The definition can appear explicitly in the program, it can be found in the | ||||||||||||||||||||
standard or a user-defined library, or (when appropriate) it is implicitly | ||||||||||||||||||||
defined. | ||||||||||||||||||||
|
||||||||||||||||||||
|
||||||||||||||||||||
Here is an example: | ||||||||||||||||||||
|
||||||||||||||||||||
 | ||||||||||||||||||||
|
||||||||||||||||||||
Both libraries libB and libC provide two different definitions of function | ||||||||||||||||||||
`b()`, so this example illustrates ODR violation. Technically this case has | ||||||||||||||||||||
undefined behaviour, however it is possible to run and compile this example on | ||||||||||||||||||||
Linux and Windows. Whereas on Linux only function `b()` from library libB is | ||||||||||||||||||||
called, on Windows both versions of function `b()` are used. | ||||||||||||||||||||
Most of backends online linkers act like static linkers, i.e. just merge | ||||||||||||||||||||
device images with each other, so it is not possible to correctly imitate | ||||||||||||||||||||
Windows behaviour in device code linking because attempts to do it will result | ||||||||||||||||||||
in multiple definition errors. | ||||||||||||||||||||
|
||||||||||||||||||||
Given that, it is not guaranteed that behaviour of shared host libraries and | ||||||||||||||||||||
shared device libraries will always match in case of such ODR violations. | ||||||||||||||||||||
|
||||||||||||||||||||
#### LD_PRELOAD | ||||||||||||||||||||
|
||||||||||||||||||||
Another way to violate ODR is `LD_PRELOAD` environment variable on Linux. It | ||||||||||||||||||||
allows to load specified shared library before any other shared libraries so it | ||||||||||||||||||||
will be searched for symbols before other shared libraries. It allows to | ||||||||||||||||||||
substitute functions from regular shared libraries by functions from preloaded | ||||||||||||||||||||
library. | ||||||||||||||||||||
Device code registration is implemented using global constructors. Order of | ||||||||||||||||||||
global constructors calling is not defined across different translation units, | ||||||||||||||||||||
so with current design of device shared libraries and device code registration | ||||||||||||||||||||
mechanism it is not possible to understand which device code comes from | ||||||||||||||||||||
preloaded library and which comes from regular shared libraries. | ||||||||||||||||||||
|
||||||||||||||||||||
Here is an example: | ||||||||||||||||||||
|
||||||||||||||||||||
 | ||||||||||||||||||||
|
||||||||||||||||||||
"libPreload" library is preloaded using `LD_PRELOAD` environment variable. | ||||||||||||||||||||
In this example, device code from "libPreload" might be registered after | ||||||||||||||||||||
device code from "libA". | ||||||||||||||||||||
|
||||||||||||||||||||
To implement basic support, for each device image we can record name of the | ||||||||||||||||||||
library where this device image comes from and parse content of `LD_PRELOAD` | ||||||||||||||||||||
environment variable to choose the proper images. However such implementation | ||||||||||||||||||||
will only allow to substitute a whole device image and not an arbitrary | ||||||||||||||||||||
function (unless it is the only function in a device image), because partial | ||||||||||||||||||||
substitution will cause multiple definition errors during runtime linking. | ||||||||||||||||||||
|
||||||||||||||||||||
### Run-time libraries loading | ||||||||||||||||||||
|
||||||||||||||||||||
It is possible to load shared library during run-time. Both Linux and Windows | ||||||||||||||||||||
provide a way to do so (for example `dlopen()` on Linux or `LoadLibrary` on | ||||||||||||||||||||
Windows). | ||||||||||||||||||||
In case run-time loading is used to load some shared library, the symbols from | ||||||||||||||||||||
this shared library do not appear in the namespace of the main program. It means | ||||||||||||||||||||
that even though shared library is loaded successfully in run-time, it is not | ||||||||||||||||||||
possible to use symbols from it directly. The symbols from run-time loaded | ||||||||||||||||||||
library can be accessed by address which can be obtained using corresponding | ||||||||||||||||||||
OS-dependent API (for example `dlsym()` on Linux). | ||||||||||||||||||||
|
||||||||||||||||||||
The problem here is that even though symbols from run-time loaded shared | ||||||||||||||||||||
library are not part of application's namespace, the library is loaded through | ||||||||||||||||||||
standard mechanism, i.e. its global constructors are invoked which means that | ||||||||||||||||||||
device code from it is registered, so it is not possible to | ||||||||||||||||||||
understand whether device code comes from run-time loaded library or not. | ||||||||||||||||||||
If such run-time loaded library exports device symbols and they | ||||||||||||||||||||
somehow match with symbols that actually directly used in device code | ||||||||||||||||||||
somewhere, it is possible that symbols from run-time loaded library | ||||||||||||||||||||
will be unexpectedly used. | ||||||||||||||||||||
|
||||||||||||||||||||
To resolve this problem we need to ensure that device code registered from | ||||||||||||||||||||
run-time loaded library appears at the end of symbols search list, however | ||||||||||||||||||||
having that device code registration is triggered by global constructors, it | ||||||||||||||||||||
doesn't seem possible. | ||||||||||||||||||||
|
||||||||||||||||||||
One more possible mitigation would be to record name of the library from which | ||||||||||||||||||||
each symbol should be imported, but it still won't resolve all potential | ||||||||||||||||||||
issues with run-time library loading, because user can load the library with the | ||||||||||||||||||||
same name as one of the explicitly linked libraries. |
Uh oh!
There was an error while loading. Please reload this page.