Skip to content

Commit d141bef

Browse files
author
Sergey Kanaev
committed
[Doc] Address comment
Signed-off-by: Sergey Kanaev <[email protected]>
1 parent 9454aee commit d141bef

File tree

1 file changed

+170
-41
lines changed

1 file changed

+170
-41
lines changed

sycl/doc/KernelProgramCache.md

Lines changed: 170 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,55 +1,184 @@
11
# A brief overview of kernel/program caching mechanism.
22

3+
## Rationale behind caching
4+
5+
*Use-case #1.* Looped enqueue of the same kernel:
6+
```C++
7+
using namespace cl::sycl::queue;
8+
9+
queue Q;
10+
std::vector<buffer> Bufs;
11+
12+
...
13+
// initialize Bufs with some number of buffers
14+
...
15+
16+
for (size_t Idx = 0; Idx < Bufs.size(); ++Idx) {
17+
Q.submit([&](handler &CGH) {
18+
auto Acc = Bufs[Idx].get_access<access::mode::read_write>(CGH);
19+
20+
CGH.parallel_for<class TheKernel>(
21+
range<2>{N, M}, [=](item<2> Item) { ... });
22+
});
23+
}
24+
```
25+
26+
*Use-case #2.* Enqueue of multiple kernels within a single program<sup>[1](#what-is-program)</sup>:
27+
```C++
28+
using namespace cl::sycl::queue;
29+
30+
queue Q;
31+
32+
Q.submit([&](handler &CGH) {
33+
...
34+
35+
CGH.parallel_for<class TheKernel_1>(
36+
range<2>{N_1, M_1}, [=](item<2> Item) { ... });
37+
});
38+
Q.submit([&](handler &CGH) {
39+
...
40+
41+
CGH.parallel_for<class TheKernel_2>(
42+
range<2>{N_2, M_2}, [=](item<2> Item) { ... });
43+
});
44+
Q.submit([&](handler &CGH) {
45+
...
46+
47+
CGH.parallel_for<class TheKernel_3>(
48+
range<2>{N_3, M_3}, [=](item<2> Item) { ... });
49+
});
50+
...
51+
Q.submit([&](handler &CGH) {
52+
...
53+
54+
CGH.parallel_for<class TheKernel_K>(
55+
range<2>{N_K, M_K}, [=](item<2> Item) { ... });
56+
});
57+
```
58+
59+
Both these use-cases will need to built the program or kernel multiple times.
60+
When JIT is employed this process may take quite a lot of time.
61+
62+
In order to eliminate this waste of run-time we introduce a kernel/program
63+
caching. The cache is per-context and it caches underlying objects of non
64+
interop kernels and programs which are built with no options.
65+
66+
<a name="what-is-program">1</a>: Here we use the term "program" in the same
67+
sense as OpenCL does i.e. a set of kernels.
68+
69+
70+
## Data structure of cache
71+
72+
The cache stores underlying PI objects of `cl::sycl::program` and
73+
`cl::sycl::kernel` in a per-context data storage. The storage consists of two
74+
maps: one is for programs and the other is for kernels.
75+
76+
Programs mapping's key consists of three components:
77+
kernel set id<sup>[1](#what-is-ksid)</sup>, specialized constants, device this
78+
program is built for.
79+
80+
Kernels mapping's key consists of three components too: program the kernel
81+
belongs to, kernel name<sup>[2](#what-is-kname)</sup>, device the program is
82+
built for.
83+
84+
<a name="what-is-ksid">1</a>: Kernel set id is merely a number of translation
85+
unit which contains at least one kernel.
86+
<a name="what-is-kname">2</a>: Kernel name is mangled class name which is
87+
provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
88+
`single_task`).
89+
90+
91+
## Points of improvement (things to do)
92+
93+
- Implement LRU policy on cached objects. See [issue](https://github.com/intel/llvm/issues/2517).
94+
- Allow for caching of objects built with some build options.
95+
- Employ the same built object for multiple devices of the same ISA,
96+
capabilities and so on. *NOTE:* It's not really known if it's possible to
97+
check if two distinct devices are *exactly* the same.
98+
- Improve testing: cover real use-cases. See currently covered cases [here](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp).
99+
100+
101+
## Implementation details
102+
103+
The caches are represented with instance of [`KernelProgramCache`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp)
104+
class. The class is instantiated in a per-context manner.
105+
106+
The `KernelProgramCache` is the storage descrived above.
107+
108+
109+
### When does the cache come at work?
110+
3111
The cache is employed when one submits kernel for execution or builds program or
4-
kernel with SYCL API. At the same time programs and kernels are cached only when
5-
they're built from C++ source, i.e. `program::build_with_kernel_type<>()` and
6-
`program::get_kernel<>()` methods are employed. This restriction is implemented
7-
via use of `program_impl::is_cacheable_with_options()` and
8-
`program_impl::is_cacheable()` methods. The latter method only returns a boolean
9-
flag which is set to false on default and is set to true in a single use-case.
10-
One can find use-cases and cache filling in the [unit-tests](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp).
11-
12-
How does it work, i.e. at which point is the cache employed? At some point of
13-
`ExecCGCommand`'s enqueue process the program manager's method will be called:
14-
either `ProgramManager::getBuildPIProgram` or
15-
`ProgramManager::getOrCreateKernel`. Now, both these methods will call template
16-
function [`getOrBuild`](../source/detail/program_manager/program_manager.cpp#L149)
112+
kernel with SYCL API. That means that the cache works when either user
113+
explicitly calls `program::build_with_kernel_type<>()`/`program::get_kernel<>()`
114+
methods or SYCL RT builds or gets the required kernel. Cacheability of an object
115+
is verified with `program_impl::is_cacheable()` method. SYCL RT will check if
116+
program is cacheable and will get the kernel with call to
117+
`ProgramManager::getOrCreateKernel()` method.
118+
119+
120+
*NOTE:* a kernel is only cacheable if and only if the program it belongs to is
121+
cacheable. On the other hand if the program is cacheable, then each and every
122+
kernel of this program will be cached also.
123+
124+
125+
Invoked by user `program::build_with_kernel_type<>()` and
126+
`program::get_kernel<>()` methods will call either
127+
`ProgramManager::getBuildPIProgram()` or `ProgramManager::getOrCreateKernel()`
128+
method respectively. Now, both these methods will call template
129+
function [`getOrBuild()`](../source/detail/program_manager/program_manager.cpp#L149)
17130
with multiple lambdas passed to it:
18131
- Acquire function;
19132
- GetCache function;
20133
- Build function.
21134

22-
Acquire function returns a locked version of cache. Locking is employed for
135+
*Acquire* function returns a locked version of cache. Locking is employed for
23136
thread safety. The threads are blocked only for insert-or-acquire attempt, i.e.
24137
when calling to `map::insert` in [`getOrBuild`](../source/detail/program_manager/program_manager.cpp#L149)
25138
function. The rest of operation is done with the help of atomics and condition
26139
variables (plus a mutex for proper work of condition variable).
27140

28-
GetCache function returns a reference to mapping `key->value` out of locked
141+
*GetCache* function returns a reference to mapping `key->value` out of locked
29142
instance of cache. We will see rationale behind it a bit later.
30143

31-
Build function actually builds the kernel or program.
32-
33-
When we say "cache" we think about mapping of some key to value. These maps are
34-
contained within [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp)
35-
class instance which on its own belongs to `context_impl` class instance.
36-
Kernel cache is per program mapping of kernel name plus device pair to
37-
`BuildResult<PiKernel>`<sup>[1](#remove-pointer)</sup>. When `getOrBuild`
38-
function is called the key for kernel cache is pair/tuple of kernel name and
39-
device. Program cache maps triple (spec consts, kernel set id, device) to
40-
`BuildResult<PiProgram>`<sup>[1](#remove-pointer)</sup>.
41-
42-
Now, we have a helper [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp)
43-
class. It's to use RAII to make code look cleaner a bit. Acquire function/lambda
44-
will return a specialization of Locked class for reference of proper mapping.
45-
Now, GetCache function will return the mapping to be employed i.e. it'll fetch
46-
mapping of kernel name plus device to `BuildResult` for proper program as
47-
`getOrBuild` will work with mapping of key (whichever it is) to `BuildResult`
48-
specialization.
49-
50-
`BuildResult` structure contains three information fields - pointer to built
51-
kernel/program, build error (if applicable) and current build status
52-
(either of "in progress", "succeeded", "failed").
144+
*Build* function actually builds the kernel or program.
145+
146+
Caching isn't done:
147+
- when program is built out of source i.e. with
148+
`program::build_with_source()` or `program::compile_with_source()` method;
149+
- when program is result of linking of multiple programs.
150+
151+
152+
### Thread-safety
153+
154+
Why do we need thread safety here? It's quite possible to have a use-case when
155+
the `cl::sycl::context` is shared across multiple threads (e.g. via sharing a
156+
queue). Possibility of enqueueing multiple cacheable kernels simultaneously
157+
within multiple threads makes us to provide thread-safety for the cache.
158+
159+
It's worth of noting that we don't cache the PI resource (kernel or program)
160+
on it's own. Instead we augment the resource with the status of build process.
161+
Hence, what is cached is a wrapper structure `BuildResult` which contains three
162+
information fields - pointer to built resource, build error (if applicable) and
163+
current build status (either of "in progress", "succeeded", "failed").
164+
165+
One can find definition of `BuildResult` template in [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp).
166+
167+
Pointer to built resource and build result are both atomic variables. Atomicity
168+
of these variables allows one to hold lock on cache for quite a short time and
169+
perform the rest of build/wait process without unwanted need of other threads to
170+
wait on lock availability.
171+
172+
A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp)
173+
for reference of proper mapping is returned by Acquire function. The use of this
174+
class implements RAII to make code look cleaner a bit. Now, GetCache function
175+
will return the mapping to be employed i.e. it'll fetch mapping of kernel name
176+
plus device to `BuildResult` for proper program as `getOrBuild` will work with
177+
mapping of key (whichever it is) to `BuildResult` specialization. The structure
178+
is specialized with either `PiKernel` or `PiProgram`<sup>[1](#remove-program)</sup>.
179+
180+
181+
### Core of caching mechanism
53182

54183
Now, how `getOrBuild` works?
55184
First, we fetch the cache with sequential calls to Acquire and GetCache
@@ -74,9 +203,9 @@ process for this kernl/program is finished (either successfuly or with a
74203
failure).
75204

76205

77-
<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted in sake of
78-
simplicity here.
206+
<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted in
207+
sake of simplicity here.
79208

80-
<a name="exception-data">2</a>: Actually, we store contents of the exception: its message and
81-
error code.
209+
<a name="exception-data">2</a>: Actually, we store contents of the exception:
210+
its message and error code.
82211

0 commit comments

Comments
 (0)