You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# A brief overview of kernel/program caching mechanism.
2
2
3
+
## Rationale behind caching
4
+
5
+
*Use-case #1.* Looped enqueue of the same kernel:
6
+
```C++
7
+
usingnamespacecl::sycl::queue;
8
+
9
+
queue Q;
10
+
std::vector<buffer> Bufs;
11
+
12
+
...
13
+
// initialize Bufs with some number of buffers
14
+
...
15
+
16
+
for (size_t Idx = 0; Idx < Bufs.size(); ++Idx) {
17
+
Q.submit([&](handler &CGH) {
18
+
auto Acc = Bufs[Idx].get_access<access::mode::read_write>(CGH);
19
+
20
+
CGH.parallel_for<class TheKernel>(
21
+
range<2>{N, M}, [=](item<2> Item) { ... });
22
+
});
23
+
}
24
+
```
25
+
26
+
*Use-case #2.* Enqueue of multiple kernels within a single program<sup>[1](#what-is-program)</sup>:
27
+
```C++
28
+
using namespace cl::sycl::queue;
29
+
30
+
queue Q;
31
+
32
+
Q.submit([&](handler &CGH) {
33
+
...
34
+
35
+
CGH.parallel_for<class TheKernel_1>(
36
+
range<2>{N_1, M_1}, [=](item<2> Item) { ... });
37
+
});
38
+
Q.submit([&](handler &CGH) {
39
+
...
40
+
41
+
CGH.parallel_for<class TheKernel_2>(
42
+
range<2>{N_2, M_2}, [=](item<2> Item) { ... });
43
+
});
44
+
Q.submit([&](handler &CGH) {
45
+
...
46
+
47
+
CGH.parallel_for<class TheKernel_3>(
48
+
range<2>{N_3, M_3}, [=](item<2> Item) { ... });
49
+
});
50
+
...
51
+
Q.submit([&](handler &CGH) {
52
+
...
53
+
54
+
CGH.parallel_for<class TheKernel_K>(
55
+
range<2>{N_K, M_K}, [=](item<2> Item) { ... });
56
+
});
57
+
```
58
+
59
+
Both these use-cases will need to built the program or kernel multiple times.
60
+
When JIT is employed this process may take quite a lot of time.
61
+
62
+
In order to eliminate this waste of run-time we introduce a kernel/program
63
+
caching. The cache is per-context and it caches underlying objects of non
64
+
interop kernels and programs which are built with no options.
65
+
66
+
<aname="what-is-program">1</a>: Here we use the term "program" in the same
67
+
sense as OpenCL does i.e. a set of kernels.
68
+
69
+
70
+
## Data structure of cache
71
+
72
+
The cache stores underlying PI objects of `cl::sycl::program` and
73
+
`cl::sycl::kernel` in a per-context data storage. The storage consists of two
74
+
maps: one is for programs and the other is for kernels.
75
+
76
+
Programs mapping's key consists of three components:
77
+
kernel set id<sup>[1](#what-is-ksid)</sup>, specialized constants, device this
78
+
program is built for.
79
+
80
+
Kernels mapping's key consists of three components too: program the kernel
81
+
belongs to, kernel name<sup>[2](#what-is-kname)</sup>, device the program is
82
+
built for.
83
+
84
+
<aname="what-is-ksid">1</a>: Kernel set id is merely a number of translation
85
+
unit which contains at least one kernel.
86
+
<aname="what-is-kname">2</a>: Kernel name is mangled class name which is
87
+
provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
88
+
`single_task`).
89
+
90
+
91
+
## Points of improvement (things to do)
92
+
93
+
- Implement LRU policy on cached objects. See [issue](https://github.com/intel/llvm/issues/2517).
94
+
- Allow for caching of objects built with some build options.
95
+
- Employ the same built object for multiple devices of the same ISA,
96
+
capabilities and so on. *NOTE:* It's not really known if it's possible to
97
+
check if two distinct devices are *exactly* the same.
98
+
- Improve testing: cover real use-cases. See currently covered cases [here](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp).
99
+
100
+
101
+
## Implementation details
102
+
103
+
The caches are represented with instance of [`KernelProgramCache`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp)
104
+
class. The class is instantiated in a per-context manner.
105
+
106
+
The `KernelProgramCache` is the storage descrived above.
107
+
108
+
109
+
### When does the cache come at work?
110
+
3
111
The cache is employed when one submits kernel for execution or builds program or
4
-
kernel with SYCL API. At the same time programs and kernels are cached only when
5
-
they're built from C++ source, i.e. `program::build_with_kernel_type<>()` and
6
-
`program::get_kernel<>()` methods are employed. This restriction is implemented
7
-
via use of `program_impl::is_cacheable_with_options()` and
8
-
`program_impl::is_cacheable()` methods. The latter method only returns a boolean
9
-
flag which is set to false on default and is set to true in a single use-case.
10
-
One can find use-cases and cache filling in the [unit-tests](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp).
11
-
12
-
How does it work, i.e. at which point is the cache employed? At some point of
13
-
`ExecCGCommand`'s enqueue process the program manager's method will be called:
14
-
either `ProgramManager::getBuildPIProgram` or
15
-
`ProgramManager::getOrCreateKernel`. Now, both these methods will call template
16
-
function [`getOrBuild`](../source/detail/program_manager/program_manager.cpp#L149)
112
+
kernel with SYCL API. That means that the cache works when either user
Now, we have a helper [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp)
43
-
class. It's to use RAII to make code look cleaner a bit. Acquire function/lambda
44
-
will return a specialization of Locked class for reference of proper mapping.
45
-
Now, GetCache function will return the mapping to be employed i.e. it'll fetch
46
-
mapping of kernel name plus device to `BuildResult` for proper program as
47
-
`getOrBuild` will work with mapping of key (whichever it is) to `BuildResult`
48
-
specialization.
49
-
50
-
`BuildResult` structure contains three information fields - pointer to built
51
-
kernel/program, build error (if applicable) and current build status
52
-
(either of "in progress", "succeeded", "failed").
144
+
*Build* function actually builds the kernel or program.
145
+
146
+
Caching isn't done:
147
+
- when program is built out of source i.e. with
148
+
`program::build_with_source()` or `program::compile_with_source()` method;
149
+
- when program is result of linking of multiple programs.
150
+
151
+
152
+
### Thread-safety
153
+
154
+
Why do we need thread safety here? It's quite possible to have a use-case when
155
+
the `cl::sycl::context` is shared across multiple threads (e.g. via sharing a
156
+
queue). Possibility of enqueueing multiple cacheable kernels simultaneously
157
+
within multiple threads makes us to provide thread-safety for the cache.
158
+
159
+
It's worth of noting that we don't cache the PI resource (kernel or program)
160
+
on it's own. Instead we augment the resource with the status of build process.
161
+
Hence, what is cached is a wrapper structure `BuildResult` which contains three
162
+
information fields - pointer to built resource, build error (if applicable) and
163
+
current build status (either of "in progress", "succeeded", "failed").
164
+
165
+
One can find definition of `BuildResult` template in [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp).
166
+
167
+
Pointer to built resource and build result are both atomic variables. Atomicity
168
+
of these variables allows one to hold lock on cache for quite a short time and
169
+
perform the rest of build/wait process without unwanted need of other threads to
170
+
wait on lock availability.
171
+
172
+
A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp)
173
+
for reference of proper mapping is returned by Acquire function. The use of this
174
+
class implements RAII to make code look cleaner a bit. Now, GetCache function
175
+
will return the mapping to be employed i.e. it'll fetch mapping of kernel name
176
+
plus device to `BuildResult` for proper program as `getOrBuild` will work with
177
+
mapping of key (whichever it is) to `BuildResult` specialization. The structure
178
+
is specialized with either `PiKernel` or `PiProgram`<sup>[1](#remove-program)</sup>.
179
+
180
+
181
+
### Core of caching mechanism
53
182
54
183
Now, how `getOrBuild` works?
55
184
First, we fetch the cache with sequential calls to Acquire and GetCache
@@ -74,9 +203,9 @@ process for this kernl/program is finished (either successfuly or with a
74
203
failure).
75
204
76
205
77
-
<aname="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted in sake of
78
-
simplicity here.
206
+
<aname="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted in
207
+
sake of simplicity here.
79
208
80
-
<aname="exception-data">2</a>: Actually, we store contents of the exception: its message and
81
-
error code.
209
+
<aname="exception-data">2</a>: Actually, we store contents of the exception:
0 commit comments