You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SYCL][Doc] Improvements design for program cache (#3119)
* add support for persistent cache on-disk;
* describe cache eviction mechanism;
* define support for caching of device code built with build options;
* describe configuration parameters for caches;
* optimize in-memory cache keys.
Note: Backend runtimes used by SYCL can have extra environment or configurations
124
+
values (e.g. IGC has [igc_flags.def](https://github.com/intel/intel-graphics-compiler/blob/7f91dd6b9f2ca9c1a8ffddd04fa86461311c4271/IGC/common/igc_flags.def) which affect JIT process). Changing such
125
+
configuration will invalidate cache and manual cache cleanup should be done.
100
126
101
-
- Implement LRU policy on cached objects. See [issue](https://github.com/intel/llvm/issues/2517).
102
-
- Allow for caching of objects built with some build options.
103
-
- Employ the same built object for multiple devices of the same ISA,
104
-
capabilities and so on. *NOTE:* It's not really known if it's possible to
105
-
check if two distinct devices are *exactly* the same. Probably this should be
106
-
an improvement request for plugins.
107
-
- Improve testing: cover real use-cases. See currently covered cases [here](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp).
127
+
<aname="what-is-kname">3</a>: Kernel name is a kernel ID mangled class' name
128
+
which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
129
+
`single_task`).
108
130
131
+
### Persistent cache
132
+
133
+
The cache works behind in-memory cache and stores the same underlying PI
134
+
object behind `cl::sycl::program` user-level objects in a per-context data
135
+
storage.
136
+
The storage is organized as a map for storing device code image. It uses
137
+
different keys to address difference in SYCL objects ids between applications
138
+
runs as well as the fact that the same kernel name can be used in different
139
+
SYCL applications.
140
+
141
+
The programs map's key consists of four components:
142
+
- device image id<sup>[1](#what-is-diid)</sup>,
143
+
- specialization constants values,
144
+
- device id<sup>[2](#what-is-did)</sup> this program is built for,
145
+
- build options id<sup>[3](#what-is-bopts)</sup>.
146
+
147
+
Hashes are used for fast built device image identification and shorten
148
+
filenames on disk. Once cache directory on disc is identified (see
There is set of configuration parameters which can be set as environment variables or parameters in `sycl.conf` and affect cache behavior:
176
+
| Environment variable | Values | Description |
177
+
| -------------------- | ------ | ----------- |
178
+
|`SYCL_CACHE_DIR`| Path | Path to persistent cache root directory. Default values are `%AppData%\Intel\sycl_program_cache` for Windows and `$HOME/intel/sycl_program_cache` on Linux. |
179
+
|`SYCL_CACHE_ENABLED`| ON, OFF | Switches persistent cache switch on or off. Default value is ON. |
180
+
|`SYCL_CACHE_MAX_SIZE`| Positive integer | Cache eviction is triggered once total size of cached images exceeds the value in megabytes (default - 8 192 for 8 GB). Set to 0 to disable size-based cache eviction. |
181
+
|`SYCL_CACHE_THRESHOLD`| Positive integer | Cache eviction threshold in days (default value is 7 for 1 week). Set to 0 for disabling time-based cache eviction. |
182
+
|`SYCL_CACHE_MIN_DEVICE_IMAGE_SIZE`| Positive integer | Minimum size of device code image in kilobytes which is reasonable to cache on disk because disk access operation may take more time than do JIT compilation for it. Default value is 0 to cache all images. |
183
+
|`SYCL_CACHE_MAX_DEVICE_IMAGE_SIZE`| Positive integer | Maximum size of device image in megabytes which is cached. Too big kernels may overload disk too fast. Default value is 0 to cache all images. |
109
184
110
185
## Implementation details
111
186
@@ -117,24 +192,21 @@ representing a context).
117
192
118
193
The `KernelProgramCache` is essentially a pair of maps as described above.
119
194
120
-
121
195
### When does the cache come at work?
122
196
123
-
The cache is used when one submits a kernel for execution or builds program or
124
-
with SYCL API. That means that the cache works when either user explicitly calls
197
+
The cache is used when one submits a kernel for execution or builds program with
198
+
SYCL API. That means that the cache works when either user explicitly calls
125
199
`program::build_with_kernel_type<>()`/`program::get_kernel<>()` methods or SYCL
126
200
RT builds a program or gets the required kernel as needed during application
127
201
execution. Cacheability of an object can be tested with
128
202
`program_impl::is_cacheable()` method. SYCL RT will only try to insert cacheable
129
203
programs or kernels into the cache. This is done as a part of
130
204
`ProgramManager::getOrCreateKernel()` method.
131
205
132
-
133
206
*NOTE:* a kernel is only cacheable if and only if the program it belongs to is
134
207
cacheable. On the other hand if the program is cacheable, then each and every
135
208
kernel of this program will be cached also.
136
209
137
-
138
210
All requests to build a program or to create a kernel - whether they originate
139
211
from explicit user API calls or from internal SYCL runtime execution logic - end
140
212
up with calling the function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149)
@@ -154,12 +226,6 @@ instance of cache. We will see rationale behind it a bit later.
154
226
155
227
*Build* function actually builds the kernel or program.
156
228
157
-
Caching isn't done:
158
-
- when program is built out of source with `program::build_with_source()` or
159
-
`program::compile_with_source()` method;
160
-
- when program is a result of linking multiple programs.
161
-
162
-
163
229
### Thread-safety
164
230
165
231
Why do we need thread safety here? It is quite possible to have a use-case when
@@ -210,10 +276,14 @@ A specialization of helper class [Locked](https://github.com/intel/llvm/blob/syc
210
276
for reference of proper mapping is returned by Acquire function. The use of this
211
277
class implements RAII to make code look cleaner a bit. Now, GetCache function
212
278
will return the mapping to be employed that includes the 3 components: kernel
213
-
name, device as well as any specialization constants. These get added to
279
+
name, device as well as any specialization constants values. These get added to
214
280
`BuildResult` and are cached. The `BuildResult` structure is specialized with
215
281
either `PiKernel` or `PiProgram`<sup>[1](#remove-pointer)</sup>.
216
282
283
+
### Hash function
284
+
285
+
STL hash function specialized for std::string is going to be used:
286
+
`template<> struct hash<std::string>`
217
287
218
288
### Core of caching mechanism
219
289
@@ -236,13 +306,104 @@ thread will try to build the same object once more.
236
306
237
307
`BuildResult` structure also contains synchronization objects: mutex and
238
308
condition variable. We employ them to signal waiting threads that the build
239
-
process for this kernl/program is finished (either successfuly or with a
309
+
process for this kernel/program is finished (either successfully or with a
240
310
failure).
241
311
242
-
243
312
<aname="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted for
244
313
the sake of simplicity here.
245
314
246
315
<aname="exception-data">2</a>: Actually, we store contents of the exception:
247
316
its message and error code.
248
317
318
+
### Persistent cache storage structure
319
+
320
+
The device code image are stored on file system using structure below:
-`<device_hash>` - hash out of device information used to identify target device;
332
+
-`<device_image_hash>` - hash made out of device image used as input for the JIT compilation;
333
+
-`<spec_constants_values_hash>` - hash for specialization constants values;
334
+
-`<build_options_hash>` - hash for all build options;
335
+
-`<n>` - sequential number of hash collisions. When hashes matches for the specific build but full values don't, new cache item is added with incremented value (enumeration started from 0).
336
+
337
+
Two files per cache item are stored on disk:
338
+
-`<n>.src` contains full values for build parameters (device information, specialization constant values, build options, device image) which is used to resolve hash collisions and analysis of cached items.
339
+
-`<n>.bin` contains built device code.
340
+
341
+
### Inter-process safety
342
+
343
+
For on-disk cache there might be access collisions for accessing the same file
344
+
from different instances of SYCL applications:
345
+
- write collision happens when 2 instances of the same application are started
346
+
to write to the same cache file/directory;
347
+
- read collision may happen if one application is writing to the file and the
348
+
other instance of the application is trying to read from it while write
349
+
operation is not finished.
350
+
351
+
To avoid collisions the file system entries are locked for read-write access
352
+
until write operation is finished. e.g if new file or directory should be
353
+
created/deleted parent directory is locked, file is created in locked state,
354
+
then the directory and the file are unlocked.
355
+
356
+
To address cases with high lock rate (multiple copies of the SYCL applications
357
+
are run in parallel and use the same cache directory) nested directories
358
+
representing cache keys are used to minimize locks down to applications which
359
+
build the same device with the same parameters. Directory is locked for minimum
360
+
time because it can be unlocked once subdirectory is created. If file is created in
361
+
a directory, the directory should be locked until file creation is done.
362
+
363
+
Advisory locking <sup>[1](#advisory-lock)</sup> is used to ensure that the
364
+
user/OS tools are able to manage files.
365
+
366
+
<aname="advisory-lock">1.</a> Advisory locks work only when a process
367
+
explicitly acquires and releases locks, and are ignored if a process is not aware
368
+
of locks.
369
+
370
+
### Cache eviction mechanism
371
+
372
+
Cache eviction mechanism is required to avoid resources overfloat both for
373
+
memory and disk. The general idea is to delete items following cache size or
374
+
LRU (least recently used) strategy both for in-memory and persistent cache.
375
+
376
+
#### In-memory cache eviction
377
+
378
+
It is initiated on program/kernel maps access/add item operation. When cache
379
+
size exceeds storage threshold the items which are least recently used are
380
+
deleted.
381
+
TODO: add detailed description of in-memory cache eviction mechanism.
382
+
383
+
384
+
#### Persistent cache eviction
385
+
386
+
Persistent cache eviction is going to be applied based on file last access (read/write) date (access
387
+
time). On SYCL application shutdown phase cache eviction process is initiated
388
+
which walks through cache directories as follows:
389
+
- if the file is locked, go to the next file;
390
+
- otherwise check file access time:
391
+
- if file access time is above threshold, delete the file and remove parent
392
+
directory while they are unlocked and empty;
393
+
- otherwise do nothing.
394
+
395
+
## Cache limitations
396
+
397
+
The caching isn't done when:
398
+
- when program is built out of source with `program::build_with_source()` or `program::compile_with_source()` method;
399
+
- when program is a result of linking multiple programs;
400
+
- program is built using interoperability methods.
401
+
402
+
## Points of improvement (things to do)
403
+
404
+
- Employ the same built object for multiple devices of the same ISA,
405
+
capabilities and so on. *NOTE:* It's not really known if it's possible to
406
+
check if two distinct devices are *exactly* the same. Probably this should be
407
+
an improvement request for plugins. By now it is assumed that two devices with the same device id <aname="what-is-did">2</a> are the same.
408
+
- Improve testing: cover real use-cases. See currently covered cases [here](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp).
409
+
- Implement tool for exploring cache items (initially it is possible using OS utilities).
0 commit comments