Skip to content

Commit ed4b4c4

Browse files
authored
[SYCL][Doc] Improvements design for program cache (#3119)
* add support for persistent cache on-disk; * describe cache eviction mechanism; * define support for caching of device code built with build options; * describe configuration parameters for caches; * optimize in-memory cache keys.
1 parent dfaaaed commit ed4b4c4

File tree

1 file changed

+195
-34
lines changed

1 file changed

+195
-34
lines changed

sycl/doc/KernelProgramCache.md

Lines changed: 195 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -68,44 +68,119 @@ times, which may involve JIT compilation and take quite a lot of time.
6868

6969
In order to eliminate this waste of run-time we introduce a kernel and program
7070
caching. The cache is per-context and it caches underlying objects of non
71-
interop kernels and programs which are built with no options.
71+
interop kernels and programs.
72+
73+
*Use-case #3.* Rebuild of all programs on SYCL application restart:
74+
JIT compilation for cases when an application contains huge amount of device
75+
code (big kernels or multiple kernels) may take significant time. The kernels and
76+
programs are rebuilt on every program restart. AOT compilation can be used to
77+
avoid that but it ties application to specific backend runtime version(s) and
78+
predefined HW configuration(s). As a general solution it is reasonable to have
79+
program persistent cache which works between application restarts (e.g. cache
80+
on disk for device code built for specific HW/SW configuration).
7281

7382
<a name="what-is-program">1</a>: Here "program" means an internal SYCL runtime
74-
object corresponding to a SPIRV module or native binary defining a set of SYCL
75-
kernels and/or device functions.
83+
object corresponding to a device code module or native binary defining a set of
84+
SYCL kernels and/or device functions.
7685

7786

7887
## Data structure of cache
7988

89+
The cache is split into two levels:
90+
- in-memory cache which is used during application runtime for device code
91+
which has been already loaded and built for target device.
92+
- persistent (on-disk) cache which is used to store device binaries between
93+
application executions.
94+
95+
### In-memory cache
96+
8097
The cache stores underlying PI objects behind `cl::sycl::program` and
8198
`cl::sycl::kernel` user-level objects in a per-context data storage. The storage
8299
consists of two maps: one is for programs and the other is for kernels.
83100

84-
The programs map's key consists of three components: kernel set id<sup>[1](#what-is-ksid)</sup>,
85-
specialized constants, device this program is built for.
101+
The programs map's key consists of four components:
102+
- kernel set id<sup>[1](#what-is-ksid)</sup>,
103+
- specialization constants values,
104+
- the device this program is built for,
105+
- build options id <sup>[2](#what-is-bopts)</sup>.
86106

87-
The krnels map's key consists of three components too: program the kernel
88-
belongs to, kernel name<sup>[2](#what-is-kname)</sup>, device the program is
89-
built for.
107+
The kernels map's key consists of two components:
108+
- the program the kernel belongs to,
109+
- kernel name<sup>[3](#what-is-kname)</sup>.
90110

91111
<a name="what-is-ksid">1</a>: Kernel set id is an ordinal number of the device
92112
binary image the kernel is contained in.
93113

94-
<a name="what-is-kname">2</a>: Kernel name is a kernel ID mangled class' name
95-
which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
96-
`single_task`).
97-
114+
<a name="what-is-bopts">2</a>: The concatenation of build options (both compile
115+
and link options) set in application or environment variables. There are three
116+
sources of build options that the cache is aware of:
117+
- from device image (pi_device_binary_struct::CompileOptions,
118+
pi_device_binary_struct::LinkOptions);
119+
- environment variables (SYCL_PROGRAM_COMPILE_OPTIONS,
120+
SYCL_PROGRAM_LINK_OPTIONS);
121+
- options passed through SYCL API.
98122

99-
## Points of improvement (things to do)
123+
Note: Backend runtimes used by SYCL can have extra environment or configurations
124+
values (e.g. IGC has [igc_flags.def](https://github.com/intel/intel-graphics-compiler/blob/7f91dd6b9f2ca9c1a8ffddd04fa86461311c4271/IGC/common/igc_flags.def) which affect JIT process). Changing such
125+
configuration will invalidate cache and manual cache cleanup should be done.
100126

101-
- Implement LRU policy on cached objects. See [issue](https://github.com/intel/llvm/issues/2517).
102-
- Allow for caching of objects built with some build options.
103-
- Employ the same built object for multiple devices of the same ISA,
104-
capabilities and so on. *NOTE:* It's not really known if it's possible to
105-
check if two distinct devices are *exactly* the same. Probably this should be
106-
an improvement request for plugins.
107-
- Improve testing: cover real use-cases. See currently covered cases [here](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp).
127+
<a name="what-is-kname">3</a>: Kernel name is a kernel ID mangled class' name
128+
which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or
129+
`single_task`).
108130

131+
### Persistent cache
132+
133+
The cache works behind in-memory cache and stores the same underlying PI
134+
object behind `cl::sycl::program` user-level objects in a per-context data
135+
storage.
136+
The storage is organized as a map for storing device code image. It uses
137+
different keys to address difference in SYCL objects ids between applications
138+
runs as well as the fact that the same kernel name can be used in different
139+
SYCL applications.
140+
141+
The programs map's key consists of four components:
142+
- device image id<sup>[1](#what-is-diid)</sup>,
143+
- specialization constants values,
144+
- device id<sup>[2](#what-is-did)</sup> this program is built for,
145+
- build options id<sup>[3](#what-is-bopts)</sup>.
146+
147+
Hashes are used for fast built device image identification and shorten
148+
filenames on disk. Once cache directory on disc is identified (see
149+
[Persistent cache storage structure](#persistent-cache-storage-structure)
150+
for detailed directory structure) full key values are compared with the ones
151+
stored on disk (in every <n>.src file located in the cache item directory):
152+
- if they match the built image is loaded from correspoding <n>.bin file;
153+
- otherwise image build is done and new cache item is created on disk
154+
containing 2 files: <max_n+1>.src for key values and <max_n+1>.bin for
155+
built image.
156+
157+
<a name="what-is-diid">1</a>: Hash out of the device code image used as input for the build.
158+
159+
<a name="what-is-did">2</a>: Hash out of the string which is concatenation of values for
160+
`info::platform::name`, `info::device::name`, `info::device::version`,
161+
`info::device::driver_version` parameters to differentiate different HW and SW
162+
installed on the same host as well as SW/HW upgrades.
163+
164+
<a name="what-is-bopts">3</a>: Hash for the concatenation of build options (both
165+
compile and link options) set in application or environment variables. There are
166+
three sources of build options:
167+
- from device image (pi_device_binary_struct::CompileOptions,
168+
pi_device_binary_struct::LinkOptions);
169+
- environment variables (SYCL_PROGRAM_COMPILE_OPTIONS,
170+
SYCL_PROGRAM_LINK_OPTIONS);
171+
- options passed through SYCL API.
172+
173+
## Cache configuration
174+
175+
There is set of configuration parameters which can be set as environment variables or parameters in `sycl.conf` and affect cache behavior:
176+
| Environment variable | Values | Description |
177+
| -------------------- | ------ | ----------- |
178+
| `SYCL_CACHE_DIR`| Path | Path to persistent cache root directory. Default values are `%AppData%\Intel\sycl_program_cache` for Windows and `$HOME/intel/sycl_program_cache` on Linux. |
179+
| `SYCL_CACHE_ENABLED` | ON, OFF | Switches persistent cache switch on or off. Default value is ON. |
180+
| `SYCL_CACHE_MAX_SIZE` | Positive integer | Cache eviction is triggered once total size of cached images exceeds the value in megabytes (default - 8 192 for 8 GB). Set to 0 to disable size-based cache eviction. |
181+
| `SYCL_CACHE_THRESHOLD` | Positive integer | Cache eviction threshold in days (default value is 7 for 1 week). Set to 0 for disabling time-based cache eviction. |
182+
| `SYCL_CACHE_MIN_DEVICE_IMAGE_SIZE` | Positive integer | Minimum size of device code image in kilobytes which is reasonable to cache on disk because disk access operation may take more time than do JIT compilation for it. Default value is 0 to cache all images. |
183+
| `SYCL_CACHE_MAX_DEVICE_IMAGE_SIZE` | Positive integer | Maximum size of device image in megabytes which is cached. Too big kernels may overload disk too fast. Default value is 0 to cache all images. |
109184

110185
## Implementation details
111186

@@ -117,24 +192,21 @@ representing a context).
117192

118193
The `KernelProgramCache` is essentially a pair of maps as described above.
119194

120-
121195
### When does the cache come at work?
122196

123-
The cache is used when one submits a kernel for execution or builds program or
124-
with SYCL API. That means that the cache works when either user explicitly calls
197+
The cache is used when one submits a kernel for execution or builds program with
198+
SYCL API. That means that the cache works when either user explicitly calls
125199
`program::build_with_kernel_type<>()`/`program::get_kernel<>()` methods or SYCL
126200
RT builds a program or gets the required kernel as needed during application
127201
execution. Cacheability of an object can be tested with
128202
`program_impl::is_cacheable()` method. SYCL RT will only try to insert cacheable
129203
programs or kernels into the cache. This is done as a part of
130204
`ProgramManager::getOrCreateKernel()` method.
131205

132-
133206
*NOTE:* a kernel is only cacheable if and only if the program it belongs to is
134207
cacheable. On the other hand if the program is cacheable, then each and every
135208
kernel of this program will be cached also.
136209

137-
138210
All requests to build a program or to create a kernel - whether they originate
139211
from explicit user API calls or from internal SYCL runtime execution logic - end
140212
up with calling the function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149)
@@ -154,12 +226,6 @@ instance of cache. We will see rationale behind it a bit later.
154226

155227
*Build* function actually builds the kernel or program.
156228

157-
Caching isn't done:
158-
- when program is built out of source with `program::build_with_source()` or
159-
`program::compile_with_source()` method;
160-
- when program is a result of linking multiple programs.
161-
162-
163229
### Thread-safety
164230

165231
Why do we need thread safety here? It is quite possible to have a use-case when
@@ -210,10 +276,14 @@ A specialization of helper class [Locked](https://github.com/intel/llvm/blob/syc
210276
for reference of proper mapping is returned by Acquire function. The use of this
211277
class implements RAII to make code look cleaner a bit. Now, GetCache function
212278
will return the mapping to be employed that includes the 3 components: kernel
213-
name, device as well as any specialization constants. These get added to
279+
name, device as well as any specialization constants values. These get added to
214280
`BuildResult` and are cached. The `BuildResult` structure is specialized with
215281
either `PiKernel` or `PiProgram`<sup>[1](#remove-pointer)</sup>.
216282

283+
### Hash function
284+
285+
STL hash function specialized for std::string is going to be used:
286+
`template<> struct hash<std::string>`
217287

218288
### Core of caching mechanism
219289

@@ -236,13 +306,104 @@ thread will try to build the same object once more.
236306

237307
`BuildResult` structure also contains synchronization objects: mutex and
238308
condition variable. We employ them to signal waiting threads that the build
239-
process for this kernl/program is finished (either successfuly or with a
309+
process for this kernel/program is finished (either successfully or with a
240310
failure).
241311

242-
243312
<a name="remove-pointer">1</a>: The use of `std::remove_pointer` was omitted for
244313
the sake of simplicity here.
245314

246315
<a name="exception-data">2</a>: Actually, we store contents of the exception:
247316
its message and error code.
248317

318+
### Persistent cache storage structure
319+
320+
The device code image are stored on file system using structure below:
321+
```
322+
<cache_root>/
323+
<device_hash>/
324+
<device_image_hash>/
325+
<spec_constants_values_hash>/
326+
<build_options_hash>/
327+
<n>.src
328+
<n>.bin
329+
```
330+
- `<cache_root>` - root directory storing cache files;
331+
- `<device_hash>` - hash out of device information used to identify target device;
332+
- `<device_image_hash>` - hash made out of device image used as input for the JIT compilation;
333+
- `<spec_constants_values_hash>` - hash for specialization constants values;
334+
- `<build_options_hash>` - hash for all build options;
335+
- `<n>` - sequential number of hash collisions. When hashes matches for the specific build but full values don't, new cache item is added with incremented value (enumeration started from 0).
336+
337+
Two files per cache item are stored on disk:
338+
- `<n>.src` contains full values for build parameters (device information, specialization constant values, build options, device image) which is used to resolve hash collisions and analysis of cached items.
339+
- `<n>.bin` contains built device code.
340+
341+
### Inter-process safety
342+
343+
For on-disk cache there might be access collisions for accessing the same file
344+
from different instances of SYCL applications:
345+
- write collision happens when 2 instances of the same application are started
346+
to write to the same cache file/directory;
347+
- read collision may happen if one application is writing to the file and the
348+
other instance of the application is trying to read from it while write
349+
operation is not finished.
350+
351+
To avoid collisions the file system entries are locked for read-write access
352+
until write operation is finished. e.g if new file or directory should be
353+
created/deleted parent directory is locked, file is created in locked state,
354+
then the directory and the file are unlocked.
355+
356+
To address cases with high lock rate (multiple copies of the SYCL applications
357+
are run in parallel and use the same cache directory) nested directories
358+
representing cache keys are used to minimize locks down to applications which
359+
build the same device with the same parameters. Directory is locked for minimum
360+
time because it can be unlocked once subdirectory is created. If file is created in
361+
a directory, the directory should be locked until file creation is done.
362+
363+
Advisory locking <sup>[1](#advisory-lock)</sup> is used to ensure that the
364+
user/OS tools are able to manage files.
365+
366+
<a name="advisory-lock">1.</a> Advisory locks work only when a process
367+
explicitly acquires and releases locks, and are ignored if a process is not aware
368+
of locks.
369+
370+
### Cache eviction mechanism
371+
372+
Cache eviction mechanism is required to avoid resources overfloat both for
373+
memory and disk. The general idea is to delete items following cache size or
374+
LRU (least recently used) strategy both for in-memory and persistent cache.
375+
376+
#### In-memory cache eviction
377+
378+
It is initiated on program/kernel maps access/add item operation. When cache
379+
size exceeds storage threshold the items which are least recently used are
380+
deleted.
381+
TODO: add detailed description of in-memory cache eviction mechanism.
382+
383+
384+
#### Persistent cache eviction
385+
386+
Persistent cache eviction is going to be applied based on file last access (read/write) date (access
387+
time). On SYCL application shutdown phase cache eviction process is initiated
388+
which walks through cache directories as follows:
389+
- if the file is locked, go to the next file;
390+
- otherwise check file access time:
391+
- if file access time is above threshold, delete the file and remove parent
392+
directory while they are unlocked and empty;
393+
- otherwise do nothing.
394+
395+
## Cache limitations
396+
397+
The caching isn't done when:
398+
- when program is built out of source with `program::build_with_source()` or `program::compile_with_source()` method;
399+
- when program is a result of linking multiple programs;
400+
- program is built using interoperability methods.
401+
402+
## Points of improvement (things to do)
403+
404+
- Employ the same built object for multiple devices of the same ISA,
405+
capabilities and so on. *NOTE:* It's not really known if it's possible to
406+
check if two distinct devices are *exactly* the same. Probably this should be
407+
an improvement request for plugins. By now it is assumed that two devices with the same device id <a name="what-is-did">2</a> are the same.
408+
- Improve testing: cover real use-cases. See currently covered cases [here](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp).
409+
- Implement tool for exploring cache items (initially it is possible using OS utilities).

0 commit comments

Comments
 (0)