Skip to content

[SYCL] Add environment variable to disable in-memory program caching #11751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions sycl/doc/EnvironmentVariables.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ compiler and runtime.
| `SYCL_CACHE_DIR` | Path | Path to persistent cache root directory. Default values are `%AppData%\libsycl_cache` for Windows and `$XDG_CACHE_HOME/libsycl_cache` on Linux, if `XDG_CACHE_HOME` is not set then `$HOME/.cache/libsycl_cache`. When none of the environment variables are set SYCL persistent cache is disabled. |
| `SYCL_CACHE_DISABLE_PERSISTENT (deprecated)` | Any(\*) | Has no effect. |
| `SYCL_CACHE_PERSISTENT` | Integer | Controls persistent device compiled code cache. Turns it on if set to '1' and turns it off if set to '0'. When cache is enabled SYCL runtime will try to cache and reuse JIT-compiled binaries. Default is off. |
| `SYCL_CACHE_IN_MEM` | '1' or '0' | Enable ('1') or disable ('0') in-memory caching of device compiled code. When cache is enabled SYCL runtime will try to cache and reuse JIT-compiled binaries. Default is '1'. |
| `SYCL_CACHE_EVICTION_DISABLE` | Any(\*) | Switches cache eviction off when the variable is set. |
| `SYCL_CACHE_MAX_SIZE` | Positive integer | Cache eviction is triggered once total size of cached images exceeds the value in megabytes (default - 8 192 for 8 GB). Set to 0 to disable size-based cache eviction. |
| `SYCL_CACHE_THRESHOLD` | Positive integer | Cache eviction threshold in days (default value is 7 for 1 week). Set to 0 for disabling time-based cache eviction. |
Expand Down
1 change: 1 addition & 0 deletions sycl/source/detail/config.def
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,4 @@ CONFIG(SYCL_RT_WARNING_LEVEL, 4, __SYCL_RT_WARNING_LEVEL)
CONFIG(SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE, 16, __SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE)
CONFIG(ONEAPI_DEVICE_SELECTOR, 1024, __ONEAPI_DEVICE_SELECTOR)
CONFIG(SYCL_ENABLE_FUSION_CACHING, 1, __SYCL_ENABLE_FUSION_CACHING)
CONFIG(SYCL_CACHE_IN_MEM, 1, __SYCL_CACHE_IN_MEM)
28 changes: 28 additions & 0 deletions sycl/source/detail/config.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -614,6 +614,34 @@ template <> class SYCLConfig<SYCL_ENABLE_FUSION_CACHING> {
}
};

template <> class SYCLConfig<SYCL_CACHE_IN_MEM> {
using BaseT = SYCLConfigBase<SYCL_CACHE_IN_MEM>;

public:
static constexpr bool Default = true; // default is true
static bool get() { return getCachedValue(); }
static const char *getName() { return BaseT::MConfigName; }

private:
static bool parseValue() {
const char *ValStr = BaseT::getRawValue();
if (!ValStr)
return Default;
if (strlen(ValStr) != 1 || (ValStr[0] != '0' && ValStr[0] != '1')) {
std::string Msg =
std::string{"Invalid value for bool configuration variable "} +
getName() + std::string{": "} + ValStr;
throw runtime_error(Msg, PI_ERROR_INVALID_OPERATION);
}
return ValStr[0] == '1';
}

static bool getCachedValue() {
static bool Val = parseValue();
return Val;
}
};

#undef INVALID_CONFIG_EXCEPTION

} // namespace detail
Expand Down
6 changes: 0 additions & 6 deletions sycl/source/detail/kernel_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,6 @@ kernel_impl::kernel_impl(sycl::detail::pi::PiKernel Kernel,
MCreatedFromSource(false), MDeviceImageImpl(std::move(DeviceImageImpl)),
MKernelBundleImpl(std::move(KernelBundleImpl)),
MKernelArgMaskPtr{ArgMask} {

// kernel_impl shared ownership of kernel handle
if (!is_host()) {
getPlugin()->call<PiApiKind::piKernelRetain>(MKernel);
}

MIsInterop = MKernelBundleImpl->isInterop();
}

Expand Down
66 changes: 63 additions & 3 deletions sycl/source/detail/program_manager/program_manager.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -699,11 +699,21 @@ sycl::detail::pi::PiProgram ProgramManager::getBuiltPIProgram(
return Cache.getOrInsertProgram(CacheKey);
};

if (!SYCLConfig<SYCL_CACHE_IN_MEM>::get())
return BuildF();

auto BuildResult =
getOrBuild<sycl::detail::pi::PiProgram, compile_program_error>(
Cache, GetCachedBuildF, BuildF);
// getOrBuild is not supposed to return nullptr
assert(BuildResult != nullptr && "Invalid build result");

// If caching is enabled, one copy of the program handle will be
// stored in the cache, and one handle is returned to the
// caller. In that case, we need to increase the ref count of the
// program.
ContextImpl->getPlugin()->call<PiApiKind::piProgramRetain>(
*BuildResult->Ptr.load());
return *BuildResult->Ptr.load();
}

Expand All @@ -728,9 +738,20 @@ ProgramManager::getOrCreateKernel(const ContextImplPtr &ContextImpl,

auto key = std::make_tuple(std::move(SpecConsts), PiDevice,
CompileOpts + LinkOpts, KernelName);
auto ret_tuple = Cache.tryToGetKernelFast(key);
if (std::get<0>(ret_tuple))
return ret_tuple;
if (SYCLConfig<SYCL_CACHE_IN_MEM>::get()) {
auto ret_tuple = Cache.tryToGetKernelFast(key);
constexpr size_t Kernel = 0; // see KernelFastCacheValT tuple
constexpr size_t Program = 3; // see KernelFastCacheValT tuple
if (std::get<Kernel>(ret_tuple)) {
// Pulling a copy of a kernel and program from the cache,
// so we need to retain those resources.
ContextImpl->getPlugin()->call<PiApiKind::piKernelRetain>(
std::get<Kernel>(ret_tuple));
ContextImpl->getPlugin()->call<PiApiKind::piProgramRetain>(
std::get<Program>(ret_tuple));
return ret_tuple;
}
}

sycl::detail::pi::PiProgram Program =
getBuiltPIProgram(ContextImpl, DeviceImpl, KernelName);
Expand All @@ -757,6 +778,14 @@ ProgramManager::getOrCreateKernel(const ContextImplPtr &ContextImpl,
return Cache.getOrInsertKernel(Program, KernelName);
};

if (!SYCLConfig<SYCL_CACHE_IN_MEM>::get()) {
// The built kernel cannot be shared between multiple
// threads when caching is disabled, so we can return
// nullptr for the mutex.
auto [Kernel, ArgMask] = BuildF();
return make_tuple(Kernel, nullptr, ArgMask, Program);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this logic be moved to getOrBuild to avoid having to do it before each call to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think it'd probably be better suited for another PR. There are really two types of uses of getOrBuild, one with kernels and one with programs. The returns types differ between the two case, so it'd be a little awkward to insert this special logic for the program case in there. Alternatively, one could make a getOrBuildProgram and getOrBuildKernel program and then put that special logic in the getOrBuildKernel function, but there would still be a oddity with the return type: when caching, we wrap the value in a BuildResult and return a pointer to that (owned by the cache), but when are not caching, we bypass the BuildResult object, and would only want to return the value. This can still be resolved by further modifying these getOrBuild functions to return only the values we extract from the BuildResult anyways, but I believe this'll most likely create a lot of changes unrelated to the original goal of PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I remember of the code, it is due for a bit of an overhaul anyway. 👍


auto BuildResult = getOrBuild<KernelArgMaskPairT, invalid_object_error>(
Cache, GetCachedBuildF, BuildF);
// getOrBuild is not supposed to return nullptr
Expand All @@ -765,6 +794,12 @@ ProgramManager::getOrCreateKernel(const ContextImplPtr &ContextImpl,
auto ret_val = std::make_tuple(KernelArgMaskPair.first,
&(BuildResult->MBuildResultMutex),
KernelArgMaskPair.second, Program);
// If caching is enabled, one copy of the kernel handle will be
// stored in the cache, and one handle is returned to the
// caller. In that case, we need to increase the ref count of the
// kernel.
ContextImpl->getPlugin()->call<PiApiKind::piKernelRetain>(
KernelArgMaskPair.first);
Cache.saveKernel(key, ret_val);
return ret_val;
}
Expand Down Expand Up @@ -2297,6 +2332,17 @@ device_image_plain ProgramManager::build(const device_image_plain &DeviceImage,
return BuiltProgram.release();
};

if (!SYCLConfig<SYCL_CACHE_IN_MEM>::get()) {
auto ResProgram = BuildF();
DeviceImageImplPtr ExecImpl = std::make_shared<detail::device_image_impl>(
InputImpl->get_bin_image_ref(), Context, Devs, bundle_state::executable,
InputImpl->get_kernel_ids_ptr(), ResProgram,
InputImpl->get_spec_const_data_ref(),
InputImpl->get_spec_const_blob_ref());

return createSyclObjFromImpl<device_image_plain>(ExecImpl);
}

uint32_t ImgId = Img.getImageID();
const sycl::detail::pi::PiDevice PiDevice =
getRawSyclObjImpl(Devs[0])->getHandleRef();
Expand Down Expand Up @@ -2389,11 +2435,25 @@ ProgramManager::getOrCreateKernel(const context &Context,
return Cache.getOrInsertKernel(Program, KernelName);
};

if (!SYCLConfig<SYCL_CACHE_IN_MEM>::get()) {
// The built kernel cannot be shared between multiple
// threads when caching is disabled, so we can return
// nullptr for the mutex.
auto [Kernel, ArgMask] = BuildF();
return make_tuple(Kernel, nullptr, ArgMask);
}

auto BuildResult =
getOrBuild<KernelProgramCache::KernelArgMaskPairT, invalid_object_error>(
Cache, GetCachedBuildF, BuildF);
// getOrBuild is not supposed to return nullptr
assert(BuildResult != nullptr && "Invalid build result");
// If caching is enabled, one copy of the kernel handle will be
// stored in the cache, and one handle is returned to the
// caller. In that case, we need to increase the ref count of the
// kernel.
Ctx->getPlugin()->call<PiApiKind::piKernelRetain>(
BuildResult->Ptr.load()->first);
return std::make_tuple(BuildResult->Ptr.load()->first,
&(BuildResult->MBuildResultMutex),
BuildResult->Ptr.load()->second);
Expand Down
48 changes: 35 additions & 13 deletions sycl/source/detail/scheduler/commands.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2403,8 +2403,8 @@ pi_int32 enqueueImpCommandBufferKernel(
auto ContextImpl = sycl::detail::getSyclObjImpl(Ctx);
const sycl::detail::PluginPtr &Plugin = ContextImpl->getPlugin();
pi_kernel PiKernel = nullptr;
std::mutex *KernelMutex = nullptr;
pi_program PiProgram = nullptr;
std::shared_ptr<kernel_impl> SyclKernelImpl = nullptr;
std::shared_ptr<device_image_impl> DeviceImageImpl = nullptr;

auto Kernel = CommandGroup.MSyclKernel;
Expand All @@ -2417,7 +2417,6 @@ pi_int32 enqueueImpCommandBufferKernel(
// and can therefore not be looked up, but since they are self-contained
// they can simply be launched directly.
if (KernelBundleImplPtr && !KernelBundleImplPtr->isInterop()) {
std::shared_ptr<kernel_impl> SyclKernelImpl;
auto KernelName = CommandGroup.MKernelName;
kernel_id KernelID =
detail::ProgramManager::getInstance().getSYCLKernelID(KernelName);
Expand All @@ -2427,14 +2426,14 @@ pi_int32 enqueueImpCommandBufferKernel(
PiKernel = SyclKernelImpl->getHandleRef();
DeviceImageImpl = SyclKernelImpl->getDeviceImage();
PiProgram = DeviceImageImpl->get_program_ref();
std::tie(PiKernel, KernelMutex, EliminatedArgMask) =
detail::ProgramManager::getInstance().getOrCreateKernel(
KernelBundleImplPtr->get_context(), KernelName,
/*PropList=*/{}, PiProgram);
EliminatedArgMask = SyclKernelImpl->getKernelArgMask();
} else if (Kernel != nullptr) {
PiKernel = Kernel->getHandleRef();
auto SyclProg = Kernel->getProgramImpl();
PiProgram = SyclProg->getHandleRef();
EliminatedArgMask = Kernel->getKernelArgMask();
} else {
std::tie(PiKernel, KernelMutex, EliminatedArgMask, PiProgram) =
std::tie(PiKernel, std::ignore, EliminatedArgMask, PiProgram) =
sycl::detail::ProgramManager::getInstance().getOrCreateKernel(
ContextImpl, DeviceImpl, CommandGroup.MKernelName);
}
Expand Down Expand Up @@ -2483,6 +2482,11 @@ pi_int32 enqueueImpCommandBufferKernel(
&NDRDesc.GlobalSize[0], LocalSize, SyncPoints.size(),
SyncPoints.size() ? SyncPoints.data() : nullptr, OutSyncPoint);

if (!SyclKernelImpl && !Kernel) {
Plugin->call<PiApiKind::piKernelRelease>(PiKernel);
Plugin->call<PiApiKind::piProgramRelease>(PiProgram);
}

if (Res != pi_result::PI_SUCCESS) {
throw sycl::exception(errc::invalid,
"Failed to add kernel to PI command-buffer");
Expand Down Expand Up @@ -2530,10 +2534,19 @@ pi_int32 enqueueImpKernel(

Program = DeviceImageImpl->get_program_ref();

std::tie(Kernel, KernelMutex, EliminatedArgMask) =
detail::ProgramManager::getInstance().getOrCreateKernel(
KernelBundleImplPtr->get_context(), KernelName,
/*PropList=*/{}, Program);
EliminatedArgMask = SyclKernelImpl->getKernelArgMask();
// When caching is enabled, kernel objects can be shared,
// so we need to retrieve the mutex associated to it via
// getOrCreateKernel
if (SYCLConfig<SYCL_CACHE_IN_MEM>::get()) {
auto [CachedKernel, CachedKernelMutex, CachedEliminatedArgMask] =
detail::ProgramManager::getInstance().getOrCreateKernel(
KernelBundleImplPtr->get_context(), KernelName,
/*PropList=*/{}, Program);
assert(CachedKernel == Kernel);
assert(CachedEliminatedArgMask == EliminatedArgMask);
KernelMutex = CachedKernelMutex;
}
} else if (nullptr != MSyclKernel) {
assert(MSyclKernel->get_info<info::kernel::context>() ==
Queue->get_context());
Expand Down Expand Up @@ -2574,8 +2587,11 @@ pi_int32 enqueueImpKernel(

pi_result Error = PI_SUCCESS;
{
assert(KernelMutex);
std::lock_guard<std::mutex> Lock(*KernelMutex);
// When KernelMutex is null, this means that in-memory caching is
// disabled, which means that kernel object is not shared, so no locking
// is necessary.
using LockT = std::unique_lock<std::mutex>;
auto Lock = KernelMutex ? LockT(*KernelMutex) : LockT();

// Set SLM/Cache configuration for the kernel if non-default value is
// provided.
Expand All @@ -2590,6 +2606,12 @@ pi_int32 enqueueImpKernel(
Error = SetKernelParamsAndLaunch(Queue, Args, DeviceImageImpl, Kernel,
NDRDesc, EventsWaitList, OutEventImpl,
EliminatedArgMask, getMemAllocationFunc);

const PluginPtr &Plugin = Queue->getPlugin();
if (!SyclKernelImpl && !MSyclKernel) {
Plugin->call<PiApiKind::piKernelRelease>(Kernel);
Plugin->call<PiApiKind::piProgramRelease>(Program);
}
}
if (PI_SUCCESS != Error) {
// If we have got non-success error code, let's analyze it to emit nice
Expand Down
87 changes: 87 additions & 0 deletions sycl/test-e2e/KernelAndProgram/disable-caching.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
// This test ensures created program/kernels are not retained
// if and only if caching is disabled.

// RUN: %{build} -o %t.out
// RUN: env SYCL_PI_TRACE=-1 SYCL_CACHE_IN_MEM=0 %{run} %t.out \
// RUN: | FileCheck %s
// RUN: env SYCL_PI_TRACE=-1 %{run} %t.out \
// RUN: | FileCheck %s --check-prefixes=CHECK-CACHE
#include <sycl/sycl.hpp>

using namespace sycl;

constexpr specialization_id<int> spec_id;

int main() {
queue q;
// CHECK: piProgramCreate
// CHECK-NOT: piProgramRetain
// CHECK: piKernelCreate
// CHECK-NOT: piKernelRetain
// CHECK: piEnqueueKernelLaunch
// CHECK: piKernelRelease
// CHECK: piProgramRelease
// CHECK: piEventsWait

// CHECK-CACHE: piProgramCreate
// CHECK-CACHE: piProgramRetain
// CHECK-CACHE: piKernelCreate
// CHECK-CACHE: piKernelRetain
// CHECK-CACHE: piEnqueueKernelLaunch
// CHECK-CACHE: piKernelRelease
// CHECK-CACHE: piProgramRelease
// CHECK-CACHE: piEventsWait
q.single_task([] {}).wait();

// CHECK: piProgramCreate
// CHECK-NOT: piProgramRetain
// CHECK: piKernelCreate
// CHECK-NOT: piKernelRetain
// CHECK: piEnqueueKernelLaunch
// CHECK: piKernelRelease
// CHECK: piProgramRelease
// CHECK: piEventsWait

// CHECK-CACHE: piProgramCreate
// CHECK-CACHE: piProgramRetain
// CHECK-CACHE: piKernelCreate
// CHECK-CACHE: piKernelRetain
// CHECK-CACHE: piEnqueueKernelLaunch
// CHECK-CACHE: piKernelRelease
// CHECK-CACHE: piProgramRelease
// CHECK-CACHE: piEventsWait

// CHECK: piProgramCreate
// CHECK-NOT: piProgramRetain
// CHECK: piKernelCreate
// CHECK-NOT: piKernelRetain
// CHECK: piEnqueueKernelLaunch
// CHECK: piKernelRelease
// CHECK: piProgramRelease
// CHECK: piEventsWait

// CHECK-CACHE: piProgramCreate
// CHECK-CACHE: piProgramRetain
// CHECK-CACHE: piKernelCreate
// CHECK-CACHE: piKernelRetain
// CHECK-CACHE: piEnqueueKernelLaunch
// CHECK-CACHE: piKernelRelease
// CHECK-CACHE: piProgramRelease
// CHECK-CACHE: piEventsWait
auto *p = malloc_shared<int>(1, q);
for (int i = 0; i < 2; ++i)
q.submit([&](handler &cgh) {
cgh.set_specialization_constant<spec_id>(i);
cgh.parallel_for(1, [=](auto, kernel_handler kh) {
*p = kh.get_specialization_constant<spec_id>();
});
}).wait();

free(p, q);
}

// (Program cache releases)
// CHECK-CACHE: piKernelRelease
// CHECK-CACHE: piProgramRelease
// CHECK-CACHE: piKernelRelease
// CHECK-CACHE: piProgramRelease
2 changes: 2 additions & 0 deletions sycl/test-e2e/XPTI/basic_event_collection_linux.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@
// CHECK: PI Call Begin : piextKernelSetArgPointer
// CHECK-NEXT: PI Call Begin : piKernelGetGroupInfo
// CHECK-NEXT: PI Call Begin : piEnqueueKernelLaunch
// CHECK-NEXT: PI Call Begin : piKernelRelease
// CHECK-NEXT: PI Call Begin : piProgramRelease
// CHECK-NEXT: Signal
// CHECK-DAG: sym_line_no : {{.*}}
// CHECK-DAG: sym_source_file_name : {{.*}}
Expand Down
4 changes: 2 additions & 2 deletions sycl/unittests/SYCL2020/GetNativeOpenCL.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ TEST(GetNative, GetNativeHandle) {
get_native<backend::opencl>(Buffer);

// Depending on global caches state, piDeviceRetain is called either once or
// twice, so there'll be 5 or 6 calls.
ASSERT_EQ(TestCounter, 5 + DeviceRetainCounter - 1)
// twice, so there'll be 6 or 7 calls.
ASSERT_EQ(TestCounter, 6 + DeviceRetainCounter - 1)
<< "Not all the retain methods were called";
}
Loading