Skip to content

[SYCL] Perform eager initialization on demand #6430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions sycl/doc/EnvironmentVariables.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ compiler and runtime.
| `SYCL_ENABLE_DEFAULT_CONTEXTS` | '1' or '0' | Enable ('1') or disable ('0') creation of default platform contexts in SYCL runtime. The default context for each platform contains all devices in the platform. Refer to [Platform Default Contexts](extensions/supported/sycl_ext_oneapi_default_context.asciidoc) extension to learn more. Enabled by default on Linux and disabled on Windows. |
| `SYCL_RT_WARNING_LEVEL` | Positive integer | The higher warning level is used the more warnings and performance hints the runtime library may print. Default value is '0', which means no warning/hint messages from the runtime library are allowed. The value '1' enables performance warnings from device runtime/codegen. The values greater than 1 are reserved for future use. |
| `SYCL_USM_HOSTPTR_IMPORT` | Integer | Enable by specifying non-zero value. Buffers created with a host pointer will result in host data promotion to USM, improving data transfer performance. To use this feature, also set SYCL_HOST_UNIFIED_MEMORY=1. |
| `SYCL_EAGER_INIT` | Integer | Enable by specifying non-zero value. Tells the SYCL runtime to do as much as possible initialization at objects construction as opposed to doing lazy initialization on the fly. This may mean doing some redundant work at warmup but ensures fastest possible execution on the following hot and reportable paths. It also instructs PI plugins to do the same. Default is "0". |

`(*) Note: Any means this environment variable is effective when set to any non-null value.`

Expand Down
87 changes: 71 additions & 16 deletions sycl/plugins/level_zero/pi_level_zero.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,15 @@ static void zePrint(const char *Format, ...) {
}
}

// Controls if we should choose doing eager initialization
// to make it happen on warmup paths and have the reportable
// paths be less likely affected.
//
static bool doEagerInit = [] {
const char *EagerInit = std::getenv("SYCL_EAGER_INIT");
return EagerInit ? std::atoi(EagerInit) != 0 : false;
}();

// Controls whether device-scope events are used, and how.
static const enum EventsScope {
// All events are created host-visible.
Expand Down Expand Up @@ -1230,7 +1239,7 @@ pi_result _pi_context::getAvailableCommandList(
// Each command list is paired with an associated fence to track when the
// command list is available for reuse.
_pi_result pi_result = PI_ERROR_OUT_OF_RESOURCES;
ZeStruct<ze_fence_desc_t> ZeFenceDesc;

// Initally, we need to check if a command list has already been created
// on this device that is available for use. If so, then reuse that
// Level-Zero Command List and Fence for this PI call.
Expand Down Expand Up @@ -1270,6 +1279,7 @@ pi_result _pi_context::getAvailableCommandList(
QueueGroupOrdinal = QGroup.getCmdQueueOrdinal(ZeCommandQueue);

ze_fence_handle_t ZeFence;
ZeStruct<ze_fence_desc_t> ZeFenceDesc;
ZE_CALL(zeFenceCreate, (ZeCommandQueue, &ZeFenceDesc, &ZeFence));
CommandList =
Queue->CommandListMap
Expand Down Expand Up @@ -1310,15 +1320,28 @@ pi_result _pi_context::getAvailableCommandList(
}
}

// If there are no available command lists nor signalled command lists, then
// we must create another command list.
// Once created, this command list & fence are added to the command list fence
// map.
ze_command_list_handle_t ZeCommandList;
// If there are no available command lists nor signalled command lists,
// then we must create another command list.
pi_result = Queue->createCommandList(UseCopyEngine, CommandList);
CommandList->second.ZeFenceInUse = true;
return pi_result;
}

// Helper function to create a new command-list to this queue and associated
// fence tracking its completion. This command list & fence are added to the
// map of command lists in this queue with ZeFenceInUse = false.
// The caller must hold a lock of the queue already.
pi_result
_pi_queue::createCommandList(bool UseCopyEngine,
pi_command_list_ptr_t &CommandList,
ze_command_queue_handle_t *ForcedCmdQueue) {

ze_fence_handle_t ZeFence;
ZeStruct<ze_fence_desc_t> ZeFenceDesc;
ze_command_list_handle_t ZeCommandList;

auto &QGroup = Queue->getQueueGroup(UseCopyEngine);
uint32_t QueueGroupOrdinal;
auto &QGroup = getQueueGroup(UseCopyEngine);
auto &ZeCommandQueue =
ForcedCmdQueue ? *ForcedCmdQueue : QGroup.getZeQueue(&QueueGroupOrdinal);
if (ForcedCmdQueue)
Expand All @@ -1327,19 +1350,16 @@ pi_result _pi_context::getAvailableCommandList(
ZeStruct<ze_command_list_desc_t> ZeCommandListDesc;
ZeCommandListDesc.commandQueueGroupOrdinal = QueueGroupOrdinal;

ZE_CALL(zeCommandListCreate,
(Queue->Context->ZeContext, Queue->Device->ZeDevice,
&ZeCommandListDesc, &ZeCommandList));
ZE_CALL(zeCommandListCreate, (Context->ZeContext, Device->ZeDevice,
&ZeCommandListDesc, &ZeCommandList));

ZE_CALL(zeFenceCreate, (ZeCommandQueue, &ZeFenceDesc, &ZeFence));
std::tie(CommandList, std::ignore) = Queue->CommandListMap.insert(
std::tie(CommandList, std::ignore) = CommandListMap.insert(
std::pair<ze_command_list_handle_t, pi_command_list_info_t>(
ZeCommandList, {ZeFence, true, ZeCommandQueue, QueueGroupOrdinal}));
if (auto Res = Queue->insertActiveBarriers(CommandList, UseCopyEngine))
return Res;
pi_result = PI_SUCCESS;
ZeCommandList, {ZeFence, false, ZeCommandQueue, QueueGroupOrdinal}));

return pi_result;
PI_CALL(insertActiveBarriers(CommandList, UseCopyEngine));
return PI_SUCCESS;
}

void _pi_queue::adjustBatchSizeForFullBatch(bool IsCopy) {
Expand Down Expand Up @@ -3396,6 +3416,41 @@ pi_result piQueueCreate(pi_context Context, pi_device Device,
} catch (...) {
return PI_ERROR_UNKNOWN;
}

// Do eager initialization of Level Zero handles on request.
if (doEagerInit) {
pi_queue Q = *Queue;
// Creates said number of command-lists.
auto warmupQueueGroup = [Q](bool UseCopyEngine,
uint32_t RepeatCount) -> pi_result {
pi_command_list_ptr_t CommandList;
while (RepeatCount--) {
if (UseImmediateCommandLists) {
CommandList = Q->getQueueGroup(UseCopyEngine).getImmCmdList();
} else {
// Heuristically create some number of regular command-list to reuse.
for (int I = 0; I < 10; ++I) {
PI_CALL(Q->createCommandList(UseCopyEngine, CommandList));
// Immediately return them to the cache of available command-lists.
std::vector<pi_event> EventsUnused;
PI_CALL(Q->resetCommandList(CommandList, true /* MakeAvailable */,
EventsUnused));
}
}
}
return PI_SUCCESS;
};
// Create as many command-lists as there are queues in the group.
// With this the underlying round-robin logic would initialize all
// native queues, and create command-lists and their fences.
PI_CALL(warmupQueueGroup(false, Q->ComputeQueueGroup.UpperIndex -
Q->ComputeQueueGroup.LowerIndex + 1));
if (Q->useCopyEngine()) {
PI_CALL(warmupQueueGroup(true, Q->CopyQueueGroup.UpperIndex -
Q->CopyQueueGroup.LowerIndex + 1));
}
// TODO: warmup event pools. Both host-visible and device-only.
}
return PI_SUCCESS;
}

Expand Down
8 changes: 8 additions & 0 deletions sycl/plugins/level_zero/pi_level_zero.hpp
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -954,6 +954,14 @@ struct _pi_queue : _pi_object {
// For non-copy commands, IsCopy is set to 'false'.
void adjustBatchSizeForPartialBatch(bool IsCopy);

// Helper function to create a new command-list to this queue and associated
// fence tracking its completion. This command list & fence are added to the
// map of command lists in this queue with ZeFenceInUse = false.
// The caller must hold a lock of the queue already.
pi_result
createCommandList(bool UseCopyEngine, pi_command_list_ptr_t &CommandList,
ze_command_queue_handle_t *ForcedCmdQueue = nullptr);

// Resets the Command List and Associated fence in the ZeCommandListFenceMap.
// If the reset command list should be made available, then MakeAvailable
// needs to be set to true. The caller must verify that this command list and
Expand Down
4 changes: 2 additions & 2 deletions sycl/source/detail/plugin.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ struct array_fill_helper<Kind, Idx, T> {

template <PiApiKind Kind, size_t Idx, typename T, typename... Args>
struct array_fill_helper<Kind, Idx, T, Args...> {
static void fill(unsigned char *Dst, const T &&Arg, Args &&... Rest) {
static void fill(unsigned char *Dst, const T &&Arg, Args &&...Rest) {
using ArgsTuple = typename PiApiArgTuple<Kind>::type;
// C-style cast is required here.
auto RealArg = (std::tuple_element_t<Idx, ArgsTuple>)(Arg);
Expand All @@ -71,7 +71,7 @@ constexpr size_t totalSize(const std::tuple<Ts...> &) {
}

template <PiApiKind Kind, typename... ArgsT>
auto packCallArguments(ArgsT &&... Args) {
auto packCallArguments(ArgsT &&...Args) {
using ArgsTuple = typename PiApiArgTuple<Kind>::type;

constexpr size_t TotalSize = totalSize(ArgsTuple{});
Expand Down