Skip to content

[SYCL][L0] optimize re-use of command-lists #8870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions sycl/doc/EnvironmentVariables.md
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,7 @@ variables in production code.</span>
| `SYCL_PI_LEVEL_ZERO_SINGLE_ROOT_DEVICE_BUFFER_MIGRATION` | Integer | When set to "0" tells to use single root-device allocation for all devices in a context where all devices have same root. Otherwise performs regular buffer migration. Default is 1. |
| `SYCL_PI_LEVEL_ZERO_REUSE_DISCARDED_EVENTS` | Integer | When set to a positive value enables the mode when discarded Level Zero events are reset and reused in scope of the same in-order queue based on the dependency chain between commands. Default is 1. |
| `SYCL_PI_LEVEL_ZERO_EXPOSE_CSLICE_IN_AFFINITY_PARTITIONING` (Deprecated) | Integer | When set to non-zero value exposes compute slices as sub-sub-devices in `sycl::info::partition_property::partition_by_affinity_domain` partitioning scheme. Default is zero meaning that they are only exposed when partitioning by `sycl::info::partition_property::ext_intel_partition_by_cslice`. This option is introduced for compatibility reasons and is immediately deprecated. New code must not rely on this behavior. Also note that even if sub-sub-device was created using `partition_by_affinity_domain` it would still be reported as created via partitioning by compute slices. |
| `SYCL_PI_LEVEL_ZERO_COMMANDLISTS_CLEANUP_THRESHOLD` | Integer | If non-negative then the threshold is set to this value. If negative, the threshold is set to INT_MAX. Whenever the number of command lists in a queue exceeds this threshold, an attempt is made to cleanup completed command lists for their subsequent reuse. The default is 20. |
| `SYCL_PI_LEVEL_ZERO_IMMEDIATE_COMMANDLISTS_EVENT_CLEANUP_THRESHOLD` | Integer | If non-negative then the threshold is set to this value. If negative, the threshold is set to INT_MAX. Whenever the number of events associated with an immediate command list exceeds this threshold, a check is made for signaled events and these events are recycled. Setting this threshold low causes events to be checked more often, which could result in unneeded events being recycled sooner. However, more frequent event status checks may cost time. The default is 20. |
| `SYCL_PI_LEVEL_ZERO_USM_RESIDENT` | Integer | Controls if/where to make USM allocations resident at the time of allocation. If set to 0 (default) then no special residency is forced. If set to 1 then allocation (device or shared) is made resident at the device of allocation. If set to 2 then allocation (device or shared) is made resident on all devices in the context of allocation that have P2P access to the device of allocation. For host allocation, any non-0 setting forces the allocation resident on all devices in the context. |
| `SYCL_PI_LEVEL_ZERO_USE_NATIVE_USM_MEMCPY2D` | Integer | When set to a positive value enables the use of Level Zero USM 2D memory copy operations. Default is 0. |
Expand Down
87 changes: 58 additions & 29 deletions sycl/plugins/level_zero/pi_level_zero.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -481,6 +481,24 @@ static const size_t ImmCmdListsEventCleanupThreshold = [] {
return Threshold;
}();

// Get value of the threshold for number of active command lists allowed before
// we start heuristically cleaning them up.
static const size_t CmdListsCleanupThreshold = [] {
const char *CmdListsCleanupThresholdStr =
std::getenv("SYCL_PI_LEVEL_ZERO_COMMANDLISTS_CLEANUP_THRESHOLD");
static constexpr int Default = 20;
if (!CmdListsCleanupThresholdStr)
return Default;

int Threshold = std::atoi(CmdListsCleanupThresholdStr);

// Basically disable threshold if negative value is provided.
if (Threshold < 0)
return INT_MAX;

return Threshold;
}();

pi_device _pi_context::getRootDevice() const {
assert(Devices.size() > 0);

Expand Down Expand Up @@ -1051,15 +1069,15 @@ static pi_result CleanupEventsInImmCmdLists(pi_queue Queue,

/// @brief Reset signalled command lists in the queue and put them to the cache
/// of command lists. Also cleanup events associated with signalled command
/// lists. Queue must not be locked by the caller.
/// lists. Queue must be locked by the caller for modification.
/// @param Queue Queue where we look for signalled command lists and cleanup
/// events.
/// @return PI_SUCCESS if successful, PI error code otherwise.
static pi_result resetCommandLists(pi_queue Queue) {
// Handle immediate command lists here, they don't need to be reset and we
// only need to cleanup events.
if (Queue->Device->ImmCommandListUsed) {
PI_CALL(CleanupEventsInImmCmdLists(Queue));
PI_CALL(CleanupEventsInImmCmdLists(Queue, true /*locked*/));
return PI_SUCCESS;
}

Expand All @@ -1068,31 +1086,29 @@ static pi_result resetCommandLists(pi_queue Queue) {
// locks are hard to control and can cause deadlocks if mutexes are locked in
// different order.
std::vector<pi_event> EventListToCleanup;
{
// We check for command lists that have been already signalled, but have not
// been added to the available list yet. Each command list has a fence
// associated which tracks if a command list has completed dispatch of its
// commands and is ready for reuse. If a command list is found to have been
// signalled, then the command list & fence are reset and command list is
// returned to the command list cache. All events associated with command
// list are cleaned up if command list was reset.
std::unique_lock<pi_shared_mutex> QueueLock(Queue->Mutex);
for (auto &&it = Queue->CommandListMap.begin();
it != Queue->CommandListMap.end(); ++it) {
// Immediate commandlists don't use a fence and are handled separately
// above.
assert(it->second.ZeFence != nullptr);
// It is possible that the fence was already noted as signalled and
// reset. In that case the ZeFenceInUse flag will be false.
if (it->second.ZeFenceInUse) {
ze_result_t ZeResult =
ZE_CALL_NOCHECK(zeFenceQueryStatus, (it->second.ZeFence));
if (ZeResult == ZE_RESULT_SUCCESS)
PI_CALL(Queue->resetCommandList(it, true, EventListToCleanup));
}

// We check for command lists that have been already signalled, but have not
// been added to the available list yet. Each command list has a fence
// associated which tracks if a command list has completed dispatch of its
// commands and is ready for reuse. If a command list is found to have been
// signalled, then the command list & fence are reset and command list is
// returned to the command list cache. All events associated with command
// list are cleaned up if command list was reset.
for (auto &&it = Queue->CommandListMap.begin();
it != Queue->CommandListMap.end(); ++it) {
// Immediate commandlists don't use a fence and are handled separately
// above.
assert(it->second.ZeFence != nullptr);
// It is possible that the fence was already noted as signalled and
// reset. In that case the ZeFenceInUse flag will be false.
if (it->second.ZeFenceInUse) {
ze_result_t ZeResult =
ZE_CALL_NOCHECK(zeFenceQueryStatus, (it->second.ZeFence));
if (ZeResult == ZE_RESULT_SUCCESS)
PI_CALL(Queue->resetCommandList(it, true, EventListToCleanup));
}
}
CleanupEventListFromResetCmdList(EventListToCleanup);
CleanupEventListFromResetCmdList(EventListToCleanup, true /*locked*/);
return PI_SUCCESS;
}

Expand All @@ -1113,6 +1129,14 @@ pi_result _pi_context::getAvailableCommandList(
if (auto Res = Queue->insertActiveBarriers(CommandList, UseCopyEngine))
return Res;
return PI_SUCCESS;
} else {
// Cleanup regular command-lists if there are too many.
// It handles the case that the queue is not synced to the host
// for a long time and we want to reclaim the command-lists for
// use by other queues.
if (Queue->CommandListMap.size() > CmdListsCleanupThreshold) {
resetCommandLists(Queue);
}
}

auto &CommandBatch =
Expand Down Expand Up @@ -2810,8 +2834,10 @@ pi_result piQueueFinish(pi_queue Queue) {
// Reset signalled command lists and return them back to the cache of
// available command lists. Events in the immediate command lists are cleaned
// up in synchronize().
if (!Queue->Device->ImmCommandListUsed)
if (!Queue->Device->ImmCommandListUsed) {
std::unique_lock<pi_shared_mutex> Lock(Queue->Mutex);
resetCommandLists(Queue);
}
return PI_SUCCESS;
}

Expand Down Expand Up @@ -5072,9 +5098,10 @@ pi_result piEventsWait(pi_uint32 NumEvents, const pi_event *EventList) {

// We waited some events above, check queue for signaled command lists and
// reset them.
for (auto &Q : Queues)
for (auto &Q : Queues) {
std::unique_lock<pi_shared_mutex> Lock(Q->Mutex);
resetCommandLists(Q);

}
return PI_SUCCESS;
}

Expand Down Expand Up @@ -5471,8 +5498,10 @@ pi_result piEnqueueEventsWait(pi_queue Queue, pi_uint32 NumEventsInWaitList,
}
}

if (!Queue->Device->ImmCommandListUsed)
if (!Queue->Device->ImmCommandListUsed) {
std::unique_lock<pi_shared_mutex> Lock(Queue->Mutex);
resetCommandLists(Queue);
}

return PI_SUCCESS;
}
Expand Down