Skip to content

Commit 24ae015

Browse files
[SYCL][L0] optimize re-use of command-lists (#8870)
Heuristically prune command-lists for re-use by others in the same context --------- Signed-off-by: Sergey V Maslov <[email protected]>
1 parent 2b19458 commit 24ae015

File tree

2 files changed

+59
-29
lines changed

2 files changed

+59
-29
lines changed

sycl/doc/EnvironmentVariables.md

100644100755
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -257,6 +257,7 @@ variables in production code.</span>
257257
| `SYCL_PI_LEVEL_ZERO_SINGLE_ROOT_DEVICE_BUFFER_MIGRATION` | Integer | When set to "0" tells to use single root-device allocation for all devices in a context where all devices have same root. Otherwise performs regular buffer migration. Default is 1. |
258258
| `SYCL_PI_LEVEL_ZERO_REUSE_DISCARDED_EVENTS` | Integer | When set to a positive value enables the mode when discarded Level Zero events are reset and reused in scope of the same in-order queue based on the dependency chain between commands. Default is 1. |
259259
| `SYCL_PI_LEVEL_ZERO_EXPOSE_CSLICE_IN_AFFINITY_PARTITIONING` (Deprecated) | Integer | When set to non-zero value exposes compute slices as sub-sub-devices in `sycl::info::partition_property::partition_by_affinity_domain` partitioning scheme. Default is zero meaning that they are only exposed when partitioning by `sycl::info::partition_property::ext_intel_partition_by_cslice`. This option is introduced for compatibility reasons and is immediately deprecated. New code must not rely on this behavior. Also note that even if sub-sub-device was created using `partition_by_affinity_domain` it would still be reported as created via partitioning by compute slices. |
260+
| `SYCL_PI_LEVEL_ZERO_COMMANDLISTS_CLEANUP_THRESHOLD` | Integer | If non-negative then the threshold is set to this value. If negative, the threshold is set to INT_MAX. Whenever the number of command lists in a queue exceeds this threshold, an attempt is made to cleanup completed command lists for their subsequent reuse. The default is 20. |
260261
| `SYCL_PI_LEVEL_ZERO_IMMEDIATE_COMMANDLISTS_EVENT_CLEANUP_THRESHOLD` | Integer | If non-negative then the threshold is set to this value. If negative, the threshold is set to INT_MAX. Whenever the number of events associated with an immediate command list exceeds this threshold, a check is made for signaled events and these events are recycled. Setting this threshold low causes events to be checked more often, which could result in unneeded events being recycled sooner. However, more frequent event status checks may cost time. The default is 20. |
261262
| `SYCL_PI_LEVEL_ZERO_USM_RESIDENT` | Integer | Controls if/where to make USM allocations resident at the time of allocation. If set to 0 (default) then no special residency is forced. If set to 1 then allocation (device or shared) is made resident at the device of allocation. If set to 2 then allocation (device or shared) is made resident on all devices in the context of allocation that have P2P access to the device of allocation. For host allocation, any non-0 setting forces the allocation resident on all devices in the context. |
262263
| `SYCL_PI_LEVEL_ZERO_USE_NATIVE_USM_MEMCPY2D` | Integer | When set to a positive value enables the use of Level Zero USM 2D memory copy operations. Default is 0. |

sycl/plugins/level_zero/pi_level_zero.cpp

Lines changed: 58 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -481,6 +481,24 @@ static const size_t ImmCmdListsEventCleanupThreshold = [] {
481481
return Threshold;
482482
}();
483483

484+
// Get value of the threshold for number of active command lists allowed before
485+
// we start heuristically cleaning them up.
486+
static const size_t CmdListsCleanupThreshold = [] {
487+
const char *CmdListsCleanupThresholdStr =
488+
std::getenv("SYCL_PI_LEVEL_ZERO_COMMANDLISTS_CLEANUP_THRESHOLD");
489+
static constexpr int Default = 20;
490+
if (!CmdListsCleanupThresholdStr)
491+
return Default;
492+
493+
int Threshold = std::atoi(CmdListsCleanupThresholdStr);
494+
495+
// Basically disable threshold if negative value is provided.
496+
if (Threshold < 0)
497+
return INT_MAX;
498+
499+
return Threshold;
500+
}();
501+
484502
pi_device _pi_context::getRootDevice() const {
485503
assert(Devices.size() > 0);
486504

@@ -1051,15 +1069,15 @@ static pi_result CleanupEventsInImmCmdLists(pi_queue Queue,
10511069

10521070
/// @brief Reset signalled command lists in the queue and put them to the cache
10531071
/// of command lists. Also cleanup events associated with signalled command
1054-
/// lists. Queue must not be locked by the caller.
1072+
/// lists. Queue must be locked by the caller for modification.
10551073
/// @param Queue Queue where we look for signalled command lists and cleanup
10561074
/// events.
10571075
/// @return PI_SUCCESS if successful, PI error code otherwise.
10581076
static pi_result resetCommandLists(pi_queue Queue) {
10591077
// Handle immediate command lists here, they don't need to be reset and we
10601078
// only need to cleanup events.
10611079
if (Queue->Device->ImmCommandListUsed) {
1062-
PI_CALL(CleanupEventsInImmCmdLists(Queue));
1080+
PI_CALL(CleanupEventsInImmCmdLists(Queue, true /*locked*/));
10631081
return PI_SUCCESS;
10641082
}
10651083

@@ -1068,31 +1086,29 @@ static pi_result resetCommandLists(pi_queue Queue) {
10681086
// locks are hard to control and can cause deadlocks if mutexes are locked in
10691087
// different order.
10701088
std::vector<pi_event> EventListToCleanup;
1071-
{
1072-
// We check for command lists that have been already signalled, but have not
1073-
// been added to the available list yet. Each command list has a fence
1074-
// associated which tracks if a command list has completed dispatch of its
1075-
// commands and is ready for reuse. If a command list is found to have been
1076-
// signalled, then the command list & fence are reset and command list is
1077-
// returned to the command list cache. All events associated with command
1078-
// list are cleaned up if command list was reset.
1079-
std::unique_lock<pi_shared_mutex> QueueLock(Queue->Mutex);
1080-
for (auto &&it = Queue->CommandListMap.begin();
1081-
it != Queue->CommandListMap.end(); ++it) {
1082-
// Immediate commandlists don't use a fence and are handled separately
1083-
// above.
1084-
assert(it->second.ZeFence != nullptr);
1085-
// It is possible that the fence was already noted as signalled and
1086-
// reset. In that case the ZeFenceInUse flag will be false.
1087-
if (it->second.ZeFenceInUse) {
1088-
ze_result_t ZeResult =
1089-
ZE_CALL_NOCHECK(zeFenceQueryStatus, (it->second.ZeFence));
1090-
if (ZeResult == ZE_RESULT_SUCCESS)
1091-
PI_CALL(Queue->resetCommandList(it, true, EventListToCleanup));
1092-
}
1089+
1090+
// We check for command lists that have been already signalled, but have not
1091+
// been added to the available list yet. Each command list has a fence
1092+
// associated which tracks if a command list has completed dispatch of its
1093+
// commands and is ready for reuse. If a command list is found to have been
1094+
// signalled, then the command list & fence are reset and command list is
1095+
// returned to the command list cache. All events associated with command
1096+
// list are cleaned up if command list was reset.
1097+
for (auto &&it = Queue->CommandListMap.begin();
1098+
it != Queue->CommandListMap.end(); ++it) {
1099+
// Immediate commandlists don't use a fence and are handled separately
1100+
// above.
1101+
assert(it->second.ZeFence != nullptr);
1102+
// It is possible that the fence was already noted as signalled and
1103+
// reset. In that case the ZeFenceInUse flag will be false.
1104+
if (it->second.ZeFenceInUse) {
1105+
ze_result_t ZeResult =
1106+
ZE_CALL_NOCHECK(zeFenceQueryStatus, (it->second.ZeFence));
1107+
if (ZeResult == ZE_RESULT_SUCCESS)
1108+
PI_CALL(Queue->resetCommandList(it, true, EventListToCleanup));
10931109
}
10941110
}
1095-
CleanupEventListFromResetCmdList(EventListToCleanup);
1111+
CleanupEventListFromResetCmdList(EventListToCleanup, true /*locked*/);
10961112
return PI_SUCCESS;
10971113
}
10981114

@@ -1113,6 +1129,14 @@ pi_result _pi_context::getAvailableCommandList(
11131129
if (auto Res = Queue->insertActiveBarriers(CommandList, UseCopyEngine))
11141130
return Res;
11151131
return PI_SUCCESS;
1132+
} else {
1133+
// Cleanup regular command-lists if there are too many.
1134+
// It handles the case that the queue is not synced to the host
1135+
// for a long time and we want to reclaim the command-lists for
1136+
// use by other queues.
1137+
if (Queue->CommandListMap.size() > CmdListsCleanupThreshold) {
1138+
resetCommandLists(Queue);
1139+
}
11161140
}
11171141

11181142
auto &CommandBatch =
@@ -2810,8 +2834,10 @@ pi_result piQueueFinish(pi_queue Queue) {
28102834
// Reset signalled command lists and return them back to the cache of
28112835
// available command lists. Events in the immediate command lists are cleaned
28122836
// up in synchronize().
2813-
if (!Queue->Device->ImmCommandListUsed)
2837+
if (!Queue->Device->ImmCommandListUsed) {
2838+
std::unique_lock<pi_shared_mutex> Lock(Queue->Mutex);
28142839
resetCommandLists(Queue);
2840+
}
28152841
return PI_SUCCESS;
28162842
}
28172843

@@ -5072,9 +5098,10 @@ pi_result piEventsWait(pi_uint32 NumEvents, const pi_event *EventList) {
50725098

50735099
// We waited some events above, check queue for signaled command lists and
50745100
// reset them.
5075-
for (auto &Q : Queues)
5101+
for (auto &Q : Queues) {
5102+
std::unique_lock<pi_shared_mutex> Lock(Q->Mutex);
50765103
resetCommandLists(Q);
5077-
5104+
}
50785105
return PI_SUCCESS;
50795106
}
50805107

@@ -5471,8 +5498,10 @@ pi_result piEnqueueEventsWait(pi_queue Queue, pi_uint32 NumEventsInWaitList,
54715498
}
54725499
}
54735500

5474-
if (!Queue->Device->ImmCommandListUsed)
5501+
if (!Queue->Device->ImmCommandListUsed) {
5502+
std::unique_lock<pi_shared_mutex> Lock(Queue->Mutex);
54755503
resetCommandLists(Queue);
5504+
}
54765505

54775506
return PI_SUCCESS;
54785507
}

0 commit comments

Comments
 (0)