Skip to content

[SYCL] Start events cleanup for immediate command lists based on threshold #7773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions sycl/doc/EnvironmentVariables.md
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@ variables in production code.</span>
| `SYCL_PI_LEVEL_ZERO_SINGLE_ROOT_DEVICE_BUFFER_MIGRATION` | Integer | When set to "0" tells to use single root-device allocation for all devices in a context where all devices have same root. Otherwise performs regular buffer migration. Default is 1. |
| `SYCL_PI_LEVEL_ZERO_REUSE_DISCARDED_EVENTS` | Integer | When set to a positive value enables the mode when discarded Level Zero events are reset and reused in scope of the same in-order queue based on the dependency chain between commands. Default is 1. |
| `SYCL_PI_LEVEL_ZERO_EXPOSE_CSLICE_IN_AFFINITY_PARTITIONING` (Deprecated) | Integer | When set to non-zero value exposes compute slices as sub-sub-devices in `sycl::info::partition_property::partition_by_affinity_domain` partitioning scheme. Default is zero meaning that they are only exposed when partitioning by `sycl::info::partition_property::ext_intel_partition_by_cslice`. This option is introduced for compatibility reasons and is immediately deprecated. New code must not rely on this behavior. Also note that even if sub-sub-device was created using `partition_by_affinity_domain` it would still be reported as created via partitioning by compute slices. |
| `SYCL_PI_LEVEL_ZERO_IMMEDIATE_COMMANDLISTS_EVENT_CLEANUP_THRESHOLD` | Integer | If non-negative then the threshold is set to this value. If negative, the threshold is set to INT_MAX. Whenever the number of events associated with an immediate command list exceeds this threshold, a check is made for signaled events and these events are recycled. Setting this threshold low causes events to be checked more often, which could result in unneeded events being recycled sooner. However, more frequent event status checks may cost time. The default is 20. |

## Debugging variables for CUDA Plugin

Expand Down
25 changes: 25 additions & 0 deletions sycl/plugins/level_zero/pi_level_zero.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -845,6 +845,25 @@ static const int ImmediateCommandlistsSetting = [] {
return std::stoi(ImmediateCommandlistsSettingStr);
}();

// Get value of the threshold for number of events in immediate command lists.
// If number of events in the immediate command list exceeds this threshold then
// cleanup process for those events is executed.
static const size_t ImmCmdListsEventCleanupThreshold = [] {
const char *ImmCmdListsEventCleanupThresholdStr = std::getenv(
"SYCL_PI_LEVEL_ZERO_IMMEDIATE_COMMANDLISTS_EVENT_CLEANUP_THRESHOLD");
static constexpr int Default = 20;
if (!ImmCmdListsEventCleanupThresholdStr)
return Default;

int Threshold = std::atoi(ImmCmdListsEventCleanupThresholdStr);

// Basically disable threshold if negative value is provided.
if (Threshold < 0)
return INT_MAX;

return Threshold;
}();

// Whether immediate commandlists will be used for kernel launches and copies.
// The default is standard commandlists. Setting a value >=1 specifies use of
// immediate commandlists. Note: when immediate commandlists are used then
Expand Down Expand Up @@ -1439,6 +1458,12 @@ pi_result _pi_context::getAvailableCommandList(
// Immediate commandlists have been pre-allocated and are always available.
if (Queue->Device->useImmediateCommandLists()) {
CommandList = Queue->getQueueGroup(UseCopyEngine).getImmCmdList();
if (CommandList->second.EventList.size() >
ImmCmdListsEventCleanupThreshold) {
std::vector<pi_event> EventListToCleanup;
Queue->resetCommandList(CommandList, false, EventListToCleanup);
CleanupEventListFromResetCmdList(EventListToCleanup, true);
}
Comment on lines +1461 to +1466
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you find it desired to run more cleanup?

Copy link
Contributor Author

@againull againull Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because currently for immediate command lists we cleanup only at point of synchronization. If application doesn't have any points of synchronization then we keep creating new events for immediate lists. This change results in better recycling of events and I see significant performance improvement for one of the sycl benchmarks (didn't check other)

Also I provided some details in the description of PR.

PI_CALL(Queue->insertStartBarrierIfDiscardEventsMode(CommandList));
if (auto Res = Queue->insertActiveBarriers(CommandList, UseCopyEngine))
return Res;
Expand Down