Skip to content

[SYCL][L0] Use compute engine for memory fill command #6802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions sycl/doc/EnvironmentVariables.md
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@ variables in production code.</span>
| `SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS` | Any(\*) | Enable support of device-scope events whose state is not visible to the host. If enabled mode is SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=1 the Level Zero plugin would create all events having device-scope only and create proxy host-visible events for them when their status is needed (wait/query) on the host. If enabled mode is SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=2 the Level Zero plugin would create all events having device-scope and add proxy host-visible event at the end of each command-list submission. The default is 2, meaning only the last event in a batch is host-visible. |
| `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS` | Integer | When set to a positive value enables use of Level Zero immediate commandlists, which means there is no batching and all commands are immediately submitted for execution. Default is 0. Note: When immediate commandlist usage is enabled it is necessary to also set SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS to either 0 or 1. |
| `SYCL_PI_LEVEL_ZERO_USE_MULTIPLE_COMMANDLIST_BARRIERS` | Integer | When set to a positive value enables use of multiple Level Zero commandlists when submitting barriers. Default is 0. |
| `SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE_FOR_FILL` | Integer | When set to a positive value enables use of a copy engine for memory fill operations. Default is 0. |

## Debugging variables for CUDA Plugin

Expand Down
14 changes: 8 additions & 6 deletions sycl/plugins/level_zero/pi_level_zero.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6654,7 +6654,7 @@ enqueueMemCopyHelper(pi_command_type CommandType, pi_queue Queue, void *Dst,
printZeEventList(WaitList);

ZE_CALL(zeCommandListAppendMemoryCopy,
(ZeCommandList, Dst, Src, Size, ZeEvent, 0, nullptr));
(ZeCommandList, Dst, Src, Size, ZeEvent, 0, nullptr));

if (auto Res =
Queue->executeCommandList(CommandList, BlockingWrite, OkToBatch))
Expand Down Expand Up @@ -6898,15 +6898,17 @@ enqueueMemFillHelper(pi_command_type CommandType, pi_queue Queue, void *Ptr,

auto &Device = Queue->Device;

// Performance analysis on a simple SYCL data "fill" test shows copy engine
// is faster than compute engine for such operations.
//
bool PreferCopyEngine = true;
// Default to using compute engine for fill operation, but allow to
// override this with an environment variable.
const char *PreferCopyEngineEnv =
std::getenv("SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE_FOR_FILL");
bool PreferCopyEngine =
PreferCopyEngineEnv ? std::stoi(PreferCopyEngineEnv) != 0 : false;

// Make sure that pattern size matches the capability of the copy queues.
// Check both main and link groups as we don't known which one will be used.
//
if (Device->hasCopyEngine()) {
if (PreferCopyEngine && Device->hasCopyEngine()) {
if (Device->hasMainCopyEngine() &&
Device->QueueGroup[_pi_device::queue_group_info_t::MainCopy]
.ZeProperties.maxMemoryFillPatternSize < PatternSize) {
Expand Down