-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] [L0] Make immediate commandlists thread-specific #7041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -6693,25 +6791,32 @@ pi_result piEnqueueEventsWaitWithBarrier(pi_queue Queue, | |||
It != Queue->CommandListMap.end(); ++It) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please double check that we indeed keep immediate command-lists (from all threads) in this Queue->CommandListMap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we have to do the same as in other places: https://github.com/intel/llvm/pull/7041/files#diff-15dd1eb076d2164bd9e87d9057f05f652a716498e8cdf5975e564c65309a0985R6976
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all commandlists go into the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then maybe use it everywhere instead of iterating with 2 for-loops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A queue has a map by TID of compute queue groups and a map by TID of copy queue groups. So we have to first iterate over the two groups (compute and copy) and then per-thread entries in each map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I understand the reason for 2-deep loopnest, but here we get away by just iterating over CommandListMap
and assume we see all the same command-lists, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For standard commandlists there are many commandlists per ZeQueue (L0 queue).
For immediate commandlists there is only one immediate commandlist per L0 queue.
So, for immediate commandlists either type of scan yields the same set of L0 queues.
For standard commandlists if we used CommandListMap we would have to extract the unique L0 queues from the set of commandlists in the map.
You will notice that where ever a two-level for-loop structure is used, the separation between standard/immediate commandlists is inside. In this one case, we check standard/immediate outside, and then do a scan in different ways. This simply matches the code as it was before.
For consistency, this particular instance could also be written as a two-level for-loop with the standard/immediate check inside, if you wish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, this particular instance could also be written as a two-level for-loop with the standard/immediate check inside, if you wish.
Yes, that's what I was looking for, consistency. Either using one way or another everywhere (and using a single list is probably a faster option). BTW, why do we need to keep the immediate command-lists in the CommandListMap
at all?
Perhaps, if we could stop doing that then there would not be this confusing diff.
Having said that, this is NIT and should not block this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've changed it for consistency.
The main reason immediate commandlists are also entered into the map is that throughout the plugin, commandlists are passed around as an iterator on that map, never as a commandlist itself. So, to retain that interface I put immediate commandlists in that map too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks!
@@ -244,7 +244,7 @@ variables in production code.</span> | |||
| `SYCL_PI_LEVEL_ZERO_USE_COMPUTE_ENGINE` | Integer | It can be set to an integer (>=0) in which case all compute commands will be submitted to the command-queue with the given index in the compute command group. If it is instead set to a negative value then all available compute engines may be used. The default value is "0" | | |||
| `SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE_FOR_D2D_COPY` (experimental) | Integer | Allows the use of copy engine, if available in the device, in Level Zero plugin for device to device copy operations. The default is 0. This option is experimental and will be removed once heuristics are added to make a decision about use of copy engine for device to device copy operations. | | |||
| `SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS` | Any(\*) | Enable support of device-scope events whose state is not visible to the host. If enabled mode is SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=1 the Level Zero plugin would create all events having device-scope only and create proxy host-visible events for them when their status is needed (wait/query) on the host. If enabled mode is SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=2 the Level Zero plugin would create all events having device-scope and add proxy host-visible event at the end of each command-list submission. The default is 0, meaning all events have host visibility. | | |||
| `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS` | Integer | When set to a positive value enables use of Level Zero immediate commandlists, which means there is no batching and all commands are immediately submitted for execution. Default is 0. Note: When immediate commandlist usage is enabled it is necessary to also set SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS to either 0 or 1. | | |||
| `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS` | Integer | When set to a positive value enables use of Level Zero immediate commandlists, which means there is no batching and all commands are immediately submitted for execution. When set to 1, unique immediate commandlists are created for each SYCL queue. When set to 2, unique immediate commandlists are created per host thread per SYCL queue. Default is 0. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand it correctly that "Default is 0" means that immediate command lists are not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, by default immediate command lists are not used.
We are working towards making them default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, immediate command lists are not enabled by default yet.
This change makes immediate commandlists used by SYCL queues host-thread specific when SYCL_PI_LEVEL_ZERO_IMMEDIATE_COMMANDLISTS=2 is specified.