-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] Implement parallel_for(range, reduction, func) #4101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Implement parallel_for(range, reduction, func) #4101
Conversation
Currently parallel_for accepting sycl::range may handle only 1 reduction variable. Also, this patch had to change/update the methods Reducer::atomic_combine(), which was the good moment to switch the methods implementation from the deprecated sycl::atomic class to sycl::ONEAPI::atomic_ref. The conditions on which the fast-atomics implementations are used were not changed as that deserves a separate patch. parallel_for accepting sycl::range works much faster than parallel_for accepting sycl::nd_range, which means that nd_range version needs some additional performance tuning soon. Signed-off-by: Vyacheslav N Klochkov <[email protected]>
size_t NWorkGroups = NWorkItems / WGSize; | ||
if (NWorkItems % WGSize) | ||
NWorkGroups++; | ||
size_t MaxNWorkGroups = NumEUThreads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mapping to Intel GPUs is such that 1 EU thread == 1 sub-group. It's not clear to me that setting the number of work groups equal to the number of EU threads is particularly meaningful if the work-group size is large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely can be tuned additionally later. I see now that these heuristics give the best results. Lowering number of work-groups or reducing the size of work-groups gives slower perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies if this comment appears twice -- having some GitHub problems. I think we should add an explicit TODO in the implementation of MaxNumConcurrentWorkGroups saying that it needs to be tuned for other devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comment. Heuristics definitely need some additional tuning.
There is a comment saying exactly that: https://github.com/v-klochkov/llvm/blob/public_vklochkov_reduction_range_review/sycl/source/detail/reduction.cpp#L57
I'll add "TODO: " to it in a separate [NFC] patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I tried to post this message directly tied to the comment but GitHub wouldn't let me do it! Adding the TODO in a separate NFC patch sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JackAKirk: This PR introduces some usages of |
No, but I think that is OK since it comes under this: "The conditions on which the fast-atomics implementations are used were Basically the 32 bit float case should be covered by IsReduOptForFastAtomicFetch rather than IsReduOptForAtomic64Add. |
Ah, you're right. I'd missed that in the PR description. Thanks. |
Signed-off-by: Vyacheslav N Klochkov <[email protected]>
…reduction_range_review
Signed-off-by: Vyacheslav N Klochkov <[email protected]>
…ted printing 'id' before 'range' Signed-off-by: Vyacheslav N Klochkov <[email protected]>
@alexbatashev - please take a quick look at this fix from 'abi breaking' point of view. |
The corresponding LIT tests are almost ready. I will upload them by the noon/end of Thursday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ABI changes LGTM
Signed-off-by: Vyacheslav N Klochkov <[email protected]>
Currently parallel_for accepting sycl::range may handle only 1 reduction
variable.
Also, this patch had to change/update the methods Reducer::atomic_combine(),
which was the good moment to switch the methods implementation from
the deprecated sycl::atomic class to sycl::ONEAPI::atomic_ref.
The conditions on which the fast-atomics implementations are used were
not changed as that deserves a separate patch.
parallel_for accepting sycl::range works much faster than parallel_for
accepting sycl::nd_range, which means that nd_range version needs some
additional performance tuning soon.
Signed-off-by: Vyacheslav N Klochkov [email protected]