Skip to content

[SYCL] Implement parallel_for(range, reduction, func) #4101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

v-klochkov
Copy link
Contributor

Currently parallel_for accepting sycl::range may handle only 1 reduction
variable.

Also, this patch had to change/update the methods Reducer::atomic_combine(),
which was the good moment to switch the methods implementation from
the deprecated sycl::atomic class to sycl::ONEAPI::atomic_ref.
The conditions on which the fast-atomics implementations are used were
not changed as that deserves a separate patch.

parallel_for accepting sycl::range works much faster than parallel_for
accepting sycl::nd_range, which means that nd_range version needs some
additional performance tuning soon.

Signed-off-by: Vyacheslav N Klochkov [email protected]

Currently parallel_for accepting sycl::range may handle only 1 reduction
variable.

Also, this patch had to change/update the methods Reducer::atomic_combine(),
which was the good moment to switch the methods implementation from
the deprecated sycl::atomic class to sycl::ONEAPI::atomic_ref.
The conditions on which the fast-atomics implementations are used were
not changed as that deserves a separate patch.

parallel_for accepting sycl::range works much faster than parallel_for
accepting sycl::nd_range, which means that nd_range version needs some
additional performance tuning soon.

Signed-off-by: Vyacheslav N Klochkov <[email protected]>
@v-klochkov v-klochkov requested a review from a team as a code owner July 14, 2021 05:57
size_t NWorkGroups = NWorkItems / WGSize;
if (NWorkItems % WGSize)
NWorkGroups++;
size_t MaxNWorkGroups = NumEUThreads;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mapping to Intel GPUs is such that 1 EU thread == 1 sub-group. It's not clear to me that setting the number of work groups equal to the number of EU threads is particularly meaningful if the work-group size is large.

Copy link
Contributor Author

@v-klochkov v-klochkov Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely can be tuned additionally later. I see now that these heuristics give the best results. Lowering number of work-groups or reducing the size of work-groups gives slower perf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if this comment appears twice -- having some GitHub problems. I think we should add an explicit TODO in the implementation of MaxNumConcurrentWorkGroups saying that it needs to be tuned for other devices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment. Heuristics definitely need some additional tuning.
There is a comment saying exactly that: https://github.com/v-klochkov/llvm/blob/public_vklochkov_reduction_range_review/sycl/source/detail/reduction.cpp#L57

I'll add "TODO: " to it in a separate [NFC] patch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I tried to post this message directly tied to the comment but GitHub wouldn't let me do it! Adding the TODO in a separate NFC patch sounds good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the TODO comment to this PR: #4361 (commit: 900de46)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the TODO comment to this PR: #4361 (commit: 900de46)

@Pennycook
Copy link
Contributor

@JackAKirk: This PR introduces some usages of atomic_ref instead of atomic. Does it address your TODO from line 64?

@JackAKirk
Copy link
Contributor

@JackAKirk: This PR introduces some usages of atomic_ref instead of atomic. Does it address your TODO from line 64?

No, but I think that is OK since it comes under this:

"The conditions on which the fast-atomics implementations are used were
not changed as that deserves a separate patch."

Basically the 32 bit float case should be covered by IsReduOptForFastAtomicFetch rather than IsReduOptForAtomic64Add.

@Pennycook
Copy link
Contributor

@JackAKirk: This PR introduces some usages of atomic_ref instead of atomic. Does it address your TODO from line 64?

No, but I think that is OK since it comes under this:

"The conditions on which the fast-atomics implementations are used were
not changed as that deserves a separate patch."

Basically the 32 bit float case should be covered by IsReduOptForFastAtomicFetch rather than IsReduOptForAtomic64Add.

Ah, you're right. I'd missed that in the PR description. Thanks.

@v-klochkov
Copy link
Contributor Author

@alexbatashev - please take a quick look at this fix from 'abi breaking' point of view.
I believe I did not add any breaking changes. The reason why I am asking is the test layout_array.cpp, which started giving AST dumps for 'id' class before dumps for 'range'. So, I just changed their order in the test.

@v-klochkov v-klochkov requested a review from alexbatashev July 15, 2021 05:41
@v-klochkov
Copy link
Contributor Author

The corresponding LIT tests are almost ready. I will upload them by the noon/end of Thursday.

Copy link
Contributor

@alexbatashev alexbatashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ABI changes LGTM

@v-klochkov v-klochkov merged commit d1556e4 into intel:sycl Jul 15, 2021
@v-klochkov v-klochkov deleted the public_vklochkov_reduction_range_review branch July 16, 2021 18:28
v-klochkov added a commit to v-klochkov/llvm that referenced this pull request Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants