[SYCL] Implement parallel_for(range, reduction, func) #4101

v-klochkov · 2021-07-14T05:57:35Z

Currently parallel_for accepting sycl::range may handle only 1 reduction
variable.

Also, this patch had to change/update the methods Reducer::atomic_combine(),
which was the good moment to switch the methods implementation from
the deprecated sycl::atomic class to sycl::ONEAPI::atomic_ref.
The conditions on which the fast-atomics implementations are used were
not changed as that deserves a separate patch.

parallel_for accepting sycl::range works much faster than parallel_for
accepting sycl::nd_range, which means that nd_range version needs some
additional performance tuning soon.

Signed-off-by: Vyacheslav N Klochkov [email protected]

Currently parallel_for accepting sycl::range may handle only 1 reduction variable. Also, this patch had to change/update the methods Reducer::atomic_combine(), which was the good moment to switch the methods implementation from the deprecated sycl::atomic class to sycl::ONEAPI::atomic_ref. The conditions on which the fast-atomics implementations are used were not changed as that deserves a separate patch. parallel_for accepting sycl::range works much faster than parallel_for accepting sycl::nd_range, which means that nd_range version needs some additional performance tuning soon. Signed-off-by: Vyacheslav N Klochkov <[email protected]>

sycl/include/CL/sycl/ONEAPI/reduction.hpp

sycl/include/CL/sycl/handler.hpp

Pennycook · 2021-07-14T17:47:34Z

sycl/include/CL/sycl/ONEAPI/reduction.hpp

+  size_t NWorkGroups = NWorkItems / WGSize;
+  if (NWorkItems % WGSize)
+    NWorkGroups++;
+  size_t MaxNWorkGroups = NumEUThreads;


The mapping to Intel GPUs is such that 1 EU thread == 1 sub-group. It's not clear to me that setting the number of work groups equal to the number of EU threads is particularly meaningful if the work-group size is large.

This definitely can be tuned additionally later. I see now that these heuristics give the best results. Lowering number of work-groups or reducing the size of work-groups gives slower perf.

Apologies if this comment appears twice -- having some GitHub problems. I think we should add an explicit TODO in the implementation of MaxNumConcurrentWorkGroups saying that it needs to be tuned for other devices.

Thank you for the comment. Heuristics definitely need some additional tuning.
There is a comment saying exactly that: https://github.com/v-klochkov/llvm/blob/public_vklochkov_reduction_range_review/sycl/source/detail/reduction.cpp#L57

I'll add "TODO: " to it in a separate [NFC] patch

Yeah, I tried to post this message directly tied to the comment but GitHub wouldn't let me do it! Adding the TODO in a separate NFC patch sounds good to me.

I added the TODO comment to this PR: #4361 (commit: 900de46)

Pennycook · 2021-07-14T17:48:34Z

@JackAKirk: This PR introduces some usages of atomic_ref instead of atomic. Does it address your TODO from line 64?

JackAKirk · 2021-07-14T18:16:37Z

@JackAKirk: This PR introduces some usages of atomic_ref instead of atomic. Does it address your TODO from line 64?

No, but I think that is OK since it comes under this:

"The conditions on which the fast-atomics implementations are used were
not changed as that deserves a separate patch."

Basically the 32 bit float case should be covered by IsReduOptForFastAtomicFetch rather than IsReduOptForAtomic64Add.

Pennycook · 2021-07-14T18:56:20Z

@JackAKirk: This PR introduces some usages of atomic_ref instead of atomic. Does it address your TODO from line 64?

No, but I think that is OK since it comes under this:

"The conditions on which the fast-atomics implementations are used were
not changed as that deserves a separate patch."

Basically the 32 bit float case should be covered by IsReduOptForFastAtomicFetch rather than IsReduOptForAtomic64Add.

Ah, you're right. I'd missed that in the PR description. Thanks.

Signed-off-by: Vyacheslav N Klochkov <[email protected]>

…reduction_range_review

Signed-off-by: Vyacheslav N Klochkov <[email protected]>

…ted printing 'id' before 'range' Signed-off-by: Vyacheslav N Klochkov <[email protected]>

v-klochkov · 2021-07-15T05:38:23Z

@alexbatashev - please take a quick look at this fix from 'abi breaking' point of view.
I believe I did not add any breaking changes. The reason why I am asking is the test layout_array.cpp, which started giving AST dumps for 'id' class before dumps for 'range'. So, I just changed their order in the test.

v-klochkov · 2021-07-15T05:42:13Z

The corresponding LIT tests are almost ready. I will upload them by the noon/end of Thursday.

alexbatashev

ABI changes LGTM

Signed-off-by: Vyacheslav N Klochkov <[email protected]>

v-klochkov requested review from Pennycook and romanovvlad July 14, 2021 05:57

v-klochkov requested a review from a team as a code owner July 14, 2021 05:57

Pennycook reviewed Jul 14, 2021

View reviewed changes

v-klochkov added 4 commits July 14, 2021 16:26

Fixes for reviewer's comments + 1 correctness fix in handler.hpp

808bfa5

Signed-off-by: Vyacheslav N Klochkov <[email protected]>

Merge remote-tracking branch 'intel_llvm/sycl' into public_vklochkov_…

9191291

…reduction_range_review

Fixed LIT test sycl_symbols_linux.dump

6efc98c

Signed-off-by: Vyacheslav N Klochkov <[email protected]>

NFC fix in layout_array.cpp LIT test. For some reasons AST dumps star…

63ca16a

…ted printing 'id' before 'range' Signed-off-by: Vyacheslav N Klochkov <[email protected]>

v-klochkov requested a review from alexbatashev July 15, 2021 05:41

alexbatashev approved these changes Jul 15, 2021

View reviewed changes

Pennycook approved these changes Jul 15, 2021

View reviewed changes

v-klochkov merged commit d1556e4 into intel:sycl Jul 15, 2021

v-klochkov mentioned this pull request Jul 15, 2021

[SYCL] Add lit tests for reduction + range (#4101) intel/llvm-test-suite#366

Merged

v-klochkov deleted the public_vklochkov_reduction_range_review branch July 16, 2021 18:28

v-klochkov added a commit to v-klochkov/llvm that referenced this pull request Aug 17, 2021

[SYCL][NFC] Add a TODO comment promised during review of intel#4101

900de46

Signed-off-by: Vyacheslav N Klochkov <[email protected]>

[SYCL] Implement parallel_for(range, reduction, func) #4101

[SYCL] Implement parallel_for(range, reduction, func) #4101

Uh oh!

Conversation

v-klochkov commented Jul 14, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pennycook Jul 14, 2021

Choose a reason for hiding this comment

Uh oh!

v-klochkov Jul 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pennycook Jul 15, 2021

Choose a reason for hiding this comment

Uh oh!

v-klochkov Jul 15, 2021

Choose a reason for hiding this comment

Uh oh!

Pennycook Jul 15, 2021

Choose a reason for hiding this comment

Uh oh!

v-klochkov Aug 18, 2021

Choose a reason for hiding this comment

Uh oh!

v-klochkov Aug 18, 2021

Choose a reason for hiding this comment

Uh oh!

Pennycook commented Jul 14, 2021

Uh oh!

JackAKirk commented Jul 14, 2021

Uh oh!

Pennycook commented Jul 14, 2021

Uh oh!

v-klochkov commented Jul 15, 2021

Uh oh!

v-klochkov commented Jul 15, 2021

Uh oh!

alexbatashev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

v-klochkov Jul 15, 2021 •

edited

Loading