Skip to content

[SYCL][Reduction] Avoid implicit atomic64 requirements #9070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 21 additions & 1 deletion sycl/include/sycl/reduction.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -2632,6 +2632,20 @@ template <> struct NDRangeReduction<reduction::strategy::auto_select> {
else
return Delegate(Impl<Strat::basic>{});
} else if constexpr (Reduction::has_fast_atomics) {
if constexpr (sizeof(typename Reduction::result_type) == 8) {
// Both group_reduce_and_atomic_cross_wg and
// local_mem_tree_and_atomic_cross_wg implicitly require
// aspect::atomic64 if the result type of the reduction is 64-bit. If
// the device does not support this, we need to fall back to more
// reliable strategies.
if (!getDeviceFromHandler(CGH).has(aspect::atomic64)) {
if constexpr (Reduction::has_fast_reduce)
return Delegate(Impl<Strat::group_reduce_and_multiple_kernels>{});
else
return Delegate(Impl<Strat::basic>{});
}
}

if constexpr (Reduction::has_fast_reduce) {
return Delegate(Impl<Strat::group_reduce_and_atomic_cross_wg>{});
} else {
Expand Down Expand Up @@ -2762,10 +2776,16 @@ void reduction_parallel_for(handler &CGH, range<Dims> Range,
// specification. However, implementing run-time check for that would
// result in an extra kernel compilation(s). We probably need to
// investigate if the usage of kernel_bundles can mitigate that.
// TODO: local_atomic_and_atomic_cross_wg uses atomics on the partial
// results, which may add an implicit requirement on aspect::atomic64. As
// a temporary work-around we do not pick this if the result type is
// 64-bit. In the future this selection should be done at runtime based
// on the device.
// Note: Identityless reductions cannot use group reductions.
if constexpr (Reduction::has_fast_reduce && Reduction::has_identity)
return reduction::strategy::group_reduce_and_last_wg_detection;
else if constexpr (Reduction::has_fast_atomics)
else if constexpr (Reduction::has_fast_atomics &&
sizeof(typename Reduction::result_type) != 8)
return reduction::strategy::local_atomic_and_atomic_cross_wg;
else
return reduction::strategy::range_basic;
Expand Down
67 changes: 67 additions & 0 deletions sycl/test-e2e/Regression/reduction_64bit_atomic64.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
// RUN: %clangxx -fsycl -fsycl-targets=%sycl_triple %s -o %t.out
//
// RUN: %CPU_RUN_PLACEHOLDER %t.out
// RUN: %GPU_RUN_PLACEHOLDER %t.out
// RUN: %ACC_RUN_PLACEHOLDER %t.out

// Tests that a previously known case for reduction doesn't cause a requirement
// for atomic64.
// TODO: When aspect requirements are added to testing, this test could be set
// to require that atomic64 is NOT supported, to limit how frequently the
// test is run. However, it should work on devices that support atomic64
// as well.
Comment on lines +9 to +12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding runtime aspect check for that here in this PR with the expectation that we'll uplift it to // REQUIRES once the support is added to the llvm.lit.cfg.py.

I also think that a slightly better option would be to "unit-test" the dynamic strategy selector once it's implemented (i.e., don't even run the actual reduction kernel).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding runtime aspect check for that here in this PR with the expectation that we'll uplift it to // REQUIRES once the support is added to the llvm.lit.cfg.py.

We could do a runtime check if you prefer. The test is still valid, so this TODO is more of a suggestion than a requirement really.

I also think that a slightly better option would be to "unit-test" the dynamic strategy selector once it's implemented (i.e., don't even run the actual reduction kernel).

I agree that would be better, but sadly the unittests don't compile the kernels, so we would have no way of knowing if the corresponding kernels had any requirements. We could set the requirements ourselves, but that really defeats the purpose of the tests.


#include <sycl/sycl.hpp>

#include <iostream>

using namespace sycl;

int main() {
queue Q;

if (Q.get_device().has(aspect::atomic64)) {
std::cout << "Device supports aspect::atomic64 so we do not need to run "
"the test."
<< std::endl;
return 0;
}

long long *Out = malloc_shared<long long>(1, Q);

// Case 1: nd_range reduction with 64-bit integer and either sycl::plus,
// sycl::minimum or sycl::maximum. group_reduce_and_atomic_cross_wg strategy
// would normally be picked, but if the device does not support atomic64 that
// strategy is invalid.
Q.submit([&](handler &CGH) {
auto Redu = reduction(Out, 0ll, sycl::plus<long long>{});
CGH.parallel_for(nd_range<1>{range<1>{32}, range<1>{32}}, Redu,
[=](nd_item<1> It, auto &Sum) {
Sum.combine(It.get_global_linear_id());
});
}).wait();

// Case 2: nd_range reduction with 64-bit integer and either sycl::bit_or,
// sycl::bit_xor, sycl::bit_and. local_mem_tree_and_atomic_cross_wg strategy
// would normally be picked, but if the device does not support atomic64 that
// strategy is invalid.
Q.submit([&](handler &CGH) {
auto Redu = reduction(Out, 0ll, sycl::bit_and<long long>{});
CGH.parallel_for(nd_range<1>{range<1>{32}, range<1>{32}}, Redu,
[=](nd_item<1> It, auto &Sum) {
Sum.combine(It.get_global_linear_id());
});
}).wait();

// Case 3: range reduction with 64-bit integer and either sycl::bit_or,
// sycl::bit_xor, sycl::bit_and. local_atomic_and_atomic_cross_wg strategy
// would normally be picked, but if the device does not support atomic64 that
// strategy is invalid.
Q.submit([&](handler &CGH) {
auto Redu = reduction(Out, 0ll, sycl::bit_and<long long>{});
CGH.parallel_for(range<1>{32}, Redu,
[=](item<1> It, auto &Sum) { Sum.combine(It); });
}).wait();
sycl::free(Out, Q);
return 0;
}