[SYCL][NFCI] Don't go through variadic for `parallel_for(range<N>, krn)` #18019

aelovikov-intel · 2025-04-14T23:16:10Z

This is a "reduction" overload that just happens to dispatch immediately to the non-reduction range+properties version of parallel_for. Going through the simpler overload (unused before this PR) seems to be cheaper.

E.g., for


template <typename...> struct Name;

template <typename Krn> struct Invoker {
  static void call(void *p, int i) { (*static_cast<Krn *>(p))(i); }
};

void invoke(void (*)(void *, int));

struct Kernel {
  using PointersVariant =
      std::variant<std::int8_t *, std::int16_t *, std::uint8_t *,
                   std::uint16_t *, float *, double *, sycl::half *>;

  PointersVariant lhs;
  PointersVariant rhs;
  std::size_t sz;
  PointersVariant out;

  template <typename T>
  Kernel(T *l, T *r, std::size_t size, T *o)
      : lhs(l), rhs(r), sz(size), out(o) {}

  void operator()(sycl::handler &h) {
    std::visit(
        [&](auto lhs_ptr, auto rhs_ptr, auto dst_ptr) {
          auto L = [=](auto i) { dst_ptr[i] = lhs_ptr[i] + rhs_ptr[i]; };
          using N =
              Name<decltype(lhs_ptr), decltype(rhs_ptr), decltype(dst_ptr)>;
          h.parallel_for<N>(sz, L);
          invoke(&Invoker<decltype(L)>::call);
        },
        lhs, rhs, out);
  }
};

auto p = &Kernel::operator();

I see 10.35s->9.9s improvement for

$ time clang++ -fsycl -c a.cpp -D__SYCL_DISABLE_PARALLEL_FOR_RANGE_ROUNDING__

This is a "reduction" overload that just happens to dispatch immediately to the non-reduction range+properties version of `parallel_for`. Going through the simpler overload (unused before this PR) seems to be cheaper. E.g., for ``` template <typename...> struct Name; template <typename Krn> struct Invoker { static void call(void *p, int i) { (*static_cast<Krn *>(p))(i); } }; void invoke(void (*)(void *, int)); struct Kernel { using PointersVariant = std::variant<std::int8_t *, std::int16_t *, std::uint8_t *, std::uint16_t *, float *, double *, sycl::half *>; PointersVariant lhs; PointersVariant rhs; std::size_t sz; PointersVariant out; template <typename T> Kernel(T *l, T *r, std::size_t size, T *o) : lhs(l), rhs(r), sz(size), out(o) {} void operator()(sycl::handler &h) { std::visit( [&](auto lhs_ptr, auto rhs_ptr, auto dst_ptr) { auto L = [=](auto i) { dst_ptr[i] = lhs_ptr[i] + rhs_ptr[i]; }; using N = Name<decltype(lhs_ptr), decltype(rhs_ptr), decltype(dst_ptr)>; h.parallel_for<N>(sz, L); invoke(&Invoker<decltype(L)>::call); }, lhs, rhs, out); } }; auto p = &Kernel::operator(); ``` I see 10.35s->9.9s improvement for `$ time clang++ -fsycl -c a.cpp -D__SYCL_DISABLE_PARALLEL_FOR_RANGE_ROUNDING__`

steffenlarsen

Seems reasonable! Less indirection is good.

aelovikov-intel requested a review from a team as a code owner April 14, 2025 23:16

aelovikov-intel requested review from sergey-semenov and steffenlarsen April 14, 2025 23:16

aelovikov-intel temporarily deployed to WindowsCILock April 14, 2025 23:16 — with GitHub Actions Inactive

aelovikov-intel temporarily deployed to WindowsCILock April 14, 2025 23:46 — with GitHub Actions Inactive

aelovikov-intel temporarily deployed to WindowsCILock April 14, 2025 23:56 — with GitHub Actions Inactive

steffenlarsen approved these changes Apr 15, 2025

View reviewed changes

aelovikov-intel merged commit 567c077 into intel:sycl Apr 15, 2025
37 of 38 checks passed

aelovikov-intel deleted the avoid-red-overloads branch April 15, 2025 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][NFCI] Don't go through variadic for `parallel_for(range<N>, krn)` #18019

[SYCL][NFCI] Don't go through variadic for `parallel_for(range<N>, krn)` #18019

Uh oh!

aelovikov-intel commented Apr 14, 2025

Uh oh!

steffenlarsen left a comment

Uh oh!

Uh oh!

Uh oh!

[SYCL][NFCI] Don't go through variadic for parallel_for(range<N>, krn) #18019

[SYCL][NFCI] Don't go through variadic for parallel_for(range<N>, krn) #18019

Uh oh!

Conversation

aelovikov-intel commented Apr 14, 2025

Uh oh!

steffenlarsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

[SYCL][NFCI] Don't go through variadic for `parallel_for(range<N>, krn)` #18019

[SYCL][NFCI] Don't go through variadic for `parallel_for(range<N>, krn)` #18019