Skip to content

[SYCL][NFCI] Don't go through variadic for parallel_for(range<N>, krn) #18019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 15, 2025

Conversation

aelovikov-intel
Copy link
Contributor

This is a "reduction" overload that just happens to dispatch immediately to the non-reduction range+properties version of parallel_for. Going through the simpler overload (unused before this PR) seems to be cheaper.

E.g., for


template <typename...> struct Name;

template <typename Krn> struct Invoker {
  static void call(void *p, int i) { (*static_cast<Krn *>(p))(i); }
};

void invoke(void (*)(void *, int));

struct Kernel {
  using PointersVariant =
      std::variant<std::int8_t *, std::int16_t *, std::uint8_t *,
                   std::uint16_t *, float *, double *, sycl::half *>;

  PointersVariant lhs;
  PointersVariant rhs;
  std::size_t sz;
  PointersVariant out;

  template <typename T>
  Kernel(T *l, T *r, std::size_t size, T *o)
      : lhs(l), rhs(r), sz(size), out(o) {}

  void operator()(sycl::handler &h) {
    std::visit(
        [&](auto lhs_ptr, auto rhs_ptr, auto dst_ptr) {
          auto L = [=](auto i) { dst_ptr[i] = lhs_ptr[i] + rhs_ptr[i]; };
          using N =
              Name<decltype(lhs_ptr), decltype(rhs_ptr), decltype(dst_ptr)>;
          h.parallel_for<N>(sz, L);
          invoke(&Invoker<decltype(L)>::call);
        },
        lhs, rhs, out);
  }
};

auto p = &Kernel::operator();

I see 10.35s->9.9s improvement for

$ time clang++ -fsycl -c a.cpp -D__SYCL_DISABLE_PARALLEL_FOR_RANGE_ROUNDING__

This is a "reduction" overload that just happens to dispatch immediately to the
non-reduction range+properties version of `parallel_for`. Going through the
simpler overload (unused before this PR) seems to be cheaper.

E.g., for

```

template <typename...> struct Name;

template <typename Krn> struct Invoker {
  static void call(void *p, int i) { (*static_cast<Krn *>(p))(i); }
};

void invoke(void (*)(void *, int));

struct Kernel {
  using PointersVariant =
      std::variant<std::int8_t *, std::int16_t *, std::uint8_t *,
                   std::uint16_t *, float *, double *, sycl::half *>;

  PointersVariant lhs;
  PointersVariant rhs;
  std::size_t sz;
  PointersVariant out;

  template <typename T>
  Kernel(T *l, T *r, std::size_t size, T *o)
      : lhs(l), rhs(r), sz(size), out(o) {}

  void operator()(sycl::handler &h) {
    std::visit(
        [&](auto lhs_ptr, auto rhs_ptr, auto dst_ptr) {
          auto L = [=](auto i) { dst_ptr[i] = lhs_ptr[i] + rhs_ptr[i]; };
          using N =
              Name<decltype(lhs_ptr), decltype(rhs_ptr), decltype(dst_ptr)>;
          h.parallel_for<N>(sz, L);
          invoke(&Invoker<decltype(L)>::call);
        },
        lhs, rhs, out);
  }
};

auto p = &Kernel::operator();
```

I see 10.35s->9.9s improvement for

`$ time clang++ -fsycl -c  a.cpp -D__SYCL_DISABLE_PARALLEL_FOR_RANGE_ROUNDING__`
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable! Less indirection is good.

@aelovikov-intel aelovikov-intel merged commit 567c077 into intel:sycl Apr 15, 2025
37 of 38 checks passed
@aelovikov-intel aelovikov-intel deleted the avoid-red-overloads branch April 15, 2025 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants