Skip to content

[SYCL][Doc] Add auto_local_range extension #5950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 4, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
= sycl_ext_oneapi_auto_local_range

:source-highlighter: coderay
:coderay-linenums-mode: table

// This section needs to be after the document title.
:doctype: book
:toc2:
:toc: left
:encoding: utf-8
:lang: en
:dpcpp: pass:[DPC++]

// Set the default source code type in this document to C++,
// for syntax highlighting purposes. This is needed because
// docbook uses c++ and html5 uses cpp.
:language: {basebackend@docbook:c++:cpp}


== Notice

[%hardbreaks]
Copyright (C) 2022-2022 Intel Corporation. All rights reserved.

Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by
permission by Khronos.


== Contact

To report problems with this extension, please open a new issue at:

https://github.com/intel/llvm/issues


== Dependencies

This extension is written against the SYCL 2020 revision 4 specification. All
references below to the "core SYCL specification" or to section numbers in the
SYCL specification refer to that revision.


== Status

This is a proposed extension specification, intended to gather community
feedback. Interfaces defined in this specification may not be implemented yet
or may be in a preliminary state. The specification itself may also change in
incompatible ways before it is finalized. *Shipping software products should
not rely on APIs defined in this specification.*


== Overview

SYCL provides a basic form of `parallel_for` that allows developers to specify
the total number of work-items to launch (i.e. the size of the global range)
without also specifying the number of work-items per work-group (i.e. the size
of the local range). However, this form of `parallel_for` does not provide
access to the `sycl::group` or `sycl::sub_group` classes; if a developer wants
to use these classes, a kernel must be launched with a `sycl::nd_range` that
specifies both the global and local sizes.

The only way to allow an implementation to choose the local work-group size
with the ND-range form of `parallel_for` is to use kernel queries, following
an approach like the one shown below:

[source, c++]
----
auto bundle = sycl::get_kernel_bundle(q.get_context());
auto kernel = bundle.get_kernel<class KernelName>();
auto multiple = kernel.get_info<sycl::info::kernel_device_specific::preferred_work_group_size_multiple>(q.get_device());
auto max = kernel.get_info<sycl::info::kernel_device_specific::work_group_size>(q.get_device());
sycl::range<1> local;
if (N % multiple == 0) {
// Use largest work-group size compatible with preferred multiple
local = static_cast<size_t>(max / multiple) * multiple;
}
else {
local = /* find largest work-group size smaller than max that divides N */;
}
q.parallel_for<class KernelName>(sycl::nd_range<1>{N, local}, [=](sycl::nd_item<1>) {
/* kernel body */
});
----

The example above is very verbose, and requires developers to learn about
kernel naming, `sycl::kernel_bundle`, `sycl::kernel` and a number of
device/kernel queries. Extending the logic to support two- and
three-dimensional kernels complicates things further. With this extension, the
example simplifies to:

[source, c++]
----
q.parallel_for(sycl::nd_range<1>{N, sycl::ext::oneapi::experimental::auto_range<1>}, [=](sycl::nd_item<1>) {
/* kernel body */
});
----

The SYCL 2020 specification recommends that extensions should not alter
existing constructors without ensuring that one of the parameters comes from
the vendor's extension namespace. This restriction prevents this extension from
defaulting the second argument of the `sycl::nd_range` constructor to
`auto_range`, which would make the example above even simpler:

[source, c++]
----
q.parallel_for(sycl::nd_range<1>{N}, [=](sycl::nd_item<1>) {
/* kernel body */
});
----

If this extension is proposed for inclusion in a future SYCL standard, altering
the definition of `sycl::nd_range` should be considered.


== Specification

=== Feature test macro

This extension provides a feature-test macro as described in the core SYCL
specification. An implementation supporting this extension must predefine the
macro `SYCL_EXT_ONEAPI_AUTO_LOCAL_RANGE` to one of the values defined in the
table below. Applications can test for the existence of this macro to
determine if the implementation supports this feature, or applications can test
the macro's value to determine which of the extension's features the
implementation supports.

[%header,cols="1,5"]
|===
|Value
|Description

|1
|Initial version of this extension.
|===


=== `auto_range`

This extension defines a new `sycl::ext::oneapi::experimental::auto_range`
variable which can be used to define a `sycl::nd_range` with an unspecified
work-group size. If such a `sycl::nd_range` object is used to launch a SYCL
kernel, an implementation is free to launch the kernel with any valid
work-group size (as defined by the SYCL specification).

The manner in which a work-group size is selected is implementation-defined.
However, the total number of work-items launched by the kernel must match the
number specified as the first argument to the `sycl::nd_range` constructor;
an implementation may not adjust the size of the global range.

NOTE: Developers must take care to avoid awkward global range sizes when using
an `auto_range`, to avoid performance issues. Since implementations must still
choose a work-group size that divides the total number of work-items, the
implementation may be forced to choose a sub-optimal work-group size (e.g. if
the total number of work-items is a prime number, the work-group size must be
1). Although this division requirement holds for all SYCL kernels, it is easier
to miss when using an `auto_range`.


[source, c++]
----
namespace sycl {
namespace ext {
namespace oneapi {
namespace experimental {

template <int Dimensions>
static const inline range<Dimensions> auto_range = /* implementation-defined */;

}
}
}
}
----

NOTE: The `auto_range` variable is not `constexpr` because `sycl::range` is not
required to be a literal type by SYCL 2020. If this changes in the future, this
extension will be updated.


== Implementation notes

This non-normative section provides information about one possible
implementation of this extension. It is not part of the specification of the
extension's API.

The value of `auto_range` is implementation-defined to maximize freedom for
implementations. If an implementation wants to use a reserved value to
represent a request for an automatic local range, a range with every element
set to 0 has no valid interpretation in SYCL 2020 and may be a logical choice.
A trivial implementation of this extension can use a range with every element
set to 1, since this is always a valid local range.

Implementations using OpenCL backends can build on existing OpenCL
functionality where `clEnqueueNDRangeKernel` is called with a `local_work_size`
value of `NULL`.

Implementations using other backends (e.g. Level Zero, CUDA) can use a
combination of device and kernel queries to determine a good work-group size.


== Issues

None.