intel · bader · Apr 4, 2022 · Apr 1, 2022
@@ -0,0 +1,204 @@
+= sycl_ext_oneapi_auto_local_range
+
+:source-highlighter: coderay
+:coderay-linenums-mode: table
+
+// This section needs to be after the document title.
+:doctype: book
+:toc2:
+:toc: left
+:encoding: utf-8
+:lang: en
+:dpcpp: pass:[DPC++]
+
+// Set the default source code type in this document to C++,
+// for syntax highlighting purposes.  This is needed because
+// docbook uses c++ and html5 uses cpp.
+:language: {basebackend@docbook:c++:cpp}
+
+
+== Notice
+
+[%hardbreaks]
+Copyright (C) 2022-2022 Intel Corporation.  All rights reserved.
+
+Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
+of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc. used by
+permission by Khronos.
+
+
+== Contact
+
+To report problems with this extension, please open a new issue at:
+
+https://github.com/intel/llvm/issues
+
+
+== Dependencies
+
+This extension is written against the SYCL 2020 revision 4 specification.  All
+references below to the "core SYCL specification" or to section numbers in the
+SYCL specification refer to that revision.
+
+
+== Status
+
+This is a proposed extension specification, intended to gather community
+feedback.  Interfaces defined in this specification may not be implemented yet
+or may be in a preliminary state.  The specification itself may also change in
+incompatible ways before it is finalized.  *Shipping software products should
+not rely on APIs defined in this specification.*
+
+
+== Overview
+
+SYCL provides a basic form of `parallel_for` that allows developers to specify
+the total number of work-items to launch (i.e. the size of the global range)
+without also specifying the number of work-items per work-group (i.e. the size
+of the local range). However, this form of `parallel_for` does not provide
+access to the `sycl::group` or `sycl::sub_group` classes; if a developer wants
+to use these classes, a kernel must be launched with a `sycl::nd_range` that
+specifies both the global and local sizes.
+
+The only way to allow an implementation to choose the local work-group size
+with the ND-range form of `parallel_for` is to use kernel queries, following
+an approach like the one shown below:
+
+[source, c++]
+----
+auto bundle = sycl::get_kernel_bundle(q.get_context());
+auto kernel = bundle.get_kernel<class KernelName>();
+auto multiple = kernel.get_info<sycl::info::kernel_device_specific::preferred_work_group_size_multiple>(q.get_device());
+auto max = kernel.get_info<sycl::info::kernel_device_specific::work_group_size>(q.get_device());
+sycl::range<1> local;
+if (N % multiple == 0) {
+  // Use largest work-group size compatible with preferred multiple
+  local = static_cast<size_t>(max / multiple) * multiple;
+}
+else {
+  local = /* find largest work-group size smaller than max that divides N */;
+}
+q.parallel_for<class KernelName>(sycl::nd_range<1>{N, local}, [=](sycl::nd_item<1>) {
+  /* kernel body */
+});
+----
+
+The example above is very verbose, and requires developers to learn about
+kernel naming, `sycl::kernel_bundle`, `sycl::kernel` and a number of
+device/kernel queries. Extending the logic to support two- and
+three-dimensional kernels complicates things further. With this extension, the
+example simplifies to:
+
+[source, c++]
+----
+q.parallel_for(sycl::nd_range<1>{N, sycl::ext::oneapi::experimental::auto_range<1>}, [=](sycl::nd_item<1>) {
+  /* kernel body */
+});
+----
+
+The SYCL 2020 specification recommends that extensions should not alter
+existing constructors without ensuring that one of the parameters comes from
+the vendor's extension namespace. This restriction prevents this extension from
+defaulting the second argument of the `sycl::nd_range` constructor to
+`auto_range`, which would make the example above even simpler:
+
+[source, c++]
+----
+q.parallel_for(sycl::nd_range<1>{N}, [=](sycl::nd_item<1>) {
+  /* kernel body */
+});
+----
+
+If this extension is proposed for inclusion in a future SYCL standard, altering
+the definition of `sycl::nd_range` should be considered.
+
+
+== Specification
+
+=== Feature test macro
+
+This extension provides a feature-test macro as described in the core SYCL
+specification.  An implementation supporting this extension must predefine the
+macro `SYCL_EXT_ONEAPI_AUTO_LOCAL_RANGE` to one of the values defined in the
+table below.  Applications can test for the existence of this macro to
+determine if the implementation supports this feature, or applications can test
+the macro's value to determine which of the extension's features the
+implementation supports.
+
+[%header,cols="1,5"]
+|===
+|Value
+|Description
+
+|1
+|Initial version of this extension.
+|===
+
+
+=== `auto_range`
+
+This extension defines a new `sycl::ext::oneapi::experimental::auto_range`
+variable which can be used to define a `sycl::nd_range` with an unspecified
+work-group size. If such a `sycl::nd_range` object is used to launch a SYCL
+kernel, an implementation is free to launch the kernel with any valid
+work-group size (as defined by the SYCL specification).
+
+The manner in which a work-group size is selected is implementation-defined.
+However, the total number of work-items launched by the kernel must match the
+number specified as the first argument to the `sycl::nd_range` constructor;
+an implementation may not adjust the size of the global range.
+
+NOTE: Developers must take care to avoid awkward global range sizes when using
+an `auto_range`, to avoid performance issues. Since implementations must still
+choose a work-group size that divides the total number of work-items, the
+implementation may be forced to choose a sub-optimal work-group size (e.g. if
+the total number of work-items is a prime number, the work-group size must be
+1). Although this division requirement holds for all SYCL kernels, it is easier
+to miss when using an `auto_range`.
+
+
+[source, c++]
+----
+namespace sycl {
+namespace ext {
+namespace oneapi {
+namespace experimental {
+
+template <int Dimensions>
+static const inline range<Dimensions> auto_range = /* implementation-defined */;
+
+}
+}
+}
+}
+----
+
+NOTE: The `auto_range` variable is not `constexpr` because `sycl::range` is not
+required to be a literal type by SYCL 2020. If this changes in the future, this
+extension will be updated.
+
+
+== Implementation notes
+
+This non-normative section provides information about one possible
+implementation of this extension.  It is not part of the specification of the
+extension's API.
+
+The value of `auto_range` is implementation-defined to maximize freedom for
+implementations. If an implementation wants to use a reserved value to
+represent a request for an automatic local range, a range with every element
+set to 0 has no valid interpretation in SYCL 2020 and may be a logical choice.
+A trivial implementation of this extension can use a range with every element
+set to 1, since this is always a valid local range.
+
+Implementations using OpenCL backends can build on existing OpenCL
+functionality where `clEnqueueNDRangeKernel` is called with a `local_work_size`
+value of `NULL`.
+
+Implementations using other backends (e.g. Level Zero, CUDA) can use a
+combination of device and kernel queries to determine a good work-group size.
+
+
+== Issues
+
+None.