intel · aelovikov-intel · Jun 28, 2023 · Jun 7, 2023 · Jun 27, 2023 · Jun 27, 2023
@@ -0,0 +1,315 @@
+= sycl_ext_oneapi_work_group_local
+
+:source-highlighter: coderay
+:coderay-linenums-mode: table
+
+// This section needs to be after the document title.
+:doctype: book
+:toc2:
+:toc: left
+:encoding: utf-8
+:lang: en
+:dpcpp: pass:[DPC++]
+
+// Set the default source code type in this document to C++,
+// for syntax highlighting purposes.  This is needed because
+// docbook uses c++ and html5 uses cpp.
+:language: {basebackend@docbook:c++:cpp}
+
+
+== Notice
+
+[%hardbreaks]
+Copyright (C) 2023-2023 Intel Corporation.  All rights reserved.
+
+Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
+of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc. used by
+permission by Khronos.
+
+
+== Contact
+
+To report problems with this extension, please open a new issue at:
+
+https://github.com/intel/llvm/issues
+
+
+== Dependencies
+
+This extension is written against the SYCL 2020 revision 7 specification.  All
+references below to the "core SYCL specification" or to section numbers in the
+SYCL specification refer to that revision.
+
+The following extensions are required only for dynamic allocations:
+
+- link:../experimental/sycl_ext_oneapi_properties.asciidoc[sycl_ext_oneapi_properties]
+
+- link:../experimental/sycl_ext_oneapi_kernel_properties.asciidoc[sycl_ext_oneapi_kernel_properties]
+
+
+== Status
+
+This is a proposed extension specification, intended to gather community
+feedback.  Interfaces defined in this specification may not be implemented yet
+or may be in a preliminary state.  The specification itself may also change in
+incompatible ways before it is finalized.  *Shipping software products should
+not rely on APIs defined in this specification.*
+
+
+== Overview
+
+This extension defines a `sycl::ext::oneapi::experimental::work_group_local`
+class template with behavior inspired by the {cpp} `thread_local` keyword
+and the CUDA `+__shared__+` keyword.
+
+`work_group_local` variables can be allocated at global or function scope,
+lifting many of the restrictions in the existing
+link:../supported/sycl_ext_oneapi_local_memory.asciidoc[sycl_ext_oneapi_local_memory]
+extension. Note, however, that `work_group_local` variables currently place
+additional limits on the types that can be allocated, owing to differences in
+constructor behavior.
+
+
+== Specification
+
+=== Feature test macro
+
+This extension provides a feature-test macro as described in the core SYCL
+specification.  An implementation supporting this extension must predefine the
+macro `SYCL_EXT_ONEAPI_WORK_GROUP_LOCAL` to one of the values defined in the
+table below.  Applications can test for the existence of this macro to
+determine if the implementation supports this feature, or applications can test
+the macro's value to determine which of the extension's features the
+implementation supports.
+
+[%header,cols="1,5"]
+|===
+|Value
+|Description
+
+|1
+|The APIs of this experimental extension are not versioned, so the
+ feature-test macro always has this value.
+|===
+
+
+=== `work_group_local` class template
+
+The `work_group_local` class template acts as a view of an
+implementation-managed pointer to work-group local memory.
+
+[source,c++]
+----
+namespace sycl::ext::oneapi::experimental {
+
+template <typename T>
+class work_group_local {
+public:
+
+  work_group_local() = default;
+  work_group_local(const work_group_local&) = delete;
+  work_group_local& operator=(const work_group_local&) = delete;
+
+  operator T&() const noexcept;
+
+  // Available only if: std::is_array_v<T> == false
+  const work_group_local& operator=(const T& value) const noexcept;
+
+  T* operator&() const noexcept;
+
+private:
+  T* ptr; // exposition only
+
+};
+
+} // namespace sycl::ext::oneapi::experimental
+----
+
+`T` must be trivially constructible and trivially destructible.
+
+The storage for the object is allocated in work-group local memory before
+calling the user's kernel lambda, and deallocated when all work-items
+in the group have completed execution of the kernel.
+
+SYCL implementations conforming to the full feature set treat
+`work_group_local` similarly to the `thread_local` keyword, and when
+a `work_group_local` object is declared at block scope it behaves
+as if the `static` keyword was specified implicitly. SYCL implementations
+conforming to the reduced feature set require the `static` keyword to be
+specified explicitly.
+
+[NOTE]
+====
+If a `work_group_local` object is declared at function scope, the work-group
+local memory associated with the object will be identical for all usages of
+that function within the kernel. In cases where a function is called multiple
+times, developers must take care to avoid race conditions (e.g., by calling
+`group_barrier` before and after using the memory).
+====
+
+SYCL 2020 requires that all global variables accessed by a device function are
+`const` or `constexpr`. This extension lifts that restriction for
+`work_group_local` variables.
+
+[NOTE]
+====
+Since `work_group_local` acts as a view, wrapping an underlying pointer, a
+developer may still choose to declare variables as `const`.
+====
+
+When `T` is a class type or bounded array, the size of the allocation is known
+at compile-time, and a SYCL implementation may embed the size of the allocation
+directly within a kernel. Each instance of `work_group_local<T>` is associated
+with a unique allocation in work-group local memory.
+
+When `T` is an unbounded array, the size of the allocation is unknown at
+compile-time, and must be communicated to the SYCL implementation via the
+`work_group_local_memory_size` property. Every instance of `work_group_local`
+for which `T` is an unbounded array is associated with a single, shared,
+allocation in work-group local memory. For example, two instances declared as
+`work_group_local<int[]>` and `work_group_local<float[]>` will be associated
+with the same shared allocation.
+
+If the total amount of local memory requested (i.e., the sum of all memory
+requested by `local_accessor`, `group_local_memory`,
+`group_local_memory_for_overwrite` and `work_group_local`) exceeds a device's
+local memory capacity (as reported by `local_mem_size`) then the implementation
+must throw a synchronous `exception` with the `errc::memory_allocation` error
+code from the kernel invocation command (e.g. `parallel_for`).
+
+[source,c++]
+----
+operator T&() const noexcept;
+----
+_Returns_: A reference to the object stored in the work-group local memory
+associated with this instance of `work_group_local`.
+
+[source,c++]
+----
+const work_group_local<T>& operator=(const T& value) const noexcept;
+----
+_Constraints_: Available only if `std::is_array_v<T>>` is false.
+
+_Effects_: Replaces the value referenced by `*ptr` with `value`.
+
+_Returns_: A reference to this instance of `work_group_local`.
+
+[source,c++]
+----
+T* operator&() const noexcept;
+----
+_Returns_: A pointer to the work-group local memory associated with this
+instance of `work_group_local` (i.e., `ptr`).
+
+
+==== Kernel properties
+
+The `work_group_local_size` property must be passed to a kernel to determine
+the run-time size of the work-group local memory allocation associated with
+all `work_group_local` variables of unbounded array type.
+
+[source,c++]
+----
+namespace sycl::ext::oneapi::experimental {
+
+struct work_group_local_size {
+  constexpr work_group_local_size(size_t bytes) : value(bytes) {}
+  size_t value;
+}; // work_group_local_size
+
+using work_group_local_size_key = work_group_local_size;
+
+template <>struct is_property_key<work_group_local_size_key> : std::true_type {};
+
+} // namespace sycl::ext::oneapi::experimental
+----
+
+|===
+|Property|Description
+
+|`work_group_local_size`
+|The `work_group_local_size` property describes the amount of dynamic
+work-group local memory required by the kernel in bytes.
+
+|===
+
+
+==== Usage examples
+
+===== Allocations with size known at compile-time
+
+[source,c++]
+----
+using namespace syclex = sycl::ext::oneapi::experimental;
+
+/* optional: static const */ syclex::work_group_local<int> program_scope_scalar;
+/* optional: static const */ syclex::work_group_local<int[16]> program_scope_array;
+
+void foo() {
+  /* optional: static const */ syclex::work_group_local<int> function_scope_scalar;
+  function_scope_scalar = 1; // assignment via overloaded = operator
+  function_scope_scalar += 2; // += operator via implicit conversion to int&
+  int* ptr = &function_scope_scalar; // conversion to pointer via overloaded & operator
+}
+
+void bar() {
+  /* optional: static const */ sylex::work_group_local<int[64]> function_scope_array;
+  function_scope_array[0] = 1; // [] operator via implicit conversion to int(&)[64]
+  int* ptr = function_scope_array; // conversion to pointer via implicit conversion to int(&)[64]
+}
+----
+
+===== Allocations with size unknown at compile-time
+
+[source,c++]
+----
+using namespace syclex = sycl::ext::oneapi::experimental;
+
+/* optional: static const */ syclex::work_group_local<int[]> dynamic_program_scope_array;
+
+...
+
+q.parallel_for(sycl::nd_range<1>{N, M},
+  syclex::properties{syclex::work_group_local_size(M * sizeof(int))},
+  [=](sycl::nd_item<1> it) {
+  ...
+});
+----
+
+
+== Implementation notes
+
+This non-normative section provides information about one possible
+implementation of this extension.  It is not part of the specification of the
+extension's API.
+
+For class types and bounded arrays, the class can be implemented on top of
+the existing `__sycl_allocateLocalMemory` intrinsic:
+[source,c++]
+----
+#ifdef __SYCL_DEVICE_ONLY__
+  __attribute__((opencl_local)) T *ptr = reinterpret_cast<__attribute__((opencl_local)) T *>(__sycl_allocateLocalMemory(sizeof(T), alignof(T)));
+#else
+  T *ptr{};
+#endif
+----
+
+Note, however, that implementing the correct semantics may require some
+adjustment to the handling of this intrinsic. A simple class as written above
+would create a separate allocation for every call to an inlined function.
+Creating work-group local allocations should be handled before inlining to
+prevent this.
+
+For unbounded arrays, a separate specialization of the class will be required,
+and the implementation may need to generate some additional code to
+appropriately initialize the pointer(s) wrapped by `work_group_local` objects.
+Alternatively, it may be possible to initialize the pointer to the beginning
+of the device's local memory region (if that value is known). Either way, the
+implementation must account for the existence of one or more `local_accessor`
+objects (which themselves may allocate a dynamic amount of work-group local
+memory).
+
+
+== Issues
+
+None.