Skip to content

[SYCL][DOC] Add sycl_ext_oneapi_work_group_local #9799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,315 @@
= sycl_ext_oneapi_work_group_local

:source-highlighter: coderay
:coderay-linenums-mode: table

// This section needs to be after the document title.
:doctype: book
:toc2:
:toc: left
:encoding: utf-8
:lang: en
:dpcpp: pass:[DPC++]

// Set the default source code type in this document to C++,
// for syntax highlighting purposes. This is needed because
// docbook uses c++ and html5 uses cpp.
:language: {basebackend@docbook:c++:cpp}


== Notice

[%hardbreaks]
Copyright (C) 2023-2023 Intel Corporation. All rights reserved.

Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by
permission by Khronos.


== Contact

To report problems with this extension, please open a new issue at:

https://github.com/intel/llvm/issues


== Dependencies

This extension is written against the SYCL 2020 revision 7 specification. All
references below to the "core SYCL specification" or to section numbers in the
SYCL specification refer to that revision.

The following extensions are required only for dynamic allocations:

- link:../experimental/sycl_ext_oneapi_properties.asciidoc[sycl_ext_oneapi_properties]

- link:../experimental/sycl_ext_oneapi_kernel_properties.asciidoc[sycl_ext_oneapi_kernel_properties]


== Status

This is a proposed extension specification, intended to gather community
feedback. Interfaces defined in this specification may not be implemented yet
or may be in a preliminary state. The specification itself may also change in
incompatible ways before it is finalized. *Shipping software products should
not rely on APIs defined in this specification.*


== Overview

This extension defines a `sycl::ext::oneapi::experimental::work_group_local`
class template with behavior inspired by the {cpp} `thread_local` keyword
and the CUDA `+__shared__+` keyword.

`work_group_local` variables can be allocated at global or function scope,
lifting many of the restrictions in the existing
link:../supported/sycl_ext_oneapi_local_memory.asciidoc[sycl_ext_oneapi_local_memory]
extension. Note, however, that `work_group_local` variables currently place
additional limits on the types that can be allocated, owing to differences in
constructor behavior.


== Specification

=== Feature test macro

This extension provides a feature-test macro as described in the core SYCL
specification. An implementation supporting this extension must predefine the
macro `SYCL_EXT_ONEAPI_WORK_GROUP_LOCAL` to one of the values defined in the
table below. Applications can test for the existence of this macro to
determine if the implementation supports this feature, or applications can test
the macro's value to determine which of the extension's features the
implementation supports.

[%header,cols="1,5"]
|===
|Value
|Description

|1
|The APIs of this experimental extension are not versioned, so the
feature-test macro always has this value.
|===


=== `work_group_local` class template

The `work_group_local` class template acts as a view of an
implementation-managed pointer to work-group local memory.

[source,c++]
----
namespace sycl::ext::oneapi::experimental {

template <typename T>
class work_group_local {
public:

work_group_local() = default;
work_group_local(const work_group_local&) = delete;
work_group_local& operator=(const work_group_local&) = delete;

operator T&() const noexcept;

// Available only if: std::is_array_v<T> == false
const work_group_local& operator=(const T& value) const noexcept;

T* operator&() const noexcept;

private:
T* ptr; // exposition only

};

} // namespace sycl::ext::oneapi::experimental
----

`T` must be trivially constructible and trivially destructible.

The storage for the object is allocated in work-group local memory before
calling the user's kernel lambda, and deallocated when all work-items
in the group have completed execution of the kernel.

SYCL implementations conforming to the full feature set treat
`work_group_local` similarly to the `thread_local` keyword, and when
a `work_group_local` object is declared at block scope it behaves
as if the `static` keyword was specified implicitly. SYCL implementations
conforming to the reduced feature set require the `static` keyword to be
specified explicitly.

[NOTE]
====
If a `work_group_local` object is declared at function scope, the work-group
local memory associated with the object will be identical for all usages of
that function within the kernel. In cases where a function is called multiple
times, developers must take care to avoid race conditions (e.g., by calling
`group_barrier` before and after using the memory).
====

SYCL 2020 requires that all global variables accessed by a device function are
`const` or `constexpr`. This extension lifts that restriction for
`work_group_local` variables.

[NOTE]
====
Since `work_group_local` acts as a view, wrapping an underlying pointer, a
developer may still choose to declare variables as `const`.
====

When `T` is a class type or bounded array, the size of the allocation is known
at compile-time, and a SYCL implementation may embed the size of the allocation
directly within a kernel. Each instance of `work_group_local<T>` is associated
with a unique allocation in work-group local memory.

When `T` is an unbounded array, the size of the allocation is unknown at
compile-time, and must be communicated to the SYCL implementation via the
`work_group_local_memory_size` property. Every instance of `work_group_local`
for which `T` is an unbounded array is associated with a single, shared,
allocation in work-group local memory. For example, two instances declared as
`work_group_local<int[]>` and `work_group_local<float[]>` will be associated
with the same shared allocation.

If the total amount of local memory requested (i.e., the sum of all memory
requested by `local_accessor`, `group_local_memory`,
`group_local_memory_for_overwrite` and `work_group_local`) exceeds a device's
local memory capacity (as reported by `local_mem_size`) then the implementation
must throw a synchronous `exception` with the `errc::memory_allocation` error
code from the kernel invocation command (e.g. `parallel_for`).

[source,c++]
----
operator T&() const noexcept;
----
_Returns_: A reference to the object stored in the work-group local memory
associated with this instance of `work_group_local`.

[source,c++]
----
const work_group_local<T>& operator=(const T& value) const noexcept;
----
_Constraints_: Available only if `std::is_array_v<T>>` is false.

_Effects_: Replaces the value referenced by `*ptr` with `value`.

_Returns_: A reference to this instance of `work_group_local`.

[source,c++]
----
T* operator&() const noexcept;
----
_Returns_: A pointer to the work-group local memory associated with this
instance of `work_group_local` (i.e., `ptr`).


==== Kernel properties

The `work_group_local_size` property must be passed to a kernel to determine
the run-time size of the work-group local memory allocation associated with
all `work_group_local` variables of unbounded array type.

[source,c++]
----
namespace sycl::ext::oneapi::experimental {

struct work_group_local_size {
constexpr work_group_local_size(size_t bytes) : value(bytes) {}
size_t value;
}; // work_group_local_size

using work_group_local_size_key = work_group_local_size;

template <>struct is_property_key<work_group_local_size_key> : std::true_type {};

} // namespace sycl::ext::oneapi::experimental
----

|===
|Property|Description

|`work_group_local_size`
|The `work_group_local_size` property describes the amount of dynamic
work-group local memory required by the kernel in bytes.

|===


==== Usage examples

===== Allocations with size known at compile-time

[source,c++]
----
using namespace syclex = sycl::ext::oneapi::experimental;

/* optional: static const */ syclex::work_group_local<int> program_scope_scalar;
/* optional: static const */ syclex::work_group_local<int[16]> program_scope_array;

void foo() {
/* optional: static const */ syclex::work_group_local<int> function_scope_scalar;
function_scope_scalar = 1; // assignment via overloaded = operator
function_scope_scalar += 2; // += operator via implicit conversion to int&
int* ptr = &function_scope_scalar; // conversion to pointer via overloaded & operator
}

void bar() {
/* optional: static const */ sylex::work_group_local<int[64]> function_scope_array;
function_scope_array[0] = 1; // [] operator via implicit conversion to int(&)[64]
int* ptr = function_scope_array; // conversion to pointer via implicit conversion to int(&)[64]
}
----

===== Allocations with size unknown at compile-time

[source,c++]
----
using namespace syclex = sycl::ext::oneapi::experimental;

/* optional: static const */ syclex::work_group_local<int[]> dynamic_program_scope_array;

...

q.parallel_for(sycl::nd_range<1>{N, M},
syclex::properties{syclex::work_group_local_size(M * sizeof(int))},
[=](sycl::nd_item<1> it) {
...
});
----


== Implementation notes

This non-normative section provides information about one possible
implementation of this extension. It is not part of the specification of the
extension's API.

For class types and bounded arrays, the class can be implemented on top of
the existing `__sycl_allocateLocalMemory` intrinsic:
[source,c++]
----
#ifdef __SYCL_DEVICE_ONLY__
__attribute__((opencl_local)) T *ptr = reinterpret_cast<__attribute__((opencl_local)) T *>(__sycl_allocateLocalMemory(sizeof(T), alignof(T)));
#else
T *ptr{};
#endif
----

Note, however, that implementing the correct semantics may require some
adjustment to the handling of this intrinsic. A simple class as written above
would create a separate allocation for every call to an inlined function.
Creating work-group local allocations should be handled before inlining to
prevent this.

For unbounded arrays, a separate specialization of the class will be required,
and the implementation may need to generate some additional code to
appropriately initialize the pointer(s) wrapped by `work_group_local` objects.
Alternatively, it may be possible to initialize the pointer to the beginning
of the device's local memory region (if that value is known). Either way, the
implementation must account for the existence of one or more `local_accessor`
objects (which themselves may allocate a dynamic amount of work-group local
memory).


== Issues

None.