-
Notifications
You must be signed in to change notification settings - Fork 788
[SYCL][Doc] Update sub-group extension docs #1330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Splits sub-group functionality into three extensions: - SubGroup (sub_group class and device queries) - SubGroupAlgorithms (GroupAlgorithm support and permute) - GroupMask (sub_group::mask_type and ballot) Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
@Pennycook, we probably need to do the same with the spec, or at least let's leave a temporary file in the previous location of the spec, which redirects to the new specs and provides a link to the previous version as well |
Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
That's a good idea. I like the temporary file with redirects solution, because forcing people to go through that extra step means they can't accidentally miss that things have changed. |
|
||
== Overview | ||
|
||
A sub-group represents an implementation-defined grouping of work-items in a work-group. The work-items within a sub-group can communicate and synchronize independently of work-items in other sub-groups, and sub-groups are therefore commonly mapped to SIMD hardware where it exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In OpenCL, subgroups are always 1-dimensional, do we want to clearly state the same for SYCL?
Within a work-group work-items may be divided into sub-groups. The mapping of work-items to
sub-groups is implementation-defined and may be queried at runtime. While sub-groups may be
used in multi-dimensional work-groups, each sub-group is 1-dimensional and any given workitem may query which sub-group it is a member of.
However, even if this spelling theoretically allows 2d-subgroups, methods of the sub_group
class below suggest that they are 1-dimensional.
BTW, OpenCL also provides some guarantees on work-items to sub-groups mapping during kernel execution, between dispatches of a given kernel and query operations:
Work items are mapped into sub-groups through a combination of compile-time decisions and
the parameters of the dispatch. The mapping to sub-groups is invariant for the duration of a
kernel’s execution, across dispatches of a given kernel with the same work-group dimensions,
between dispatches and query operations consistent with the dispatch parameterization, and from
one work-group to another within the dispatch (excluding the trailing edge work-groups in the
presence of non-uniform work-group sizes). In addition, all sub-groups within a work-group will
be the same size, apart from the sub-group with the maximum index which may be smaller if the
size of the work-group is not evenly divisible by the size of the sub-groups.In the degenerate case, a single sub-group must be supported for each work-group. In this
situation all sub-group scope functions are equivalent to their work-group level equivalents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to partially address this. We need to consider the degenerate case more, as I'm not sure we necessarily need to follow OpenCL here.
The permute sub-group functions perform arbitrary communication between pairs of work-items in a sub-group. Common patterns -- such as shifting all values in a sub-group by a fixed number of work-items, or reversing the order of all values in a sub-group -- are exposed as specialized functions that may be accelerated in hardware. | ||
|
||
|=== | ||
|Function|Description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous version of the spec had equivalents of the following built-in functions from cl_intel_subgroups
OpenCL extension: intel_sub_group_shuffle_down
and intel_sub_group_shuffle_up
, while this algorithms library doesn't describe them.
I see that shift_left
and shift_right
are the most closer ones, but unlike the OpenCL extension, delta
has to be the same across all work-items in a sub-group, which basically disallows cycle shifts. I don't know how important that use-case is, but this functions will almost, I guess, result in undefined elements in some elements of sub-group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a deliberate change. Other architectures can only implement shuffle_down
/shuffle_up
more efficiently than a general shuffle if it's known to be a shift (i.e. that delta
is uniform). Developers can mimic the original behavior by using permute
where the indices are computed from offsets.
It hadn't occurred to me that developers might be using non-uniform delta
to implement cycle shifts. If that use-case is important, we should consider adding an equivalent of std::rotate
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It hadn't occurred to me that developers might be using non-uniform delta to implement cycle shifts. If that use-case is important, we should consider adding an equivalent of std::rotate.
I don't have any data about amount of such usages, but I guess this shouldn't be a blocker for this PR - we can always add such if requested
sycl/doc/extensions/SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc
Show resolved
Hide resolved
Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
Text before tables now identifies which class is being discussed. Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
Unclear what this means if non-uniform work-groups are not supported. Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
T inclusive_scan(T x, BinaryOp binary_op) const; | ||
|
||
template <typename T, class BinaryOp> | ||
T inclusive_scan(T x, BinaryOp binary_op, T init) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where is the best place to ask this, but inclusive_scan
seems inconsistent with exclusive_scan
and reduce
- init
is the second argument in the other two functions, but the third argument here. This is also true of the document here: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/GroupAlgorithms/SYCL_INTEL_group_algorithms.asciidoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inconsistency here is unfortunate, but we inherited it from standard C++. See std::reduce
, std::exclusive_scan
and std::inclusive_scan
.
My understanding is that this ordering is aligned with the ability to provide defaults for each argument: an inclusive scan is well defined without an initial value, but an exclusive scan is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I wasn't aware of that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more for a mathematical reason than a C++ inconsistency.
For example for +
the neutral element is 0 but for *
the neutral element is 1. Obviously for all other data types and other operations it it is different, not even talking about non-mathematical use of these prefix operations, such as implementing a parser with scans...
Signed-off-by: John Pennycook <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that I'm the right person to approve this PR, but at least I don't have any other questions or comments, LGTM
* sycl: (1209 commits) [SYCL] Check exit status get_device_count_by_type [SYCL][Doc] Update sub-group extension docs (intel#1330) [SYCL][Doc] Add leader to GroupAlgorithms (intel#1297) [SYCL] Add SYCL headers search path to default compilation options (intel#1347) [SYCL][PI] Add interoperability with generic handles to device and program classes (intel#1244) Move SPIR devicelib to top level (intel#1276) [SYCL][Driver] Improve fat static library support (intel#1319) [SYCL] Remove image_api LIT (intel#1349) [SYCL] Fix headers location for check-sycl-deploy target [SYCL] Allow gcc asm statements in kernel code (intel#1341) [SYCL] Add Intel FPGA force_pow2_depth attribute (intel#1284) [SPIR-V][NFC] Fix for building llvm-spirv with -DLLVM_LINK_LLVM_DYLIB=ON (intel#1323) [SYCL][NFC] Fix execution graph dump (intel#1331) [SYCL][Doc] Release SYCL_INTEL_enqueue_barrier extension document (intel#1199) [SYCL][USM] Fix USM malloc_shared and free to handle zero byte (intel#1273) [SYCL] Fix undefined symbols in async_work_group_copy (intel#1243) [SYCL] Mark calls to barrier and work-item functions as convergent [SYCL][CUDA] Fix CUDA plug-in build with enabled assertions (intel#1325) [SYCL][Test] Add OpenCL requirement to test/ordered_queue/prop.cpp (intel#1335) [SYCL][CUDA] Improve CUDA backend documentation (intel#1293) ...
|+info::kernel_sub_group::max_num_sub_groups+ | ||
|+uint32_t+ | ||
|Returns the maximum number of sub-groups for this kernel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just now realized that the list of info descriptors was reduced and previously we had:
info::kernel_sub_group::local_size_for_sub_group_count
- removed nowinfo::kernel_sub_group::sub_group_count_for_ndrange
- removed nowinfo::kernel_sub_group::max_sub_group_size_for_ndrange
- replaced withinfo::kernel_sub_group::max_sub_group_size
I agree that (1) is not really needed and can be replaced by the following sequence:
- ensure that desired number of sub-groups doesn't exceed
max_num_sub_groups
- query sub-group size selected for the kernel
- multiple desired number of sub-groups by selected sub-group size
- ensure that calculated value doesn't exceed maximum work-group size for a kernel
- if previous check passed, use calculated work-group size
(2) is even easier to replace - once you know selected sub-group size, you can easily calculate amount of sub-groups per work-group
So, (1) and (2) depend on selected sub-groups size for a kernel, which is intended to be returned by (3) and here is a problem: according to OpenCL 2.1 spec, rev. 23:
Work items are mapped into sub-groups through a combination of compile-time decisions *and
the parameters of the dispatch*. The mapping to sub-groups is invariant for the duration of a
kernel’s execution, *across dispatches of a given kernel with the same work-group dimensions*,
between dispatches and query operations consistent with the dispatch parameterization, and from
one work-group to another within the dispatch (excluding the trailing edge work-groups in the
presence of non-uniform work-group sizes).
Spec explicitly states that selected sub-groups size for a particular kernel might be changed from one dispatch to another if work-group size is changed.
Strictly speaking, we cannot query exact max_sub_group_size
from OpenCL RT without specifying exact work-group size which will be used for kernel dispatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed response. I've opened #1565 to address this.
Splits sub-group functionality into three extensions:
Signed-off-by: John Pennycook [email protected]
When this moves to implementation, we should mark the existing sub-group functions as deprecated rather than suddenly removing them and replacing them with free-functions.