Skip to content

[SYCL][Doc] Update sub-group extension docs #1330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Mar 19, 2020
Merged

Conversation

Pennycook
Copy link
Contributor

Splits sub-group functionality into three extensions:

  • SubGroup (sub_group class and device queries)
  • SubGroupAlgorithms (GroupAlgorithm support and permute)
  • GroupMask (sub_group::mask_type and ballot)

Signed-off-by: John Pennycook [email protected]


When this moves to implementation, we should mark the existing sub-group functions as deprecated rather than suddenly removing them and replacing them with free-functions.

Splits sub-group functionality into three extensions:
- SubGroup (sub_group class and device queries)
- SubGroupAlgorithms (GroupAlgorithm support and permute)
- GroupMask (sub_group::mask_type and ballot)

Signed-off-by: John Pennycook <[email protected]>
@Pennycook Pennycook added the spec extension All issues/PRs related to extensions specifications label Mar 16, 2020
mkinsner
mkinsner previously approved these changes Mar 16, 2020
@AlexeySachkov
Copy link
Contributor

we should mark the existing sub-group functions as deprecated rather than suddenly removing them and replacing them with free-functions.

@Pennycook, we probably need to do the same with the spec, or at least let's leave a temporary file in the previous location of the spec, which redirects to the new specs and provides a link to the previous version as well

@Pennycook
Copy link
Contributor Author

@Pennycook, we probably need to do the same with the spec, or at least let's leave a temporary file in the previous location of the spec, which redirects to the new specs and provides a link to the previous version as well

That's a good idea. I like the temporary file with redirects solution, because forcing people to go through that extra step means they can't accidentally miss that things have changed.


== Overview

A sub-group represents an implementation-defined grouping of work-items in a work-group. The work-items within a sub-group can communicate and synchronize independently of work-items in other sub-groups, and sub-groups are therefore commonly mapped to SIMD hardware where it exists.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OpenCL, subgroups are always 1-dimensional, do we want to clearly state the same for SYCL?

Within a work-group work-items may be divided into sub-groups. The mapping of work-items to
sub-groups is implementation-defined and may be queried at runtime. While sub-groups may be
used in multi-dimensional work-groups, each sub-group is 1-dimensional and any given workitem may query which sub-group it is a member of.

However, even if this spelling theoretically allows 2d-subgroups, methods of the sub_group class below suggest that they are 1-dimensional.

BTW, OpenCL also provides some guarantees on work-items to sub-groups mapping during kernel execution, between dispatches of a given kernel and query operations:

Work items are mapped into sub-groups through a combination of compile-time decisions and
the parameters of the dispatch. The mapping to sub-groups is invariant for the duration of a
kernel’s execution, across dispatches of a given kernel with the same work-group dimensions,
between dispatches and query operations consistent with the dispatch parameterization, and from
one work-group to another within the dispatch (excluding the trailing edge work-groups in the
presence of non-uniform work-group sizes). In addition, all sub-groups within a work-group will
be the same size, apart from the sub-group with the maximum index which may be smaller if the
size of the work-group is not evenly divisible by the size of the sub-groups.

In the degenerate case, a single sub-group must be supported for each work-group. In this
situation all sub-group scope functions are equivalent to their work-group level equivalents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to partially address this. We need to consider the degenerate case more, as I'm not sure we necessarily need to follow OpenCL here.

The permute sub-group functions perform arbitrary communication between pairs of work-items in a sub-group. Common patterns -- such as shifting all values in a sub-group by a fixed number of work-items, or reversing the order of all values in a sub-group -- are exposed as specialized functions that may be accelerated in hardware.

|===
|Function|Description
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous version of the spec had equivalents of the following built-in functions from cl_intel_subgroups OpenCL extension: intel_sub_group_shuffle_down and intel_sub_group_shuffle_up, while this algorithms library doesn't describe them.

I see that shift_left and shift_right are the most closer ones, but unlike the OpenCL extension, delta has to be the same across all work-items in a sub-group, which basically disallows cycle shifts. I don't know how important that use-case is, but this functions will almost, I guess, result in undefined elements in some elements of sub-group

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a deliberate change. Other architectures can only implement shuffle_down/shuffle_up more efficiently than a general shuffle if it's known to be a shift (i.e. that delta is uniform). Developers can mimic the original behavior by using permute where the indices are computed from offsets.

It hadn't occurred to me that developers might be using non-uniform delta to implement cycle shifts. If that use-case is important, we should consider adding an equivalent of std::rotate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It hadn't occurred to me that developers might be using non-uniform delta to implement cycle shifts. If that use-case is important, we should consider adding an equivalent of std::rotate.

I don't have any data about amount of such usages, but I guess this shouldn't be a blocker for this PR - we can always add such if requested

Text before tables now identifies which class is being discussed.

Signed-off-by: John Pennycook <[email protected]>
Unclear what this means if non-uniform work-groups are not supported.

Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
T inclusive_scan(T x, BinaryOp binary_op) const;

template <typename T, class BinaryOp>
T inclusive_scan(T x, BinaryOp binary_op, T init) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where is the best place to ask this, but inclusive_scan seems inconsistent with exclusive_scan and reduce - init is the second argument in the other two functions, but the third argument here. This is also true of the document here: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/GroupAlgorithms/SYCL_INTEL_group_algorithms.asciidoc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inconsistency here is unfortunate, but we inherited it from standard C++. See std::reduce, std::exclusive_scan and std::inclusive_scan.

My understanding is that this ordering is aligned with the ability to provide defaults for each argument: an inclusive scan is well defined without an initial value, but an exclusive scan is not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I wasn't aware of that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more for a mathematical reason than a C++ inconsistency.

For example for + the neutral element is 0 but for * the neutral element is 1. Obviously for all other data types and other operations it it is different, not even talking about non-mathematical use of these prefix operations, such as implementing a parser with scans...

Copy link
Contributor

@AlexeySachkov AlexeySachkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I'm the right person to approve this PR, but at least I don't have any other questions or comments, LGTM

@bader bader merged commit d9b178f into intel:sycl Mar 19, 2020
@Pennycook Pennycook deleted the sub-group-docs branch March 19, 2020 17:06
alexbatashev pushed a commit to alexbatashev/llvm that referenced this pull request Mar 20, 2020
* sycl: (1209 commits)
  [SYCL] Check exit status get_device_count_by_type
  [SYCL][Doc] Update sub-group extension docs (intel#1330)
  [SYCL][Doc] Add leader to GroupAlgorithms (intel#1297)
  [SYCL] Add SYCL headers search path to default compilation options (intel#1347)
  [SYCL][PI] Add interoperability with generic handles to device and program classes (intel#1244)
  Move SPIR devicelib to top level (intel#1276)
  [SYCL][Driver] Improve fat static library support (intel#1319)
  [SYCL] Remove image_api LIT (intel#1349)
  [SYCL] Fix headers location for check-sycl-deploy target
  [SYCL] Allow gcc asm statements in kernel code (intel#1341)
  [SYCL] Add Intel FPGA force_pow2_depth attribute (intel#1284)
  [SPIR-V][NFC] Fix for building llvm-spirv with -DLLVM_LINK_LLVM_DYLIB=ON (intel#1323)
  [SYCL][NFC] Fix execution graph dump (intel#1331)
  [SYCL][Doc] Release SYCL_INTEL_enqueue_barrier extension document (intel#1199)
  [SYCL][USM] Fix USM malloc_shared and free to handle zero byte (intel#1273)
  [SYCL] Fix undefined symbols in async_work_group_copy (intel#1243)
  [SYCL] Mark calls to barrier and work-item functions as convergent
  [SYCL][CUDA] Fix CUDA plug-in build with enabled assertions (intel#1325)
  [SYCL][Test] Add OpenCL requirement to test/ordered_queue/prop.cpp (intel#1335)
  [SYCL][CUDA] Improve CUDA backend documentation (intel#1293)
  ...
Comment on lines +128 to +130
|+info::kernel_sub_group::max_num_sub_groups+
|+uint32_t+
|Returns the maximum number of sub-groups for this kernel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pennycook,

I've just now realized that the list of info descriptors was reduced and previously we had:

  1. info::kernel_sub_group::local_size_for_sub_group_count - removed now
  2. info::kernel_sub_group::sub_group_count_for_ndrange - removed now
  3. info::kernel_sub_group::max_sub_group_size_for_ndrange - replaced with info::kernel_sub_group::max_sub_group_size

I agree that (1) is not really needed and can be replaced by the following sequence:

  • ensure that desired number of sub-groups doesn't exceed max_num_sub_groups
  • query sub-group size selected for the kernel
  • multiple desired number of sub-groups by selected sub-group size
  • ensure that calculated value doesn't exceed maximum work-group size for a kernel
  • if previous check passed, use calculated work-group size

(2) is even easier to replace - once you know selected sub-group size, you can easily calculate amount of sub-groups per work-group

So, (1) and (2) depend on selected sub-groups size for a kernel, which is intended to be returned by (3) and here is a problem: according to OpenCL 2.1 spec, rev. 23:

Work items are mapped into sub-groups through a combination of compile-time decisions *and
the parameters of the dispatch*
. The mapping to sub-groups is invariant for the duration of a
kernel’s execution, *across dispatches of a given kernel with the same work-group dimensions*,
between dispatches and query operations consistent with the dispatch parameterization, and from
one work-group to another within the dispatch (excluding the trailing edge work-groups in the
presence of non-uniform work-group sizes).

Spec explicitly states that selected sub-groups size for a particular kernel might be changed from one dispatch to another if work-group size is changed.

Strictly speaking, we cannot query exact max_sub_group_size from OpenCL RT without specifying exact work-group size which will be used for kernel dispatch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed response. I've opened #1565 to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec extension All issues/PRs related to extensions specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants