Skip to content

[SYCL][Matrix] Add get-coord API and general query example #7964

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 53 commits into from
Aug 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
9628bd0
- Remove the general query from TODO list as an example is added to t…
dkhaldi Jan 9, 2023
39875df
add an other distribution example
dkhaldi Jan 9, 2023
e42ef4a
add revision history
dkhaldi Jan 10, 2023
8bb98c1
Bader comments
dkhaldi Jan 10, 2023
48386d6
better wording
dkhaldi Jan 10, 2023
1e85155
Incorporate Greg comments and other improvements, specifically:
dkhaldi Jan 12, 2023
6f91525
Update the specification document to follow the formal template
dkhaldi Jan 30, 2023
cdcab5a
add tf32 type and conversion function
dkhaldi Jan 30, 2023
04e18fe
correct the matrix types in the appendix
dkhaldi Jan 30, 2023
9403a38
correct the matrix types in the appendix
dkhaldi Jan 30, 2023
ddb87f1
remove _t from the types
dkhaldi Jan 30, 2023
8a8e0a9
Specify in Status that joint matrix is an optional kernel feature
dkhaldi Feb 4, 2023
7e610aa
Move the iteration-style EWOps to the Intel extension and introduce j…
dkhaldi Feb 9, 2023
509056c
Address Jack's comments
dkhaldi Feb 10, 2023
805630c
Add get_info runtime query
dkhaldi Feb 13, 2023
20c09c9
reword the optional device feature checking
dkhaldi Feb 14, 2023
a7494c8
Address Greg's comments
dkhaldi Feb 28, 2023
7159591
Incorporate the last batch of Greg's comments
dkhaldi Feb 28, 2023
5b9fdfc
incorporate Greg's comments: query syntax
dkhaldi Mar 2, 2023
e0f683e
use sycl::ext::oneapi::experimental::architecture and remove scope query
dkhaldi Mar 2, 2023
008dbfc
fix the comments formatting
dkhaldi Mar 2, 2023
efb103a
- Add overloads and explanation for each of the API in the tf32 section
dkhaldi Mar 6, 2023
e69ff85
typo
dkhaldi Mar 6, 2023
6868a37
Address Greg's comments in the Intel extension
dkhaldi Mar 11, 2023
fb70d27
Add overload of joint matrix apply where row and col are provided
dkhaldi Mar 20, 2023
433e65a
Address Greg's comments: change packed name, add tf32 rounding mode, …
dkhaldi Mar 23, 2023
f5694eb
fix formatting
dkhaldi Mar 23, 2023
862880e
Address Greg's comments: remove loop-based indexing, add Td and defau…
dkhaldi Apr 24, 2023
885cf09
Incorporate Greg's suggestions
dkhaldi May 23, 2023
d0a81af
Incorporate Greg's small comments in intel-specific spec
dkhaldi May 23, 2023
cd41588
Rename folder name, add primary definition of matrix_params
dkhaldi May 25, 2023
0bf47c9
Add missing const to multi_ptr
dkhaldi May 25, 2023
15306d6
- Add copy function; - Add clarification about copy constructor and a…
dkhaldi May 30, 2023
bee344e
small typo correction
dkhaldi May 31, 2023
e5648e4
Remove default copy constructor and assign op
dkhaldi Jun 7, 2023
e22d057
fixed merge conflicts without merging and add Jack's Nvidia combinati…
dkhaldi Jun 8, 2023
0b4eecc
Remove the oneapi matrix folder that is replaced here by matrix folde…
dkhaldi Jun 8, 2023
8d80ad6
Add old folder to try to fix conflicts
dkhaldi Jun 9, 2023
1059870
Merge branch 'intel:sycl' into get-coord-doc
dkhaldi Jun 9, 2023
35c8744
remove the old folder that resulted from the merge with sycl branch
dkhaldi Jun 9, 2023
d63bdb8
address Greg's comments: change Nvidia table, minor formatting
dkhaldi Jun 29, 2023
7bfb8e5
corrected two types in the Nvidia table
dkhaldi Jun 29, 2023
08fd2db
address Greg, Jack, and Alexey comments
dkhaldi Jul 28, 2023
d7d0a70
Clarify use of must when referring to the query interface
dkhaldi Jul 31, 2023
bf8e00c
Address Greg's comments: fix 2 broken lines, const multi_ptr, line wrap
dkhaldi Aug 2, 2023
84af291
Add clarifications about joint_matrix_copy
dkhaldi Aug 2, 2023
2c2af7d
Add non const overload to tf32 load as implicit conversion for multi_…
dkhaldi Aug 7, 2023
e8bde89
minor clarification
dkhaldi Aug 9, 2023
a7f92ce
fix width of query table
dkhaldi Aug 23, 2023
789b593
fix the width for the right table
dkhaldi Aug 25, 2023
ee28250
Avoid line breaks in table by using source block
gmlueck Aug 25, 2023
2d80d16
add the conflicted file first in order to resolve the conflict
dkhaldi Aug 28, 2023
901252b
Merge branch 'intel:sycl' into get-coord-doc
dkhaldi Aug 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions sycl/ReleaseNotes.md
Original file line number Diff line number Diff line change
Expand Up @@ -293,7 +293,7 @@ extension. [1d993446] [4f7787c8]
- Implemented `ext::oneapi::experimental::radix_sorter` from the
[`sycl_ext_oneapi_group_sort`](doc/extensions/proposed/sycl_ext_oneapi_group_sort.asciidoc)
extension proposal. [86ba1809]
- Implemented a new unified interface for the [`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
- Implemented a new unified interface for the [`sycl_ext_oneapi_matrix`](https://github.com/intel/llvm/blob/7dab76e1d33341b1e6bf339ab933552281abb3e2/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc)
extension for CUDA. [166bbc36]
- Added support for sorting over sub-groups. [168767c6]
- Added C++ API wrappers for the Intel math functions `ceil`, `floor`, `rint`,
Expand Down Expand Up @@ -407,7 +407,7 @@ extension proposal to allow the compiler to determine the initiation interval.
- Updated the [`sycl_ext_intel_usm_address_spaces`](doc/extensions/supported/sycl_ext_intel_usm_address_spaces.asciidoc)
extension to adhere to SYCL 2020 `multi_ptr`. [4a9e9a0e]
- Added a new matrix use parameter to `joint_matrix` from the
[`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
[`sycl_ext_oneapi_matrix`](https://github.com/intel/llvm/blob/f2983fc0d8fcd7bd6022a7006ad489c591838041/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
extension specification. [52f34fd5]
- Removed `queue::size` and `queue::get_wait_list` functions from the
`sycl_ext_oneapi_queue_status_query` extension due to performance overhead
Expand Down Expand Up @@ -654,7 +654,7 @@ Release notes for commit range [`4043dda3..0f579bae`](https://github.com/intel/l
to mark `has_property` API as `noexcept`. [7805aa3f]
- Updated [`sycl_ext_intel_device_info`](doc/extensions/supported/sycl_ext_intel_device_info.md)
to support querying free device memory. [0eeef2b3]
- Updated [`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc)
- Updated [`sycl_ext_oneapi_matrix`](https://github.com/intel/llvm/blob/770f540d8b600c8c16df12dfccbf38fa780cf77a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc)
with description of new matrix features. [770f540d]
- Moved [`sycl_ext_oneapi_invoke_simd`](doc/extensions/experimental/sycl_ext_oneapi_invoke_simd.asciidoc)
extensions specification from `proposed` to `experimental` because
Expand Down Expand Up @@ -1300,7 +1300,7 @@ Release notes for commit range 23ca0c2..27f59d8
Level Zero, ESIMD emulator, HIP [2b0ebab376dc]
- Added support for `sycl::ext::intel::experimental::esimd_ballot` function
[0bbb091c1baa]
- Added initial support for [Tensorcore matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
- Added initial support for [Tensor Cores matrix extension](https://github.com/intel/llvm/blob/f2983fc0d8fcd7bd6022a7006ad489c591838041/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
[711ba58c30a8]

### Documentation
Expand Down Expand Up @@ -1692,7 +1692,7 @@ Release notes for commit range 4fc5ebe..bd68232
- Added [sRGBA support](doc/extensions/supported/sycl_ext_oneapi_srgb.asciidoc)
[e488327][191efdd]
- Added a preview feature implementation for the DPC++ experimental
[matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
[matrix extension](https://github.com/intel/llvm/blob/467ef25a309ec882027052f3d4c3df58c11ee2ac/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc)
[7f218531] [a95f46d]
- Added support for SYCL 2020 exceptions [5c0f748][eef07606][5af8c43d]
- Added support for [sycl_ext_intel_bf16_conversion extension](doc/extensions/experimental/sycl_ext_intel_bf16_conversion.asciidoc)
Expand Down Expand Up @@ -1956,7 +1956,7 @@ Release notes for commit range 6a49170027fb..962909fe9e78
for querying of free device memory in LevelZero backend extension [fa428bf]
- Added [InvokeSIMD](doc/extensions/proposed/sycl_ext_oneapi_invoke_simd.asciidoc) and
[Uniform](doc/extensions/proposed/sycl_ext_oneapi_uniform.asciidoc) extensions [72e1611]
- Added [Matrix Programming Extension for DPC++ document](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc) [ace4c733]
- Added [Matrix Programming Extension for DPC++ document](https://github.com/intel/llvm/blob/ce12ec028681aa90133c518126014b0881d9e6bc/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc) [ace4c733]
- Implemented SYCL 2020 `sycl::span` [9356d53]
- Added [device-if](doc/extensions/proposed/sycl_ext_oneapi_device_if.asciidoc) extension
[4fb95fc]
Expand Down Expand Up @@ -2102,7 +2102,7 @@ Release notes for commit range 6a49170027fb..962909fe9e78
- Fixed build issue when CUDA 11 is used [f7224f1]
- Fixed caching of sub-devices in Level Zero backend[4c34f93]
- Fixed requesting of USM memory allocation info on CUDA [691f842]
- Fixed [`joint_matrix_mad`](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
- Fixed [`joint_matrix_mad`](https://github.com/intel/llvm/blob/ce12ec028681aa90133c518126014b0881d9e6bc/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc)
behaviour to return `A*B+C` instead of assigning the result to `C` [ea59c2b]
- Workaround an issue in Level Zero backend when event isn't waited upon its
completion but is queried for its status in an infinite loop [bfef316]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,327 @@
= sycl_ext_intel_matrix

:source-highlighter: coderay
:coderay-linenums-mode: table

// This section needs to be after the document title.
:doctype: book
:toc2:
:toc: left
:encoding: utf-8
:lang: en
:dpcpp: pass:[DPC++]

// Set the default source code type in this document to C++,
// for syntax highlighting purposes. This is needed because
// docbook uses c++ and html5 uses cpp.
:language: {basebackend@docbook:c++:cpp}


== Notice

Copyright (c) 2022-2023 Intel Corporation. All rights reserved.

NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc.
used by permission by Khronos.

== Contact

To report problems with this extension, please open a new issue at:

https://github.com/intel/llvm/issues

== Dependencies

This extension is written against the SYCL 2020 revision 6 specification. All
references below to the "core SYCL specification" or to section numbers in the
SYCL specification refer to that revision.

This extension also depends on the following other SYCL extensions:

* link:sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix]

== Status
This is an experimental extension specification, intended to provide early
access to features and gather community feedback. Interfaces defined in this
specification are implemented in {dpcpp}, but they are not finalized and may
change incompatibly in future versions of {dpcpp} without prior notice.
*Shipping software products should not rely on APIs defined in this
specification.*

== Backend support status
This document describes the extra features and details for the
implementation of `joint_matrix` extension on Intel AMX and Intel
XMX.

The APIs in this extension may be used only on a device that has
`aspect::ext_intel_matrix`. The application must check that the device
has this aspect before submitting a kernel using any of the APIs in
this extension. If the application fails to do this, the
implementation throws a synchronous exception with the
`errc::kernel_not_supported` error code when the kernel is submitted to
the queue.

== Overview
This extension provides additional APIs related to the `joint_matrix`
type that can be used only on Intel devices that have Intel AMX or
Intel XMX technology. These Intel devices also support all of the
generic matrix APIs specified in `sycl_ext_oneapi_matrix`, but
applications can make use of the extended Intel specific APIs in this
extension to gain additional performance and capabilities.

== Specification

=== Feature test macro

This extension provides a feature-test macro as described in the core SYCL
specification. An implementation supporting this extension must
predefine the macro `SYCL_EXT_INTEL_MATRIX` to one of the values
defined in the table below. Applications can test for the existence of
this macro to determine if the implementation supports this feature,
or applications can test the macro's value to determine which of the
extension's APIs the implementation supports.

[%header,cols="1,5"]
|===
|Value
|Description

|1
|The APIs of this experimental extension are not versioned, so the
feature-test macro always has this value.
|===

=== New Aspect for Intel-Specific Matrix APIs
This extension adds a new device aspect:
```c++
namespace sycl {

enum class aspect : /*unspecified*/ {
ext_intel_matrix
};

} // namespace sycl
```
The `ext_intel_matrix` aspect indicates that the device is capable of
using the extended joint matrix APIs that are defined in the sections
that follow.

=== New Layout Type
This extension adds a new layout type named `ext_intel_packed` which
an application can use to indicate that the matrix data is loaded or
stored in VNNI "packed" format.

```c++
namespace sycl::ext::oneapi::experimental::matrix::layout {

enum class layout {
ext_intel_packed
};

} // namespace sycl::ext::oneapi::experimental::matrix
```

Consequently, the layout argument `layout` in `joint_matrix_load` can
take `ext_intel_packed` as argument to specify that the data has
already been transformed into VNNI format. In this case, the `stride`
argument of `joint_matrix_load` describes the number of elements
between consecutive rows for packed layouts.

In order to get maximum performance on Intel AMX and Intel XMX,
prepacking data in the memory is necessary. If users did not specify
the packed layouts, transforms done by the implementation may be slow
due to extra scatter/gather operations. Hence, we expose the
`ext_intel_packed` layout to the user to specify that A or B have
already been VNNIed. The packed or VNNI layout is introduced in the
`VNNI layout` section below.

=== Additional Store Operations
Besides store of matrix `accumulator`, the Intel implementation allows
store on matrix `a` and `b` as well.

```c++
namespace sycl::ext::intel::experimental::matrix {

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, access::address_space Space,
access::decorated IsDecorated>
void joint_matrix_store(Group g,
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
multi_ptr<T, Space, IsDecorated> src, size_t stride);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, access::address_space Space,
access::decorated IsDecorated>
void joint_matrix_store(Group g,
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
multi_ptr<T, Space, IsDecorated> src, size_t stride);

} // namespace sycl::ext::intel::experimental::matrix
```

=== Per-element Access with Coordinates
The function `joint_matrix_apply` in `sycl_ext_oneapi_matrix` provides
a way for the application to apply the same operation on every element
of the matrix. However, some algorithms require the application to
know the coordinates of each element as it operates on them. In this
case, the joint matrix index must be known in order to reason about
the matrix view and extract the relevant piece such as a sum of all
elements in a row for example. For instance, quantization that is
needed for conversion between low precision types like `int8_t` and `fp32`
uses such logic.

This extension adds a new form of the `joint_matrix_apply` function in
the `sycl::ext::intel::matrix` namespace that allows the application
to perform an operation on each element of the matrix. This function
is similar to the form in `sycl_ext_oneapi_joint_matrix`, but it also
provides the matrix coordinates of each element to the callback
function:

```c++
namespace sycl::ext::intel::experimental::matrix {

template<typename Group, typename T, use Use, size_t Rows, size_t
Cols, layout Layout, typename F>
void joint_matrix_apply(Group g, joint_matrix<Group, T, Use, Rows,
Cols, Layout>& C, F&& func);

} // namespace sycl::ext::intel::experimental::matrix
```
The `func` callback is invoked with three parameters `(T& element,
size_t row, size_t col)`, where `row` and `col` tell the coordinates
of element in the joint matrix. To illustrate, the following example
shows how you can use this API to sum the rows of a matrix:

```c++
joint_matrix_apply(sg, A, [=](T &val, size_t row, size_t col) {
sum_local_rows[row] += val;
});
```
=== New Device Information Descriptor
Besides the query we provide in
link:sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix],
some device descriptors are Intel hardware specific. These are
provided as part of `ext::intel::experimental::info::device::matrix`
namespace:

[frame="none",options="header"]
|======================
| Device descriptors | Return type| Description
|`ext::intel::experimental::info::device::matrix::numtiles`| `int`
|If the matrix hardware in the device has separate storage (register
files or tiles) from the rest of the processing units (e.g. Intel
AMX), returns the number of tiles. For other devices, returns 0.
|======================

=== Packed Layout Format
The `ext_intel_packed` layout (aka VNNI) is a special layout for
matrix data that allows Intel AMX and Intel XMX devices to load
matrices more efficiently (packing in 32 bits). This layout applies
only to the A and B matrices, and may not be used with the accumulator
matrix. The layout is different depending on whether the matrix
element type is 8 bits or 16 bits, which are the only two element
sizes supported for the A and B matrices on Intel AMX and Intel XMX
devices.

For an 8-bit element, the first four elements of column 0 are stored
contiguously in memory, followed by the first four elements of column
1, etc. This continues until the end of the row. After all the
elements for rows 0 - 3 have been stored this way, the process
repeats, starting with the next four elements of column 0. The diagram
below illustrates this layout for a 8 x 4 matrix.

==== Example 1: 8-bit elements

// Example of a 8 row x 4 column matrix using a 8-bit data
// element, in row-major layout, rows are shown horizontally.
// Element a1 is contiguous in memory with element b1, etc.
// ---------------------------------
// a1, b1, c1, d1
// a2, b2, c2, d2
// a3, b3, c3, d3
// a4, b4, c4, d4
// a5, b5, c5, d5
// a6, b6, c6, d6
// a7, b7, c7, d7
// a8, b8, c8, d8
// ---------------------------------
// The same matrix reformatted in packed layout.
// Here, packing of 4 elements is needed to form 32 bits.
// Elements a1, a2, a3, a4 are contiguous in memory, etc.
// ---------------------------------
// a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4
// a5, a6, a7, a8, b5, b6, b7, b8, c5, c6, c7, c8, d5, d6, d7, d8

For a 16-bit element, the first two elements of column 0 are stored
contiguously in memory, followed by the first two elements of column
1, etc. This continues until the end of the row. After all the
elements for rows 0 - 1 have been stored this way, the process
repeats, starting with the next two elements of column 0. The diagram
below illustrates this layout for a 4 x 4 matrix.

==== Example 2: 16-bit elements
// Example of a 4 row x 4 column matrix using a 16-bit data
// element, in row-major layout.
// Element a1 is contiguous in memory with element b1, etc.
// ---------------------------------
// a1, b1, c1, d1
// a2, b2, c2, d2
// a3, b3, c3, d3
// a4, b4, c4, d4
// ---------------------------------
// The same matrix reformatted in packed layout.
// Here, packing of 2 elements is needed to form 32 bits.
// Element a1 is contiguous in memory with element a2, etc.
// ---------------------------------
// a1, a2, b1, b2, c1, c2, d1, d2
// a3, a4, b3, b4, c3, c4, d3, d4

=== Example using int8_t type
```c++
using namespace sycl::ext::oneapi::experimental::matrix;

queue q;
range<2> G = {M/tM, N};
range<2> L = {1, SG_SIZE};
auto bufA = sycl::buffer{memA, sycl::range{M*K}};
auto bufB = sycl::buffer{memB, sycl::range{K*N}};
auto bufC = sycl::buffer{memC, sycl::range{M*N}};
q.submit([&](sycl::handler& cgh) {
auto accA = sycl::accessor{bufA, cgh, sycl::read_only};
auto accB = sycl::accessor{bufB, cgh, sycl::read_only};
auto accC = sycl::accessor{bufC, cgh, sycl::read_write};
cgh.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
[[sycl::reqd_sub_group_size(SG_SIZE)]] {
const auto global_idx = item.get_global_id(0);
const auto global_idy = item.get_global_id(1);
const auto sg_startx = global_idx - item.get_local_id(0);
const auto sg_starty = global_idy - item.get_local_id(1);
sub_group sg = item.get_sub_group();
joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA;
joint_matrix<sub_group, int8_t, use::b, tK, tN,
layout::ext_intel_packed> tB;
joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;
joint_matrix_fill(sg, tC, 0);
for (int k = 0; k < K; k += tK) {
joint_matrix_load(sg, tA, accA + sg_startx * tM * K + k, K);
joint_matrix_load(sg, tB, accB + k * N*4 + sg_starty/SG_SIZE*tN*4, N*4);
tC = joint_matrix_mad(sg, tA, tB, tC);
}
auto wi_data_c = ext::intel::experimental::matrix::get_wi_data(sg, tC);
for (int i = 0; i < wi_data_c.length(); i++)
wi_data_c[i] *= alpha;
joint_matrix_store(sg, tC,
accC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
});
});
q.wait();
```
== Revision History

[frame="none",options="header"]
|======================
|Rev |Date |Author |Changes
|1 |2022-11-07 |Dounia Khaldi |Add Intel-specific store API,
layout information, and `joint_matrix_apply` with coordinates API
|======================
Loading