-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL][Matrix] Add spec document for the matrix extension interface and its first implementation for AMX #3551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
49b6749
[SYCL][Matrix]Add spec document for the matrix extension interface an…
dkhaldi 37456c9
[SYCL][Matrix] Incorporate feedback related to: adding the API to nam…
dkhaldi fa675c3
[SYCL][Matrix] Incorporate feedback related to: Move future looking d…
dkhaldi 8e94952
[SYCL][Matrix] Incorporate feedback related to: add matrix layout nam…
dkhaldi 7b4a3bc
- Add layout parameter on the template for load, store, mad functions
dkhaldi 9702af8
Merge branch 'sycl' into matrix-amx-doc
dkhaldi 111fe1e
[SYCL][matrix] Incorporate changes related to: add specific implemen…
dkhaldi 05aa2ac
[SYCL][Matrix] Incorporate review changes related to:
dkhaldi f5c1d13
[SYCL][Matrix] Add missing template parameter layout in load/store/ma…
dkhaldi ce7e015
[SYCL][Matrix] reword the matrix layout description
dkhaldi 467ef25
Update dpcpp-joint-matrix.asciidoc
dkhaldi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# Matrix Programming Extension for DPC++ | ||
`matrix` is a new experimental DPC++ extension to provide unified matrix programming on different tensor hardware. The current implementation provides support of the matrix interface using Intel(R) Advanced Matrix Extensions (AMX). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,223 @@ | ||
# Matrix Programming Extension for DPC++: SYCL_EXT_ONEAPI_MATRIX | ||
:source-highlighter: coderay | ||
:coderay-linenums-mode: table | ||
|
||
// This section needs to be after the document title. | ||
:doctype: book | ||
:toc2: | ||
:toc: left | ||
:encoding: utf-8 | ||
:lang: en | ||
|
||
:blank: pass:[ +] | ||
|
||
// Set the default source code type in this document to C++, | ||
// for syntax highlighting purposes. This is needed because | ||
// docbook uses c++ and html5 uses cpp. | ||
:language: {basebackend@docbook:c++:cpp} | ||
|
||
|
||
== Notice | ||
|
||
Copyright (c) 2021-2021 Intel Corporation. All rights reserved. | ||
|
||
NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are | ||
trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. | ||
used by permission by Khronos. | ||
|
||
This extension is written against the SYCL 2020 revision 3 specification. All | ||
references below to the "core SYCL specification" or to section numbers in the | ||
SYCL specification refer to that revision. | ||
|
||
|
||
**_NOTE:_** _This document describes the current design and API for the matrix | ||
extension to DPC++. This is an initial experimental version to try out functionality | ||
and performance. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX). We are going to work with the community on incrementally improving | ||
gmlueck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
the API to bring them closer to standard C++ (aligned with the `std::mdspan` and `std::mdarray` proposals) and SYCL in the next several months._ | ||
|
||
## Introduction | ||
This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: AMX in Intel CPU, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer. | ||
|
||
## Feature test macro | ||
|
||
This extension provides a feature-test macro as described in the core SYCL | ||
specification section 6.3.3 "Feature test macros". Therefore, an | ||
implementation supporting this extension must predefine the macro | ||
`SYCL_EXT_ONEAPI_MATRIX` to one of the values defined in the table below. | ||
Applications can test for the existence of this macro to determine if the | ||
implementation supports this feature, or applications can test the macro's | ||
value to determine which of the extension's APIs the implementation supports. | ||
|
||
[frame="none",options="header"] | ||
|====================== | ||
|Value |Description | ||
|1 |Initial extension implementation on AMX. Base features are supported. | ||
|====================== | ||
|
||
### New `matrix` class | ||
We introduce a new class called `matrix`. The user needs to specify the type of the elements, sizes, and the memory layout. | ||
|
||
The same class `matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. | ||
|
||
Layout is necessary on the type to be able to calculate the physical offset if the user needs to access a single entry for some purpose. Besides row major and column major layout, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts. AMX hardware requires A and B to be in VNNI or 32 bits packed layout. If users did not specify these layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. | ||
|
||
```c++ | ||
template <typename T, size_t Rows=std::dynamic_extent, size_t Cols=std::dynamic_extent, matrix_layout l = row_major> | ||
gmlueck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
struct matrix; | ||
enum class matrix_layout { | ||
row_major, | ||
col_major, | ||
packed_a, | ||
packed_b | ||
}; | ||
``` | ||
gmlueck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
### Matrix Operations | ||
We define three new functions needed to perform the main and common operations on matrices. This set of functions can be easily extended if the tensor hardware implements new features. | ||
The base pointer determines the starting address of the matrix to be loaded/stored. | ||
`layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). | ||
|
||
```c++ | ||
void matrix_load(matrix<>A, T *base, unsigned stride, matrix_layout l = row_major); | ||
``` | ||
This function loads data from memory to the 2d tiles of AMX that is a 2d storage. | ||
```c++ | ||
void matrix_store(matrix<>A, T *base, unsigned stride, matrix_layout l = row_major); | ||
``` | ||
This function stores the data from the 2d tiles back to memory. | ||
```c++ | ||
matrix<> matrix_mad(matrix<>A, matrix<>B, matrix<>C); | ||
``` | ||
|
||
The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result. | ||
|
||
gmlueck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## Integration with DPC++ | ||
When using the matrix interface in a DPC++ kernel, additional semantics have to be added to define the memory scope of the matrix object and the execution scope of its operations. In the context of DPC++, `matrix` is distributed among an execution unit. In practice this can be one work-item, the work-items in a sub-group, or the work-items in a work-group. | ||
gmlueck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a non-diverged control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by all the work items in a group. | ||
gmlueck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Memory Scope | ||
For the memory scope, we have two solutions. The long term solution is to use the proposed https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/LocalMemory/SYCL_INTEL_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below. | ||
|
||
```c++ | ||
multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN>>(sg); | ||
``` | ||
However, sub-group local memory is not yet well defined in DPC++. Moreover, the representation of this notion in LLVM IR and SPIR-V is not yet clear. Hence, for this proposal, we will proceed with adding the memory scope as an additional constructor argument as follows: | ||
|
||
gmlueck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
```c++ | ||
joint_matrix<sub_group, int8_t, tM, tN> tA(sg); | ||
gmlueck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
### Execution Scope | ||
To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The DPC++ syntax is the following: | ||
|
||
```c++ | ||
void joint_matrix_load(Group g, matrix<>A, T *base, unsigned stride, matrix_layout l = row_major); | ||
``` | ||
```c++ | ||
void joint_matrix_store(Group g, matrix<>A, T *base, unsigned stride, matrix_layout l = row_major); | ||
gmlueck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
```c++ | ||
matrix<> joint_matrix_mad(Group g, matrix<>A, matrix<>B, matrix<>C); | ||
``` | ||
gmlueck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Example using int8_t type | ||
```c++ | ||
using namespace cl::sycl::ext::intel::matrix; | ||
|
||
queue q; | ||
range<2> G = {M, N}; | ||
// For this first implementation, SG_SIZE has to be equal to one | ||
range<2> L = {1, SG_SIZE}; | ||
int8_t *memA = malloc_shared<int8_t>(M*K, q); | ||
int8_t *memB = malloc_shared<int8_t>(K*N, q); | ||
Int32_t *memC = malloc_shared<int32_t>(M*N, q); | ||
//Assuming memB has already been VNNIed | ||
q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) | ||
[[sycl::reqd_sub_group_size(SG_SIZE)]] { | ||
const auto global_idx = item.get_global_id(0); | ||
const auto global_idy = item.get_global_id(1); | ||
const auto sg_startx = global_idx - item.get_local_id(0); | ||
const auto sg_starty = global_idy - item.get_local_id(1); | ||
sub_group sg = item.get_sub_group(); | ||
joint_matrix<sub_group, int8_t, tM, tK> tA(sg); | ||
// For B, since current implementation does not support non packed layout, | ||
// users need to specify the updated VNNI sizes along with the packed_b layout | ||
joint_matrix<sub_group, int8_t, tK/4, tN*4, packed_b> tB(sg); | ||
joint_matrix<sub_group, int32_t, tM, tN> tC(sg); | ||
joint_matrix_load(sg, tC, memC + sg_startx * tM * N + sg_starty, N, row_major); | ||
for (int k = 0; k < K; k += tk) { | ||
joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, row_major);//collective | ||
joint_matrix_load(sg, tB, memB + k * N + sg_starty, N, packed_b);//VNNI | ||
tC = joint_matrix_mad(sg, tA, tB, tC); | ||
} | ||
joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty, N, row_major); | ||
}).wait(); | ||
|
||
``` | ||
## Implementation Status | ||
For oneAPI release 3, an AOT implementation is available on the CPU device to targets AMX hardware. we are using AMX tile intrinsics to implement the matrix load and store operations. Since we are currently emitting AMX intrinsics directly, this only enables AOT compilation. Please refer to the following section that talks about the future unified SPIR-V path that will enable JIT compilation. | ||
// We used the https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AMX[`_tile_`-prefixed intrinsics] defined in `immintrin.h`. | ||
gmlueck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Currently, this is the compilation command line needed to invoke AMX unit of Sapphire Rapids CPU: | ||
|
||
```c++ | ||
clang++ -fsycl -march=sapphirerapids fsycl-targets="spir64_x86_64-uknown-linux-sycldevice" -O2 matmul-int8.cpp -o matmul-int8 | ||
``` | ||
|
||
### Current Implementation Restrictions | ||
#### Type, Sizes, and Layouts | ||
The types supported by this AMX implementation are restricted to the types that AMX hardware support. Although the AMX hardware supports 2d tiles with a maximum size of 16x64 bytes, this current implementation can handle any size. If the matrix size is bigger than 1024 bytes, it will be stored in memory rather than mapped to a 2d tile. Performance penalty may occur in this case. In order to get the best performance with this implementation, matrix sizes should be smaller than 16x64 bytes and A and B matrices should be already packed (put in VNNI format). | ||
gmlueck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
More specifically, the following operation C = A*B+C can be performed on AMX with this interface where: | ||
A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major) | ||
or | ||
A(bf16, any-size, row_major), B(bf16, any-size, packed_b), C(float, any-size, row_major) | ||
gmlueck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### Memory and Execution Scope | ||
This current implementation only considers a sub-group scope. However, the sub-group size has to be equal to one in this first implementation. | ||
gmlueck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Future Work: Unfied LLVM IR and SPIRV JIT Enabling | ||
To enable JIT compilation, a unified matrix IR needs to be added. Currently, there is no matrix type in LLVM IR or SPIR-V. We are working towards adding a new matrix type in both LLVM IR and SPIR-V. This JIT enabling is expected to be part of a future compiler release. | ||
|
||
### LLVM IR Extension | ||
As a short-term solution, we are extending the https://llvm.org/docs/LangRef.html#llvm-matrix-transpose-intrinsic[existing LLVM IR matrix intrinsics] to include features like VNNI layout. The current matrix intrinsics use flattened vectors to represent the matrix. Therefore, we are exploring both adding matrix type to LLVM IR and also using MLIR `vector` dialect for this work. | ||
|
||
### SPIR-V Extension | ||
The current draft proposal can be found https://gitlab.devtools.intel.com/OpenCL/opencl-extension-drafts/-/blob/master/SPV_INTEL_matrix.asciidoc[here]. | ||
We are adding translation from LLVM IR matrix to SPIR-V matrix and vice versa in the LLVM to SPIR-V translator tool. | ||
|
||
## VNNI/Packed Layout | ||
AMX compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. | ||
The VNNI blocking factor is 2 in the case of 16bits, 4 in the case of 8 bits elements. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform. | ||
|
||
|
||
// Example of bf16 data type: | ||
// --------------------------------- | ||
// a1, b1, c1, d1 | ||
// a2, b2, c2, d2 | ||
// a3, b3, c3, d3 | ||
// a4, b4, c4, d4 | ||
// --------------------------------- | ||
// reformat to | ||
// --------------------------------- | ||
// a1, a2, b1, b2, c1, c2, d1, d2 | ||
// a3, a4, b3, b4, c3, c4, d3, d4 | ||
|
||
|
||
|
||
## Open Questions | ||
- Besides row, col major and packed (VNNI) layout, what are the additional layouts that should absolutely be added? | ||
- Are there alternative names for the `packed_a` and `packed_b` layouts that would be clearer to distinguish between the VNNI Layout in matrix A and VNNI layout in matrix B of a matrix multiply and add operation on AMX? | ||
- Ronan Keyrell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?" | ||
bader marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## TODO List | ||
- Handle sub group sizes that are bigger than one. | ||
- Add support for queries that gives information about the capabilities of the implementation on a particular device. | ||
- Once the SPIRV translator work is done, this code generation work will move to the backend along enabling JIT compilation. | ||
|
||
## Revision History | ||
|
||
[frame="none",options="header"] | ||
|====================== | ||
|Rev |Date |Author |Changes | ||
|1 |2021-04-13 |Dounia Khaldi |Initial public working draft. | ||
|====================== |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.