-
Notifications
You must be signed in to change notification settings - Fork 787
Move the Intel specific features to a separate document. Mainly: #7307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… Intel specific document.
**_NOTE:_** This document describes extra features and details for the implementation of `joint_matrix` extension on Intel AMX and Intel XMX. | ||
|
||
## Introduction | ||
The Intel backend implementations on both Intel AMX and DPAS support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are exra specifics about the supported layouts for extra performance and functionality that are listed in this document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Intel backend implementations on both Intel AMX and DPAS support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are exra specifics about the supported layouts for extra performance and functionality that are listed in this document. | |
The Intel backend implementations on both Intel AMX and XMX support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are exra specifics about the supported layouts for extra performance and functionality that are listed in this document. |
|
||
## Extra Functionality | ||
### Layout argument in `joint_matrix` type | ||
Layout in `joint_matrix` type is completely optional. Intel backends do not need to know about memory layout at the moment of creation of `joint_matrix`. Therefore, `layout` in `joint_matrix` type is optional, not only for matrix `accumulator` but for also Matrix `a` and `b`. In this case, the load with layout as an argument must be used. If `layout` is specified on Matrix `a` or `b`, it must then use the load without `layout` argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should refer to the "Layout template argument" here when referring to the joint_matrix argument. Where you say e.g. "it must then use the load without layout
argument. " this is not completely clear: you could state explicitly joint_matrix_load
. Ideally you can be more explicit about what you mean by layout
argument here if possible.
The Intel backend implementations on both Intel AMX and DPAS support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are exra specifics about the supported layouts for extra performance and functionality that are listed in this document. | ||
|
||
// I don't think we need a specific feature test macro because there is not really additional features. | ||
The Intel backend implementations on both Intel AMX and Intel XMX support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are exra specifics about the supported layouts for extra performance and functionality that are listed in this document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Intel backend implementations on both Intel AMX and Intel XMX support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are exra specifics about the supported layouts for extra performance and functionality that are listed in this document. | |
The Intel backend implementations on both Intel AMX and Intel XMX support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are additional specifics about the supported layouts that enable extra performance and functionality listed in this document. |
|
||
## Extra Functionality | ||
### Layout argument in `joint_matrix` type | ||
Layout in `joint_matrix` type is completely optional. Intel backends do not need to know about memory layout at the moment of creation of `joint_matrix`. Therefore, `layout` in `joint_matrix` type is optional, not only for matrix `accumulator` but for also Matrix `a` and `b`. In this case, the load with layout as an argument must be used. If `layout` is specified on Matrix `a` or `b`, it must then use the load without `layout` argument. | ||
The layout template argument in `joint_matrix` type is completely optional. Intel backends do not need to know about memory layout at the moment of creation of `joint_matrix`. Therefore, `layout` in `joint_matrix` type is optional, not only for matrix `accumulator` but for matrices `a` and `b` as well. In this case, the `joint_matrix_load` function that takes layout as an argument must be used. Ifthe template argument `layout` is specified on `joint_matrix` type with use `a` or `b`, it must then use the `joint_matrix_load` function that does not take `layout` as an argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The layout template argument in `joint_matrix` type is completely optional. Intel backends do not need to know about memory layout at the moment of creation of `joint_matrix`. Therefore, `layout` in `joint_matrix` type is optional, not only for matrix `accumulator` but for matrices `a` and `b` as well. In this case, the `joint_matrix_load` function that takes layout as an argument must be used. Ifthe template argument `layout` is specified on `joint_matrix` type with use `a` or `b`, it must then use the `joint_matrix_load` function that does not take `layout` as an argument. | |
The layout template argument in the `joint_matrix` constructor is completely optional. Intel backends do not need to know about memory layout at the moment of creation of `joint_matrix`. Therefore, specifying `layout` in the `joint_matrix` constructor is optional, not only for matrix `accumulator` but for matrices `a` and `b` as well. In this case, the `joint_matrix_load` function that takes layout as an argument must be used. If the template argument `layout` is specified on the `joint_matrix` type with use `a` or `b`, it must then use the `joint_matrix_load` function that does not take `layout` as an argument. |
@intel/dpcpp-specification-reviewers, can you please take a look? |
Hi @dkhaldi, This is not what I had in mind when I said we should split out the Intel specific functionality into its own extension. I had in mind that we would create a completely separate extension, defined in the |
- Remove packed layout to the Intel extension - Remove group argument from the constructor as it is unused today and complicates the construction of an array of matrices
…cifically: - Remove dynamic_p from the query API since this is not supported - Extend scope to return more than one value (Resolving one to do item) - Change float19 to tf32 - Add xmx16 to the list of TPUs. This is the PVC TPU
…nd add Group as argument
Remove the open question: ## Open Questions - Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?" Because we discussed the non possibility of adding the functions as methods because of the mad function that takes 3 joint matrix argument
@JackAKirk, @gmlueck, I updated the document with what we discussed last week:
|
|3 |Implementation on Nvidia Tensor Cores | ||
|====================== | ||
|
||
## New `joint_matrix` class | ||
We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the matrix use, the memory layout, and the memory scope of the matrix. This results in the following description: | ||
We introduce a new class called `joint_matrix`. The user needs to specify the memory scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure "memory scope" is the correct term here for the Group template argument: Isn't it "group scope"?. If you keep "memory scope" it should be made clear that the options here are sub_group or group.
We introduce a new class called `joint_matrix`. The user needs to specify the memory scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description: | |
We introduce a new class called `joint_matrix`. The user needs to specify the group scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory scope terminology exists in the SYCL spec and it is exactly what we refers to here. But to make the description matches the class definition, I will change it to the group memory scope. So it is group scope but related to memory.
@@ -100,41 +94,39 @@ enum class use { | |||
} | |||
``` | |||
|
|||
#### Shape | |||
Size of a `joint_matrix` that refers to its number of rows `Rows` and number of columns `Cols` must be constant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Size of a `joint_matrix` that refers to its number of rows `Rows` and number of columns `Cols` must be constant. | |
The shape of a `joint_matrix` refers to its number of rows `Rows` and number of columns `Cols`. |
#### Layout | ||
Besides row major and column major layouts, `layout` is flexible enough to introduce custom layouts such as packed layout. | ||
This specifies the memory layout and it can be row major and column major layouts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specifies the memory layout and it can be row major and column major layouts | |
This specifies the memory layout and it can be row major or column major. |
dynamic | ||
}; | ||
}; | ||
} | ||
``` | ||
|
||
|
||
#### Memory Scope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### Memory Scope | |
#### Group Scope |
} | ||
``` | ||
|
||
|
||
#### Memory Scope | ||
In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments. | ||
In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments. | |
In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter and is also part of the constructor arguments. |
} | ||
``` | ||
|
||
|
||
#### Memory Scope | ||
In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments. | ||
In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments. | ||
|
||
IMPORTANT: In the current implementation, only the subgroup scope is supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMPORTANT: In the current implementation, only the subgroup scope is supported | |
IMPORTANT: In the current implementation, only the `sub_group` scope is supported |
Note that in order to get maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below. | ||
|
||
IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B (provided by the `layout memL` parameter below) must be `packed` or `row_major`. Automatic VNNI transform is supported on AMX. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C (provided by the `layout memL` parameter below) must also be `row_major`. | ||
The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data is being read/written in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't a version of this paragraph come after the below definitions of joint_matrix_load
etc, instead of here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I will add this paragraph after each of the two load and store operations
@@ -148,18 +140,18 @@ namespace sycl::ext::oneapi::experimental::matrix { | |||
template <typename Group, typename T, size_t NumRows, size_t NumCols, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I think these functions should be exposed in the simplest way possible that provides all the necessary information to the user for their proper use. There is some information that I think can be omitted, such as the Group template parameter for example because it is inferred from the Group argument and we never want the user to type it out.
It would probably be a good idea to have a short description for the purpose of the Space
template parameter that references shared memory, because utilizing shared memory will be very important for application users of joint_matrix
multi_ptr<T, Space, IsDecorated> src, size_t stride); | ||
} | ||
``` | ||
|
||
`joint_matrix_load` loads data from memory to the 2d tiles/registers of Intel AMX/DPAS. | ||
`joint_matrix_load` loads data from memory to the 2d tiles/registers of the tensor hardware. | ||
We define two overloads of the load function depending on whether the memory layout was declared as part of the `joint_matrix` type or not. | ||
The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that was declared with `layout::dynamic`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that was declared with `layout::dynamic`. | |
The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that used the default value `layout::dynamic`. |
I think all references to layout::dynamic
should emphasize that it is only intended to be used as a default value, and users should never have to type layout::dynamic
@@ -228,20 +220,15 @@ Also, note that `get_wi_data` cannot return a fixed size array length because th | |||
|
|||
2- SG size is not generally fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2- SG size is not generally fixed. | |
2- sub_group size is not generally fixed. |
Probably not a good idea to use shorthand unless "SG" is explicitly defined.
3- AMX has the flexibility of allowing variable sizes on the matrix (`dynamic_extent`). | ||
|
||
In the case of CUDA backend which is SYCL AOT compiled and SG size = 32 known and fixed, the additional marray capability will be provided. | ||
In the case of CUDA backend which is SYCL AOT compiled and SG size = 32 known and fixed, an additional `marray`-based capability is provided in [matrix Tensor Cores additional extension](sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix_cuda.asciidoc). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence can be removed now.
struct joint_matrix { | ||
wi_data<T, Use, NumRows, NumCols, Layout, Group> get_wi_data(); | ||
}; | ||
template <typename Group, typename T, size_t NumRows, size_t NumCols, use Use, layout Layout> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template <typename Group, typename T, size_t NumRows, size_t NumCols, use Use, layout Layout> |
wi_data<T, Use, NumRows, NumCols, Layout, Group> get_wi_data(); | ||
}; | ||
template <typename Group, typename T, size_t NumRows, size_t NumCols, use Use, layout Layout> | ||
wi_data<group, T, Use, NumRows, NumCols, Layout> get_wi_data(Group sg, joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wi_data<group, T, Use, NumRows, NumCols, Layout> get_wi_data(Group sg, joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat); | |
wi_data get_wi_data(Group sg, joint_matrix<...> Mat); |
Also below - we could remove all these references to additional CTAD templating inferred from the joint_matrix argument that the user does not need to be exposed to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JackAKirk , are you suggesting we remove the wi_data list of template arguments or joint_matrix ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the joint_matrix ones the user has to provide the template args explicitly, so they should be documented. However I think that in some cases it is appropriate to omit them when joint_matrix is used as an argument in other functions: using the <...> notation instead.
For wi_data I think it can just be made clear that it takes the joint_matrix as an argument in its constructor. Then it is quite clear to me that its construction will depend solely on the joint_matrix it takes (which is the case), and actually any mention of template arguments is not necessary IMO.
@@ -274,43 +261,6 @@ IMPORTANT: The WI data to joint matrix mapping coordinates information is not im | |||
|
|||
IMPORTANT: In the Tensor Cores implementation, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMPORTANT: In the Tensor Cores implementation, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores backend. |
…t on Intel AMX and Intel XMX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments to make sure the example code can compile.
joint_matrix<int32_t, use::accumulator, tM, tN> tC(sg); | ||
joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA(); | ||
joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB(); | ||
joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC(); | ||
joint_matrix_fill(sg, tC, 0); | ||
for (int k = 0; k < K; k += tk) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (int k = 0; k < K; k += tk) { | |
for (int k = 0; k < K; k += tK) { |
joint_matrix<int8_t, use::a, tM, tK, layout::row_major> tA(sg); | ||
joint_matrix<int8_t, use::b, tK, tN, layout::row_major> tB(sg); | ||
joint_matrix<int32_t, use::accumulator, tM, tN> tC(sg); | ||
joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA(); | |
joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA; |
joint_matrix<int8_t, use::b, tK, tN, layout::row_major> tB(sg); | ||
joint_matrix<int32_t, use::accumulator, tM, tN> tC(sg); | ||
joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA(); | ||
joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB(); | |
joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB; |
joint_matrix<int32_t, use::accumulator, tM, tN> tC(sg); | ||
joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA(); | ||
joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB(); | ||
joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC(); | |
joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC; |
joint_matrix<int32_t, use::accumulator, tM, tN> tC(sg); | ||
joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA(); | ||
joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB(); | ||
joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC(); | ||
joint_matrix_fill(sg, tC, 0); | ||
for (int k = 0; k < K; k += tk) { | ||
joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K); | |
joint_matrix_load(sg, tA, multi_ptr<int8_t, sycl::access::address_space::global_space>(memA + sg_startx * tM * K + k), K); |
joint_matrix<int32_t, use::accumulator, tM, tN> tC(sg); | ||
joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA(); | ||
joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB(); | ||
joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC(); | ||
joint_matrix_fill(sg, tC, 0); | ||
for (int k = 0; k < K; k += tk) { | ||
joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K); | ||
joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N); | |
joint_matrix_load(sg, tB, multi_ptr<int8_t, sycl::access::address_space::global_space>(memB + k * N + sg_starty/SG_SIZE*tN), N); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsji this is actually a bug. The implementation should be able to handle both types of pointers.
Thank you for spotting this. I will report this bug and create a test for this.
joint_matrix_fill(sg, tC, 0); | ||
for (int k = 0; k < K; k += tk) { | ||
joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K); | ||
joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N); | ||
tC = joint_matrix_mad(sg, tA, tB, tC); | ||
} | ||
auto wi_data_c = matC.get_wi_data(); | ||
auto wi_data_c = get_wi_data(sg, tC); | ||
for (int i = 0; i < wi_data_c.length(); i++) | ||
wi_data_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C | ||
joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major); | |
joint_matrix_store(sg, tC, multi_ptr<int32_t, sycl::access::address_space::global_space>(memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN), N, layout::row_major); |
@gmlueck, do you have further comments? |
[frame="none",options="header"] | ||
|====================== | ||
| A type | B type | Accumulator type | M | N | K | ||
| (u)int8_t | (u)int8_t | int32_t | `< =` 16 | `< =` 16 | `< =` 64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| (u)int8_t | (u)int8_t | int32_t | `< =` 16 | `< =` 16 | `< =` 64 | |
| (u)int8_t | (u)int8_t | int32_t | +<=+ 16 | +<=+ 16 | +<=+ 64 |
I saw that you had difficulty with the formatting here. The above should work, and it will look better in the HTML rendering. The +
signs are an inline passthrough, which prevents Asciidoctor from doing special formatting.
I'm also OK either way. I think the main areas that still need work are the query APIs and the behavior when the application uses a matrix with an unsupported size and/or data type. One of our project goals is to give good error messages when an application does something wrong, so we need to pay attention to the case when the application uses an unsupported matrix size or data type. This will probably require some further discussion, so I don't mind addressing it incrementally in another PR. |
Sounds good. Let's then approve and merge this one. If users choose to skip the query, elaborate errors are generated at runtime if sizes/type are unsupported. |
@intel/llvm-reviewers-runtime, please help merge this. |
Besides that, I also change the name of DPAS to Intel XMX in this PR