Move the Intel specific features to a separate document. Mainly: #7307

dkhaldi · 2022-11-07T21:30:03Z

Remove dynamic extent. I kept it in the open questions though of the Intel specific document.
Remove the extra functionality Intel implementation provides WRT layout to the Intel specific document

Besides that, I also change the name of DPAS to Intel XMX in this PR

… Intel specific document.

JackAKirk · 2022-11-09T11:59:35Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_intel_matrix.asciidoc

+**_NOTE:_** This document describes extra features and details for the implementation of `joint_matrix` extension on Intel AMX and Intel XMX.
+
+## Introduction
+The Intel backend implementations on both Intel AMX and DPAS  support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are exra specifics about the supported layouts for extra performance and functionality that are listed in this document.


Suggested change

The Intel backend implementations on both Intel AMX and DPAS support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are exra specifics about the supported layouts for extra performance and functionality that are listed in this document.

The Intel backend implementations on both Intel AMX and XMX support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are exra specifics about the supported layouts for extra performance and functionality that are listed in this document.

JackAKirk · 2022-11-09T12:09:56Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_intel_matrix.asciidoc

+
+## Extra Functionality
+### Layout argument in `joint_matrix` type
+Layout in `joint_matrix` type is completely optional. Intel backends do not need to know about memory layout at the moment of creation of `joint_matrix`. Therefore, `layout` in `joint_matrix` type is optional, not only for matrix `accumulator` but for also Matrix `a` and `b`. In this case, the load with layout as an argument must be used. If `layout` is specified on Matrix `a` or `b`, it must then use the load without `layout` argument. 


I think you should refer to the "Layout template argument" here when referring to the joint_matrix argument. Where you say e.g. "it must then use the load without layout argument. " this is not completely clear: you could state explicitly joint_matrix_load. Ideally you can be more explicit about what you mean by layout argument here if possible.

JackAKirk · 2022-11-17T16:30:49Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_intel_matrix.asciidoc

-The Intel backend implementations on both Intel AMX and DPAS  support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are exra specifics about the supported layouts for extra performance and functionality that are listed in this document.
-
-// I don't think we need a specific feature test macro because there is not really additional features. 
+The Intel backend implementations on both Intel AMX and Intel XMX  support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are exra specifics about the supported layouts for extra performance and functionality that are listed in this document.


Suggested change

The Intel backend implementations on both Intel AMX and Intel XMX support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are exra specifics about the supported layouts for extra performance and functionality that are listed in this document.

The Intel backend implementations on both Intel AMX and Intel XMX support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are additional specifics about the supported layouts that enable extra performance and functionality listed in this document.

JackAKirk · 2022-11-17T16:34:06Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_intel_matrix.asciidoc


 ## Extra Functionality
 ### Layout argument in `joint_matrix` type
-Layout in `joint_matrix` type is completely optional. Intel backends do not need to know about memory layout at the moment of creation of `joint_matrix`. Therefore, `layout` in `joint_matrix` type is optional, not only for matrix `accumulator` but for also Matrix `a` and `b`. In this case, the load with layout as an argument must be used. If `layout` is specified on Matrix `a` or `b`, it must then use the load without `layout` argument. 
+The layout template argument in `joint_matrix` type is completely optional. Intel backends do not need to know about memory layout at the moment of creation of `joint_matrix`. Therefore, `layout` in `joint_matrix` type is optional, not only for matrix `accumulator` but for matrices `a` and `b` as well. In this case, the `joint_matrix_load` function that takes layout as an argument must be used. Ifthe template argument `layout` is specified on `joint_matrix` type with use `a` or `b`, it must then use the `joint_matrix_load` function that does not take `layout` as an argument.  


Suggested change

The layout template argument in `joint_matrix` type is completely optional. Intel backends do not need to know about memory layout at the moment of creation of `joint_matrix`. Therefore, `layout` in `joint_matrix` type is optional, not only for matrix `accumulator` but for matrices `a` and `b` as well. In this case, the `joint_matrix_load` function that takes layout as an argument must be used. Ifthe template argument `layout` is specified on `joint_matrix` type with use `a` or `b`, it must then use the `joint_matrix_load` function that does not take `layout` as an argument.

The layout template argument in the `joint_matrix` constructor is completely optional. Intel backends do not need to know about memory layout at the moment of creation of `joint_matrix`. Therefore, specifying `layout` in the `joint_matrix` constructor is optional, not only for matrix `accumulator` but for matrices `a` and `b` as well. In this case, the `joint_matrix_load` function that takes layout as an argument must be used. If the template argument `layout` is specified on the `joint_matrix` type with use `a` or `b`, it must then use the `joint_matrix_load` function that does not take `layout` as an argument.

dkhaldi · 2022-11-17T21:34:45Z

@intel/dpcpp-specification-reviewers, can you please take a look?
The next step would be to remove the core unified API out of experimental namespace so it is important to separate the Intel specific details from the core specification document.

gmlueck · 2022-11-18T14:41:52Z

Hi @dkhaldi,

This is not what I had in mind when I said we should split out the Intel specific functionality into its own extension. I had in mind that we would create a completely separate extension, defined in the ext::intel namespace. I'll send you an email with more information, and we can discuss offline.

- Remove packed layout to the Intel extension - Remove group argument from the constructor as it is unused today and complicates the construction of an array of matrices

….asciidoc

…cifically: - Remove dynamic_p from the query API since this is not supported - Extend scope to return more than one value (Resolving one to do item) - Change float19 to tf32 - Add xmx16 to the list of TPUs. This is the PVC TPU

…nd add Group as argument

Remove the open question: ## Open Questions - Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?" Because we discussed the non possibility of adding the functions as methods because of the mad function that takes 3 joint matrix argument

dkhaldi · 2022-12-05T19:37:49Z

@JackAKirk, @gmlueck, I updated the document with what we discussed last week:

reorder Group and layout in joint_matrix type.
make get_wi_data a free function and add Group there as argument
remove the open question that talks about the possibility of making the functions as member functions because we ruled out this option last week

JackAKirk · 2022-12-06T10:05:18Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

 |3     |Implementation on Nvidia Tensor Cores
 |======================

 ## New `joint_matrix` class
-We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the matrix use, the memory layout, and the memory scope of the matrix. This results in the following description:
+We introduce a new class called `joint_matrix`. The user needs to specify the memory scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description:


I'm not sure "memory scope" is the correct term here for the Group template argument: Isn't it "group scope"?. If you keep "memory scope" it should be made clear that the options here are sub_group or group.

Suggested change

We introduce a new class called `joint_matrix`. The user needs to specify the memory scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description:

We introduce a new class called `joint_matrix`. The user needs to specify the group scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description:

memory scope terminology exists in the SYCL spec and it is exactly what we refers to here. But to make the description matches the class definition, I will change it to the group memory scope. So it is group scope but related to memory.

JackAKirk · 2022-12-06T10:07:34Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

@@ -100,41 +94,39 @@ enum class use {
 }
 ```

+#### Shape
+Size of a `joint_matrix` that refers to its number of rows `Rows` and number of columns `Cols` must be constant.


Suggested change

Size of a `joint_matrix` that refers to its number of rows `Rows` and number of columns `Cols` must be constant.

The shape of a `joint_matrix` refers to its number of rows `Rows` and number of columns `Cols`.

JackAKirk · 2022-12-06T10:08:04Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

 #### Layout
-Besides row major and column major layouts, `layout` is flexible enough to introduce custom layouts such as packed layout.
+This specifies the memory layout and it can be row major and column major layouts


Suggested change

This specifies the memory layout and it can be row major and column major layouts

This specifies the memory layout and it can be row major or column major.

JackAKirk · 2022-12-06T10:09:06Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

  dynamic
-};
+ };
 }
 ```


 #### Memory Scope


Suggested change

#### Memory Scope

#### Group Scope

JackAKirk · 2022-12-06T10:09:23Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

 }
 ```


 #### Memory Scope
-In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.
+In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.


Suggested change

In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.

In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter and is also part of the constructor arguments.

JackAKirk · 2022-12-06T10:09:38Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

 }
 ```


 #### Memory Scope
-In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.
+In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.

 IMPORTANT: In the current implementation, only the subgroup scope is supported


Suggested change

IMPORTANT: In the current implementation, only the subgroup scope is supported

IMPORTANT: In the current implementation, only the `sub_group` scope is supported

JackAKirk · 2022-12-06T10:23:20Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

-Note that in order to get maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below.
-
-IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B (provided by the `layout memL` parameter below) must be `packed` or `row_major`. Automatic VNNI transform is supported on AMX. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C (provided by the `layout memL` parameter below) must also be `row_major`.
+The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data is being read/written in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 


Shouldn't a version of this paragraph come after the below definitions of joint_matrix_load etc, instead of here?

yes, I will add this paragraph after each of the two load and store operations

JackAKirk · 2022-12-06T10:27:25Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

@@ -148,18 +140,18 @@ namespace sycl::ext::oneapi::experimental::matrix {
  template <typename Group, typename T, size_t NumRows, size_t NumCols,


Ideally I think these functions should be exposed in the simplest way possible that provides all the necessary information to the user for their proper use. There is some information that I think can be omitted, such as the Group template parameter for example because it is inferred from the Group argument and we never want the user to type it out.

It would probably be a good idea to have a short description for the purpose of the Space template parameter that references shared memory, because utilizing shared memory will be very important for application users of joint_matrix

JackAKirk · 2022-12-06T10:29:27Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

    multi_ptr<T, Space, IsDecorated> src, size_t stride);
 }
 ```

-`joint_matrix_load` loads data from memory to the 2d tiles/registers of Intel AMX/DPAS.
+`joint_matrix_load` loads data from memory to the 2d tiles/registers of the tensor hardware.
 We define two overloads of the load function depending on whether the memory layout was declared as part of the `joint_matrix` type or not. 
 The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that was declared with `layout::dynamic`.


Suggested change

The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that was declared with `layout::dynamic`.

The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that used the default value `layout::dynamic`.

I think all references to layout::dynamic should emphasize that it is only intended to be used as a default value, and users should never have to type layout::dynamic

JackAKirk · 2022-12-06T10:30:51Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

@@ -228,20 +220,15 @@ Also, note that `get_wi_data` cannot return a fixed size array length because th

 2- SG size is not generally fixed.


Suggested change

2- SG size is not generally fixed.

2- sub_group size is not generally fixed.

Probably not a good idea to use shorthand unless "SG" is explicitly defined.

JackAKirk · 2022-12-06T10:33:12Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

-3- AMX has the flexibility of allowing variable sizes on the matrix (`dynamic_extent`).
-
-In the case of CUDA backend which is SYCL AOT compiled and SG size = 32 known and fixed, the additional marray capability will be provided.
+In the case of CUDA backend which is SYCL AOT compiled and SG size = 32 known and fixed, an additional `marray`-based capability is provided in [matrix Tensor Cores additional extension](sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix_cuda.asciidoc).


This sentence can be removed now.

JackAKirk · 2022-12-06T10:34:10Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

-struct joint_matrix {
-   wi_data<T, Use, NumRows, NumCols, Layout, Group> get_wi_data();
-};
+template <typename Group, typename T, size_t NumRows, size_t NumCols, use Use, layout Layout>


Suggested change

template <typename Group, typename T, size_t NumRows, size_t NumCols, use Use, layout Layout>

JackAKirk · 2022-12-06T10:34:44Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

-   wi_data<T, Use, NumRows, NumCols, Layout, Group> get_wi_data();
-};
+template <typename Group, typename T, size_t NumRows, size_t NumCols, use Use, layout Layout>
+   wi_data<group, T, Use, NumRows, NumCols, Layout> get_wi_data(Group sg, joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);


Suggested change

wi_data<group, T, Use, NumRows, NumCols, Layout> get_wi_data(Group sg, joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);

wi_data get_wi_data(Group sg, joint_matrix<...> Mat);

Also below - we could remove all these references to additional CTAD templating inferred from the joint_matrix argument that the user does not need to be exposed to.

@JackAKirk , are you suggesting we remove the wi_data list of template arguments or joint_matrix ones?

For the joint_matrix ones the user has to provide the template args explicitly, so they should be documented. However I think that in some cases it is appropriate to omit them when joint_matrix is used as an argument in other functions: using the <...> notation instead.

For wi_data I think it can just be made clear that it takes the joint_matrix as an argument in its constructor. Then it is quite clear to me that its construction will depend solely on the joint_matrix it takes (which is the case), and actually any mention of template arguments is not necessary IMO.

JackAKirk · 2022-12-06T10:37:15Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

@@ -274,43 +261,6 @@ IMPORTANT: The WI data to joint matrix mapping coordinates information is not im

 IMPORTANT: In the Tensor Cores implementation, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores backend.


Suggested change

IMPORTANT: In the Tensor Cores implementation, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores backend.

…t on Intel AMX and Intel XMX

jsji

Minor comments to make sure the example code can compile.

jsji · 2022-12-14T21:42:11Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

-   joint_matrix<int32_t, use::accumulator, tM, tN> tC(sg);
+   joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA();
+   joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB();
+   joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC();
   joint_matrix_fill(sg, tC, 0);
   for (int k = 0; k < K; k += tk) {


Suggested change

for (int k = 0; k < K; k += tk) {

for (int k = 0; k < K; k += tK) {

jsji · 2022-12-14T21:47:14Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

-   joint_matrix<int8_t, use::a, tM, tK, layout::row_major> tA(sg);
-   joint_matrix<int8_t, use::b, tK, tN, layout::row_major> tB(sg);
-   joint_matrix<int32_t, use::accumulator, tM, tN> tC(sg);
+   joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA();


Suggested change

joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA();

joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA;

jsji · 2022-12-14T21:47:24Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

-   joint_matrix<int8_t, use::b, tK, tN, layout::row_major> tB(sg);
-   joint_matrix<int32_t, use::accumulator, tM, tN> tC(sg);
+   joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA();
+   joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB();


Suggested change

joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB();

joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB;

jsji · 2022-12-14T21:47:33Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

-   joint_matrix<int32_t, use::accumulator, tM, tN> tC(sg);
+   joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA();
+   joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB();
+   joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC();


Suggested change

joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC();

joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;

jsji · 2022-12-14T22:04:43Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

-   joint_matrix<int32_t, use::accumulator, tM, tN> tC(sg);
+   joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA();
+   joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB();
+   joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC();
   joint_matrix_fill(sg, tC, 0);
   for (int k = 0; k < K; k += tk) {
     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K);


Suggested change

joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K);

joint_matrix_load(sg, tA, multi_ptr<int8_t, sycl::access::address_space::global_space>(memA + sg_startx * tM * K + k), K);

jsji · 2022-12-14T22:04:59Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

-   joint_matrix<int32_t, use::accumulator, tM, tN> tC(sg);
+   joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA();
+   joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB();
+   joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC();
   joint_matrix_fill(sg, tC, 0);
   for (int k = 0; k < K; k += tk) {
     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K);
     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N); 


Suggested change

joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N);

joint_matrix_load(sg, tB, multi_ptr<int8_t, sycl::access::address_space::global_space>(memB + k * N + sg_starty/SG_SIZE*tN), N);

@jsji this is actually a bug. The implementation should be able to handle both types of pointers.
Thank you for spotting this. I will report this bug and create a test for this.

jsji · 2022-12-14T22:05:21Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

   joint_matrix_fill(sg, tC, 0);
   for (int k = 0; k < K; k += tk) {
     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K);
     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N); 
     tC = joint_matrix_mad(sg, tA, tB, tC);
   }
-   auto wi_data_c = matC.get_wi_data();
+   auto wi_data_c = get_wi_data(sg, tC);
   for (int i = 0; i < wi_data_c.length(); i++)
     wi_data_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C
   joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);


Suggested change

joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);

joint_matrix_store(sg, tC, multi_ptr<int32_t, sycl::access::address_space::global_space>(memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN), N, layout::row_major);

dkhaldi · 2022-12-19T20:13:14Z

@gmlueck, do you have further comments?
I think we can merge this PR as it has the latest updates WRT the unified interface.
We can continue Query related issues in this PR or a new one. I am fine with either.

gmlueck · 2022-12-20T15:59:37Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc

+[frame="none",options="header"]
+|======================
+| A type | B type | Accumulator type | M | N | K
+| (u)int8_t  | (u)int8_t |  int32_t  |  `< =` 16 |  `< =` 16 |  `< =` 64


gmlueck · 2022-12-20T16:10:53Z

@gmlueck, do you have further comments?
I think we can merge this PR as it has the latest updates WRT the unified interface.
We can continue Query related issues in this PR or a new one. I am fine with either.

I'm also OK either way. I think the main areas that still need work are the query APIs and the behavior when the application uses a matrix with an unsupported size and/or data type. One of our project goals is to give good error messages when an application does something wrong, so we need to pay attention to the case when the application uses an unsupported matrix size or data type.

This will probably require some further discussion, so I don't mind addressing it incrementally in another PR.

dkhaldi · 2022-12-21T02:17:31Z

@gmlueck, do you have further comments?
I think we can merge this PR as it has the latest updates WRT the unified interface.
We can continue Query related issues in this PR or a new one. I am fine with either.

I'm also OK either way. I think the main areas that still need work are the query APIs and the behavior when the application uses a matrix with an unsupported size and/or data type. One of our project goals is to give good error messages when an application does something wrong, so we need to pay attention to the case when the application uses an unsupported matrix size or data type.

This will probably require some further discussion, so I don't mind addressing it incrementally in another PR.

Sounds good. Let's then approve and merge this one.
For the query interface, this is a static compile-time query. I have the AR to create a test that shows how to dispatch to different devices. This test uses an elaborate query of the device type: for instance, if the device has AMX (or other TPU user is interested in), users can start using this matrix query and parametrize their kernels with the queried sizes.
The advantage in this case is that users will get errors at compile time if they used unsupported sizes/types.

If users choose to skip the query, elaborate errors are generated at runtime if sizes/type are unsupported.

dkhaldi · 2022-12-21T19:53:34Z

@intel/llvm-reviewers-runtime, please help merge this.

dkhaldi added 2 commits November 7, 2022 13:27

-Remove dynamic extent. I kept it in the open questions though of the…

eaca6bf

… Intel specific document.

remove comment

4bf0775

dkhaldi marked this pull request as ready for review November 7, 2022 21:33

dkhaldi requested a review from a team as a code owner November 7, 2022 21:33

dkhaldi requested review from JackAKirk, yubingex007-a11y and gmlueck November 7, 2022 21:34

Add an open question for adding a constructor with no argument

cc99697

JackAKirk reviewed Nov 9, 2022

View reviewed changes

address Jack's comments

e673549

JackAKirk reviewed Nov 17, 2022

View reviewed changes

Address more comments from Jack

687ff28

JackAKirk approved these changes Nov 17, 2022

View reviewed changes

dkhaldi added 3 commits November 21, 2022 11:21

More cleanup of the non unified parts

9fb5b05

- Remove packed layout to the Intel extension - Remove group argument from the constructor as it is unused today and complicates the construction of an array of matrices

Add packed layout and store flexibility to the Intel extension

cc6f846

Rename sycl_ext_oneapi_intel_matrix.asciidoc to sycl_ext_intel_matrix…

d92ca02

….asciidoc

dkhaldi mentioned this pull request Nov 21, 2022

[SYCL] Implement matrix extension using new unified interface #7413

Merged

dkhaldi mentioned this pull request Nov 28, 2022

[SYCL] Add DPAS of PVC query and other improvements: #7562

Merged

dkhaldi added 2 commits December 5, 2022 12:56

reorder Group and layout parameters, make get_wi_data free function a…

e3ee0ea

…nd add Group as argument

JackAKirk reviewed Dec 6, 2022

View reviewed changes

dkhaldi added 5 commits December 7, 2022 12:29

Address Jack's comments

74217b9

Add information about the combinations that the implementation suppor…

96aaa2f

…t on Intel AMX and Intel XMX

format

e8d1646

format

88906a2

format

e9f5366

jsji reviewed Dec 14, 2022

View reviewed changes

Incorporate Jinsong's corrections

10477a4

gmlueck reviewed Dec 20, 2022

View reviewed changes

format <= sign in the supported sizes

4582fb3

gmlueck approved these changes Dec 21, 2022

View reviewed changes

againull merged commit c64b157 into intel:sycl Dec 21, 2022

	The Intel backend implementations on both Intel AMX and DPAS support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are exra specifics about the supported layouts for extra performance and functionality that are listed in this document.
	The Intel backend implementations on both Intel AMX and XMX support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are exra specifics about the supported layouts for extra performance and functionality that are listed in this document.

	We introduce a new class called `joint_matrix`. The user needs to specify the memory scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description:
	We introduce a new class called `joint_matrix`. The user needs to specify the group scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description:

	Size of a `joint_matrix` that refers to its number of rows `Rows` and number of columns `Cols` must be constant.
	The shape of a `joint_matrix` refers to its number of rows `Rows` and number of columns `Cols`.

	This specifies the memory layout and it can be row major and column major layouts
	This specifies the memory layout and it can be row major or column major.

	In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.
	In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter and is also part of the constructor arguments.

	IMPORTANT: In the current implementation, only the subgroup scope is supported
	IMPORTANT: In the current implementation, only the `sub_group` scope is supported

		@@ -148,18 +140,18 @@ namespace sycl::ext::oneapi::experimental::matrix {
		template <typename Group, typename T, size_t NumRows, size_t NumCols,

	The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that was declared with `layout::dynamic`.
	The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that used the default value `layout::dynamic`.

		@@ -228,20 +220,15 @@ Also, note that `get_wi_data` cannot return a fixed size array length because th

		2- SG size is not generally fixed.

	2- SG size is not generally fixed.
	2- sub_group size is not generally fixed.

	wi_data<group, T, Use, NumRows, NumCols, Layout> get_wi_data(Group sg, joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);
	wi_data get_wi_data(Group sg, joint_matrix<...> Mat);

		@@ -274,43 +261,6 @@ IMPORTANT: The WI data to joint matrix mapping coordinates information is not im

		IMPORTANT: In the Tensor Cores implementation, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores backend.

	for (int k = 0; k < K; k += tk) {
	for (int k = 0; k < K; k += tK) {

	joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA();
	joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA;

	joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB();
	joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB;

	joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC();
	joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;

	joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K);
	joint_matrix_load(sg, tA, multi_ptr<int8_t, sycl::access::address_space::global_space>(memA + sg_startx * tM * K + k), K);

	joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N);
	joint_matrix_load(sg, tB, multi_ptr<int8_t, sycl::access::address_space::global_space>(memB + k * N + sg_starty/SG_SIZE*tN), N);

	joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
	joint_matrix_store(sg, tC, multi_ptr<int32_t, sycl::access::address_space::global_space>(memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN), N, layout::row_major);

	\| (u)int8_t \| (u)int8_t \| int32_t \| `< =` 16 \| `< =` 16 \| `< =` 64
	\| (u)int8_t \| (u)int8_t \| int32_t \| +<=+ 16 \| +<=+ 16 \| +<=+ 64

Move the Intel specific features to a separate document. Mainly: #7307

Move the Intel specific features to a separate document. Mainly: #7307

Uh oh!

Conversation

dkhaldi commented Nov 7, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Nov 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkhaldi commented Nov 17, 2022

Uh oh!

gmlueck commented Nov 18, 2022

Uh oh!

dkhaldi commented Dec 5, 2022

Uh oh!

JackAKirk Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkhaldi Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

JackAKirk Nov 9, 2022 •

edited

Loading

JackAKirk Dec 6, 2022 •

edited

Loading

dkhaldi Dec 7, 2022 •

edited

Loading

JackAKirk Dec 6, 2022 •

edited

Loading

JackAKirk Dec 6, 2022 •

edited

Loading

JackAKirk Dec 6, 2022 •

edited

Loading

JackAKirk Dec 7, 2022 •

edited

Loading

dkhaldi Dec 16, 2022 •

edited

Loading