Skip to content

[SYCL][Matrix] Add documentation about new matrix features #6157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 20, 2022

Conversation

dkhaldi
Copy link
Contributor

@dkhaldi dkhaldi commented May 16, 2022

The new two features are joint_matrix_fill and get_wi_data for piece-wise operations

@dkhaldi dkhaldi requested a review from a team as a code owner May 16, 2022 15:52
Copy link
Contributor

@JackAKirk JackAKirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a couple of separate concepts that could be disentangled and made more clear in this document(I'm using the available information in this document and in the implementations of the matrix extension expressed in this document in intel/llvm code):

1.
joint_matrix allocates register memory for a matrix group operation. It is notable that joint_matrix takes a group template argument and that the document states:

"In this experimental API version, we used the terminology of joint_matrix instead of plain matrix to emphasis that the matrix is shared among a group of work items and is not private to each work item"

My interpretation of this information is that joint_matrix needs to take a group argument because either
a) some backends require that parts of the register memory assigned for the matrix are "held" by a particular work_item but which work_item "holds" which matrix register memory is not known at runtime
or
b) some backends do not assign matrix register memory to a particular work_item, and again a runtime function e.g. get_wi_data is required to assign register memory to a particular work_item.

Off the top of my head the most precise definition of the phrase "memory held by a thread" is that this refers to the memory in which the thread can perform element wise operations on within the SYCL SPMD programming model. The existence of get_wi_data in this spec implies to me that it is not generally possible to know which registers are "held" by each thread in the group until runtime (due to some as yet unspecified reason: arch version dependence or parameter dependence etc). My tentative definition here of "memory held by a thread" should probably be clarified/improved, but I think that it would be very useful to have some justification for the "group" nature of joint_matrix that is more precise and to the point than "matrix is shared among a group of work items and is not private to each work item", but remains concise. The connection between the "group" nature of joint_matrix with the element wise indexing can then be made which (from my understanding) is important for motivating the inclusion of the get_wi_data method of joint_matrix.
One big motivation for clarifying this is that in the CUDA Tensor Cores backend it is precisely known at compile time which work_items "hold" which registers: in the CUDA backend implementation of joint_matrix there is a register array (termed a "fragment" in the CUDA ptx ISA) assigned to each thread, representing the register memory "held" by that thread. In CUDA runtime and Julia implementations of the CUDA Tensor Cores backend, programmers are allowed to operate on these fragments (equivalent of fragment in joint_matrix is joint_matrix.data (in cuda backend)) directly. Users coming from these languages to SYCL/DPC++ might then wonder at the meaning and motivation for get_wi_data. A couple of precise sentences would be sufficient I think. Users can be directed to backend specifications for additional backend specific information.
The other big reason for clarifying this point is that it is helpful for contributors to this extension to know the constraints of backends other than those which they are implementing since the biggest challenge of these vendor agnostic oneapi extensions is for them to be sufficiently general to be portable for all backends, without adding unnecessary complications and optimizing for performance. It is for this reason that I think it is OK for these experimental extension documents to be a little overly verbose initially (so long as they are precise), since there is plenty of time to trim the fat when the extension proposal approaches the point of being ready to be a formal extension.

2.
The second distinct point is that the parts of the matrix "held" by the work_item (the "mapping") may not be known for a given backend. I think that this extension document should simply state that some backends (e.g. CUDA WMMA (definitely) and AMX (if I remember correctly)) do not know this mapping (at least at compile time) which means that "non-linear" (I understand the meaning of non-linear that you mean here but it should be clarified in the doc) operations that depend on more than one element of the matrix are Undefined Behaviour when acting on representations of the matrix from "get_wi_data", and instead require that joint_matrix_store is called to copy the matrix data back to shared/global memory prior to such "non-linear" operations. Again the spec reader can be referred to a backend spec document to find out whether such "non-linear" operations on the registers are supported by the backend or not, and the proper mapping that the user needs to be resposible for to get expected behaviour.

Information still to add to this document
Finally I think that this document should state clearly that wi_element holds a reference to the original joint_matrix that it was constructed from, so that users can use the "=" operator to update the element of the joint_matrix represented by the wi_element after the element wise operation. E.g. in the case of the example written in this document:

auto wi_data_c = matC.get_wi_data();
for (int i = 0; i < wi_data_c.length(); i++)
        wi_data_c[i] *= alpha;

Say something explicit like
"wi_data_c[i] OP rhs updates the corresponding matrix element in the joint_matrix matC."

I also think that, as discussed previously, we need to clarify when (when includes specifying which backends support this) "linear" element wise operations can act on an array representation of the WI "held" registers, represented by an marray or otherwise, and clearly state the interface used to achieve this. (cuda backend would ideally just act directly on joint_matrix.data (which is already represented as an marray) as described above which is consistent with CUDA and Julia programming languages, so the constraints for the API depend solely on the constraints imposed by other backends.)

for (int k = 0; k < K; k += tk) {
joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, matrix_layout::row_major);
joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN*4, N*4, matrix_layout::packed_b); // VNNI
tC = joint_matrix_mad(sg, tA, tB, tC);
}
auto wi_slice_c = matC.get_wi_data();
for (int i = 0; i < wi_slice_c.length(); i++)
wi_slice_c[i] *= alpha; // The indexing here “i” is in the vector owned by a WI, not in the matrix C
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wi_slice_c[i] *= alpha; // The indexing here “i” is in the vector owned by a WI, not in the matrix C
wi_slice_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C

Minor nit: looks like you cut-and-paste this code from a Word document, which introduced non-ascii quote characters. They should be changed to standard double-quote characters.

namespace sycl::ext::oneapi::experimental::matrix {
template <typename Group, typename T, size_t NumRows, size_t NumCols,
matrix_layout L, typename Tv>
void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &m, const Tv v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &m, const Tv v);
void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &m, Tv v);

It does not make sense to use const here when passing a parameter by value.

namespace sycl::ext::oneapi::experimental::matrix {
template <typename Group, typename T, size_t NumRows, size_t NumCols,
matrix_layout L, typename Tv>
void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &m, const Tv v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there separate types for T and Tv? Don't you want v to have the same type T?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v can have a different type

Comment on lines 181 to 191
Besides matrix multiply and add, matrices are used in linear and non linear piece-wise operations. Activation functions are an example of element-wise operations. They can be linear like `ReLU` that, for each value `z`, returns the maximum between `z` and zero, or non linear like `Sigmoid` that calculates `1/(1+ exp(-z))`. Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations. For instance, quantized GEMM for `int8_t` is calculated using `A*B + sum_rows_A + sum_cols_B + scalar_zero_point`. `sum_rows_A` and `sum_cols_B` do not operate on elements of the matrix but on pieces: row in `sum_rows_A` and columns in `sum_cols_B`.

We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future.

In order to be able to perform any piece-wise operation in a general and in an efficient way, we provide a mapping conversion function from the matrix domain that is owned by a group of work items to the portion that is owned by a work item. Besides, the WI data to joint matrix mapping coordinates information must be known.

Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix.
While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix.
However using the `mma` ptx instructions as opposed to the `wmma` ptx instructions the mapping is known. Knowing this mapping is important for the user to implement new operations like sum of rows of a matrix for quantized algorithms.

##### Solution: Explicit conversion with mapping from SIMD to SPMD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remaining paragraphs above seem inappropriate for an API specification. The audience for this document wants to know what this API does and how to use it. However, these paragraphs seem more like a justification for why this API was chosen vs. some other possibility. That's not really the purpose of this document. I'd suggest either removing them or moving them to a new section towards the bottom of the document titled something like "Background on the element indexing operations".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move these in a background subsection but I will leave it in this section. Let's see if it looks better.

However using the `mma` ptx instructions as opposed to the `wmma` ptx instructions the mapping is known. Knowing this mapping is important for the user to implement new operations like sum of rows of a matrix for quantized algorithms.

##### Solution: Explicit conversion with mapping from SIMD to SPMD
We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence should be expanded to explain the purpose of this API better. The reader needs to understand that each work-item contains only a subset of the elements in the matrix. The sentence above sort of mentions this, but I think it could be clearer. For example:

The data elements in a joint_matrix distributed across the work-items in the Group in an implementation-defined way, such that each work-item owns a unique subset of the data elements. An application can use the APIs in this section to access the data elements owned by each work-item. This is especially useful for algorithms that operate on each data element individually.

I think this last sentence could replace the first paragraph you have "Besides matrix multiply and add, matrices are used in linear ...". However, if you think there's more to say about when these APIs are useful, you could add some more sentences here explaining it.

Then finish up by saying something like:

The code listing below shows a synopsis of these new APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already added more clarifications based on Jack's first review. I will add more based on your input as well. However, note that , "such that each work-item owns a unique subset of the data elements" is not always true like in the AMX case for instance. A matrix is allocated in the 2d register tile that is a subgroup shared memory (register in this case).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, note that , "such that each work-item owns a unique subset of the data elements" is not always true like in the AMX case for instance. A matrix is allocated in the 2d register tile that is a subgroup shared memory (register in this case).

Are you saying that when one work-item calls get_wi_data that it might get overlapping elements that are also returned from some other work-item's call to get_wi_data? If this is the case, I don't see how this API is very useful. For example, code like this would result in some elements being incremented twice:

auto wi_data_c = matC.get_wi_data();
for (int i = 0; i < wi_data_c.length(); i++)
  wi_data_c[i] += 1;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that when one work-item calls get_wi_data that it might get overlapping elements

No this is not possible

##### Solution: Explicit conversion with mapping from SIMD to SPMD
We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI.

```c++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of these code synopses is to show the API, not the implementation. Therefore, remove the function bodies and all the private data members. For example:

namespace sycl::ext::oneapi::experimental::matrix {

template <typename T, size_t NumRows, size_t NumCols,
          matrix_layout Layout = matrix_layout::row_major,
          typename Group = sycl::sub_group>
struct joint_matrix {
   wi_data<T, NumRows, NumCols, Layout, Group> get_wi_data();
};

/* ... */

} // namespace sycl::ext::oneapi::experimental::matrix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do that, thanks

return wi_data<T, NumRows, NumCols, Layout, Group>(*this);
}
};
template <typename T, size_t NumRows, size_t NumCols, matrix_layout Layout, typename Group>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do wi_data and wi_element really need all these template parameters? It seems like it would be easier to use if the only template parameter was T. It seems like the other template parameters are only there because there is a private data member M (a reference to matrix). However, you only seem to use M.spvm in the function bodies. Could you instead just store the spvm member directly in wi_data and wi_element?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on your suggestion? I get what you want to do but did not get the how.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed response. I was OOO for about 2 months and just got back recently.

I was thinking that these types could be simplified to have fewer template parameters, so the goal would be to have an API like this:

template <typename T>
class wi_data {
  size_t length();
  wi_element<T> operator[](size_t i);
};

template <typename T>
class wi_element {
  operator T();
  wi_element &operator=(const T &rhs);
};

Very roughly, I was thinking that you could accomplish this by changing the private data member included in wi_data and wi_element. Currently, these both contain a reference to the joint matrix M. However, it seems like they only need to use M.spvm. Therefore, could you change the implementation to hold just the spvm like:

template <typename T>
class wi_data {
  /* not sure what type */ spvm;
public:
  size_t length() {return __spirv_JointMatrixWorkItemLengthINTEL(spvm);}
  wi_element<T> operator[](size_t i) {
    return wi_element<T>(spvm, i);
  }
};

template <typename T>
class wi_element {
  /* not sure what type */ spvm;
  std::size_t idx;

public:
  operator T() {
    return __spirv_VectorExtractDynamic(spvm, idx);
  }
  wi_element &operator=(const T &rhs) {
    M.spvm = __spirv_VectorInsertDynamic(spvm, rhs, idx);
    return *this;
  }
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlueck
It looks like spvm type also needs these template parameters. SO I don't think we can reduce them:
__spv::__spirv_JointMatrixINTEL<
T, NumRows, NumCols, spv_matrix_layout_traits::value> *spvm;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's too bad.

class wi_data {
joint_matrix<T, NumRows, NumCols, Layout, Group> &M;
public:
wi_data(joint_matrix<T, NumRows, NumCols, Layout, Group> &Mat) : M(Mat){}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume we do NOT want application code to construct a wi_data (or wi_element) directly? Instead, I presume we want application to call joint_matrix::get_wi_data to get the wi_data? If that is the case, these constructors should be private in the implementation, and joint_matrix should be a friend, so that it can construct the objects.

The code synopsis, then, would only list the public member functions:

template <typename T, size_t NumRows, size_t NumCols, matrix_layout Layout, typename Group>
class wi_data {
 public:
  size_t length();
  wi_element<T, NumRows, NumCols, Layout, Group> operator[](size_t i);
};

template <typename T, size_t NumRows, size_t NumCols, matrix_layout Layout, typename Group>
class wi_element {
 public:
  operator T();
  wi_element &operator=(const T &rhs);
};

};
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the code synopsis, there should be some description of the member functions. I'd suggest three tables, one for each class:

  • Table describing member functions of joint_matrix (get_wi_data)
  • Table describing member functions of wi_data
  • Table describing member functions of wi_element.

You can see an example here: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_device_global.asciidoc#representation-of-device-globals

(Scroll down to the table after the code synopsis.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't have that many members, I added description in the text. Let me know if it looks enough.

```c++
auto wi_data_c = matC.get_wi_data();
for (int i = 0; i < wi_data_c.length(); i++)
wi_data_c[i] *= alpha; // Note that the indexing here “i” is in the vector owned by a WI, not in the matrix C
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wi_data_c[i] *= alpha; // Note that the indexing here “i” is in the vector owned by a WI, not in the matrix C
wi_data_c[i] *= alpha; // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C

@dkhaldi
Copy link
Contributor Author

dkhaldi commented May 18, 2022

My interpretation of this information is that joint_matrix needs to take a group argument because either
a) some backends require that parts of the register memory assigned for the matrix are "held" by a particular work_item but which work_item "holds" which matrix register memory is not known at runtime
or
b) some backends do not assign matrix register memory to a particular work_item, and again a runtime function e.g. get_wi_data is required to assign register memory to a particular work_item.

The group argument is there because it is up to the implementation on how to distribute the matrix among the work items or not. The matrix can still be a shared entity among the work items. So the partitioning on the Wis is implementation defined. SYCL implementation for Intel CPUS/GPUs uses JIT so we don’t know what is the target hardware/implementation until runtime. AMX, DPAS, and Nvidia tensorcores do things differently. For instance, the AMX tile that holds the matrix is a 2d register that has no notion of work items. Once we convert to element wise operations on the "regular" CPU, there, this matrix element to WI occurs.

@dkhaldi
Copy link
Contributor Author

dkhaldi commented May 18, 2022

"memory held by a thread" is that this refers to the memory in which the thread can perform element wise operations on within the SYCL SPMD programming model.

I like this definition, I will add to the document to clarify get_wi_data function.

@dkhaldi
Copy link
Contributor Author

dkhaldi commented May 18, 2022

Users coming from these languages to SYCL/DPC++ might then wonder at the meaning and motivation for get_wi_data.

On the contrary, I view joint matrix as a compatible direct mapping to CUDA fragment.
When the user declares a fragment, they use the shape (M,N,K) that is the warp-wide matrix tiles participating in the multiply-accumulate operation. So the fragment is really the warp matrix and not the WI portion of it.

In CUDA, fragment<matrix_a, 16, 16, 16, half, col_major> A;
is equivalent to
joint_matrix<half, 16, 16, col_major> A(sg); in SYCL.
Considering a warp is what corresponds to a subgroup in SYCL.

fragment.x is the interface that is used to get the WI portion.
So get_wi_data is very similar to x. The only differences are:

  • the length of x is known at compiler time, it is not in SYCL because of the above reasons.
  • We intend to provide the mapping as I note in the todo list. In the Cuda interface, there is no way to know the exact mapping: matrix element to x element.

Thank you for raising all these clarifications issues, I will update the text accordingly.

@JackAKirk
Copy link
Contributor

My interpretation of this information is that joint_matrix needs to take a group argument because either
a) some backends require that parts of the register memory assigned for the matrix are "held" by a particular work_item but which work_item "holds" which matrix register memory is not known at runtime
or
b) some backends do not assign matrix register memory to a particular work_item, and again a runtime function e.g. get_wi_data is required to assign register memory to a particular work_item.

The group argument is there because it is up to the implementation on how to distribute the matrix among the work items or not. The matrix can still be a shared entity among the work items. So the partitioning on the Wis is implementation defined. SYCL implementation for Intel CPUS/GPUs uses JIT so we don’t know what is the target hardware/implementation until runtime. AMX, DPAS, and Nvidia tensorcores do things differently. For instance, the AMX tile that holds the matrix is a 2d register that has no notion of work items. Once we convert to element wise operations on the "regular" CPU, there, this matrix element to WI occurs.

Thanks for the clarification.

@dkhaldi
Copy link
Contributor Author

dkhaldi commented May 19, 2022

The second distinct point is that the parts of the matrix "held" by the work_item (the "mapping") may not be known for a given backend. I think that this extension document should simply state that some backends (e.g. CUDA WMMA (definitely) and AMX (if I remember correctly)) do not know this mapping (at least at compile time) which means that "non-linear" (I understand the meaning of non-linear that you mean here but it should be clarified in the doc) operations that depend on more than one element of the matrix are Undefined Behaviour when acting on representations of the matrix from "get_wi_data", and instead require that joint_matrix_store is called to copy the matrix data back to shared/global memory prior to such "non-linear" operations. Again the spec reader can be referred to a backend spec document to find out whether such "non-linear" operations on the registers are supported by the backend or not, and the proper mapping that the user needs to be resposible for to get expected behaviour.

The problem is not really that with some backends you will not know the mapping. The problem is that the mapping is implementation defined and change from one backend to the other. With joint matrix, we want to write, as much as possible, one code to run on these different backends. So if backend X tells me that a WI owns one exact row of the matrix for instance. Writing this will work only on that backend and for probably only that version. Probably the next version will change and implement something else: like one WI now owns half of a row because SG size increased or hardware units increased.
auto data = C.get_wi_data();
for (int i = 0; i < length; ++i) {
sum_of_local_rows[row] += data[i];
}

Instead of this hard-coded mapping, we want to write a general code and this is only possible if we can query this mapping. So code does not have to change from one version to the other:

So for the mapping problem, since this mapping is implementation-defined, one of the proposals is to add runtime functions like:
auto data = C.get_wi_data();
for (int i = 0; i < length; ++i) {
auto row, col = data[i].get_coord();
sum_of_local_rows[row] += data[i];
}
I will add this to the future API section and add more clarifying text.

@@ -244,6 +249,9 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported.

IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.

IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. so `wi_data` in this case can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend.
Copy link
Contributor

@JackAKirk JackAKirk May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification:

The current Tensor Cores implementation isn't AOT: it is JIT just at a lower level than DPC++: DPC++ compiles down to CUDA "PTX" asm, which can then be compiled down to a lower level asm called "SASS" at execution time. Despite this, the number of data elements in joint_matrix owned by each WI is known at compile time for all compilation versions. The thing that isn't known at compile time is the mapping between these WI data elements and the subgroup matrix owned by joint_matrix as a whole.
The other point is that, from the available information in this doc, I don't see that it follows that the WI owned elements can only be returned as an array if and only if "the number of data elements in joint_matrix owned by each WI is known at compile time". But this point is not important to dwell on because I think such an explanation is not really necessary for this "IMPORTANT" note: the point here is to make the user aware that a limited number of backends (Tensor Cores only atm) can return the WI portion of the matrix as an marray (although as specified earlier in the document the order of the marray elements does not have a defined mapping to the order of the matrix owned by the joint_matrix subgroup as a whole).

Suggested change
IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. so `wi_data` in this case can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend.
IMPORTANT: The Tensor Cores implementation is capable of returning the complete set of matrix elements owned by a WI as an `marray`, which can be useful in cases where the user wishes to perform identical operations on every element of the matrix efficiently and conveniently, where SYCL math functions optimized for `marray` exist. An additional interface will be provided for the Tensor Cores backend to return the `marray`.

Copy link
Contributor Author

@dkhaldi dkhaldi May 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I say AOT, I am referring to the SYCL compilation flow . When users have to specify -fsycl-targets this is AOT. JIT is when the compiler generates SPIRV. At runtime SPIRV is translated to the target-specific offload binaries.

I will add a comment about the fact that get_wi_data length cannot be constexpr because:
1- SYCL JIT compilation and partitioning among WIs is implementation defined. There is no way to provide a constexpr length that is true for every backend (it has to be a return value of a SPIRV function so it is not constexpr)
2- SG size is not fixed (like in the CUDA backend where warp size is always 32)
3- Even if we solve the first two, since AMX tile size can be a runtime variable, AMX case will still have to return variable length

In the case of CUDA backend which is SYCL AOT compiled and SG size = 32 known and fixed, the additional marray capability will be provided.

@@ -233,7 +236,9 @@ public:
}
```

Example where each WI gets a vector in `wi_data_c`. Vectorization along the subgroup dimension will get enabled automatically to vectorize the contiguous portion of the matrix.
Example where each WI gets a vector in `wi_data_c`. Vectorization along the subgroup dimension will get enabled automatically to vectorize the contiguous portion of the matrix. Since `wi_data_c` constructs a view of the joint matrix `matC`, `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence:

"Vectorization along the subgroup dimension will get enabled automatically to vectorize the contiguous portion of the matrix."

implies to me that WI portion of joint_matrix represents contiguous elements of the matrix (either row or column major), but this is not generally true: In the Tensor Cores backend it is rarely true: in most cases the WI owned portion of joint_matrix includes elements that are not contiguous. I'm not sure if there is any value to this sentence and I think it can be removed. Instead there could be:

Suggested change
Example where each WI gets a vector in `wi_data_c`. Vectorization along the subgroup dimension will get enabled automatically to vectorize the contiguous portion of the matrix. Since `wi_data_c` constructs a view of the joint matrix `matC`, `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`.
In the following example `wi_data_c` is a reference to the WI owned portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not imply WI portion of joint matrix are contiguous. It implies round robin distribution among the WIs.
Let's take a matrix example of 3 rows x 2 cols, if we assume we have SG size of 2 just for illustration here. To ensure vectorization along the SG dimension, each WI should own 1 column. so you have simd of 2 (1 row) in this case and 3 vector instructions (for each row). Does this make sense?
Does the CUDA backend generate vector code for the loop around WI portions?

@@ -182,14 +182,17 @@ Besides matrix multiply and add, matrices are used in linear and non linear piec

We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future.

In order to be able to perform any piece-wise operation in a general and in an efficient way, we provide a mapping conversion function from the matrix domain that is owned by a group of work items to the portion that is owned by a work item. Besides, the WI data to joint matrix mapping coordinates information must be known.
In order to be able to perform any piece-wise operation in a general and efficient way, we provide a mapping conversion function from the matrix domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. In joint matrix, it is up to the implementation to distribute the matrix among the work items or keep it shared. For instance, the matrix is a shared entity among the work items in the case of AMX where the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. Here, for the purpose of piece-wise operations, the conversion to the SPMD model happens using the matrix elements to WI mapping. Besides, the WI data to joint matrix mapping coordinates information must be known in order to extract the relevant piece for operations like sum of rows of a matrix.

Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix.
Copy link
Contributor

@JackAKirk JackAKirk May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to remove this whole paragraph beginning "Nvidia wmma interface" because it (or something similar) is more appropriate for the CUDA backend spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

IMPORTANT: In the current implementation, only the subgroup scope is supported.

#### Element Indexing and Piece-Wise Operations
Besides matrix multiply and add, matrices are used in linear and non linear piece-wise operations. Activation functions are an example of element-wise operations. They can be linear like `ReLU` that, for each value `z`, returns the maximum between `z` and zero, or non linear like `Sigmoid` that calculates `1/(1+ exp(-z))`. Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations. For instance, quantized GEMM for `int8_t` is calculated using `A*B + sum_rows_A + sum_cols_B + scalar_zero_point`. `sum_rows_A` and `sum_cols_B` do not operate on elements of the matrix but on pieces: row in `sum_rows_A` and columns in `sum_cols_B`.
Copy link
Contributor

@JackAKirk JackAKirk May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all that needs to be stated is that:

Suggested change
Besides matrix multiply and add, matrices are used in linear and non linear piece-wise operations. Activation functions are an example of element-wise operations. They can be linear like `ReLU` that, for each value `z`, returns the maximum between `z` and zero, or non linear like `Sigmoid` that calculates `1/(1+ exp(-z))`. Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations. For instance, quantized GEMM for `int8_t` is calculated using `A*B + sum_rows_A + sum_cols_B + scalar_zero_point`. `sum_rows_A` and `sum_cols_B` do not operate on elements of the matrix but on pieces: row in `sum_rows_A` and columns in `sum_cols_B`.
Besides matrix multiply and add, this extension aims to make it possible to perform piece-wise operations on matrices in a SPMD manner. The mechanisms that are recommended to perform such piece-wise operations depend upon which of the following classes the operation falls into:
Class "1". Element-wise operations that are performed identically on every element of the matrix.
Class "2". Element-wise operations that depend on the element index of the matrix or operations that take multiple elements as operands (such as a sum of all elements in a row for example).
This extension currently only supports case 1). However a proposal for supporting 2) (for some backends) in the future is provided in a later section.

Then continue with the explanation of how case 1) is dealt with. Case 2) seems to have been considered in section "### WI data to joint matrix mapping coordinates information for piece-wise operations" and requires that the backend knows the mapping from "joint_matrix Domain" to "WI Domain".

Copy link
Contributor

@JackAKirk JackAKirk May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically case 1) doesn't require mapping between get_data and joint_matrix, but cases 2) do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make the change, thanks

#### Element Indexing and Piece-Wise Operations
Besides matrix multiply and add, matrices are used in linear and non linear piece-wise operations. Activation functions are an example of element-wise operations. They can be linear like `ReLU` that, for each value `z`, returns the maximum between `z` and zero, or non linear like `Sigmoid` that calculates `1/(1+ exp(-z))`. Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations. For instance, quantized GEMM for `int8_t` is calculated using `A*B + sum_rows_A + sum_cols_B + scalar_zero_point`. `sum_rows_A` and `sum_cols_B` do not operate on elements of the matrix but on pieces: row in `sum_rows_A` and columns in `sum_cols_B`.

We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future.
Copy link
Contributor

@JackAKirk JackAKirk May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the paragraph beginning "We explored" can be removed also I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay


We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future.

In order to be able to perform any piece-wise operation in a general and efficient way, we provide a mapping conversion function from the matrix domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. In joint matrix, it is up to the implementation to distribute the matrix among the work items or keep it shared. For instance, the matrix is a shared entity among the work items in the case of AMX where the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. Here, for the purpose of piece-wise operations, the conversion to the SPMD model happens using the matrix elements to WI mapping. Besides, the WI data to joint matrix mapping coordinates information must be known in order to extract the relevant piece for operations like sum of rows of a matrix.
Copy link
Contributor

@JackAKirk JackAKirk May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AMX example is a good addition here. I think it can be made a bit more clear and precise:

Suggested change
In order to be able to perform any piece-wise operation in a general and efficient way, we provide a mapping conversion function from the matrix domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. In joint matrix, it is up to the implementation to distribute the matrix among the work items or keep it shared. For instance, the matrix is a shared entity among the work items in the case of AMX where the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. Here, for the purpose of piece-wise operations, the conversion to the SPMD model happens using the matrix elements to WI mapping. Besides, the WI data to joint matrix mapping coordinates information must be known in order to extract the relevant piece for operations like sum of rows of a matrix.
There is no fixed allocation of matrix elements held by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the WIs in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. However it is necessary to allocate WIs to specific elements of the matrix. In order to be able to perform piece-wise operations in a general and efficient way, we provide a conversion function from the joint_matrix domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. The mapping between the matrix data owned by a WI and them matrix data owned by the joint matrix may not be known for all backends. As such, additional conditions are necessary for class "2" piece-wise operations that we propose in a later section...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks

@JackAKirk
Copy link
Contributor

"memory held by a thread" is that this refers to the memory in which the thread can perform element wise operations on within the SYCL SPMD programming model.

I like this definition, I will add to the document to clarify get_wi_data function.

Currently "owned" is used in the text with the same meaning as "held" above. Probably it would be best to use only one of "owned" or "held" consistently.

##### Background
Besides matrix multiply and add, this extension aims to make it possible to perform piece-wise operations on matrices in a SPMD manner. The mechanisms that are recommended to perform such piece-wise operations depend upon which of the following classes the operation falls into:

Class 1- Element-wise operations where the same operation is performed on every element of the matrix. Activation functions are an example of element-wise operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Class 1- Element-wise operations where the same operation is performed on every element of the matrix. Activation functions are an example of element-wise operations.
Class 1- Unary element-wise operations where the same operation is performed on every element of the matrix, such that the operation can be performed without knowledge of the position of the element within the matrix. Activation functions or adding a constant value to every element of the matrix are two examples.

@dkhaldi
Copy link
Contributor Author

dkhaldi commented Jun 8, 2022

ping for more reviews/approval

@JackAKirk
Copy link
Contributor

ping for more reviews/approval

LGTM for merging: on the understanding it is still far from a final version.

I think that it would be good to merge this update because it will be very useful for anyone that wants to learn about the extension because, although not perfect, this update makes things a lot more clear and complete.

It is up to date with the current state of the implementation (although some temporary differences between CUDA and Intel implementations are not mentioned: they can be added later by us). Ideally the structure and presentation of the information could still be improved but I don't think it is a big problem if this is left for a more final version.

@JackAKirk JackAKirk self-requested a review June 9, 2022 18:13
@dkhaldi
Copy link
Contributor Author

dkhaldi commented Jun 14, 2022

Any more review for this?

@bader bader requested a review from rolandschulz July 4, 2022 10:13
@dkhaldi
Copy link
Contributor Author

dkhaldi commented Aug 19, 2022

@gmlueck are there more comments on this?
It will be good to merge this soon to have doc ready for these new features.

@dkhaldi
Copy link
Contributor Author

dkhaldi commented Aug 19, 2022

@bader, can you please help merge this?

@bader bader merged commit 770f540 into sycl Aug 20, 2022
@bader bader deleted the fill-piece-wise-ops branch August 20, 2022 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants