Skip to content

[SYCL][Joint Matrix Spec] Add new API for out of bounds fill/load/store #11172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Mar 19, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,261 @@ In the case of `ext_intel_packed` matrix memory layout, `row` and
`col` represent the coordinates in the logical matrix before VNNI
transformation.

=== Load/Store/Fill With Out-of-Bounds Checks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand how this API can work. We might need to schedule a meeting where we can draw some pictures.

Wouldn't it make more sense to require the user to set Width and Height according to the remaining amount of memory in the "big matrix"? That way, the implementation simply checks to see if the template parameter Rows is greater than Height and if the template parameter Cols is greater than Width. If they are both less, the operations behave like normal loads and stores. If they are less, then the load sets the out-of-range elements to zero and the store ignores the out-of-range elements.

From the description, though, it doesn't seem like the APIs work this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is done compared to the big matrix size comparing the address to be loaded (ptr) with the big matrix size (Height). If ptr > Height, the element should not be loaded.
if the remainder is passed, how is the implementation going to compare with it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your recent change that adds coordX and coordY parameters, I see how this could work for the load and store APIs. However, I still don't see how the fill API can work.

This API structure seems confusing to me, though, and probably less efficient. I suspect users will get confused that the src pointer to joint_matrix_load is completely different from the src_base pointer to joint_matrix_load_checked. It seems like it would be less confusing and more efficient to have the user pass values for Width and Height that are the remaining amount of memory in the big matrix. Maybe this would be easier to explain in a meeting where we can draw pictures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's discuss fill and remainder approach in a meeting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I unresolve this comment as there was no follow up and I probably has missed the meeting? :) I also don't understand, why we need coordinates for fill

The APIs in this section may be used only on a device that has
`aspect::ext_intel_matrix_checked`. The application must check that
the device has this aspect before submitting a kernel using any of the
APIs in this section. If the application fails to do this, the
implementation throws a synchronous exception with the
`errc::kernel_not_supported` error code when the kernel is submitted
to the queue.

==== New Aspect for Checked Matrix APIs
This extension adds a new device aspect:
```c++
namespace sycl {

enum class aspect : /*unspecified*/ {
ext_intel_matrix_checked
};

} // namespace sycl
```
The `ext_intel_matrix_checked` aspect indicates that the device is capable of
supporting the out of bounds checked APIs that are defined in this section.

==== Introduction
In this section, we refer to the memory buffer where a `joint_matrix`
is loaded from or stored to as the global matrix. This global matrix
is also interpreted as a two-dimensional memory region as follows, where
`GlobalRows` is number of rows in the global matrix, `GlobalCols` is number of
columns in the global matrix, `Stride` is number of columns that include
the out of bounds data (depicted as x here).

```
GlobalCols
<----------->
dddddddddddddxxx ^
dddddddddddddxxx | GlobalRows
dddddddddddddxxx v
xxxxxxxxxxxxxxxx
<-------------->
Stride
```

In the diagram above, the global matrix has 13 columns and 3
rows. This is padded out to be evenly divisible by a joint matrix with
8 columns and 2 rows, which results in a stride of 16.

Note that joint matrix shape `Rows` and `Cols` represents a sub-block
of the picture above. The out of bounds data results when the global
matrix size is not evenly divisible by the joint matrix size.

==== Checked APIs
When an algorithm iterates over the global matrix, it loads or stores
elements that correspond to a joint matrix. When the global matrix
size does not evenly divide by the joint matrix size, some of these
loads or stores access the extra elements marked "x" in the diagram
above. The standard joint matrix functions (`joint_matrix_load`,
`joint_matrix_store` and `joint_matrix_fill`) do not do any bounds
checking in this case, so they simply load or store to these extra
elements. This could cause unexpected values to be loaded into the
joint matrix for these elements. These functions could also cause a
memory fault if the extra elements are not valid addresses.

The checked APIs described below do not attempt to access the extra
memory. The checked load is guaranteed to return 0 for the extra
elements, and the checked store simply ignores stores to the extra
elements. Neither function will cause a memory fault if the extra
elements correspond to invalid addresses.

These functions are similar to the existing ones without bounds
checking, namely `joint_matrix_fill`, `joint_matrix_load`, and
`joint_matrix_store`. But they are different in three ways:

* The pointer `base_src` or `base_dest` designates the base pointer of
the global memory matrix, which is different from the APIs that do not
do bounds checking. Those non-bounds-checking APIs take a pointer to
the base of the joint matrix.
* The coordinates `RowIndex` and `ColIndex` into the global matrix to
calculate the pointer offset to load/store are given as separate
arguments.
* These variants take extra arguments to determine the global bounds
`GlobalRows` and `GlobalCols` of the global matrix.

To illustrate the out-of-bounds checking, consider the global matrix
shown above which has 13 columns and 3 rows (`GlobalRows=3` and
`GlobalCols=13`), where the joint matrix size is 8 columns by 2 rows defined as
```
joint_matrix<sub_group, bfloat16, use::b, 2, 8, layout::row_major> sub_b;
```
The load of the joint matrix at coordinate [8, 2] (column number 8,
row number 2 in the global matrix), overlaps the extra elements in
both dimensions. This is shown below, where capital letters correspond
to the elements that are accessed by this joint matrix load:

```
GlobalCols
<----------->
dddddddddddddxxx ^
dddddddddddddxxx | GlobalRows
ddddddddDDDDDXXX v
xxxxxxxxXXXXXXXX
<-------------->
Stride
```

If the joint matrix is loaded via `joint_matrix_load_checked` using
```
joint_matrix_load_checked(sg, sub_b, base_src, 16, 3, 13, 2, 8);
```
the extra elements that are shown with capital `X` are not accessed in
memory, and those elements are guaranteed to have the value zero in
the joint matrix after the load operation completes.

```c++
namespace sycl::ext::intel::experimental::matrix {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fda0 does this change address the missing arguments to generate efficiently 2d load/store in IGC?

Copy link
Contributor Author

@dkhaldi dkhaldi Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fda0 Yes I changed it to add the two offset coordinates separately and excluded them from the base pointer calculation.

template <typename Group, typename T, size_t Rows, size_t Cols,
use Use, layout Layout, typename Tv>
void joint_matrix_fill_checked(Group g, joint_matrix<Group, T, Use, Rows,
Cols, Layout> &m, Tv v, size_t GlobalRows, size_t GlobalCols,
size_t RowIndex, size_t ColIndex);

// Only available when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2,
size_t Rows, size_t Cols,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_load_checked(Group g,
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
multi_ptr<T2, Space, IsDecorated> base_src, size_t Stride,
layout Layout, size_t GlobalRows, size_t GlobalCols,
size_t RowIndex, size_t ColIndex);

// Only available when Layout != layout::dynamic
// and when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2,
size_t Rows, size_t Cols,
use Use, layout Layout,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_load_checked(Group g,
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
multi_ptr<T2, Space, IsDecorated> base_src, size_t Stride,
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);

// Only available when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2,
size_t Rows, size_t Cols, typename PropertyListT>
void joint_matrix_load_checked(Group g,
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
ext::oneapi::experimental::annotated_ptr<T2, PropertyListT> base_src,
size_t Stride, layout Layout, size_t GlobalRows, size_t GlobalCols,
size_t RowIndex, size_t ColIndex);

// Only available when Layout != layout::dynamic
// and when std::is_same_v<T1, std::remove_const_t<T2>>
template <typename Group, typename T1, typename T2, size_t Rows,
size_t Cols, use Use, layout Layout, typename PropertyListT>
void joint_matrix_load_checked(Group g,
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
ext::oneapi::experimental::annotated_ptr<T2, PropertyListT> base_src,
size_t Stride, size_t GlobalRows, size_t GlobalCols,
size_t RowIndex, size_t ColIndex);

template <typename Group, typename T, size_t Rows, size_t Cols,
access::address_space Space, access::decorated IsDecorated>
void joint_matrix_store_checked(Group g,
const joint_matrix<Group, T, use::accumulator, Rows, Cols, layout::dynamic> &res,
multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride, layout Layout,
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, access::address_space Space,
access::decorated IsDecorated>
void joint_matrix_store_checked(Group g,
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride,
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, access::address_space Space,
access::decorated IsDecorated>
void joint_matrix_store_checked(Group g,
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride,
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex);

template <typename Group, typename T, size_t Rows, size_t Cols,
typename PropertyListT>
void joint_matrix_store_checked(Group g,
const joint_matrix<Group, T, use::accumulator, Rows, Cols, layout::dynamic> &res,
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest,
size_t Stride, layout Layout, size_t GlobalRows, size_t GlobalCols,
size_t RowIndex, size_t ColIndex);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, typename PropertyListT>
void joint_matrix_store_checked(Group g,
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest,
size_t Stride, size_t GlobalRows, size_t GlobalCols,
size_t RowIndex, size_t ColIndex);

template <typename Group, typename T, size_t Rows, size_t Cols,
layout Layout, typename PropertyListT>
void joint_matrix_store_checked(Group g,
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest,
size_t Stride, size_t GlobalRows, size_t GlobalCols,
size_t RowIndex, size_t ColIndex);

} // namespace sycl::ext::intel::experimental::matrix
```

The property list associated with the `annotated_ptr` argument
represents the compile-time constant properties for cache control included
in the SYCL extenion
link:../../proposed/sycl_ext_intel_cache_controls.asciidoc[sycl_ext_intel_cache_controls].

==== Restrictions and Device Information Descriptors
Applications must adhere to certain alignment restrictions when using
the checked APIs described in this section. This extension provides
the following queries to get these requirements:

[frame="none",options="header"]
|======================
| Device descriptors | Return type| Description
|`ext::intel::experimental::info::device::matrix_checked_alignment`| `size_t`
|Tells the required alignment (in bytes) of the base pointer for
`joint_matrix_load_checked` and `joint_matrix_store_checked`.
|`ext::intel::experimental::info::device::matrix_checked_rowindex_multiple_of<T>`|
`size_t`|Returns a value, of which `RowIndex` must be multiple of;
where `T` is the element type of the matrix. When using the matrices
with the machine learning types, `T` should be the element type
(e.g. `precision::tf32`) not the storage type.
|`ext::intel::experimental::info::device::matrix_checked_globalcols_multiple_of<T>`|
`size_t` | Returns a value, of which `GlobalCols` must be multiple of;
where `T` is the element type of the matrix. When using the matrices
with the machine learning types, `T` should be the element type
(e.g. `precision::tf32`) not the storage type.
|======================

==== Appendix: Restrictions Per Hardware
===== Intel XMX
The checked APIs are currently available in devices with the architecture
`architecture::intel_gpu_pvc`. The following restrictions apply to
these checked APIs:

- The base pointer must be 4 bytes aligned.

- For 8 bits data type, `RowIndex` must be a multiple of 4. For 16 bits
data type, `RowIndex` must be a multiple of 2. So `RowIndex` must be a
multiple of 4 divided by size of the element type (`4/sizeof(T)`).

- For 8 bits data type, `GlobalCols` must be a multiple of 4. For 16 bits
data type, `GlobalCols` must be a multiple of 2. So `GlobalCols` must be a
multiple of 4 divided by size of the element type (`4/sizeof(T)`).

=== New Device Information Descriptor
Besides the query we provide in
link:sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix],
Expand Down Expand Up @@ -334,4 +589,6 @@ q.wait();
|Rev |Date |Author |Changes
|1 |2022-11-07 |Dounia Khaldi |Add Intel-specific store API,
layout information, and `joint_matrix_apply` with coordinates API
|2 |2023-10-19 |Dounia Khaldi |Add Intel-specific out-of-bounds
load/store/fill APIs
|======================