-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL][Joint Matrix Spec] Add new API for out of bounds fill/load/store #11172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
36cbd73
f0c0610
d9608a2
8c197c6
d041a73
f7fdefe
0039eb0
908c321
4ae7c46
ee364e4
07e72ab
e9886ae
3354607
df2c2d4
95cdb68
166ec96
667ba34
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -203,6 +203,261 @@ In the case of `ext_intel_packed` matrix memory layout, `row` and | |
`col` represent the coordinates in the logical matrix before VNNI | ||
transformation. | ||
|
||
=== Load/Store/Fill With Out-of-Bounds Checks | ||
The APIs in this section may be used only on a device that has | ||
`aspect::ext_intel_matrix_checked`. The application must check that | ||
the device has this aspect before submitting a kernel using any of the | ||
APIs in this section. If the application fails to do this, the | ||
implementation throws a synchronous exception with the | ||
`errc::kernel_not_supported` error code when the kernel is submitted | ||
to the queue. | ||
|
||
==== New Aspect for Checked Matrix APIs | ||
This extension adds a new device aspect: | ||
```c++ | ||
namespace sycl { | ||
|
||
enum class aspect : /*unspecified*/ { | ||
ext_intel_matrix_checked | ||
}; | ||
|
||
} // namespace sycl | ||
``` | ||
The `ext_intel_matrix_checked` aspect indicates that the device is capable of | ||
supporting the out of bounds checked APIs that are defined in this section. | ||
|
||
==== Introduction | ||
In this section, we refer to the memory buffer where a `joint_matrix` | ||
is loaded from or stored to as the global matrix. This global matrix | ||
is also interpreted as a two-dimensional memory region as follows, where | ||
`GlobalRows` is number of rows in the global matrix, `GlobalCols` is number of | ||
columns in the global matrix, `Stride` is number of columns that include | ||
the out of bounds data (depicted as x here). | ||
|
||
``` | ||
GlobalCols | ||
<-----------> | ||
dddddddddddddxxx ^ | ||
dddddddddddddxxx | GlobalRows | ||
dddddddddddddxxx v | ||
xxxxxxxxxxxxxxxx | ||
<--------------> | ||
Stride | ||
``` | ||
|
||
In the diagram above, the global matrix has 13 columns and 3 | ||
rows. This is padded out to be evenly divisible by a joint matrix with | ||
8 columns and 2 rows, which results in a stride of 16. | ||
|
||
Note that joint matrix shape `Rows` and `Cols` represents a sub-block | ||
of the picture above. The out of bounds data results when the global | ||
matrix size is not evenly divisible by the joint matrix size. | ||
|
||
==== Checked APIs | ||
When an algorithm iterates over the global matrix, it loads or stores | ||
elements that correspond to a joint matrix. When the global matrix | ||
size does not evenly divide by the joint matrix size, some of these | ||
loads or stores access the extra elements marked "x" in the diagram | ||
above. The standard joint matrix functions (`joint_matrix_load`, | ||
`joint_matrix_store` and `joint_matrix_fill`) do not do any bounds | ||
checking in this case, so they simply load or store to these extra | ||
elements. This could cause unexpected values to be loaded into the | ||
joint matrix for these elements. These functions could also cause a | ||
memory fault if the extra elements are not valid addresses. | ||
|
||
The checked APIs described below do not attempt to access the extra | ||
memory. The checked load is guaranteed to return 0 for the extra | ||
elements, and the checked store simply ignores stores to the extra | ||
elements. Neither function will cause a memory fault if the extra | ||
elements correspond to invalid addresses. | ||
|
||
These functions are similar to the existing ones without bounds | ||
checking, namely `joint_matrix_fill`, `joint_matrix_load`, and | ||
`joint_matrix_store`. But they are different in three ways: | ||
|
||
* The pointer `base_src` or `base_dest` designates the base pointer of | ||
the global memory matrix, which is different from the APIs that do not | ||
do bounds checking. Those non-bounds-checking APIs take a pointer to | ||
the base of the joint matrix. | ||
* The coordinates `RowIndex` and `ColIndex` into the global matrix to | ||
calculate the pointer offset to load/store are given as separate | ||
arguments. | ||
* These variants take extra arguments to determine the global bounds | ||
`GlobalRows` and `GlobalCols` of the global matrix. | ||
|
||
To illustrate the out-of-bounds checking, consider the global matrix | ||
shown above which has 13 columns and 3 rows (`GlobalRows=3` and | ||
`GlobalCols=13`), where the joint matrix size is 8 columns by 2 rows defined as | ||
``` | ||
joint_matrix<sub_group, bfloat16, use::b, 2, 8, layout::row_major> sub_b; | ||
``` | ||
The load of the joint matrix at coordinate [8, 2] (column number 8, | ||
row number 2 in the global matrix), overlaps the extra elements in | ||
both dimensions. This is shown below, where capital letters correspond | ||
to the elements that are accessed by this joint matrix load: | ||
|
||
``` | ||
GlobalCols | ||
<-----------> | ||
dddddddddddddxxx ^ | ||
dddddddddddddxxx | GlobalRows | ||
ddddddddDDDDDXXX v | ||
xxxxxxxxXXXXXXXX | ||
<--------------> | ||
Stride | ||
``` | ||
|
||
If the joint matrix is loaded via `joint_matrix_load_checked` using | ||
``` | ||
joint_matrix_load_checked(sg, sub_b, base_src, 16, 3, 13, 2, 8); | ||
``` | ||
the extra elements that are shown with capital `X` are not accessed in | ||
memory, and those elements are guaranteed to have the value zero in | ||
the joint matrix after the load operation completes. | ||
|
||
```c++ | ||
namespace sycl::ext::intel::experimental::matrix { | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fda0 does this change address the missing arguments to generate efficiently 2d load/store in IGC? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fda0 Yes I changed it to add the two offset coordinates separately and excluded them from the base pointer calculation. |
||
template <typename Group, typename T, size_t Rows, size_t Cols, | ||
use Use, layout Layout, typename Tv> | ||
void joint_matrix_fill_checked(Group g, joint_matrix<Group, T, Use, Rows, | ||
Cols, Layout> &m, Tv v, size_t GlobalRows, size_t GlobalCols, | ||
size_t RowIndex, size_t ColIndex); | ||
|
||
// Only available when std::is_same_v<T1, std::remove_const_t<T2>> | ||
template <typename Group, typename T1, typename T2, | ||
size_t Rows, size_t Cols, | ||
access::address_space Space, access::decorated IsDecorated> | ||
void joint_matrix_load_checked(Group g, | ||
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res, | ||
multi_ptr<T2, Space, IsDecorated> base_src, size_t Stride, | ||
layout Layout, size_t GlobalRows, size_t GlobalCols, | ||
size_t RowIndex, size_t ColIndex); | ||
|
||
// Only available when Layout != layout::dynamic | ||
// and when std::is_same_v<T1, std::remove_const_t<T2>> | ||
template <typename Group, typename T1, typename T2, | ||
size_t Rows, size_t Cols, | ||
use Use, layout Layout, | ||
access::address_space Space, access::decorated IsDecorated> | ||
void joint_matrix_load_checked(Group g, | ||
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res, | ||
multi_ptr<T2, Space, IsDecorated> base_src, size_t Stride, | ||
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex); | ||
|
||
// Only available when std::is_same_v<T1, std::remove_const_t<T2>> | ||
template <typename Group, typename T1, typename T2, | ||
size_t Rows, size_t Cols, typename PropertyListT> | ||
void joint_matrix_load_checked(Group g, | ||
joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res, | ||
ext::oneapi::experimental::annotated_ptr<T2, PropertyListT> base_src, | ||
size_t Stride, layout Layout, size_t GlobalRows, size_t GlobalCols, | ||
size_t RowIndex, size_t ColIndex); | ||
|
||
// Only available when Layout != layout::dynamic | ||
// and when std::is_same_v<T1, std::remove_const_t<T2>> | ||
template <typename Group, typename T1, typename T2, size_t Rows, | ||
size_t Cols, use Use, layout Layout, typename PropertyListT> | ||
void joint_matrix_load_checked(Group g, | ||
joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res, | ||
ext::oneapi::experimental::annotated_ptr<T2, PropertyListT> base_src, | ||
size_t Stride, size_t GlobalRows, size_t GlobalCols, | ||
size_t RowIndex, size_t ColIndex); | ||
|
||
template <typename Group, typename T, size_t Rows, size_t Cols, | ||
access::address_space Space, access::decorated IsDecorated> | ||
void joint_matrix_store_checked(Group g, | ||
const joint_matrix<Group, T, use::accumulator, Rows, Cols, layout::dynamic> &res, | ||
multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride, layout Layout, | ||
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex); | ||
|
||
template <typename Group, typename T, size_t Rows, size_t Cols, | ||
layout Layout, access::address_space Space, | ||
access::decorated IsDecorated> | ||
void joint_matrix_store_checked(Group g, | ||
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res, | ||
multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride, | ||
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex); | ||
|
||
template <typename Group, typename T, size_t Rows, size_t Cols, | ||
layout Layout, access::address_space Space, | ||
access::decorated IsDecorated> | ||
void joint_matrix_store_checked(Group g, | ||
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res, | ||
multi_ptr<T, Space, IsDecorated> base_dest, size_t Stride, | ||
size_t GlobalRows, size_t GlobalCols, size_t RowIndex, size_t ColIndex); | ||
|
||
template <typename Group, typename T, size_t Rows, size_t Cols, | ||
typename PropertyListT> | ||
void joint_matrix_store_checked(Group g, | ||
const joint_matrix<Group, T, use::accumulator, Rows, Cols, layout::dynamic> &res, | ||
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest, | ||
size_t Stride, layout Layout, size_t GlobalRows, size_t GlobalCols, | ||
size_t RowIndex, size_t ColIndex); | ||
|
||
template <typename Group, typename T, size_t Rows, size_t Cols, | ||
layout Layout, typename PropertyListT> | ||
void joint_matrix_store_checked(Group g, | ||
const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res, | ||
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest, | ||
size_t Stride, size_t GlobalRows, size_t GlobalCols, | ||
size_t RowIndex, size_t ColIndex); | ||
|
||
template <typename Group, typename T, size_t Rows, size_t Cols, | ||
layout Layout, typename PropertyListT> | ||
void joint_matrix_store_checked(Group g, | ||
const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res, | ||
ext::oneapi::experimental::annotated_ptr<T, PropertyListT> base_dest, | ||
size_t Stride, size_t GlobalRows, size_t GlobalCols, | ||
size_t RowIndex, size_t ColIndex); | ||
|
||
} // namespace sycl::ext::intel::experimental::matrix | ||
``` | ||
|
||
The property list associated with the `annotated_ptr` argument | ||
represents the compile-time constant properties for cache control included | ||
in the SYCL extenion | ||
link:../../proposed/sycl_ext_intel_cache_controls.asciidoc[sycl_ext_intel_cache_controls]. | ||
|
||
==== Restrictions and Device Information Descriptors | ||
Applications must adhere to certain alignment restrictions when using | ||
the checked APIs described in this section. This extension provides | ||
the following queries to get these requirements: | ||
|
||
[frame="none",options="header"] | ||
|====================== | ||
| Device descriptors | Return type| Description | ||
|`ext::intel::experimental::info::device::matrix_checked_alignment`| `size_t` | ||
|Tells the required alignment (in bytes) of the base pointer for | ||
`joint_matrix_load_checked` and `joint_matrix_store_checked`. | ||
|`ext::intel::experimental::info::device::matrix_checked_rowindex_multiple_of<T>`| | ||
`size_t`|Returns a value, of which `RowIndex` must be multiple of; | ||
where `T` is the element type of the matrix. When using the matrices | ||
with the machine learning types, `T` should be the element type | ||
(e.g. `precision::tf32`) not the storage type. | ||
|`ext::intel::experimental::info::device::matrix_checked_globalcols_multiple_of<T>`| | ||
`size_t` | Returns a value, of which `GlobalCols` must be multiple of; | ||
where `T` is the element type of the matrix. When using the matrices | ||
with the machine learning types, `T` should be the element type | ||
(e.g. `precision::tf32`) not the storage type. | ||
|====================== | ||
|
||
==== Appendix: Restrictions Per Hardware | ||
===== Intel XMX | ||
The checked APIs are currently available in devices with the architecture | ||
`architecture::intel_gpu_pvc`. The following restrictions apply to | ||
these checked APIs: | ||
|
||
- The base pointer must be 4 bytes aligned. | ||
|
||
- For 8 bits data type, `RowIndex` must be a multiple of 4. For 16 bits | ||
data type, `RowIndex` must be a multiple of 2. So `RowIndex` must be a | ||
multiple of 4 divided by size of the element type (`4/sizeof(T)`). | ||
|
||
- For 8 bits data type, `GlobalCols` must be a multiple of 4. For 16 bits | ||
data type, `GlobalCols` must be a multiple of 2. So `GlobalCols` must be a | ||
multiple of 4 divided by size of the element type (`4/sizeof(T)`). | ||
|
||
=== New Device Information Descriptor | ||
Besides the query we provide in | ||
link:sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix], | ||
|
@@ -334,4 +589,6 @@ q.wait(); | |
|Rev |Date |Author |Changes | ||
|1 |2022-11-07 |Dounia Khaldi |Add Intel-specific store API, | ||
layout information, and `joint_matrix_apply` with coordinates API | ||
|2 |2023-10-19 |Dounia Khaldi |Add Intel-specific out-of-bounds | ||
load/store/fill APIs | ||
|====================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand how this API can work. We might need to schedule a meeting where we can draw some pictures.
Wouldn't it make more sense to require the user to set
Width
andHeight
according to the remaining amount of memory in the "big matrix"? That way, the implementation simply checks to see if the template parameterRows
is greater thanHeight
and if the template parameterCols
is greater thanWidth
. If they are both less, the operations behave like normal loads and stores. If they are less, then the load sets the out-of-range elements to zero and the store ignores the out-of-range elements.From the description, though, it doesn't seem like the APIs work this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check is done compared to the big matrix size comparing the address to be loaded (ptr) with the big matrix size (Height). If ptr > Height, the element should not be loaded.
if the remainder is passed, how is the implementation going to compare with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your recent change that adds
coordX
andcoordY
parameters, I see how this could work for the load and store APIs. However, I still don't see how the fill API can work.This API structure seems confusing to me, though, and probably less efficient. I suspect users will get confused that the
src
pointer tojoint_matrix_load
is completely different from thesrc_base
pointer tojoint_matrix_load_checked
. It seems like it would be less confusing and more efficient to have the user pass values forWidth
andHeight
that are the remaining amount of memory in the big matrix. Maybe this would be easier to explain in a meeting where we can draw pictures?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's discuss fill and remainder approach in a meeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I unresolve this comment as there was no follow up and I probably has missed the meeting? :) I also don't understand, why we need coordinates for fill