-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL][Joint Matrix Spec] Add new API for out of bounds fill/load/store #11172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…t to zero for load/fill, drop out when storing
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
@gmlueck, gentle ping to review. |
@@ -203,6 +203,96 @@ In the case of `ext_intel_packed` matrix memory layout, `row` and | |||
`col` represent the coordinates in the logical matrix before VNNI | |||
transformation. | |||
|
|||
=== Load/Store/Fill With Out-of-Bounds Checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand how this API can work. We might need to schedule a meeting where we can draw some pictures.
Wouldn't it make more sense to require the user to set Width
and Height
according to the remaining amount of memory in the "big matrix"? That way, the implementation simply checks to see if the template parameter Rows
is greater than Height
and if the template parameter Cols
is greater than Width
. If they are both less, the operations behave like normal loads and stores. If they are less, then the load sets the out-of-range elements to zero and the store ignores the out-of-range elements.
From the description, though, it doesn't seem like the APIs work this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check is done compared to the big matrix size comparing the address to be loaded (ptr) with the big matrix size (Height). If ptr > Height, the element should not be loaded.
if the remainder is passed, how is the implementation going to compare with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your recent change that adds coordX
and coordY
parameters, I see how this could work for the load and store APIs. However, I still don't see how the fill API can work.
This API structure seems confusing to me, though, and probably less efficient. I suspect users will get confused that the src
pointer to joint_matrix_load
is completely different from the src_base
pointer to joint_matrix_load_checked
. It seems like it would be less confusing and more efficient to have the user pass values for Width
and Height
that are the remaining amount of memory in the big matrix. Maybe this would be easier to explain in a meeting where we can draw pictures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's discuss fill and remainder approach in a meeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I unresolve this comment as there was no follow up and I probably has missed the meeting? :) I also don't understand, why we need coordinates for fill
|
||
```c++ | ||
namespace sycl::ext::intel::experimental::matrix { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fda0 does this change address the missing arguments to generate efficiently 2d load/store in IGC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fda0 Yes I changed it to add the two offset coordinates separately and excluded them from the base pointer calculation.
Notes after online discussion with @gmlueck:
|
…d coordX/Y. Add clarification on how the bounds are checked
@gmlueck, I changed the API to stick with offset pointer, no separate coordinates are needed. We did that because we found that alignment restriction does not really make a difference whether we use offset or base pointer (4 bytes alignments must be honored regardless). However, we found an another issue if we stick with offset pointer which is the inability to support negative coordinates. There are use cases where users need to read backwards in memory and the extra elements (backward before the start of the memory pointer) need to be zeroed. If we stick with offset pointer (remainder Width and Height cannot express the negative coordinates case), we will have to restrict the use of these API to only positive coordinates which can be limiting to the real use cases. |
Hi, I was asked to describe my use case for block2d loads that popped up in a related discussion about intel_subgroup_block_* functions (in OpenCL). I am working on a kernel which requires the following sum: with the boundary condition that in(k_1, ..., k_N) = 0 if any k_n is out-of-bounds. Using the base address + coordinate format I can simply write int2 coord = (int2)(0, i + j - 1);
ushort16 tmp = intel_subgroup_block_read_u16_m8k16v2(base_address, SurfaceWidth * sizeof(short), SurfaceHeight, SurfacePitch * sizeof(short), coord); and then a 0 will be return if coord.x < 0 or coord.x >= SurfaceWidth. Suggestion about the interface: Maybe it would be nice to wrap the surface parameters in its own type. Then it would be imho clearer to what Height, Width, Stride, Layout relate to. E.g. something like this: auto big_matrix = big_matrix_ref{base_ptr, Stride, Layout, Height, Width};
void joint_matrix_load_checked(sub_group, the_joint_matrix, big_matrix, int2(coordX, coordY)); Would also require less characters to write if one has several loads on the same matrix.
There's also a 64 B alignment requirement on the base_ptr on current gen. |
Here's another thing to consider. Assume that I tile my row-major MxN matrix into 8x16 tiles. Using the current proposal, writing a loop accessing every tile looks like the following: for (int m = 0; m < M; m += 8) {
for (int n = 0; n < N; n += 16) {
joint_matrix_load_checked(sub_group, result, base_ptr + n + m * Stride, Stride, M - m, N - n);
}
} In contrast, with base_ptr + coordinate we have for (int m = 0; m < M; m += 8) {
for (int n = 0; n < N; n += 16) {
joint_matrix_load_checked(sub_group, result, base_ptr, Stride, M, N, int2{m, n});
}
} In variant 1) we have additional arithmetic going on; we need 4 add / sub and 1 mul instruction, that would otherwise be done in hardware in a single load instruction. Moreover, we have to do the offset calculation ourselves every time and must not forget to adjust the bounds - imho much more error prone and complicated than 2). |
…ew use case where coords are negative values and cannot be taken into account with the offset pointer solution
@gmlueck, as we discussed on Monday, I put back the coords arguments and base pointer as they are needed for negative coordinates case. |
use Use, layout Layout, typename Tv> | ||
void joint_matrix_fill_checked(Group g, joint_matrix<Group, T, Use, Rows, | ||
Cols, Layout> &m, Tv v, size_t Stride, size_t Height, size_t | ||
Width, size_t coordX, size_t coordY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we don't need coordinate parameters for joint_matrix_fill_checked, as we don't need to access memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MrSidims,
In the load/store, the check does the following (let's just consider one horizontal dimension to simplify):
if (base_ptr + coordY + Cols > base_ptr+Width) --> out of bounds
Since we intend to do the check just like load and store (just considering base_ptr=0), we need to add these arguments: Height, Width, Stride, CoordX, and CoordY. The check will do the following:
if (coordY + Cols > Width) --> out of bounds
@@ -203,6 +203,96 @@ In the case of `ext_intel_packed` matrix memory layout, `row` and | |||
`col` represent the coordinates in the logical matrix before VNNI | |||
transformation. | |||
|
|||
=== Load/Store/Fill With Out-of-Bounds Checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I unresolve this comment as there was no follow up and I probably has missed the meeting? :) I also don't understand, why we need coordinates for fill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the assumption, that Stride is removed form matrilx_fill API
Yes, I just removed stride. So the expectation is the following WRT joint_matrix_fill_checked:
|
in the SYCL extenion | ||
link:../../proposed/sycl_ext_intel_cache_controls.asciidoc[sycl_ext_intel_cache_controls], | ||
and for the alignment property defined in the SYCL extension | ||
link:../../experimental/sycl_ext_oneapi_kernel_arg_properties.asciidoc[sycl_ext_oneapi_kernel_arg_properties]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also added support for annotated_ptr
in #11473, but that PR doesn't allow the alignment
property. Is that an oversight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove alignment from the checked APIs because only cache control is what is being considered for implementation so far.
Alignments will be handled by the query like you suggested in your last comment. So users must make sure their pointers are 4 bytes aligned. Otherwise, results are undefined.
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
I found this information in an email conversation we had back in January:
I'm not sure if these limitations still exist. If so, I think these queries need to be added to the spec. Rather than adding None of the information descriptor queries in sycl_ext_oneapi_matrix have a namespace like |
- Restructuring and more clarifications - Add Device Information Descriptors for the checked APIs restrictions - Fix store checked for A and B use
@gmlueck, I addressed your comments, specifically:
Can you please take a look again? |
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
…he specific requirements as an appendix subsection
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
…ying T in the query, fix punctuation
@intel/llvm-gatekeepers, please help merge. |
Code example to show usage can be found here:
https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/Matrix/joint_matrix_out_bounds.cpp