Skip to content

[SYCL][Joint Matrix Spec] Add new API for out of bounds fill/load/store #11172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Mar 19, 2024

Conversation

dkhaldi
Copy link
Contributor

@dkhaldi dkhaldi commented Sep 13, 2023

@dkhaldi dkhaldi requested a review from a team as a code owner September 13, 2023 20:56
…t to zero for load/fill, drop out when storing
@dkhaldi
Copy link
Contributor Author

dkhaldi commented Oct 19, 2023

@gmlueck, gentle ping to review.

@@ -203,6 +203,96 @@ In the case of `ext_intel_packed` matrix memory layout, `row` and
`col` represent the coordinates in the logical matrix before VNNI
transformation.

=== Load/Store/Fill With Out-of-Bounds Checks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand how this API can work. We might need to schedule a meeting where we can draw some pictures.

Wouldn't it make more sense to require the user to set Width and Height according to the remaining amount of memory in the "big matrix"? That way, the implementation simply checks to see if the template parameter Rows is greater than Height and if the template parameter Cols is greater than Width. If they are both less, the operations behave like normal loads and stores. If they are less, then the load sets the out-of-range elements to zero and the store ignores the out-of-range elements.

From the description, though, it doesn't seem like the APIs work this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is done compared to the big matrix size comparing the address to be loaded (ptr) with the big matrix size (Height). If ptr > Height, the element should not be loaded.
if the remainder is passed, how is the implementation going to compare with it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your recent change that adds coordX and coordY parameters, I see how this could work for the load and store APIs. However, I still don't see how the fill API can work.

This API structure seems confusing to me, though, and probably less efficient. I suspect users will get confused that the src pointer to joint_matrix_load is completely different from the src_base pointer to joint_matrix_load_checked. It seems like it would be less confusing and more efficient to have the user pass values for Width and Height that are the remaining amount of memory in the big matrix. Maybe this would be easier to explain in a meeting where we can draw pictures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's discuss fill and remainder approach in a meeting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I unresolve this comment as there was no follow up and I probably has missed the meeting? :) I also don't understand, why we need coordinates for fill


```c++
namespace sycl::ext::intel::experimental::matrix {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fda0 does this change address the missing arguments to generate efficiently 2d load/store in IGC?

Copy link
Contributor Author

@dkhaldi dkhaldi Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fda0 Yes I changed it to add the two offset coordinates separately and excluded them from the base pointer calculation.

@dkhaldi
Copy link
Contributor Author

dkhaldi commented Nov 1, 2023

Notes after online discussion with @gmlueck:

  • Annotated ptr load/store overloads needs to be in oneapi namespace for use on targets that support cache control hints
  • Prefetch needs to be in oneapi namespace
  • Out of bounds support:
    - Do we need emulation when hardware cannot do the bounds check?
    - Do we need these for AMX?
    - Fill does not need base pointer but will need coordinates and stride to be added
    - Should we add base ptr, coord x and y to the current joint_matrix_load/store to match the checked variants?

@dkhaldi
Copy link
Contributor Author

dkhaldi commented Dec 6, 2023

@gmlueck, I changed the API to stick with offset pointer, no separate coordinates are needed. We did that because we found that alignment restriction does not really make a difference whether we use offset or base pointer (4 bytes alignments must be honored regardless).

However, we found an another issue if we stick with offset pointer which is the inability to support negative coordinates. There are use cases where users need to read backwards in memory and the extra elements (backward before the start of the memory pointer) need to be zeroed. If we stick with offset pointer (remainder Width and Height cannot express the negative coordinates case), we will have to restrict the use of these API to only positive coordinates which can be limiting to the real use cases.

@uphoffc
Copy link

uphoffc commented Dec 6, 2023

Hi, I was asked to describe my use case for block2d loads that popped up in a related discussion about intel_subgroup_block_* functions (in OpenCL).

I am working on a kernel which requires the following sum:

$$out(..., j, ...) = \sum_i in(..., j + i - 1, ...)$$

with the boundary condition that in(k_1, ..., k_N) = 0 if any k_n is out-of-bounds.

Using the base address + coordinate format I can simply write

int2 coord = (int2)(0, i + j - 1);
ushort16 tmp = intel_subgroup_block_read_u16_m8k16v2(base_address, SurfaceWidth * sizeof(short), SurfaceHeight, SurfacePitch * sizeof(short), coord);

and then a 0 will be return if coord.x < 0 or coord.x >= SurfaceWidth.

Suggestion about the interface: Maybe it would be nice to wrap the surface parameters in its own type. Then it would be imho clearer to what Height, Width, Stride, Layout relate to. E.g. something like this:

auto big_matrix = big_matrix_ref{base_ptr, Stride, Layout, Height, Width};
void joint_matrix_load_checked(sub_group, the_joint_matrix, big_matrix, int2(coordX, coordY));

Would also require less characters to write if one has several loads on the same matrix.

We did that because we found that alignment restriction does not really make a difference whether we use offset or base pointer (4 bytes alignments must be honored regardless).

There's also a 64 B alignment requirement on the base_ptr on current gen.

@uphoffc
Copy link

uphoffc commented Dec 7, 2023

Here's another thing to consider.

Assume that I tile my row-major MxN matrix into 8x16 tiles. Using the current proposal, writing a loop accessing every tile looks like the following:
1)

for (int m = 0; m < M; m += 8) {
  for (int n = 0; n < N; n += 16) {
    joint_matrix_load_checked(sub_group, result, base_ptr + n + m * Stride, Stride, M - m, N - n);
  }
}

In contrast, with base_ptr + coordinate we have
2)

for (int m = 0; m < M; m += 8) {
  for (int n = 0; n < N; n += 16) {
    joint_matrix_load_checked(sub_group, result, base_ptr, Stride, M, N, int2{m, n});
  }
}

In variant 1) we have additional arithmetic going on; we need 4 add / sub and 1 mul instruction, that would otherwise be done in hardware in a single load instruction. Moreover, we have to do the offset calculation ourselves every time and must not forget to adjust the bounds - imho much more error prone and complicated than 2).

…ew use case where coords are negative values and cannot be taken into account with the offset pointer solution
@dkhaldi
Copy link
Contributor Author

dkhaldi commented Dec 21, 2023

@gmlueck, as we discussed on Monday, I put back the coords arguments and base pointer as they are needed for negative coordinates case.
Can you please review/approve?

steffenlarsen pushed a commit that referenced this pull request Jan 4, 2024
use Use, layout Layout, typename Tv>
void joint_matrix_fill_checked(Group g, joint_matrix<Group, T, Use, Rows,
Cols, Layout> &m, Tv v, size_t Stride, size_t Height, size_t
Width, size_t coordX, size_t coordY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we don't need coordinate parameters for joint_matrix_fill_checked, as we don't need to access memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrSidims,
In the load/store, the check does the following (let's just consider one horizontal dimension to simplify):
if (base_ptr + coordY + Cols > base_ptr+Width) --> out of bounds

Since we intend to do the check just like load and store (just considering base_ptr=0), we need to add these arguments: Height, Width, Stride, CoordX, and CoordY. The check will do the following:
if (coordY + Cols > Width) --> out of bounds

@@ -203,6 +203,96 @@ In the case of `ext_intel_packed` matrix memory layout, `row` and
`col` represent the coordinates in the logical matrix before VNNI
transformation.

=== Load/Store/Fill With Out-of-Bounds Checks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I unresolve this comment as there was no follow up and I probably has missed the meeting? :) I also don't understand, why we need coordinates for fill

Copy link
Contributor

@MrSidims MrSidims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the assumption, that Stride is removed form matrilx_fill API

@dkhaldi
Copy link
Contributor Author

dkhaldi commented Jan 16, 2024

LGTM with the assumption, that Stride is removed form matrilx_fill API

Yes, I just removed stride. So the expectation is the following WRT joint_matrix_fill_checked:

  • Check for the y dimension (columns): coordy+smallmatrixcols >? width

  • Check for the x dimension (rows): coordx + smallmatrixrows>? Height

in the SYCL extenion
link:../../proposed/sycl_ext_intel_cache_controls.asciidoc[sycl_ext_intel_cache_controls],
and for the alignment property defined in the SYCL extension
link:../../experimental/sycl_ext_oneapi_kernel_arg_properties.asciidoc[sycl_ext_oneapi_kernel_arg_properties].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also added support for annotated_ptr in #11473, but that PR doesn't allow the alignment property. Is that an oversight?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove alignment from the checked APIs because only cache control is what is being considered for implementation so far.
Alignments will be handled by the query like you suggested in your last comment. So users must make sure their pointers are 4 bytes aligned. Otherwise, results are undefined.

@gmlueck
Copy link
Contributor

gmlueck commented Feb 29, 2024

I found this information in an email conversation we had back in January:

There are two issues with this feature because we only plan on adding it on hardware that offers this capability, no emulation:

  • This will only be supported on PVC and FS1, not the CPU(AMX). So we should add a query value about OOB checking.
  • They need to be aligned by 4 bytes because we want to avoid runtime checking. I will add a note about this requirement. We may need to add this as an extension to the OOB query value. Something like:
    • bool info::device::matrix::out_of_bounds --> support for OOB or not
    • size_t info::device::matrix::out_of_bounds_alignment --> alignment requirement

I'm not sure if these limitations still exist. If so, I think these queries need to be added to the spec.

Rather than adding info::device::matrix::out_of_bounds, we should add an aspect for this. Maybe name it like ext_intel_matrix_checked, as this aligns with the names of the APIs (e.g. joint_matrix_load_checked).

None of the information descriptor queries in sycl_ext_oneapi_matrix have a namespace like info::device::matrix::FOO, so I don't think you should add that namespace just for this one query. Maybe name it like ext::intel::experimental::info::device::matrix_checked_alignment.

dkhaldi added 2 commits March 6, 2024 09:28
- Restructuring and more clarifications
- Add Device Information Descriptors for the checked APIs restrictions
- Fix store checked for A and B use
@dkhaldi
Copy link
Contributor Author

dkhaldi commented Mar 6, 2024

@gmlueck, I addressed your comments, specifically:

  • Add new aspect for the checked APIs ext_intel_matrix_checked
  • Restructuring and more clarifications: add subsections for aspect, introduction, API, restrictions and query
  • Add Device Information Descriptors for the checked APIs restrictions

Can you please take a look again?

…he specific requirements as an appendix subsection
@dkhaldi
Copy link
Contributor Author

dkhaldi commented Mar 19, 2024

@intel/llvm-gatekeepers, please help merge.

@bader bader merged commit ecd3b90 into intel:sycl Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants