Skip to content

[SYCL][CUDA] Draft PR for discussing matrix ext impl issues #6657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

JackAKirk
Copy link
Contributor

This is a move towards the future looking joint_matrix, joint_matrix_load, joint_matrix_store APIs. The aim is to make the CUDA and Intel implementations of the joint_matrix extension use matching interfaces, whilst enabling all functionality of both backends.

Signed-off-by: JackAKirk [email protected]

Signed-off-by: JackAKirk <[email protected]>
This is a move towards the future looking joint_matrix, joint_matrix_load, joint_matrix_store APIs.
Signed-off-by: JackAKirk <[email protected]>
@@ -16,25 +16,28 @@ namespace oneapi {
namespace experimental {
namespace matrix {

enum class matrix_use { a, b, accumulator };
enum class matrix_use { a, b, accumulator, unnecessary };
Copy link
Contributor

@dkhaldi dkhaldi Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#5835 deoes not have the most uptodate changes.
Also, the PR of doc with use is ready: https://github.com/intel/llvm/pull/6659/files.
In the final version:

  • there is no "unecessary" in use
  • packed_a and packed_b will be replaced by packed
  • we are calling "none", "unused". But if you prefer "none", we can call it that instead.


enum class matrix_layout { row_major, col_major, packed_a, packed_b };
enum class layout { row_major, col_major, packed_a, packed_b, none};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if are calling it "layout", we should remove matrix from "matrix_use" as well, right?

Copy link
Contributor Author

@JackAKirk JackAKirk Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It could be worth having a final conversation about the semantics of layout/use in each backend (and for each API : it looks like in both backends there is a distinction between "layout" and "memory layout") before deciding on the naming. I wanted to focus on making sure we are on the same page about the interfaces first though.

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Aug 29, 2022

Updated usage of joint_matrix can be seen in the changes here: intel/llvm-test-suite#1183.

Also updated the impl functions used in the CUDA backend (Some of these functions may be also used in the HIP AMD case when that is implemented, since the interfaces will match).

Signed-off-by: JackAKirk <[email protected]>
JackAKirk and others added 8 commits September 1, 2022 10:19
This is for illustrative purposes: to show the advantage of the proposed change in the joint_matrix_mad interface.

Signed-off-by: JackAKirk <[email protected]>
@JackAKirk JackAKirk changed the title [SYCL][CUDA] Layout accumulator is specified at load/store. [SYCL][CUDA] Separate matrix extension interfaces from impls Oct 10, 2022
@JackAKirk
Copy link
Contributor Author

JackAKirk commented Oct 11, 2022

@dkhaldi @yubingex007-a11y @gmlueck

matrix-unified.hpp contains the agreed interfaces for joint_matrix_load, joint_matrix_store and joint_matrix_mad. These functions call backend implementations depending on compiler flags __NVPTX__ __SPIR__ (and later we can also add amd flags). I've added the backend implementations for CUDA in the matrix-tensor-cores.hpp file.
This is just a draft aimed at finding any technical issues with the unified approach but when #6957 is merged I will pull in those changes and update the flags usage.

The main implementation issue that I think we will face is the redefinition of partial specializations of the joint_matrix struct in the AMX/CUDA backends. These backends use completely different definitions of joint_matrix but have overlapping template parameters. Ideally I think we can select the correct definitions depending on the backend. Unless you can see another solution?
Here you can see that I have also separated the unified joint_matrix struct in joint-matrix.hpp from the CUDA backend partial specializations of joint_matrix in joint-matrix-cuda-impl.hpp.

Do you think that we could use the driver to select the correct partial specializations in a similar manner to how https://github.com/intel/llvm/pull/6524/files#diff-f8c64e36dfe3828a6f816c4550e78bb0305769ace1be53207e86ac9a3280ac9e selects the correct bfloat16 native library?

Also you might want to check that you can call the intel implementations from matrix-unified.hpp, replacing the joint_matrix cuda partial specializations with the intel ones, in order to check that there are no other technical issues we need to consider when we unify sooner rather than later.

Tests using the unified interface in the cuda backend: intel/llvm-test-suite#1183

@JackAKirk JackAKirk closed this Oct 14, 2022
@JackAKirk JackAKirk changed the title [SYCL][CUDA] Separate matrix extension interfaces from impls [SYCL][CUDA] Draft PR for discussing matrix ext impl issues Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants