-
Notifications
You must be signed in to change notification settings - Fork 788
[SYCL][CUDA] Draft PR for discussing matrix ext impl issues #6657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
This is a move towards the future looking joint_matrix, joint_matrix_load, joint_matrix_store APIs. Signed-off-by: JackAKirk <[email protected]>
@@ -16,25 +16,28 @@ namespace oneapi { | |||
namespace experimental { | |||
namespace matrix { | |||
|
|||
enum class matrix_use { a, b, accumulator }; | |||
enum class matrix_use { a, b, accumulator, unnecessary }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#5835 deoes not have the most uptodate changes.
Also, the PR of doc with use is ready: https://github.com/intel/llvm/pull/6659/files.
In the final version:
- there is no "unecessary" in use
- packed_a and packed_b will be replaced by packed
- we are calling "none", "unused". But if you prefer "none", we can call it that instead.
|
||
enum class matrix_layout { row_major, col_major, packed_a, packed_b }; | ||
enum class layout { row_major, col_major, packed_a, packed_b, none}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if are calling it "layout", we should remove matrix from "matrix_use" as well, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It could be worth having a final conversation about the semantics of layout/use in each backend (and for each API : it looks like in both backends there is a distinction between "layout" and "memory layout") before deciding on the naming. I wanted to focus on making sure we are on the same page about the interfaces first though.
Updated usage of joint_matrix can be seen in the changes here: intel/llvm-test-suite#1183. |
Also updated the impl functions used in the CUDA backend (Some of these functions may be also used in the HIP AMD case when that is implemented, since the interfaces will match). Signed-off-by: JackAKirk <[email protected]>
This is for illustrative purposes: to show the advantage of the proposed change in the joint_matrix_mad interface. Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
@dkhaldi @yubingex007-a11y @gmlueck matrix-unified.hpp contains the agreed interfaces for joint_matrix_load, joint_matrix_store and joint_matrix_mad. These functions call backend implementations depending on compiler flags __NVPTX__ __SPIR__ (and later we can also add amd flags). I've added the backend implementations for CUDA in the matrix-tensor-cores.hpp file. The main implementation issue that I think we will face is the redefinition of partial specializations of the joint_matrix struct in the AMX/CUDA backends. These backends use completely different definitions of joint_matrix but have overlapping template parameters. Ideally I think we can select the correct definitions depending on the backend. Unless you can see another solution? Do you think that we could use the driver to select the correct partial specializations in a similar manner to how https://github.com/intel/llvm/pull/6524/files#diff-f8c64e36dfe3828a6f816c4550e78bb0305769ace1be53207e86ac9a3280ac9e selects the correct bfloat16 native library? Also you might want to check that you can call the intel implementations from matrix-unified.hpp, replacing the joint_matrix cuda partial specializations with the intel ones, in order to check that there are no other technical issues we need to consider when we unify sooner rather than later. Tests using the unified interface in the cuda backend: intel/llvm-test-suite#1183 |
This is a move towards the future looking joint_matrix, joint_matrix_load, joint_matrix_store APIs. The aim is to make the CUDA and Intel implementations of the joint_matrix extension use matching interfaces, whilst enabling all functionality of both backends.
Signed-off-by: JackAKirk [email protected]