Skip to content

[Matrix][SYCL] Add use argument for joint_matrix and add another feat… #5835

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Sep 14, 2022

Conversation

yubingex007-a11y
Copy link
Contributor

…ure macro for it

@yubingex007-a11y yubingex007-a11y marked this pull request as ready for review March 23, 2022 02:46
@yubingex007-a11y yubingex007-a11y requested a review from a team as a code owner March 23, 2022 02:46
@yubingex007-a11y
Copy link
Contributor Author

the remaining fail in Lit is because "File /__w/llvm/llvm/build/bin/../include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp Line 21: no member named 'dynamic_extent' in namespace 'sycl'"

it's strange since i saw "inline constexpr size_t dynamic_extent = SIZE_MAX;" in sycl/include/CL/sycl/sycl_span.hpp

@@ -38,7 +38,7 @@ namespace sycl {
// 2- provides JIT implementation (target agnostic) for the
// experimental matrix extension
#ifndef SYCL_EXT_ONEAPI_MATRIX
#define SYCL_EXT_ONEAPI_MATRIX 2
#define SYCL_EXT_ONEAPI_MATRIX 3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to 2

joint_matrix<Tb, defaultK, defaultN, matrix_layout::packed_b,
matrix_use::unnecessary, Group>;
template <typename Group>
using joint_matrix_c =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give correct template param

Copy link
Contributor

@dkhaldi dkhaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides my other comments, please add a test case that uses "use".

__spv::__spirv_JointMatrixINTEL<T2, K, N, LB, S> *B,
__spv::__spirv_JointMatrixINTEL<T3, M, N, LC, S> *C,
__spv::__spirv_JointMatrixINTEL<T1, M, K, LA, UA, S> *A,
__spv::__spirv_JointMatrixINTEL<T1, K, N, LB, UB, S> *B,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be T2, right?

__spv::__spirv_JointMatrixINTEL<T2, K, N, LB, S> *B,
__spv::__spirv_JointMatrixINTEL<T3, M, N, LC, S> *C,
__spv::__spirv_JointMatrixINTEL<T1, M, K, LA, UA, S> *A,
__spv::__spirv_JointMatrixINTEL<T1, K, N, LB, UB, S> *B,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T2

__spv::__spirv_JointMatrixINTEL<T2, K, N, LB, S> *B,
__spv::__spirv_JointMatrixINTEL<T3, M, N, LC, S> *C,
__spv::__spirv_JointMatrixINTEL<T1, M, K, LA, UA, S> *A,
__spv::__spirv_JointMatrixINTEL<T1, K, N, LB, UB, S> *B,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T2

namespace oneapi {
namespace experimental::matrix {

enum class matrix_layout { row_major, col_major, packed_a, packed_b };
Copy link
Contributor

@dkhaldi dkhaldi Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need packed_a, packed_b here?
can you make the change to just packed?

enum class matrix_use { matrix_a, matrix_b, accumulator, unnecessary };

template <matrix_use Use> struct spv_matrix_use_traits {
static constexpr __spv::MatrixUse value = __spv::MatrixUse::MatrixA;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only the specialization on line 47 is the one really used, this should be empty, right?

enum class matrix_use { matrix_a, matrix_b, accumulator, unnecessary };

template <matrix_use Use> struct spv_matrix_use_traits {
static constexpr __spv::MatrixUse value = __spv::MatrixUse::MatrixA;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, this should be empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean we should remove the default value. but we have default value for matrix_layout's traits

@@ -0,0 +1,435 @@
//===-------------- static-query.hpp - SYCL matrix ------------*- C++ -*---===//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static-query-use.hpp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you correct this?

nd_range<2>({NDRangeM, NDRangeN}, {1, 1}),
[ accA, accB, accC, M, N, K ](nd_item<2> spmd_item)
[[intel::reqd_sub_group_size(1)]]
nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG size attribute is missing in this test.
Please start from a test that uses explicitly "SG size attribute" and just delete the AMX test.

@yubingex007-a11y
Copy link
Contributor Author

@bader it seems the fail in Jenkins/Precommit is common issue, could you verify it?

Copy link
Contributor

@dkhaldi dkhaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yubingex007-a11y
Copy link
Contributor Author

ping? @intel/llvm-reviewers-runtime

@steffenlarsen
Copy link
Contributor

Generally I think it looks good, but why do most of tests have XFAIL: *?

Comment on lines 17 to 20
namespace sycl {
namespace ext {
namespace oneapi {
namespace experimental::matrix {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you are here

Suggested change
namespace sycl {
namespace ext {
namespace oneapi {
namespace experimental::matrix {
namespace sycl::ext::oneapi::experimental::matrix {

@dkhaldi
Copy link
Contributor

dkhaldi commented Apr 12, 2022

@JackAKirk, in this PR, we add the use argument to transition from using required layout (especially packed_b) to "use" argument. This will make it possible hopefully to remove the duplicated code in the tensorcores implementation. I don't think we will merge the two in one file, but I would like to see the tensorcores implementation reuse the same definitions and not redefine the types, functions, enums, etc.
Can you please review and let me know where there is still divergence?

@JackAKirk
Copy link
Contributor

JackAKirk commented Apr 12, 2022

@JackAKirk, in this PR, we add the use argument to transition from using required layout (especially packed_b) to "use" argument. This will make it possible hopefully to remove the duplicated code in the tensorcores implementation. I don't think we will merge the two in one file, but I would like to see the tensorcores implementation reuse the same definitions and not redefine the types, functions, enums, etc. Can you please review and let me know where there is still divergence?

Sounds good, I'll take a look. Hopefully once we make the tensor cores impl support the WI interface as well we can think about supporting them in the same SYCL_EXT_ONEAPI_MATRIX extension version.

@@ -129,10 +136,12 @@ enum class MatrixLayout : uint32_t {
// information to SPIRV translator.
// The long term solution would be to introduce a matrix type in Clang and use
// it instead of this member.
template <typename T, std::size_t R, std::size_t C, MatrixLayout U,
template <typename T, std::size_t R, std::size_t C, MatrixUse U, MatrixLayout L,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches the template arguments used for joint_matrix in matrix-tensorcore.hpp (the current CUDA impl), except that the ordering is a little different (MatrixUse is the second template param in matrix-tensorcore.hpp). Trivial point, but we should align an ordering before merging the impls into a single SYCL_EXT_ONEAPI_MATRIX version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. Currently, this is at this position because it is an optional argument for now (to keep backward compatibility with the previous API (no use argument)).
Once this API with "use" argument is stable enough that we can remove the non-use-API, we can revise the order.
My personal preference is that it is probably a good idea to keep the arguments that might be become "optional" in the future as last. Today, "use" is required but layout is definitely optional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest not to modify the order of template parameters of internal SPIR-V joint matrix type representation. The reason is that after this patch: #6535 the clang started to generate opaque matrix type like this: spirv.JointMatrixINTEL._half_10_2_0_0 and hence we can and will remove the array W/A. So it becomes crucial to keep the template parameter's order internally (it still can be changed in user-visible API).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we will keep the Workaround in SPIRVRegularizeLLVMBase::adaptStructTypes for now. Since the W/A will be removed in the future, we'd better move the Use at the end of template param list, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

struct joint_matrix {
public:
__spv::__spirv_JointMatrixINTEL<
T, NumRows, NumCols, spv_matrix_use_traits<Use>::value,
Copy link
Contributor

@JackAKirk JackAKirk Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches the template arguments used for joint_matrix in matrix-tensorcore.hpp (the current CUDA impl), except that the ordering is a little different (matrix_use is the second template param in matrix-tensorcore.hpp). Trivial point, but we should align an ordering before merging the impls into a single SYCL_EXT_ONEAPI_MATRIX version.


// unnecessary was introduced for backward compatibility.
// Once the use implementation is stable, "unnecessary" value will be omitted
enum class matrix_use { matrix_a, matrix_b, accumulator, unnecessary };
Copy link
Contributor

@JackAKirk JackAKirk Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in matrix-tensorcores.hpp we define "a"/"b" rather than "matrix_a"/"matrix_b":

enum class matrix_use { a, b, accumulator };

We should align on a single naming for matrix_use members, since it is a user facing class that the user will have to provide to the joint_matrix template constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer adding matrix there, but then why there is "matrix" on "a" and "b" and not on "accumulator".
I think a better naming can be "left", "right", and "accumulator"

access::address_space Space>
inline __SYCL_ALWAYS_INLINE void
joint_matrix_store(Group sg,
joint_matrix<T, NumRows, NumCols, Use, MatL, Group> &src,
Copy link
Contributor

@JackAKirk JackAKirk Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a more interesting point:

Are you sure that the matrix_use is necessary as a parameter for joint_matrix_store? In the CUDA backend it is not necessary because you can only call joint_matrix_store for a matrix_use::accumulator matrix, since there is no use case to store a matrix back to a matrix tile pointer storage other than the matrix that is the output of a MMA, which is always a matrix_use::accumulator matrix. I'd be surprised if this was different in the Intel backend!?
If this template param is removed this function interface will align with the definition in matrix-tensorcores.hpp.

joint_matrix<ext::oneapi::sub_group, unsigned short, TM, TK> sub_a(
sg);
sycl::ext::oneapi::sub_group sg = spmd_item.get_sub_group();
joint_matrix<bfloat16, TM, TK, use::a> sub_a(sg);
// For B, since current implementation does not support non-packed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment does not apply anymore

dkhaldi
dkhaldi previously approved these changes Sep 7, 2022
Copy link
Contributor

@dkhaldi dkhaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return;
joint_matrix<bfloat16, TK, TN, use::b> sub_b(sg);
joint_matrix<float, TM, TN, use::accumulator> sub_c(sg);

// AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, let's not make AMX specific --> remove this comment

};

template <typename Group, typename T, size_t NumRows, size_t NumCols, use Use,
layout Layout = layout::unused, access::address_space Space>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this needs layout Layout = layout::unused here right? This template paramater should be safely removed since it is not used in intel/cuda/hip for this interface.

return;
joint_matrix<bfloat16, TK, TN, use::b> sub_b(sg);
joint_matrix<float, TM, TN, use::accumulator> sub_c(sg);

// AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64
// strideX = X's cols, so strideC = N, strideA = K, strideB = N*4
joint_matrix_load(sg, sub_c,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use joint_matrix_fill instead

layout Layout = layout::unused, access::address_space Space>
inline __SYCL_ALWAYS_INLINE void
joint_matrix_load(Group sg,
joint_matrix<T, NumRows, NumCols, Use, Layout, Group> &res,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can change this to joint_matrix<T, NumRows, NumCols, Use, layout::unused, Group> &res,

Then get rid of the layout template parameter as mentioned above.

Copy link
Contributor Author

@yubingex007-a11y yubingex007-a11y Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but For CUDA backend, layout must be specified for matrix A and B, cuda users won't set it it unused for a&b, right?
according to #5835 (comment)
#5835 (review)

Copy link
Contributor

@JackAKirk JackAKirk Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right for joint_matrix_load; however I think it is better if there is a separate overload for that case of joint_matrix_load in the cuda backend like I added here:


Then we have also a version of joint_matrix_load that is only for the accumulator matrix::use here
This immediately prevents the user from using both the template layout parameter and the non-template layout parameter at the same time. This also allows the Intel backend implementation to be simplified in the suggested way.

But for joint_matrix_store also in the CUDA backend there is no cases that require the layout as a template parameter (since joint_matrix_store is only functional for matrix::use::accumulator anyway in the CUDA backend - and this matrix::use type always relies on the non-template runtime layout parameter.) So also in the Intel backend the layout template parameter is not needed.

It is a little confusing but I hope that makes sense?

}

template <typename Group, typename T, size_t NumRows, size_t NumCols, use Use,
layout MatL = layout::unused, access::address_space Space>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same situation here as with joint_matrix_load. You can remove layout MatL = layout::unused from the template parameters then directly use layout::unused in joint_matrix<T, NumRows, NumCols, Use, layout::unused, Group>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise it makes it legal for a user to try to call joint_matrix_load/store with a different layout from layout::unused which is wrong.

@yubingex007-a11y
Copy link
Contributor Author

The clang-format issue has no relationship with this pr since we doesn't change clang/lib/Driver/ToolChains/SYCL.h, sycl/test/basic_tests/stdcpp_compat.cpp

@steffenlarsen
Copy link
Contributor

The clang-format issue has no relationship with this pr since we doesn't change clang/lib/Driver/ToolChains/SYCL.h, sycl/test/basic_tests/stdcpp_compat.cpp

Most testing won't run until lint is happy. I suspect this is a previously fixed problem that is still present in this because it is based on a faulty intel/llvm HEAD. Could you please try and merge the sycl branch into this? My hope is that will fix it.

@yubingex007-a11y
Copy link
Contributor Author

The clang-format issue has no relationship with this pr since we doesn't change clang/lib/Driver/ToolChains/SYCL.h, sycl/test/basic_tests/stdcpp_compat.cpp

Most testing won't run until lint is happy. I suspect this is a previously fixed problem that is still present in this because it is based on a faulty intel/llvm HEAD. Could you please try and merge the sycl branch into this? My hope is that will fix it.

ok, let me try, thanks!

@yubingex007-a11y
Copy link
Contributor Author

ping?

MrSidims
MrSidims previously approved these changes Sep 9, 2022
@dkhaldi
Copy link
Contributor

dkhaldi commented Sep 9, 2022

I think the only change needed here is to remove L from template arguments of load/store, replace it with "unused". But keep the memL that is the argument of the function

@JackAKirk
Copy link
Contributor

I think the only change needed here is to remove L from template arguments of load/store, replace it with "unused". But keep the memL that is the argument of the function

Looks good now. There is also still the issue of whether Rows/Cols will take a default value in joint_matrix right? And then as a result whether we have to move the Layout template parameter back before the Rows/Cols template parameter in joint_matrix.

Also in the CUDA backend we also have an additional template parameter in joint_matrix_load that is necessary for the precision cases such as tf32, but you also have this in https://github.com/intel/llvm/pull/5920/files, so I guess it can wait until https://github.com/intel/llvm/pull/5920/files is merged?

@dkhaldi
Copy link
Contributor

dkhaldi commented Sep 9, 2022

Looks good now. There is also still the issue of whether Rows/Cols will take a default value in joint_matrix right? And then as a result whether we have to move the Layout template parameter back before the Rows/Cols template parameter in joint_matrix.

Also in the CUDA backend we also have an additional template parameter in joint_matrix_load that is necessary for the precision cases such as tf32, but you also have this in https://github.com/intel/llvm/pull/5920/files, so I guess it can wait until https://github.com/intel/llvm/pull/5920/files is merged?

Yes let's address these two in the next PR. But I will address the first in the doc PR

#endif // __SYCL_DEVICE_ONLY__
}

template <typename Group, typename T1, typename T2, typename T3, size_t M,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the CUDA backend the layouts need to be provided to joint_matrix_mad because they must be known at compile time by the builtins. In the CUDA backend these layouts are inferred from the joint_matrix layout template parameters: see

std::size_t K, std::size_t N, layout LayoutA, layout LayoutB>
.

So if we want interfaces that match in both backends we need to add them here too, in the same way. For the user it should appear no different due to CTAD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems strange, if we want support intel/cuda backend, why we don't need to provide matrix_load without memL.

Copy link
Contributor

@dkhaldi dkhaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yubingex007-a11y
Copy link
Contributor Author

@steffenlarsen would you help us merge it?

@steffenlarsen steffenlarsen merged commit 76ec14b into intel:sycl Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants