-
Notifications
You must be signed in to change notification settings - Fork 787
[Matrix][SYCL] Add use argument for joint_matrix and add another feat… #5835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
the remaining fail in Lit is because "File /__w/llvm/llvm/build/bin/../include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp Line 21: no member named 'dynamic_extent' in namespace 'sycl'" it's strange since i saw "inline constexpr size_t dynamic_extent = SIZE_MAX;" in sycl/include/CL/sycl/sycl_span.hpp |
@@ -38,7 +38,7 @@ namespace sycl { | |||
// 2- provides JIT implementation (target agnostic) for the | |||
// experimental matrix extension | |||
#ifndef SYCL_EXT_ONEAPI_MATRIX | |||
#define SYCL_EXT_ONEAPI_MATRIX 2 | |||
#define SYCL_EXT_ONEAPI_MATRIX 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to 2
joint_matrix<Tb, defaultK, defaultN, matrix_layout::packed_b, | ||
matrix_use::unnecessary, Group>; | ||
template <typename Group> | ||
using joint_matrix_c = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
give correct template param
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides my other comments, please add a test case that uses "use".
__spv::__spirv_JointMatrixINTEL<T2, K, N, LB, S> *B, | ||
__spv::__spirv_JointMatrixINTEL<T3, M, N, LC, S> *C, | ||
__spv::__spirv_JointMatrixINTEL<T1, M, K, LA, UA, S> *A, | ||
__spv::__spirv_JointMatrixINTEL<T1, K, N, LB, UB, S> *B, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be T2, right?
__spv::__spirv_JointMatrixINTEL<T2, K, N, LB, S> *B, | ||
__spv::__spirv_JointMatrixINTEL<T3, M, N, LC, S> *C, | ||
__spv::__spirv_JointMatrixINTEL<T1, M, K, LA, UA, S> *A, | ||
__spv::__spirv_JointMatrixINTEL<T1, K, N, LB, UB, S> *B, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T2
__spv::__spirv_JointMatrixINTEL<T2, K, N, LB, S> *B, | ||
__spv::__spirv_JointMatrixINTEL<T3, M, N, LC, S> *C, | ||
__spv::__spirv_JointMatrixINTEL<T1, M, K, LA, UA, S> *A, | ||
__spv::__spirv_JointMatrixINTEL<T1, K, N, LB, UB, S> *B, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T2
namespace oneapi { | ||
namespace experimental::matrix { | ||
|
||
enum class matrix_layout { row_major, col_major, packed_a, packed_b }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need packed_a, packed_b here?
can you make the change to just packed?
enum class matrix_use { matrix_a, matrix_b, accumulator, unnecessary }; | ||
|
||
template <matrix_use Use> struct spv_matrix_use_traits { | ||
static constexpr __spv::MatrixUse value = __spv::MatrixUse::MatrixA; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If only the specialization on line 47 is the one really used, this should be empty, right?
enum class matrix_use { matrix_a, matrix_b, accumulator, unnecessary }; | ||
|
||
template <matrix_use Use> struct spv_matrix_use_traits { | ||
static constexpr __spv::MatrixUse value = __spv::MatrixUse::MatrixA; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, this should be empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean we should remove the default value. but we have default value for matrix_layout's traits
@@ -0,0 +1,435 @@ | |||
//===-------------- static-query.hpp - SYCL matrix ------------*- C++ -*---===// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static-query-use.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you correct this?
nd_range<2>({NDRangeM, NDRangeN}, {1, 1}), | ||
[ accA, accB, accC, M, N, K ](nd_item<2> spmd_item) | ||
[[intel::reqd_sub_group_size(1)]] | ||
nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG size attribute is missing in this test.
Please start from a test that uses explicitly "SG size attribute" and just delete the AMX test.
@bader it seems the fail in Jenkins/Precommit is common issue, could you verify it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ping? @intel/llvm-reviewers-runtime |
Generally I think it looks good, but why do most of tests have |
namespace sycl { | ||
namespace ext { | ||
namespace oneapi { | ||
namespace experimental::matrix { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you are here
namespace sycl { | |
namespace ext { | |
namespace oneapi { | |
namespace experimental::matrix { | |
namespace sycl::ext::oneapi::experimental::matrix { |
@JackAKirk, in this PR, we add the use argument to transition from using required layout (especially packed_b) to "use" argument. This will make it possible hopefully to remove the duplicated code in the tensorcores implementation. I don't think we will merge the two in one file, but I would like to see the tensorcores implementation reuse the same definitions and not redefine the types, functions, enums, etc. |
Sounds good, I'll take a look. Hopefully once we make the tensor cores impl support the WI interface as well we can think about supporting them in the same |
@@ -129,10 +136,12 @@ enum class MatrixLayout : uint32_t { | |||
// information to SPIRV translator. | |||
// The long term solution would be to introduce a matrix type in Clang and use | |||
// it instead of this member. | |||
template <typename T, std::size_t R, std::size_t C, MatrixLayout U, | |||
template <typename T, std::size_t R, std::size_t C, MatrixUse U, MatrixLayout L, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matches the template arguments used for joint_matrix
in matrix-tensorcore.hpp (the current CUDA impl), except that the ordering is a little different (MatrixUse is the second template param in matrix-tensorcore.hpp). Trivial point, but we should align an ordering before merging the impls into a single SYCL_EXT_ONEAPI_MATRIX
version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. Currently, this is at this position because it is an optional argument for now (to keep backward compatibility with the previous API (no use argument)).
Once this API with "use" argument is stable enough that we can remove the non-use-API, we can revise the order.
My personal preference is that it is probably a good idea to keep the arguments that might be become "optional" in the future as last. Today, "use" is required but layout is definitely optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest not to modify the order of template parameters of internal SPIR-V joint matrix type representation. The reason is that after this patch: #6535 the clang started to generate opaque matrix type like this: spirv.JointMatrixINTEL._half_10_2_0_0
and hence we can and will remove the array W/A. So it becomes crucial to keep the template parameter's order internally (it still can be changed in user-visible API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we will keep the Workaround in SPIRVRegularizeLLVMBase::adaptStructTypes for now. Since the W/A will be removed in the future, we'd better move the Use at the end of template param list, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
struct joint_matrix { | ||
public: | ||
__spv::__spirv_JointMatrixINTEL< | ||
T, NumRows, NumCols, spv_matrix_use_traits<Use>::value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matches the template arguments used for joint_matrix in matrix-tensorcore.hpp (the current CUDA impl), except that the ordering is a little different (matrix_use is the second template param in matrix-tensorcore.hpp). Trivial point, but we should align an ordering before merging the impls into a single SYCL_EXT_ONEAPI_MATRIX version.
|
||
// unnecessary was introduced for backward compatibility. | ||
// Once the use implementation is stable, "unnecessary" value will be omitted | ||
enum class matrix_use { matrix_a, matrix_b, accumulator, unnecessary }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in matrix-tensorcores.hpp we define "a"/"b" rather than "matrix_a"/"matrix_b":
enum class matrix_use { a, b, accumulator };
We should align on a single naming for matrix_use members, since it is a user facing class that the user will have to provide to the joint_matrix template constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer adding matrix there, but then why there is "matrix" on "a" and "b" and not on "accumulator".
I think a better naming can be "left", "right", and "accumulator"
access::address_space Space> | ||
inline __SYCL_ALWAYS_INLINE void | ||
joint_matrix_store(Group sg, | ||
joint_matrix<T, NumRows, NumCols, Use, MatL, Group> &src, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a more interesting point:
Are you sure that the matrix_use
is necessary as a parameter for joint_matrix_store
? In the CUDA backend it is not necessary because you can only call joint_matrix_store
for a matrix_use::accumulator
matrix, since there is no use case to store a matrix back to a matrix tile pointer storage other than the matrix that is the output of a MMA, which is always a matrix_use::accumulator
matrix. I'd be surprised if this was different in the Intel backend!?
If this template param is removed this function interface will align with the definition in matrix-tensorcores.hpp.
joint_matrix<ext::oneapi::sub_group, unsigned short, TM, TK> sub_a( | ||
sg); | ||
sycl::ext::oneapi::sub_group sg = spmd_item.get_sub_group(); | ||
joint_matrix<bfloat16, TM, TK, use::a> sub_a(sg); | ||
// For B, since current implementation does not support non-packed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment does not apply anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
return; | ||
joint_matrix<bfloat16, TK, TN, use::b> sub_b(sg); | ||
joint_matrix<float, TM, TN, use::accumulator> sub_c(sg); | ||
|
||
// AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, let's not make AMX specific --> remove this comment
}; | ||
|
||
template <typename Group, typename T, size_t NumRows, size_t NumCols, use Use, | ||
layout Layout = layout::unused, access::address_space Space> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this needs layout Layout = layout::unused
here right? This template paramater should be safely removed since it is not used in intel/cuda/hip for this interface.
return; | ||
joint_matrix<bfloat16, TK, TN, use::b> sub_b(sg); | ||
joint_matrix<float, TM, TN, use::accumulator> sub_c(sg); | ||
|
||
// AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64 | ||
// strideX = X's cols, so strideC = N, strideA = K, strideB = N*4 | ||
joint_matrix_load(sg, sub_c, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use joint_matrix_fill instead
layout Layout = layout::unused, access::address_space Space> | ||
inline __SYCL_ALWAYS_INLINE void | ||
joint_matrix_load(Group sg, | ||
joint_matrix<T, NumRows, NumCols, Use, Layout, Group> &res, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can change this to joint_matrix<T, NumRows, NumCols, Use, layout::unused, Group> &res,
Then get rid of the layout template parameter as mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but For CUDA backend, layout must be specified for matrix A and B, cuda users won't set it it unused for a&b, right?
according to #5835 (comment)
#5835 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right for joint_matrix_load; however I think it is better if there is a separate overload for that case of joint_matrix_load
in the cuda backend like I added here:
void joint_matrix_load( |
Then we have also a version of
joint_matrix_load
that is only for the accumulator
matrix::use
here void joint_matrix_load( |
This immediately prevents the user from using both the template layout parameter and the non-template layout parameter at the same time. This also allows the Intel backend implementation to be simplified in the suggested way.
But for joint_matrix_store
also in the CUDA backend there is no cases that require the layout as a template parameter (since joint_matrix_store is only functional for matrix::use::accumulator anyway in the CUDA backend - and this matrix::use type always relies on the non-template runtime layout parameter.) So also in the Intel backend the layout template parameter is not needed.
It is a little confusing but I hope that makes sense?
} | ||
|
||
template <typename Group, typename T, size_t NumRows, size_t NumCols, use Use, | ||
layout MatL = layout::unused, access::address_space Space> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same situation here as with joint_matrix_load
. You can remove layout MatL = layout::unused
from the template parameters then directly use layout::unused in joint_matrix<T, NumRows, NumCols, Use, layout::unused, Group>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise it makes it legal for a user to try to call joint_matrix_load/store with a different layout from layout::unused
which is wrong.
The clang-format issue has no relationship with this pr since we doesn't change clang/lib/Driver/ToolChains/SYCL.h, sycl/test/basic_tests/stdcpp_compat.cpp |
Most testing won't run until lint is happy. I suspect this is a previously fixed problem that is still present in this because it is based on a faulty intel/llvm HEAD. Could you please try and merge the sycl branch into this? My hope is that will fix it. |
ok, let me try, thanks! |
ping? |
I think the only change needed here is to remove L from template arguments of load/store, replace it with "unused". But keep the memL that is the argument of the function |
Looks good now. There is also still the issue of whether Rows/Cols will take a default value in Also in the CUDA backend we also have an additional template parameter in |
Yes let's address these two in the next PR. But I will address the first in the doc PR |
#endif // __SYCL_DEVICE_ONLY__ | ||
} | ||
|
||
template <typename Group, typename T1, typename T2, typename T3, size_t M, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the CUDA backend the layouts need to be provided to joint_matrix_mad
because they must be known at compile time by the builtins. In the CUDA backend these layouts are inferred from the joint_matrix
layout template parameters: see
std::size_t K, std::size_t N, layout LayoutA, layout LayoutB> |
So if we want interfaces that match in both backends we need to add them here too, in the same way. For the user it should appear no different due to CTAD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems strange, if we want support intel/cuda backend, why we don't need to provide matrix_load without memL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@steffenlarsen would you help us merge it? |
…ure macro for it