-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] Implement matrix extension using new unified interface #7413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Implement matrix extension using new unified interface #7413
Conversation
// uint16_t, this interpretation is possible. This design choice was made before | ||
// the introduction of SYCL experimental bfloat16 type. Our plan is to move | ||
// towards using the SYCL bfloat16. But since it is still experimental, we will | ||
// probably keep both uint16 interpretation and SYCL bfloat16. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR moving bfloat16 out of experimental namespace is about to be merged: #6524
In my PR I have removed the uint16_t cases now. I think it is recommended to remove it because we want people to use bfloat16 and not be confused by the existence of uint16_t.
// fp32=>bf16). This is a workaround until we are able to use | ||
// __spirv_ConvertFToBF16INTEL and __spirv_ConvertBF16ToFINTEL once these are | ||
// supported in the CPU backend | ||
static float make_fp32(uint16_t x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is no longer required: see #6524
return *res; | ||
} | ||
|
||
static uint16_t make_bf16(float x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is no longer required: see #6524
// with no member variables. Morally, it is equivalent to an enumeration--it | ||
// just uses the type system to communicate the desired accuracy of arithmetic | ||
// computations. Users can't construct a tf32 | ||
namespace precision { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to move precision::tf32
etc and layout
enum to spirv_types also? @dkhaldi had some ideas on how it should be organized. I don't mind how we deal with this, I'll leave it up to you.
|
||
// unnecessary was introduced for backward compatibility. | ||
// Once the use implementation is stable, "unnecessary" value will be omitted | ||
enum class use { a, b, accumulator, unnecessary }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not need to have unecessary
@@ -29,3 +29,6 @@ | |||
#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 3) | |||
#include <sycl/ext/oneapi/matrix/matrix-tensorcore.hpp> | |||
#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION | |||
#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move unified to use 2.
Remove use implementation .
We need to wait for https://github.com/intel/llvm/pull/7077to get merged first
joint_matrix<int8_t, TK, TN, matrix_layout::packed_b> sub_b(sg); | ||
joint_matrix<int32_t, TM, TN> sub_c(sg); | ||
joint_matrix<int8_t, use::b, TK, TN, layout::packed> sub_b(sg); | ||
joint_matrix<int32_t, use::accumulator, TM, TN, layout::dynamic> sub_c(sg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove layout::dynamic
In matrix-intel.hpp, we need to add store of A and B, |
Packed = 2, | ||
Dynamic = 3 | ||
}; | ||
#else | ||
enum class MatrixLayout : uint32_t { | ||
RowMajor = 0, | ||
ColumnMajor = 1, | ||
PackedA = 2, | ||
PackedB = 3, | ||
Unused = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused is not needed
Packed = 2, | ||
Dynamic = 3 | ||
}; | ||
#else | ||
enum class MatrixLayout : uint32_t { | ||
RowMajor = 0, | ||
ColumnMajor = 1, | ||
PackedA = 2, | ||
PackedB = 3, | ||
Unused = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for use, Dmitry is removing unecessary in this patch #7335
it will be merged soon. Once it is merged, this PR should be updated as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yubingex007-a11y #7335 has been merged. You can rebase your patch to remove all "unnecessary" use occurrences.
template <typename Group> | ||
using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, use::accumulator, | ||
using joint_matrix_c = joint_matrix<Tc, use::accumulator, defaultM, defaultN, | ||
layout::row_major, Group>; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tpu_params will have to take layout of A and layout of B as template arguments. Then you pass them here.
Layout if the accumulator matrix should be layout::dynamic
template <typename Group> | ||
using joint_matrix_b = | ||
joint_matrix<Tb, defaultK, defaultN, use::b, layout::packed_b, Group>; | ||
joint_matrix<Tb, use::b, defaultK, defaultN, layout::packed, Group>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at #7307, packed now should be part of an Intel specific name space.
namespace sycl::ext::intel::experimental::matrix {
enum class layout {
packed
};
@@ -60,13 +60,13 @@ void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C, | |||
const auto sg_starty = global_idy - spmd_item.get_local_id(1); | |||
|
|||
sycl::ext::oneapi::sub_group sg = spmd_item.get_sub_group(); | |||
joint_matrix<bfloat16, TM, TK, use::a> sub_a(sg); | |||
joint_matrix<bfloat16, use::a, TM, TK, layout::row_major> sub_a(sg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove use from the name
@@ -23,7 +23,7 @@ | |||
#include <sycl/ext/oneapi/matrix/static-query.hpp> | |||
#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION | |||
#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 2) | |||
#include <sycl/ext/oneapi/matrix/matrix-jit-use.hpp> | |||
#include <sycl/ext/oneapi/matrix/matrix-unified.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove matrix-use.hpp
template <typename Group> | ||
using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, use::accumulator, | ||
using joint_matrix_c = joint_matrix<Tc, use::accumulator, defaultM, defaultN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase this patch to use the changes in
#6981
it was merged.
These are mainly just changing names. So no big deal
@@ -206,12 +206,12 @@ struct tpu_params< | |||
|
|||
template <typename Group> | |||
using joint_matrix_a = | |||
joint_matrix<Ta, defaultM, defaultK, use::a, layout::row_major, Group>; | |||
joint_matrix<Ta, use::a, defaultM, defaultK, layout::row_major, Group>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tpu_params should now take layout A and layout B as template arguments so we can pass them here.
…_data a free function and move packed into intel namespace
@@ -60,13 +60,19 @@ void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C, | |||
const auto sg_starty = global_idy - spmd_item.get_local_id(1); | |||
|
|||
sycl::ext::oneapi::sub_group sg = spmd_item.get_sub_group(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sub_group should be part of sycl namespace so no need for sycl::ext::oneapi
remove use from the name of the tests
@@ -60,13 +60,19 @@ void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C, | |||
const auto sg_starty = global_idy - spmd_item.get_local_id(1); | |||
|
|||
sycl::ext::oneapi::sub_group sg = spmd_item.get_sub_group(); | |||
joint_matrix<bfloat16, TM, TK, use::a> sub_a(sg); | |||
joint_matrix<sycl::ext::oneapi::sub_group, bfloat16, use::a, TM, TK, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with sub_group
@@ -152,13 +152,14 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0, | |||
|
|||
template <typename Group> | |||
using joint_matrix_a = | |||
joint_matrix<Ta, defaultM, defaultK, use::a, layout::row_major, Group>; | |||
joint_matrix<Group, Ta, use::a, defaultM, defaultK, layout::row_major>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should take layout as argument. we should not only assume row major layout
@@ -23,7 +23,7 @@ | |||
#include <sycl/ext/oneapi/matrix/static-query.hpp> | |||
#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION | |||
#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave it as 4 so it does not break users'code of Jack's implementation
/verify with intel/llvm-test-suite#1334 |
I am taking care of the testing and my changes do not break CUDA tests in intel/llvm-test-suite#1334 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, remove a.out.
😂 sorry i should have noticed it. |
ping? @dkhaldi @steffenlarsen @hdelan |
/verify with intel/llvm-test-suite#1391 |
@@ -24,11 +24,11 @@ | |||
#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION | |||
#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove matrix-jit-use. it does not work anymore because we removed SPIRV and codegen support for this.
|
||
enum class use { a, b, accumulator }; | ||
|
||
enum class layout { row_major = 0, col_major = 1, dynamic = 3 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what was the reason behind this change:
enum class layout { row_major, col_major, dynamic };
to
enum class layout { row_major = 0, col_major = 1, dynamic = 3 };
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because layout::packed carry the value 2
@@ -1,4 +1,4 @@ | |||
// RUN: %clangxx -fsycl -O2 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=2 %s -o %t.out | |||
// RUN: %clangxx -fsycl -O2 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 %s -o %t.out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove "-use" from the name of the tests that use the unified API
__SYCL_INLINE_VER_NAMESPACE(_V1) { | ||
namespace ext { | ||
namespace intel::experimental::matrix::layout { | ||
constexpr sycl::ext::oneapi::experimental::matrix::layout packed = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we introducing the layout in a new namespace here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hdelan, I see you are not in sync with the changes we made ;) to make this a unified API: write one code, run on Intel AMX, Intel XMX and Nvidia Tensor Cores.
The PR for documentation is in #7307
The basic idea is that anything that is Intel specific (like packed which is the VNNI layout) should go to a new Intel extension with a new namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct I am not fully up to date with joint matrix atm! OK thanks for explanation, that makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Ping? @intel/llvm-reviewers-runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit, but LGTM.
#endif // defined(__SYCL_DEVICE_ONLY__) | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#endif // defined(__SYCL_DEVICE_ONLY__) | |
#endif | |
#endif // defined(__NVPTX__) | |
#endif // defined(__SYCL_DEVICE_ONLY__) |
would you please help merge it? @steffenlarsen |
No description provided.