Skip to content

[SYCL] Implement matrix extension using new unified interface #7413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 16, 2022

Conversation

yubingex007-a11y
Copy link
Contributor

No description provided.

// uint16_t, this interpretation is possible. This design choice was made before
// the introduction of SYCL experimental bfloat16 type. Our plan is to move
// towards using the SYCL bfloat16. But since it is still experimental, we will
// probably keep both uint16 interpretation and SYCL bfloat16.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR moving bfloat16 out of experimental namespace is about to be merged: #6524
In my PR I have removed the uint16_t cases now. I think it is recommended to remove it because we want people to use bfloat16 and not be confused by the existence of uint16_t.

// fp32=>bf16). This is a workaround until we are able to use
// __spirv_ConvertFToBF16INTEL and __spirv_ConvertBF16ToFINTEL once these are
// supported in the CPU backend
static float make_fp32(uint16_t x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is no longer required: see #6524

return *res;
}

static uint16_t make_bf16(float x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is no longer required: see #6524

// with no member variables. Morally, it is equivalent to an enumeration--it
// just uses the type system to communicate the desired accuracy of arithmetic
// computations. Users can't construct a tf32
namespace precision {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to move precision::tf32 etc and layout enum to spirv_types also? @dkhaldi had some ideas on how it should be organized. I don't mind how we deal with this, I'll leave it up to you.


// unnecessary was introduced for backward compatibility.
// Once the use implementation is stable, "unnecessary" value will be omitted
enum class use { a, b, accumulator, unnecessary };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not need to have unecessary

@@ -29,3 +29,6 @@
#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 3)
#include <sycl/ext/oneapi/matrix/matrix-tensorcore.hpp>
#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION
#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move unified to use 2.
Remove use implementation .
We need to wait for https://github.com/intel/llvm/pull/7077to get merged first

joint_matrix<int8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
joint_matrix<int32_t, TM, TN> sub_c(sg);
joint_matrix<int8_t, use::b, TK, TN, layout::packed> sub_b(sg);
joint_matrix<int32_t, use::accumulator, TM, TN, layout::dynamic> sub_c(sg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove layout::dynamic

@dkhaldi
Copy link
Contributor

dkhaldi commented Nov 16, 2022

In matrix-intel.hpp, we need to add store of A and B,
In unified: load of A and B has no layout argument.
In unified: load of C must have layout
In Intel version, users can load A and B with layout argument.
These cases are needed for the tests for corner cases we made for element wise operations (ewo on B and the store B, we don't need load on B, so there is no need to specify layout in matrix B.)

Packed = 2,
Dynamic = 3
};
#else
enum class MatrixLayout : uint32_t {
RowMajor = 0,
ColumnMajor = 1,
PackedA = 2,
PackedB = 3,
Unused = 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused is not needed

Packed = 2,
Dynamic = 3
};
#else
enum class MatrixLayout : uint32_t {
RowMajor = 0,
ColumnMajor = 1,
PackedA = 2,
PackedB = 3,
Unused = 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for use, Dmitry is removing unecessary in this patch #7335
it will be merged soon. Once it is merged, this PR should be updated as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yubingex007-a11y #7335 has been merged. You can rebase your patch to remove all "unnecessary" use occurrences.

template <typename Group>
using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, use::accumulator,
using joint_matrix_c = joint_matrix<Tc, use::accumulator, defaultM, defaultN,
layout::row_major, Group>;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tpu_params will have to take layout of A and layout of B as template arguments. Then you pass them here.
Layout if the accumulator matrix should be layout::dynamic

template <typename Group>
using joint_matrix_b =
joint_matrix<Tb, defaultK, defaultN, use::b, layout::packed_b, Group>;
joint_matrix<Tb, use::b, defaultK, defaultN, layout::packed, Group>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at #7307, packed now should be part of an Intel specific name space.
namespace sycl::ext::intel::experimental::matrix {
enum class layout {
packed
};

@@ -60,13 +60,13 @@ void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
const auto sg_starty = global_idy - spmd_item.get_local_id(1);

sycl::ext::oneapi::sub_group sg = spmd_item.get_sub_group();
joint_matrix<bfloat16, TM, TK, use::a> sub_a(sg);
joint_matrix<bfloat16, use::a, TM, TK, layout::row_major> sub_a(sg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove use from the name

@@ -23,7 +23,7 @@
#include <sycl/ext/oneapi/matrix/static-query.hpp>
#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION
#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 2)
#include <sycl/ext/oneapi/matrix/matrix-jit-use.hpp>
#include <sycl/ext/oneapi/matrix/matrix-unified.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove matrix-use.hpp

template <typename Group>
using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, use::accumulator,
using joint_matrix_c = joint_matrix<Tc, use::accumulator, defaultM, defaultN,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase this patch to use the changes in
#6981
it was merged.
These are mainly just changing names. So no big deal

@@ -206,12 +206,12 @@ struct tpu_params<

template <typename Group>
using joint_matrix_a =
joint_matrix<Ta, defaultM, defaultK, use::a, layout::row_major, Group>;
joint_matrix<Ta, use::a, defaultM, defaultK, layout::row_major, Group>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tpu_params should now take layout A and layout B as template arguments so we can pass them here.

…_data a free function and move packed into intel namespace
@@ -60,13 +60,19 @@ void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
const auto sg_starty = global_idy - spmd_item.get_local_id(1);

sycl::ext::oneapi::sub_group sg = spmd_item.get_sub_group();
Copy link
Contributor

@dkhaldi dkhaldi Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sub_group should be part of sycl namespace so no need for sycl::ext::oneapi
remove use from the name of the tests

@@ -60,13 +60,19 @@ void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
const auto sg_starty = global_idy - spmd_item.get_local_id(1);

sycl::ext::oneapi::sub_group sg = spmd_item.get_sub_group();
joint_matrix<bfloat16, TM, TK, use::a> sub_a(sg);
joint_matrix<sycl::ext::oneapi::sub_group, bfloat16, use::a, TM, TK,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with sub_group

@@ -152,13 +152,14 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,

template <typename Group>
using joint_matrix_a =
joint_matrix<Ta, defaultM, defaultK, use::a, layout::row_major, Group>;
joint_matrix<Group, Ta, use::a, defaultM, defaultK, layout::row_major>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should take layout as argument. we should not only assume row major layout

@@ -23,7 +23,7 @@
#include <sycl/ext/oneapi/matrix/static-query.hpp>
#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION
#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave it as 4 so it does not break users'code of Jack's implementation

@yubingex007-a11y
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1334

@yubingex007-a11y
Copy link
Contributor Author

yubingex007-a11y commented Dec 14, 2022

I am taking care of the testing and my changes do not break CUDA tests in intel/llvm-test-suite#1334
take notes for myself "/verify xxx" can't work for cuda's testcases, we verify cuda's testcase in local machine.

Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, remove a.out.

@yubingex007-a11y
Copy link
Contributor Author

Please, remove a.out.

😂 sorry i should have noticed it.

@yubingex007-a11y
Copy link
Contributor Author

ping? @dkhaldi @steffenlarsen @hdelan

@yubingex007-a11y
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1391

@@ -24,11 +24,11 @@
#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION
#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove matrix-jit-use. it does not work anymore because we removed SPIRV and codegen support for this.


enum class use { a, b, accumulator };

enum class layout { row_major = 0, col_major = 1, dynamic = 3 };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the reason behind this change:
enum class layout { row_major, col_major, dynamic };
to
enum class layout { row_major = 0, col_major = 1, dynamic = 3 };
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because layout::packed carry the value 2

@@ -1,4 +1,4 @@
// RUN: %clangxx -fsycl -O2 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=2 %s -o %t.out
// RUN: %clangxx -fsycl -O2 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 %s -o %t.out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "-use" from the name of the tests that use the unified API

__SYCL_INLINE_VER_NAMESPACE(_V1) {
namespace ext {
namespace intel::experimental::matrix::layout {
constexpr sycl::ext::oneapi::experimental::matrix::layout packed =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we introducing the layout in a new namespace here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hdelan, I see you are not in sync with the changes we made ;) to make this a unified API: write one code, run on Intel AMX, Intel XMX and Nvidia Tensor Cores.
The PR for documentation is in #7307

The basic idea is that anything that is Intel specific (like packed which is the VNNI layout) should go to a new Intel extension with a new namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct I am not fully up to date with joint matrix atm! OK thanks for explanation, that makes sense to me.

Copy link
Contributor

@dkhaldi dkhaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yubingex007-a11y
Copy link
Contributor Author

Ping? @intel/llvm-reviewers-runtime

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit, but LGTM.

Comment on lines +31 to +32
#endif // defined(__SYCL_DEVICE_ONLY__)
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#endif // defined(__SYCL_DEVICE_ONLY__)
#endif
#endif // defined(__NVPTX__)
#endif // defined(__SYCL_DEVICE_ONLY__)

@yubingex007-a11y
Copy link
Contributor Author

would you please help merge it? @steffenlarsen

@steffenlarsen steffenlarsen changed the title [SYCL][INTEL] Implementation of matrix ext using new unified interface [SYCL] Implement matrix extension using new unified interface Dec 16, 2022
@steffenlarsen steffenlarsen merged commit f4a9ef1 into intel:sycl Dec 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants