[CUDA][Matrix][Doc] Introduced sycl_ext_oneapi_matrix_cuda extension. #6968

JackAKirk · 2022-10-05T15:58:53Z

This document details the CUDA only features matrix extension for DPC++
This extension is built on top of the backend agnostic matrix extension that is being updated here: #6662.

zjin-lcf · 2022-10-06T10:47:39Z

Could you point me to an equivalent CUDA example ? Thanks.

JackAKirk · 2022-10-06T11:21:09Z

Could you point me to an equivalent CUDA example ? Thanks.

For the marray functionality I don't think there is a direct analogue in CUDA Runtime API.

For BMAD See the section on sub-byte operations here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-subbyte
joint_matrix_bmad has the same functionality as bmma_sync in CUDA Runtime API. The implementation of joint_matrix_bmad is here: #5363. However this will not be merged until we merge the unified matrix extension from #6662.

For an example of how binary MADs can be leveraged on current Nvidia® hardware see (A. Li, and S. Su. IEEE Transactions on Parallel and Distributed Systems, 32(7):1878-1891, 2021).
See this paper also for better details of the interfaces used in the Nvidia® CUDA runtime API.

The XOR operator is deprecated by Nvidia Hardware and will be unsupported in sm_90, and new operators (e.g. XNOR as a guess) may be added in the future as research finds them to be more useful - see below.

More motivation for BMAD (there are some other applications beyond deep learning but I believe this was the initial motivation):

Single-bit MADs can be used as part of Binarized Neural Networks (BNNs) in the case that both the activations and weights are binarized. "Quantizing" a network to form a BNN represents the extreme limit of reducing the precision of the network degrees of freedom in order to gain performance and improve efficiency.
Hubara et al. (I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized Neural Networks, Advances in Neural Information Processing Systems 29 (NIPS 2016)) first demonstrated the utility of an algorithm that could use both binarized activations and weights with backpropagation, by keeping track of real valued weights which are mapped to the binarized weights. In the backwards pass the real valued weights are updated according to a heuristic named the "Straight Through Estimator", whereby the gradient of the loss function with respect to the real weights is set equal to the gradient of the loss function with respect to the binarized weights.
This implies that the precision of the data type used in the matrix multiplications can be single bit, with the necessary addition of forward and backward element wise mappings between binarized and real valued representations of the matrices.
This could prove a significant advantage for large models, since the computational cost of Matrix Multiplication scales with the number of elements per dimension, N, as O(N^3) for square matrices, whereas corresponding element wise operations scale as O(N^2).
Further algorithms based on this binarized approach have been proposed, e.g. see Rastegari et al. (M. Rastegari, V Ordonez, J. Redmon, and A. Farhadi. Computer Vision – ECCV 2016, 525-542) who have made a comparison between a binarized version of a CNN (Using a XNOR Binary Dot Product) and corresponding full precision models, for both the accuracy and performance of image classification using the ImageNet data set.

Signed-off-by: JackAKirk <[email protected]>

JackAKirk · 2022-10-31T16:31:56Z

Hi @dkhaldi @gmlueck

I've updated this document following the merge of #6662

Do you want me to change anything in the document?

Do you want me to move this file to extensions/experimental/sycl_ext_oneapi_matrix_cuda/
or somewhere else?

Thanks

dkhaldi · 2022-11-01T15:12:45Z

Do you want me to move this file to extensions/experimental/sycl_ext_oneapi_matrix_cuda/ or somewhere else?

I think this file and a new file for intel specific extension should be added to https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix

So this same location will contain the matrix specific APIs: unified one + Cuda specific + Intel specific

dkhaldi · 2022-11-01T15:10:28Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix_cuda.asciidoc

+
+#### Valid `joint_matrix` types and shapes
+
+The complete set of matrix data types and shapes that are supported by the CUDA backend are represented in the following table. Tm indicates the matrix element data type held by a "multiplicand" `joint_matrix`: i.e requiring `use::a` or `use::b`. Tc indicates the matrix element data type held by an "accumulator" `joint_matrix`: i.e requiring `use::accumulator`.


Should these be added to the query interface?

Yes they can be, although we will not do this immediately, so I documented them here. Also there is some additional useful information here that cannot be represented in the query interface as it is currently: minimum Compute Capability etc.

It will be good to know the limitations of the current query interface so we can make a final version of it in order to move the matrix API (unified including the query interface) out of experimental.

Sure, we can look into that. For the moment our priority is going to be getting the main functionality correct and portable, so that our libraries can be consistent with the interfaces used by the DPC++ compiler, so that things are more likely to work out of the box for people and to minimize any changes that we will have to make in the future to the main minimum viable product that users will need for accelerated GEMM.

FYI, we also plan to start work on the MVP for HIP AMD soon, which although expected to be very similar to the cuda backend, has some differences, and will need a first implementation to iron out any unforeseen issues we may come across, which could also influence any experimental supplementary features such as the query interface in the same way as you suggest for cuda.

Sounds good, please add a todo list that contains adopting the query interface to the tensor cores TPU which is currently missing.

OK, I was just looking over the query interface, it looks like it is quite important if you want to provide a large matrix and you aren't sure how to break it up into submatrices, so you introduce these guys:

constexpr int msize = break_dimension(params, M); constexpr int msize_remainder = break_dimension_remainder(params, M);

Is this more important for AMX or both AMX and DPAS? Sounds very interesting.
I'm just guessing but I thought it might be more important for AMX because there are more options for sizes since there is the continuous range.
Library teams might be interested in this doing similar things to pick good parameters.

BTW do you think you should also add a table similar to the one I added here but for DPAS/AMX in sycl_ext_oneapi_matrix_intel? Even with the query interface perhaps the information should also be written somewhere for the users?

The query interface was added specifically to avoid adding such tables in the documentation.
The query interface will let the user write portable code across different generation of hardware and implementations.
I usually keep this kind of table in slides for presentation purposes.

What if the user wants the information before writing any code at all?

Signed-off-by: JackAKirk <[email protected]>

dkhaldi · 2022-11-07T20:11:44Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix_cuda.asciidoc

+This extension provides a feature-test macro as described in the core SYCL
+specification section 6.3.3 "Feature test macros".  Therefore, an
+implementation supporting this extension must predefine the macro
+`SYCL_EXT_ONEAPI_MATRIX_CUDA` to one of the values defined in the table below.


What is the convention used here in the naming: should this be called SYCL_EXT_ONEAPI_MATRIX_CUDA
or SYCL_EXT_ONEAPI_CUDA_MATRIX?

I just looked at a CUDA only feature. It is called: SYCL_EXT_ONEAPI_CUDA_ASYNC_BARRIER

Good point, I'll make this change.

BTW I will also update #5363 to reflect this extension. However I think I will wait until #7077 is merged before adding the final bmad implementation in #5363 on top of it.

Signed-off-by: JackAKirk <[email protected]>

JackAKirk · 2022-12-05T15:46:35Z

I've removed the get_wi_marray function since I don't think we plan to support this anymore. I've also removed the joint_matrix_bmad because if we want to still include it there are quite a few issues to resolve first with regard to how we connect it with the main matrix extension. For the time being I think we will simply shelve it.

Note that now the document added in this PR only documents additional constraints of the core features of the main matrix extension when using the ext_oneapi_cuda backend : note this is not only types of the joint_matrix but also some contraints of the parameters that may be passed to joint_matrix_load and joint_matrix_store. Perhaps this should not really be presented as an extension document per-se, rather as a backend implementation reference for the main matrix extension: There could be such a document for each backend, listing the supported parameters for the joint_matrix API in each backend?

Then if there are any backend specific APIs (such as joint_matrix_bmad) they could be added to a separate extension document.

zjin-lcf · 2022-12-05T16:34:46Z

Are there some SYCL examples for the usage of sycl_ext_oneapi_matrix_cuda extension ?

JackAKirk · 2022-12-05T16:38:46Z

Are there some SYCL examples for the usage of sycl_ext_oneapi_matrix_cuda extension ?

See my above message: this PR is now just additional details for users of sycl_ext_oneapi_matrix for the ext_oneapi_cuda backend: No features not included in sycl_ext_oneapi_matrix are mentioned: so I plan to remove the references to sycl_ext_oneapi_matrix_cuda. The currently published examples are available here: https://github.com/intel/llvm-test-suite/pull/1334/files#diff-baabe8a5f5eb1de4f775c5370e698bdddbddbf321bf9ab3ae43f76b5e634185f

Ideally, for the 2023.1 release we will provide some sample codes for a larger scale GEMM problem employing some optimizations: using shared memory and paddings to minimize shared memory bank conflicts.

Here is a sample code for the proposed new feature that originally constituted sycl_ext_oneapi_matrix_cuda: "joint_matrix_bmad" intel/llvm-test-suite#760 . However this feature will not be added atm.

gmlueck · 2022-12-06T20:31:05Z

It looks to me like there are three categories of limitations documented in this PR now:

You can't use "packed" layout.
A list of restrictions on the component type and M, N, and K sizes (per compute capability).
A limitation on the stride parameter to joint_matrix_load and joint_matrix_store

For (1), I think @dkhaldi said she would remove "packed" from the portable extension, so this will no longer need to be documented as a CUDA limitation.

All of the (2) restrictions can be covered by the "query" API in the portable extension. I think this is on our list of things to discuss further.

Can the (3) limitation be written into the portable API? I think our goal for the portable extension is that it is portable to all devices, including Nvidia.

dkhaldi · 2022-12-06T21:28:57Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_cuda_matrix.asciidoc

+### Additional contraints in the `ext_oneapi_cuda` backend
+
+IMPORTANT: The `stride` argument to `joint_matrix_load` and `joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a multiple of 4 when `T` is `float`; where `T` is the type of the `joint_matrix` elements.
+


Is this a functional or performance requirement?
If functional, can there be a workaround to support other strides (like some sort of padding at the load level)?

This is functional. The ptx builtin requires this constraint. A work-around isn't possible.

JackAKirk · 2022-12-07T09:47:26Z

Can the (3) limitation be written into the portable API? I think our goal for the portable extension is that it is portable to all devices, including Nvidia.

Do you mean to write the (3) limitation into the query interface?

The point is that the user won't be able to use a stride argument if it doesn't satisfy the constraint when targetting the CUDA backend. I don't see that there is anything more that we can do about this beyond documenting it properly.
This doesn't affect the portable interface, just reduces the sets of acceptable stride parameter values which are supported on all backends: already there isn't a single complete parameter set (complete parameter set means the complete set of parameters required by a single functionaly invocation of joint_matrix_load/store/mad) supported by both XMX and Tensor Cores.

I'll remove the reference to packed.

gmlueck · 2022-12-07T15:01:18Z

Do you mean to write the (3) limitation into the query interface?

This is not what I had in mind originally, but I agree that it is an option.

I do fear that the matrix API will be very hard to use in a portable manner because so many of the parameters have device-specific limitations.

JackAKirk · 2022-12-07T15:13:05Z

Do you mean to write the (3) limitation into the query interface?

This is not what I had in mind originally, but I agree that it is an option.

I do fear that the matrix API will be very hard to use in a portable manner because so many of the parameters have device-specific limitations.

Yes, and this is just considering three backends (AMX, XMX, Tensor Cores)! AMD cases will be added next and there is also little overlap with Tensor Cores in terms of the parameter space (but API will be a perfect fit.). I think we just have to do the best we can, and already a single API combined with good docs/queries is better than several different APIs IMO.

My understanding is that making different backends fit for GEMM kernels is of a greater importance in the realm of libraries and frameworks using well established algorithms. Also if we do not know the specific algorithm any given user is targeting, and if they aren't relying on libraries the algorithm is likely to be novel anyway, I imagine that generally we just have to leave porting up to the user in any case. I don't really see how we can do better than choosing an API that fully exposes the functionality of different vendors matrix multiplication hardware, and then properly documenting it so that users can make correct decisions.

Signed-off-by: JackAKirk <[email protected]>

JackAKirk · 2023-04-11T12:15:31Z

Closed because #9019 puts this information into the correct place.

Introduced sycl_ext_oneapi_matrix_cuda extension.

4b82fe7

JackAKirk requested a review from a team as a code owner October 5, 2022 15:58

JackAKirk requested review from dkhaldi and gmlueck October 5, 2022 15:59

Updated to use unified interface.

5c7510f

Signed-off-by: JackAKirk <[email protected]>

JackAKirk mentioned this pull request Oct 26, 2022

[SYCL][Spec] Update the matrix spec based on new use argument #6662

Merged

JackAKirk added 8 commits October 31, 2022 13:54

commit to check table format.

a4c827e

Signed-off-by: JackAKirk <[email protected]>

test table format 2

c0d4c70

Signed-off-by: JackAKirk <[email protected]>

test table format 3

5d23f75

test table format 4

d7c2af1

test table format 5

ff12f77

add CC column

45742d7

Added table of supported matrix shapes/types.

190c8ca

Signed-off-by: JackAKirk <[email protected]>

removed typo.

e393207

Signed-off-by: JackAKirk <[email protected]>

dkhaldi reviewed Nov 1, 2022

View reviewed changes

JackAKirk added 3 commits November 7, 2022 10:05

Merge branch 'sycl' into SYCL_EXT_ONEAPI_MATRIX_CUDA

9d096e7

Merge branch 'sycl' into SYCL_EXT_ONEAPI_MATRIX_CUDA

57d1387

Updated get_wi_marray().

ab39f25

Signed-off-by: JackAKirk <[email protected]>

dkhaldi reviewed Nov 7, 2022

View reviewed changes

JackAKirk added 2 commits November 9, 2022 14:51

Renamed extension, added back zero param constructor.

80a072a

Signed-off-by: JackAKirk <[email protected]>

Removed bmad and get_wi_marray. added constraints section.

a4650dc

Signed-off-by: JackAKirk <[email protected]>

dkhaldi reviewed Dec 6, 2022

View reviewed changes

Remove ref to layouts.

e50a2f5

Signed-off-by: JackAKirk <[email protected]>

JackAKirk closed this Apr 11, 2023


		#### Valid `joint_matrix` types and shapes

		The complete set of matrix data types and shapes that are supported by the CUDA backend are represented in the following table. Tm indicates the matrix element data type held by a "multiplicand" `joint_matrix`: i.e requiring `use::a` or `use::b`. Tc indicates the matrix element data type held by an "accumulator" `joint_matrix`: i.e requiring `use::accumulator`.

		### Additional contraints in the `ext_oneapi_cuda` backend

		IMPORTANT: The `stride` argument to `joint_matrix_load` and `joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a multiple of 4 when `T` is `float`; where `T` is the type of the `joint_matrix` elements.

[CUDA][Matrix][Doc] Introduced sycl_ext_oneapi_matrix_cuda extension. #6968

[CUDA][Matrix][Doc] Introduced sycl_ext_oneapi_matrix_cuda extension. #6968

Uh oh!

Conversation

JackAKirk commented Oct 5, 2022

Uh oh!

zjin-lcf commented Oct 6, 2022

Uh oh!

JackAKirk commented Oct 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackAKirk commented Oct 31, 2022

Uh oh!

dkhaldi commented Nov 1, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Nov 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkhaldi Nov 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Nov 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjin-lcf commented Dec 5, 2022

Uh oh!

JackAKirk commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmlueck commented Dec 6, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmlueck commented Dec 7, 2022

Uh oh!

JackAKirk commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackAKirk commented Apr 11, 2023

Uh oh!

Uh oh!

JackAKirk commented Oct 6, 2022 •

edited

Loading

JackAKirk Nov 1, 2022 •

edited

Loading

JackAKirk Nov 1, 2022 •

edited

Loading

JackAKirk Nov 7, 2022 •

edited

Loading

dkhaldi Nov 7, 2022 •

edited

Loading

JackAKirk Nov 9, 2022 •

edited

Loading

JackAKirk commented Dec 5, 2022 •

edited

Loading

JackAKirk commented Dec 5, 2022 •

edited

Loading

JackAKirk Dec 7, 2022 •

edited

Loading

JackAKirk commented Dec 7, 2022 •

edited

Loading

JackAKirk commented Dec 7, 2022 •

edited

Loading