Skip to content

[CUDA][Matrix][Doc] Introduced sycl_ext_oneapi_matrix_cuda extension. #6968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

JackAKirk
Copy link
Contributor

This document details the CUDA only features matrix extension for DPC++
This extension is built on top of the backend agnostic matrix extension that is being updated here: #6662.

@JackAKirk JackAKirk requested a review from a team as a code owner October 5, 2022 15:58
@JackAKirk JackAKirk requested review from dkhaldi and gmlueck October 5, 2022 15:59
@zjin-lcf
Copy link
Contributor

zjin-lcf commented Oct 6, 2022

Could you point me to an equivalent CUDA example ? Thanks.

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Oct 6, 2022

Could you point me to an equivalent CUDA example ? Thanks.

For the marray functionality I don't think there is a direct analogue in CUDA Runtime API.

For BMAD See the section on sub-byte operations here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-subbyte
joint_matrix_bmad has the same functionality as bmma_sync in CUDA Runtime API. The implementation of joint_matrix_bmad is here: #5363. However this will not be merged until we merge the unified matrix extension from #6662.

For an example of how binary MADs can be leveraged on current Nvidia® hardware see (A. Li, and S. Su. IEEE Transactions on Parallel and Distributed Systems, 32(7):1878-1891, 2021).
See this paper also for better details of the interfaces used in the Nvidia® CUDA runtime API.

The XOR operator is deprecated by Nvidia Hardware and will be unsupported in sm_90, and new operators (e.g. XNOR as a guess) may be added in the future as research finds them to be more useful - see below.

More motivation for BMAD (there are some other applications beyond deep learning but I believe this was the initial motivation):

Single-bit MADs can be used as part of Binarized Neural Networks (BNNs) in the case that both the activations and weights are binarized. "Quantizing" a network to form a BNN represents the extreme limit of reducing the precision of the network degrees of freedom in order to gain performance and improve efficiency.
Hubara et al. (I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized Neural Networks, Advances in Neural Information Processing Systems 29 (NIPS 2016)) first demonstrated the utility of an algorithm that could use both binarized activations and weights with backpropagation, by keeping track of real valued weights which are mapped to the binarized weights. In the backwards pass the real valued weights are updated according to a heuristic named the "Straight Through Estimator", whereby the gradient of the loss function with respect to the real weights is set equal to the gradient of the loss function with respect to the binarized weights.
This implies that the precision of the data type used in the matrix multiplications can be single bit, with the necessary addition of forward and backward element wise mappings between binarized and real valued representations of the matrices.
This could prove a significant advantage for large models, since the computational cost of Matrix Multiplication scales with the number of elements per dimension, N, as O(N^3) for square matrices, whereas corresponding element wise operations scale as O(N^2).
Further algorithms based on this binarized approach have been proposed, e.g. see Rastegari et al. (M. Rastegari, V Ordonez, J. Redmon, and A. Farhadi. Computer Vision – ECCV 2016, 525-542) who have made a comparison between a binarized version of a CNN (Using a XNOR Binary Dot Product) and corresponding full precision models, for both the accuracy and performance of image classification using the ImageNet data set.

@JackAKirk
Copy link
Contributor Author

Hi @dkhaldi @gmlueck

I've updated this document following the merge of #6662

Do you want me to change anything in the document?

Do you want me to move this file to extensions/experimental/sycl_ext_oneapi_matrix_cuda/
or somewhere else?

Thanks

@dkhaldi
Copy link
Contributor

dkhaldi commented Nov 1, 2022

Do you want me to move this file to extensions/experimental/sycl_ext_oneapi_matrix_cuda/ or somewhere else?

I think this file and a new file for intel specific extension should be added to https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix

So this same location will contain the matrix specific APIs: unified one + Cuda specific + Intel specific


#### Valid `joint_matrix` types and shapes

The complete set of matrix data types and shapes that are supported by the CUDA backend are represented in the following table. Tm indicates the matrix element data type held by a "multiplicand" `joint_matrix`: i.e requiring `use::a` or `use::b`. Tc indicates the matrix element data type held by an "accumulator" `joint_matrix`: i.e requiring `use::accumulator`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be added to the query interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they can be, although we will not do this immediately, so I documented them here. Also there is some additional useful information here that cannot be represented in the query interface as it is currently: minimum Compute Capability etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to know the limitations of the current query interface so we can make a final version of it in order to move the matrix API (unified including the query interface) out of experimental.

Copy link
Contributor Author

@JackAKirk JackAKirk Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can look into that. For the moment our priority is going to be getting the main functionality correct and portable, so that our libraries can be consistent with the interfaces used by the DPC++ compiler, so that things are more likely to work out of the box for people and to minimize any changes that we will have to make in the future to the main minimum viable product that users will need for accelerated GEMM.

Copy link
Contributor Author

@JackAKirk JackAKirk Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, we also plan to start work on the MVP for HIP AMD soon, which although expected to be very similar to the cuda backend, has some differences, and will need a first implementation to iron out any unforeseen issues we may come across, which could also influence any experimental supplementary features such as the query interface in the same way as you suggest for cuda.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, please add a todo list that contains adopting the query interface to the tensor cores TPU which is currently missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I was just looking over the query interface, it looks like it is quite important if you want to provide a large matrix and you aren't sure how to break it up into submatrices, so you introduce these guys:

constexpr int msize = break_dimension(params, M);
constexpr int msize_remainder = break_dimension_remainder(params, M);

Is this more important for AMX or both AMX and DPAS? Sounds very interesting.
I'm just guessing but I thought it might be more important for AMX because there are more options for sizes since there is the continuous range.
Library teams might be interested in this doing similar things to pick good parameters.

BTW do you think you should also add a table similar to the one I added here but for DPAS/AMX in sycl_ext_oneapi_matrix_intel? Even with the query interface perhaps the information should also be written somewhere for the users?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query interface was added specifically to avoid adding such tables in the documentation.
The query interface will let the user write portable code across different generation of hardware and implementations.
I usually keep this kind of table in slides for presentation purposes.

Copy link
Contributor Author

@JackAKirk JackAKirk Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user wants the information before writing any code at all?

This extension provides a feature-test macro as described in the core SYCL
specification section 6.3.3 "Feature test macros". Therefore, an
implementation supporting this extension must predefine the macro
`SYCL_EXT_ONEAPI_MATRIX_CUDA` to one of the values defined in the table below.
Copy link
Contributor

@dkhaldi dkhaldi Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the convention used here in the naming: should this be called SYCL_EXT_ONEAPI_MATRIX_CUDA
or SYCL_EXT_ONEAPI_CUDA_MATRIX?

I just looked at a CUDA only feature. It is called: SYCL_EXT_ONEAPI_CUDA_ASYNC_BARRIER

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll make this change.

Copy link
Contributor Author

@JackAKirk JackAKirk Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I will also update #5363 to reflect this extension. However I think I will wait until #7077 is merged before adding the final bmad implementation in #5363 on top of it.

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Dec 5, 2022

I've removed the get_wi_marray function since I don't think we plan to support this anymore. I've also removed the joint_matrix_bmad because if we want to still include it there are quite a few issues to resolve first with regard to how we connect it with the main matrix extension. For the time being I think we will simply shelve it.

Note that now the document added in this PR only documents additional constraints of the core features of the main matrix extension when using the ext_oneapi_cuda backend : note this is not only types of the joint_matrix but also some contraints of the parameters that may be passed to joint_matrix_load and joint_matrix_store. Perhaps this should not really be presented as an extension document per-se, rather as a backend implementation reference for the main matrix extension: There could be such a document for each backend, listing the supported parameters for the joint_matrix API in each backend?

Then if there are any backend specific APIs (such as joint_matrix_bmad) they could be added to a separate extension document.

@zjin-lcf
Copy link
Contributor

zjin-lcf commented Dec 5, 2022

Are there some SYCL examples for the usage of sycl_ext_oneapi_matrix_cuda extension ?

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Dec 5, 2022

Are there some SYCL examples for the usage of sycl_ext_oneapi_matrix_cuda extension ?

See my above message: this PR is now just additional details for users of sycl_ext_oneapi_matrix for the ext_oneapi_cuda backend: No features not included in sycl_ext_oneapi_matrix are mentioned: so I plan to remove the references to sycl_ext_oneapi_matrix_cuda. The currently published examples are available here: https://github.com/intel/llvm-test-suite/pull/1334/files#diff-baabe8a5f5eb1de4f775c5370e698bdddbddbf321bf9ab3ae43f76b5e634185f

Ideally, for the 2023.1 release we will provide some sample codes for a larger scale GEMM problem employing some optimizations: using shared memory and paddings to minimize shared memory bank conflicts.

Here is a sample code for the proposed new feature that originally constituted sycl_ext_oneapi_matrix_cuda: "joint_matrix_bmad" intel/llvm-test-suite#760 . However this feature will not be added atm.

@gmlueck
Copy link
Contributor

gmlueck commented Dec 6, 2022

It looks to me like there are three categories of limitations documented in this PR now:

  1. You can't use "packed" layout.
  2. A list of restrictions on the component type and M, N, and K sizes (per compute capability).
  3. A limitation on the stride parameter to joint_matrix_load and joint_matrix_store

For (1), I think @dkhaldi said she would remove "packed" from the portable extension, so this will no longer need to be documented as a CUDA limitation.

All of the (2) restrictions can be covered by the "query" API in the portable extension. I think this is on our list of things to discuss further.

Can the (3) limitation be written into the portable API? I think our goal for the portable extension is that it is portable to all devices, including Nvidia.

### Additional contraints in the `ext_oneapi_cuda` backend

IMPORTANT: The `stride` argument to `joint_matrix_load` and `joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a multiple of 4 when `T` is `float`; where `T` is the type of the `joint_matrix` elements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a functional or performance requirement?
If functional, can there be a workaround to support other strides (like some sort of padding at the load level)?

Copy link
Contributor Author

@JackAKirk JackAKirk Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is functional. The ptx builtin requires this constraint. A work-around isn't possible.

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Dec 7, 2022

Can the (3) limitation be written into the portable API? I think our goal for the portable extension is that it is portable to all devices, including Nvidia.

Do you mean to write the (3) limitation into the query interface?

The point is that the user won't be able to use a stride argument if it doesn't satisfy the constraint when targetting the CUDA backend. I don't see that there is anything more that we can do about this beyond documenting it properly.
This doesn't affect the portable interface, just reduces the sets of acceptable stride parameter values which are supported on all backends: already there isn't a single complete parameter set (complete parameter set means the complete set of parameters required by a single functionaly invocation of joint_matrix_load/store/mad) supported by both XMX and Tensor Cores.

I'll remove the reference to packed.

@gmlueck
Copy link
Contributor

gmlueck commented Dec 7, 2022

Do you mean to write the (3) limitation into the query interface?

This is not what I had in mind originally, but I agree that it is an option.

I do fear that the matrix API will be very hard to use in a portable manner because so many of the parameters have device-specific limitations.

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Dec 7, 2022

Do you mean to write the (3) limitation into the query interface?

This is not what I had in mind originally, but I agree that it is an option.

I do fear that the matrix API will be very hard to use in a portable manner because so many of the parameters have device-specific limitations.

Yes, and this is just considering three backends (AMX, XMX, Tensor Cores)! AMD cases will be added next and there is also little overlap with Tensor Cores in terms of the parameter space (but API will be a perfect fit.). I think we just have to do the best we can, and already a single API combined with good docs/queries is better than several different APIs IMO.

My understanding is that making different backends fit for GEMM kernels is of a greater importance in the realm of libraries and frameworks using well established algorithms. Also if we do not know the specific algorithm any given user is targeting, and if they aren't relying on libraries the algorithm is likely to be novel anyway, I imagine that generally we just have to leave porting up to the user in any case. I don't really see how we can do better than choosing an API that fully exposes the functionality of different vendors matrix multiplication hardware, and then properly documenting it so that users can make correct decisions.

Signed-off-by: JackAKirk <[email protected]>
@JackAKirk
Copy link
Contributor Author

Closed because #9019 puts this information into the correct place.

@JackAKirk JackAKirk closed this Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants