-
Notifications
You must be signed in to change notification settings - Fork 787
[CUDA][Matrix][Doc] Introduced sycl_ext_oneapi_matrix_cuda extension. #6968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Could you point me to an equivalent CUDA example ? Thanks. |
For the marray functionality I don't think there is a direct analogue in CUDA Runtime API. For BMAD See the section on sub-byte operations here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-subbyte For an example of how binary MADs can be leveraged on current Nvidia® hardware see (A. Li, and S. Su. IEEE Transactions on Parallel and Distributed Systems, 32(7):1878-1891, 2021). The XOR operator is deprecated by Nvidia Hardware and will be unsupported in sm_90, and new operators (e.g. XNOR as a guess) may be added in the future as research finds them to be more useful - see below. More motivation for BMAD (there are some other applications beyond deep learning but I believe this was the initial motivation): Single-bit MADs can be used as part of Binarized Neural Networks (BNNs) in the case that both the activations and weights are binarized. "Quantizing" a network to form a BNN represents the extreme limit of reducing the precision of the network degrees of freedom in order to gain performance and improve efficiency. |
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
I've updated this document following the merge of #6662 Do you want me to change anything in the document? Do you want me to move this file to extensions/experimental/sycl_ext_oneapi_matrix_cuda/ Thanks |
I think this file and a new file for intel specific extension should be added to https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix So this same location will contain the matrix specific APIs: unified one + Cuda specific + Intel specific |
|
||
#### Valid `joint_matrix` types and shapes | ||
|
||
The complete set of matrix data types and shapes that are supported by the CUDA backend are represented in the following table. Tm indicates the matrix element data type held by a "multiplicand" `joint_matrix`: i.e requiring `use::a` or `use::b`. Tc indicates the matrix element data type held by an "accumulator" `joint_matrix`: i.e requiring `use::accumulator`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be added to the query interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they can be, although we will not do this immediately, so I documented them here. Also there is some additional useful information here that cannot be represented in the query interface as it is currently: minimum Compute Capability etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be good to know the limitations of the current query interface so we can make a final version of it in order to move the matrix API (unified including the query interface) out of experimental.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can look into that. For the moment our priority is going to be getting the main functionality correct and portable, so that our libraries can be consistent with the interfaces used by the DPC++ compiler, so that things are more likely to work out of the box for people and to minimize any changes that we will have to make in the future to the main minimum viable product that users will need for accelerated GEMM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, we also plan to start work on the MVP for HIP AMD soon, which although expected to be very similar to the cuda backend, has some differences, and will need a first implementation to iron out any unforeseen issues we may come across, which could also influence any experimental supplementary features such as the query interface in the same way as you suggest for cuda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, please add a todo list that contains adopting the query interface to the tensor cores TPU which is currently missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I was just looking over the query interface, it looks like it is quite important if you want to provide a large matrix and you aren't sure how to break it up into submatrices, so you introduce these guys:
constexpr int msize = break_dimension(params, M);
constexpr int msize_remainder = break_dimension_remainder(params, M);
Is this more important for AMX or both AMX and DPAS? Sounds very interesting.
I'm just guessing but I thought it might be more important for AMX because there are more options for sizes since there is the continuous range.
Library teams might be interested in this doing similar things to pick good parameters.
BTW do you think you should also add a table similar to the one I added here but for DPAS/AMX in sycl_ext_oneapi_matrix_intel? Even with the query interface perhaps the information should also be written somewhere for the users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query interface was added specifically to avoid adding such tables in the documentation.
The query interface will let the user write portable code across different generation of hardware and implementations.
I usually keep this kind of table in slides for presentation purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the user wants the information before writing any code at all?
This extension provides a feature-test macro as described in the core SYCL | ||
specification section 6.3.3 "Feature test macros". Therefore, an | ||
implementation supporting this extension must predefine the macro | ||
`SYCL_EXT_ONEAPI_MATRIX_CUDA` to one of the values defined in the table below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the convention used here in the naming: should this be called SYCL_EXT_ONEAPI_MATRIX_CUDA
or SYCL_EXT_ONEAPI_CUDA_MATRIX?
I just looked at a CUDA only feature. It is called: SYCL_EXT_ONEAPI_CUDA_ASYNC_BARRIER
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll make this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
I've removed the Note that now the document added in this PR only documents additional constraints of the core features of the main matrix extension when using the Then if there are any backend specific APIs (such as |
Are there some SYCL examples for the usage of sycl_ext_oneapi_matrix_cuda extension ? |
See my above message: this PR is now just additional details for users of Ideally, for the 2023.1 release we will provide some sample codes for a larger scale GEMM problem employing some optimizations: using shared memory and paddings to minimize shared memory bank conflicts. Here is a sample code for the proposed new feature that originally constituted |
It looks to me like there are three categories of limitations documented in this PR now:
For (1), I think @dkhaldi said she would remove "packed" from the portable extension, so this will no longer need to be documented as a CUDA limitation. All of the (2) restrictions can be covered by the "query" API in the portable extension. I think this is on our list of things to discuss further. Can the (3) limitation be written into the portable API? I think our goal for the portable extension is that it is portable to all devices, including Nvidia. |
### Additional contraints in the `ext_oneapi_cuda` backend | ||
|
||
IMPORTANT: The `stride` argument to `joint_matrix_load` and `joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a multiple of 4 when `T` is `float`; where `T` is the type of the `joint_matrix` elements. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a functional or performance requirement?
If functional, can there be a workaround to support other strides (like some sort of padding at the load level)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is functional. The ptx builtin requires this constraint. A work-around isn't possible.
Do you mean to write the (3) limitation into the query interface? The point is that the user won't be able to use a stride argument if it doesn't satisfy the constraint when targetting the CUDA backend. I don't see that there is anything more that we can do about this beyond documenting it properly. I'll remove the reference to packed. |
This is not what I had in mind originally, but I agree that it is an option. I do fear that the matrix API will be very hard to use in a portable manner because so many of the parameters have device-specific limitations. |
Yes, and this is just considering three backends (AMX, XMX, Tensor Cores)! AMD cases will be added next and there is also little overlap with Tensor Cores in terms of the parameter space (but API will be a perfect fit.). I think we just have to do the best we can, and already a single API combined with good docs/queries is better than several different APIs IMO. My understanding is that making different backends fit for GEMM kernels is of a greater importance in the realm of libraries and frameworks using well established algorithms. Also if we do not know the specific algorithm any given user is targeting, and if they aren't relying on libraries the algorithm is likely to be novel anyway, I imagine that generally we just have to leave porting up to the user in any case. I don't really see how we can do better than choosing an API that fully exposes the functionality of different vendors matrix multiplication hardware, and then properly documenting it so that users can make correct decisions. |
Signed-off-by: JackAKirk <[email protected]>
Closed because #9019 puts this information into the correct place. |
This document details the CUDA only features matrix extension for DPC++
This extension is built on top of the backend agnostic matrix extension that is being updated here: #6662.