Skip to content

[CUDA][Matrix][Doc] Introduced sycl_ext_oneapi_matrix_cuda extension. #6968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# `sycl_ext_oneapi_matrix` extension constraints specific to the `ext_oneapi_cuda` backend.
:source-highlighter: coderay
:coderay-linenums-mode: table
:dpcpp: pass:[DPC++]

// This section needs to be after the document title.
:doctype: book
:toc2:
:toc: left
:encoding: utf-8
:lang: en

:blank: pass:[ +]

// Set the default source code type in this document to C++,
// for syntax highlighting purposes. This is needed because
// docbook uses c++ and html5 uses cpp.
:language: {basebackend@docbook:c++:cpp}


== Notice

Copyright (c) 2022-2022 Intel Corporation. All rights reserved.

NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc.
used by permission by Khronos.

This extension is written against the SYCL 2020 revision 6 specification. All
references below to the "core SYCL specification" or to section numbers in the
SYCL specification refer to that revision.


**_NOTE:_** This document describes the current design and API for the `ext_oneapi_cuda` only features matrix
extension to {dpcpp}. This is an initial experimental version to try out functionality
and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**.

## Introduction
The `ext_oneapi_cuda` backend supports `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad` and `joint_matrix_fill` as they are defined in the `sycl_ext_oneapi_matrix` extension. The complete set of `joint_matrix` types and shapes that are valid in the `ext_oneapi_cuda` backend are listed in this document.
This extension presents any constraints that apply specifically when using the `ext_oneapi_cuda` backend, which may not apply generally to the `sycl_ext_oneapi_matrix` extension.

### Valid `joint_matrix` types and shapes

The complete set of matrix data types and shapes that are supported by the `ext_oneapi_cuda` backend are represented in the following table. Tm indicates the matrix element data type held by a "multiplicand" `joint_matrix`: i.e requiring `use::a` or `use::b`. Tc indicates the matrix element data type held by an "accumulator" `joint_matrix`: i.e requiring `use::accumulator`.
--
[.center]
|======================
|Tm (`use::a` or `use::b`) |Tc (`use::accumulator`) |M |N |K | Minimum Compute Capability
.3+|half .3+|float
|16 |16 |16| sm_70
|8 |32 |16| sm_70
|32 |8 |16| sm_70
.3+|half .3+|half
|16 |16 |16| sm_70
|8 |32 |16| sm_70
|32 |8 |16| sm_70
.3+|int8_t .3+|int32_t
|16 |16 |16| sm_72
|8 |32 |16| sm_72
|32 |8 |16| sm_72
.3+|uint8_t .3+|int32_t
|16 |16 |16| sm_72
|8 |32 |16| sm_72
|32 |8 |16| sm_72
|precision::tf32 |float |16 |16 |8| sm_80
.3+|bfloat16 .3+|float
|16 |16 |16 |sm_80
|8 |32 |16 |sm_80
|32 |8 |16 |sm_80
|double |double |8 |8 |4 |sm_80
|======================
--

The M, N, K triple from the above table defines the complete set of matrix shapes constructible:
--
[.center]
|======================
|use |NumRows | NumCols
|a |M |K
|b |K |N
|accumulator | M| N
|======================
--

### Additional contraints in the `ext_oneapi_cuda` backend

IMPORTANT: The `stride` argument to `joint_matrix_load` and `joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a multiple of 4 when `T` is `float`; where `T` is the type of the `joint_matrix` elements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a functional or performance requirement?
If functional, can there be a workaround to support other strides (like some sort of padding at the load level)?

Copy link
Contributor Author

@JackAKirk JackAKirk Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is functional. The ptx builtin requires this constraint. A work-around isn't possible.

## Revision History

[frame="none",options="header"]
|======================
|Rev |Date |Author |Changes
|1 |2022-10-5 |Jack Kirk |Initial public working draft.
|======================