Skip to content

Add intrinsics for FEAT_SME_MOP4 #381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 19, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
310 changes: 307 additions & 3 deletions main/acle.md
Original file line number Diff line number Diff line change
Expand Up @@ -445,12 +445,14 @@ Armv8.4-A [[ARMARMv84]](#ARMARMv84). Support is added for the Dot Product intrin
* Added `svdot[_n_f16_mf8]_fpm` and `svdot[_n_f32_mf8]_fpm`.
* Added Guarded Control Stack (GCS) at
[**Beta**](#current-status-and-anticipated-changes) quality level.
* Added [**Alpha**](#current-status-and-anticipated-changes) support
for structured sparsity outer product intrinsics

#### Changes for next release

* Changed the Function Multi Versioning default version rules to be more explicit.
* Added [**Alpha**](#current-status-and-anticipated-changes) support
for structured sparsity outer product intrinsics
* Added [**Alpha**](#current-status-and-anticipated-changes) support
for quarter-tile outer product intrinsics.

### References

Expand Down Expand Up @@ -2388,6 +2390,17 @@ support for the SME structured sparsity outer product (FEAT_SME_TMOP)
instructions and if their associated intrinsics are
available. This implies that `__ARM_FEATURE_SME2` is nonzero.

#### Quarter-tile outer product intrinsics

The specification for SME is in
[**Alpha** state](#current-status-and-anticipated-changes) and may change or be
extended in the future.

`__ARM_FEATURE_SME_MOP4` is defined to `1` if there is hardware
support for the SME quarter-tile outer product (FEAT_SME_MOP4)
instructions and if their associated intrinsics are
available. This implies that `__ARM_FEATURE_SME2` is nonzero.

## Floating-point model

These macros test the floating-point model implemented by the compiler
Expand Down Expand Up @@ -2584,7 +2597,8 @@ be found in [[BA]](#BA).
| [`__ARM_FEATURE_SME_F8F16`](#modal-8-bit-floating-point-extensions) | Modal 8-bit floating-point extensions | 1 |
| [`__ARM_FEATURE_SME_F8F32`](#modal-8-bit-floating-point-extensions) | Modal 8-bit floating-point extensions | 1 |
| [`__ARM_FEATURE_SME_I16I64`](#16-bit-to-64-bit-integer-widening-outer-product-intrinsics) | 16-bit to 64-bit integer widening outer product intrinsics (FEAT_SME_I16I64) | 1 |
| [`__ARM_FEATURE_SME_TMOP`](#structured-sparsity-outer-product-intrinsics) | Structured sparsity outer product intrinsics (FEAT_SME_TMOP) | 1 |
| [`__ARM_FEATURE_SME_TMOP`](#structured-sparsity-outer-product-intrinsics) | Structured sparsity outer product intrinsics (FEAT_SME_TMOP) | 1 |
| [`__ARM_FEATURE_SME_MOP4`](#quarter-tile-outer-product-intrinsics) | quarter-tile outer product intrinsics (FEAT_SME_MOP4) | 1 |
| [`__ARM_FEATURE_SME_LOCALLY_STREAMING`](#scalable-matrix-extension-sme) | Support for the `arm_locally_streaming` attribute | 1 |
| [`__ARM_FEATURE_SME_LUTv2`](#lookup-table-extensions) | Lookup table extensions (FEAT_SME_LUTv2) | 1 |
| [`__ARM_FEATURE_SSVE_FP8DOT2`](#modal-8-bit-floating-point-extensions) | Modal 8-bit floating-point extensions | 1 |
Expand Down Expand Up @@ -13780,6 +13794,296 @@ Multi-vector 8-bit floating-point multiply-add long.
__arm_streaming __arm_inout("za");
```

### SME2 mop4 intrinsics

The intrinsics in this section are defined by the header file
[`<arm_sme.h>`](#arm_sme.h) when `__ARM_FEATURE_SME2` and
`__ARM_FEATURE_SME_MOP4` are defined. Individual intrinsics may have
additional target feature requirements.

These intrinsics have an additional '_{1,2}x{1,2}' suffix in
non-overloaded names to indicate which vector argument is a vector register pair.

#### FMOP4A (non-FP8), BFMOP4A, SMOP4A, UMOP4A

``` c
// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
// _za64[_u16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_1x1]_za32[_f32_f32](uint64_t tile, svfloat32_t zn,
svfloat32_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
// _za64[_u16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_2x2]_za32[_f32_f32](uint64_t tile, svfloat32x2_t zn,
svfloat32x2_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
// _za64[_u16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_2x1]_za32[_f32_f32](uint64_t tile, svfloat32x2_t zn,
svfloat32_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
// _za64[_u16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_1x2]_za32[_f32_f32](uint64_t tile, svfloat32_t zn,
svfloat32x2_t zm)
__arm_streaming __arm_inout("za");
```

#### FMOP4S (non-FP8), BFMOP4S, SMOP4S, UMOP4S

``` c
// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
// _za64[_u16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_1x1]_za32[_f32_f32](uint64_t tile, svfloat32_t zn,
svfloat32_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_2x2]_za32[_f32_f32](uint64_t tile, svfloat32x2_t zn,
svfloat32x2_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
// _za64[_u16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_2x1]_za32[_f32_f32](uint64_t tile, svfloat32x2_t zn,
svfloat32_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
// _za64[_u16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_1x2]_za32[_f32_f32](uint64_t tile, svfloat32_t zn,
svfloat32x2_t zm)
__arm_streaming __arm_inout("za");
```

#### SUMOP4A

``` c
// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_1x1]_za32[_s8_u8](uint64_t tile, svint8_t zn,
svuint8_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_2x2]_za32[_s8_u8](uint64_t tile, svint8x2_t zn,
svuint8x2_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_2x1]_za32[_s8_u8](uint64_t tile, svint8x2_t zn,
svuint8__t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_1x2]_za32[_s8_u8](uint64_t tile, svint8_t zn,
svuint8x2_t zm)
__arm_streaming __arm_inout("za");
```

#### SUMOP4S

``` c
// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_1x1]_za32[_s8_u8](uint64_t tile, svint8_t zn,
svuint8_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_2x2]_za32[_s8_u8](uint64_t tile, svint8x2_t zn,
svuint8x2_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_2x1]_za32[_s8_u8](uint64_t tile, svint8x2_t zn,
svuint8__t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_1x2]_za32[_s8_u8](uint64_t tile, svint8_t zn,
svuint8x2_t zm)
__arm_streaming __arm_inout("za");
```

#### USMOP4A

``` c
// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_1x1]_za32[_u8_s8](uint64_t tile, svuint8_t zn,
svint8_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_2x2]_za32[_u8_s8](uint64_t tile, svuint8x2_t zn,
svint8x2_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_2x1]_za32[_u8_s8](uint64_t tile, svuint8x2_t zn,
svint8__t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_1x2]_za32[_u8_s8](uint64_t tile, svuint8_t zn,
svint8x2_t zm)
__arm_streaming __arm_inout("za");
```

#### USMOP4S

``` c
// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_1x1]_za32[_u8_s8](uint64_t tile, svuint8_t zn,
svint8_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_2x2]_za32[_u8_s8](uint64_t tile, svuint8x2_t zn,
svint8x2_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_2x1]_za32[_u8_s8](uint64_t tile, svuint8x2_t zn,
svint8__t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_1x2]_za32[_u8_s8](uint64_t tile, svuint8_t zn,
svint8x2_t zm)
__arm_streaming __arm_inout("za");
```

#### FMOP4A (FP8)

``` c
// Variants are also available for:
// _za16[_mf8_mf8] (only if __ARM_FEATURE_SME_F8F16 != 0)
// _za32[_mf8_mf8] (only if __ARM_FEATURE_SME_F8F32 != 0)
void svmop4a[_1x1]_za32[_mf8_mf8]_fpm(uint64_t tile, svmfloat8_t zn,
svmfloat8_t zm, fpm_t fpm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_mf8_mf8] (only if __ARM_FEATURE_SME_F8F16 != 0)
// _za32[_mf8_mf8] (only if __ARM_FEATURE_SME_F8F32 != 0)
void svmop4a[_2x2]_za32[_mf8_mf8]_fpm(uint64_t tile, svmfloat8x2_t zn,
svmfloat8x2_t zm, fpm_t fpm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_mf8_mf8] (only if __ARM_FEATURE_SME_F8F16 != 0)
// _za32[_mf8_mf8] (only if __ARM_FEATURE_SME_F8F32 != 0)
void svmop4a[_2x1]_za32[_mf8_mf8]_fpm(uint64_t tile, svmfloat8x2_t zn,
svmfloat8_t zm, fpm_t fpm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_mf8_mf8] (only if __ARM_FEATURE_SME_F8F16 != 0)
// _za32[_mf8_mf8] (only if __ARM_FEATURE_SME_F8F32 != 0)
void svmop4a[_1x2]_za32[_mf8_mf8]_fpm(uint64_t tile, svmfloat8_t zn,
svmfloat8x2_t zm, fpm_t fpm)
__arm_streaming __arm_inout("za");
```

# M-profile Vector Extension (MVE) intrinsics

The M-profile Vector Extension (MVE) [[MVE-spec]](#MVE-spec) instructions provide packed Single
Expand Down