ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel #13053

eddnjjn · 2025-04-21T18:41:30Z

This commit adds a fp32 = bf16 X bf16 GEMM SME kernel to the KleidiAI integration and enables the kernel for use by KV-cache matmul operations.

The integration currently requires that LHS is F32 and RHS is F16. Both LHS and RHS are converted into the format expected by the KleidiAI kernel (bf16) at runtime before the kernel is executed.

Signed-off-by: Dan Johansson <[email protected]>

slaren · 2025-04-21T23:14:49Z

ggml/src/ggml-cpu/kleidiai/kleidiai.cpp

+            m_to_process = m - m_start;
+        }
+
+        if(m_start < m) {


Suggested change

if(m_start < m) {

if (m_start < m) {

slaren · 2025-04-21T23:19:24Z

ggml/src/ggml-cpu/kleidiai/kleidiai.cpp

+                    variant_call<void>(kernel->run_kernel, m_to_process, n_to_process, k, lhs_ptr, rhs_ptr, dst_ptr, dst_stride, sizeof(float), -FLT_MAX, FLT_MAX);
+                }
+
+                ggml_barrier(params->threadpool);


Shouldn't be necessary to add a barrier at the end of an operation, there is always a barrier between every op.

The barrier is necessary in case the batch size (ne02) is larger than 1 since it separates the processing of batches from each other. The processing of each batch uses the work data (params->wdata) as temporary storage and they must therefore not be interleaved.

However, the code can be updated to avoid waiting at the barrier if batch_idx == batch_size - 1. I'll make this change.

Should add a comment in the code or reference to this comment.

slaren · 2025-04-21T23:21:42Z

ggml/src/ggml-cpu/kleidiai/kernels.cpp

+        },
+        /* .required_cpu       = */ CPU_FEATURE_SME,
+        /* .lhs_type           = */ GGML_TYPE_F32,
+        /* .rhs_type           = */ GGML_TYPE_F16,


GGML_TYPE_F16 is not BF16, it is standard float16. For BF16 you should use GGML_TYPE_BF16.

This is defining the input types for this SME operation. In this case the RHS input type is GGML_TYPE_F16.

ggerganov

Just wondering, wouldn't it be more straightforward to use BF16 for the KV cache type and avoid the conversion from F16 -> BF16 for the RHS?

ggerganov · 2025-04-22T06:29:45Z

ggml/src/ggml-cpu/kleidiai/kleidiai.cpp


-            GGML_TENSOR_BINARY_OP_LOCALS
+    bool compute_forward_kv_cache(ggml_compute_params * params, struct ggml_tensor * dst) {


You probably would want to call these functions something more generic, like compute_forward_f16 and add comments that they are typically used during KV cache ops. Is there something that makes this work only for the KV cache matrix multiplications?

~~Agreed on the function naming and comment.~~
~~This function will be called for any MUL_MAT for which the RHS is GGML_TYPE_F16 - and this is the default case for the KV-cache matrix multiplications.~~

Edit: Correction - This currently only handles KV-cache ops since others are filtered out in

ggml::cpu::tensor_traits * get_tensor_traits(const struct ggml_tensor * op) override { ... else if (ggml_kleidiai_select_kernels(ctx.features, op) && op->src[0]->op == GGML_OP_VIEW && (op->src[1]->op == GGML_OP_PERMUTE || op->src[1]->op == GGML_OP_SOFT_MAX) && op->src[1]->ne[1] > 1) {

So this function will not handle generic FP32 x FP16 ops. It will only handle KV-cache FP32 x FP16.

Regarding BF16 and KV-cache: We have currently only enabled the GEMM part of the KleidiAI kernel (m > 1) and let the default CPU implementation handle GEMV. As long as we don't accelerate GEMV I suspect switching to BF16 will harm performance. I'm currently out of office but will look into this when I get back end of next week!

That's a good point about defaulting the KV cache tensor to BF16. However, since this optimization is only active when KleidiAI is enabled, SME is present, and the environment variable is explicitly set, we opted to keep the KV-cache in F16. This avoids introducing type inconsistencies across the model and maintains compatibility with other backends. Additionally, because KleidiAI kernels are still evolving, keeping the tensor in F16 provides flexibility in case future versions expect different input formats or drop BF16 requirements altogether.

Yes, it's OK to support F16 KV cache. I just think you should also consider supporting BF16 KV cache in the future since it would require less memory copies and conversions and might give you even better perf. It should not bring any incompatibilities with other backends or models. The model can still be F16, only the cache can be changed to BF16 via user config (e.g. llama-cli -ctk bf16 -ctv bf16).

As long as we don't accelerate GEMV I suspect switching to BF16 will harm performance.

Yes, that might be true, though atm I am not sure if GEMV is ever used for the KV cache matrix multiplication. It's still worth an experiment.

In any case, this is just a recommendation for the future - no action needed for this PR specifically.

ggerganov · 2025-04-22T06:32:58Z

ggml/src/ggml-cpu/kleidiai/kleidiai.cpp

+        const size_t nth = params->nth;
+        const size_t ith = params->ith;


nit : for consistency

Suggested change

const size_t nth = params->nth;

const size_t ith = params->ith;

const int nth = params->nth;

const int ith = params->ith;

Generally when counting items, enumerating elements, dimensions, etc. we tend to use int or int64_t. size_t is used only for computing memory sizes, strides and offsets.

It's OK as it is and no change is needed, but for consistency with the rest of the ggml code, you can consider following this convention.

ggerganov · 2025-04-22T06:49:07Z

ggml/src/ggml-cpu/kleidiai/kleidiai.cpp

+                variant_call<void>(kernels->rhs_info.pack_func, 1, n, k, nr, kr, sr, n * sizeof(float),
+                             rhs_kxn, bias, nullptr, rhs_packed, 0, nullptr);


Just want to make sure that I'm following the logic:

First we take the F16 RHS data and cast+transpose it to F32

Next the F32 data is converted back to BF16?

Yes, the RHS packing expects knx F32 input so cast+transpose.
Following this, the data is then packed in BF16 for use in the SME kernel.
The SME kernel evaluates BF16 * BF16 and returns F32.

Signed-off-by: Dan Johansson <[email protected]>

ggerganov

Apart from the overly-specific naming of the compute_forward_kv_cache() function, I don't think I have any other general advice. Also, haven't tested the changes.

Should be good to merge.

ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel

3addc2b

Signed-off-by: Dan Johansson <[email protected]>

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 21, 2025

slaren reviewed Apr 21, 2025

View reviewed changes

ggerganov reviewed Apr 22, 2025

View reviewed changes

eddnjjn added 2 commits May 8, 2025 14:30

* code review fixes

eb7e3c0

Signed-off-by: Dan Johansson <[email protected]>

* adds a comment that clarifies barrier usage

b326947

Signed-off-by: Dan Johansson <[email protected]>

ggerganov approved these changes May 12, 2025

View reviewed changes

ggerganov requested a review from slaren May 12, 2025 08:18

slaren approved these changes May 12, 2025

View reviewed changes

slaren merged commit a71a407 into ggml-org:master May 12, 2025
40 of 44 checks passed


		GGML_TENSOR_BINARY_OP_LOCALS
		bool compute_forward_kv_cache(ggml_compute_params * params, struct ggml_tensor * dst) {

		const size_t nth = params->nth;
		const size_t ith = params->ith;

		variant_call<void>(kernels->rhs_info.pack_func, 1, n, k, nr, kr, sr, n * sizeof(float),
		rhs_kxn, bias, nullptr, rhs_packed, 0, nullptr);

ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel #13053

ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel #13053

Uh oh!

Conversation

eddnjjn commented Apr 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damhog01 Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

damhog01 Apr 24, 2025 •

edited

Loading