ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs #13107

SongXiaoXi · 2025-04-25T11:08:42Z

This PR introduces two main improvements to the x86_64 implementations of low-precision floating point conversions in GGML:

~~Added runtime detection of CPU capabilities (e.g., AVX2, AVX-512) for the functions ggml_bf16_to_fp32_row() and ggml_fp32_to_fp16_row().~~
Added an AVX-512 optimized version of ggml_fp32_to_fp16_row()
Move fp convert to ggml-cpu.

Benchmark (Ryzen 9950X)

build: 13be08d (5186)
Before Optimization:

model	size	params	backend	threads	test	t/s
llama 8B F16	14.96 GiB	8.03 B	CPU	16	pp512	213.78 ± 0.24
llama 8B F16	14.96 GiB	8.03 B	CPU	16	tg128	4.16 ± 0.00

After Optimization:

model	size	params	backend	threads	test	t/s
llama 8B F16	14.96 GiB	8.03 B	CPU	16	pp512	226.08 ± 0.09
llama 8B F16	14.96 GiB	8.03 B	CPU	16	tg128	4.15 ± 0.00

pp512: ~5% throughput increase

slaren · 2025-04-25T16:23:12Z

It would be better to simply move the accelerated versions to the CPU backend, since that's the only place where these functions are going to affect evaluation performance. Keep the basic C-only functions in ggml.c so that applications can use this functionality. It is not good to add this much complexity and use compiler-specific features.

SongXiaoXi · 2025-04-25T17:12:28Z

@slaren Hi.

You're right that moving the accelerated implementations to the CPU backend would make the separation cleaner. However, there's a practical issue here:

ggml.c is compiled into libggml-base.so, which does not link against libggml-cpu.so. As a result, unless the application uses dlopen with manual symbol resolution, it cannot access the optimized AVX versions of functions like ggml_fp32_to_fp16_row().

Unless we explicitly register the accelerated implementation function pointers from the CPU backend during ggml_cpu_init() to libggml-base, functions in libggml-base won't be able to see or invoke the optimized versions.

Let me know if you'd prefer this approach.

slaren · 2025-04-25T17:22:12Z

Yes, the optimized version will not be available to applications, but that's not really a problem since these functions are not generally used by applications in performance sensitive paths.

So, to be clear:

Move the vectorized implementations to the CPU backend
Modify the CPU backend to use its own version of these functions, rather than the version from ggml-base

Applications can continue to use the basic C implementation from ggml-base. Alternatively, applications can use the faster CPU backend implementation by doing the conversion in a graph with a ggml_cast operation.

SongXiaoXi · 2025-04-26T03:24:45Z

Got it — I understand your point now.

I've updated the implementation accordingly: the vectorized versions have been moved to the CPU backend, and the CPU backend now uses its own optimized functions instead of relying on the ones from ggml-base. It is indeed much cleaner this way.

Additionally, I added four new exported functions:

GGML_BACKEND_API void ggml_cpu_fp32_to_fp16(const float *, ggml_fp16_t *, int64_t);
GGML_BACKEND_API void ggml_cpu_fp16_to_fp32(const ggml_fp16_t *, float *, int64_t);
GGML_BACKEND_API void ggml_cpu_fp32_to_bf16(const float *, ggml_bf16_t *, int64_t);
GGML_BACKEND_API void ggml_cpu_bf16_to_fp32(const ggml_bf16_t *, float *, int64_t);

slaren · 2025-04-26T11:30:50Z

ggml_compute_forward_get_rows_f16 and ggml_compute_forward_get_rows_bf16 could also be changed to use these functions.

…16_to_fp32

… conversion APIs (ggml-org#13107) * ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion * move fp converter to ggml-cpu * Switch ggml_compute_forward_get_rows_f16/bf16 to new ggml_cpu_fp16/bf16_to_fp32

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 25, 2025

SongXiaoXi force-pushed the master branch from 59084ff to 7756ed7 Compare April 25, 2025 11:27

ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion

c5e3b52

SongXiaoXi force-pushed the master branch from 7756ed7 to c5e3b52 Compare April 25, 2025 12:55

move fp converter to ggml-cpu

3efb0e7

SongXiaoXi changed the title ~~ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion~~ ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs Apr 26, 2025

slaren approved these changes Apr 26, 2025

View reviewed changes

Switch ggml_compute_forward_get_rows_f16/bf16 to new ggml_cpu_fp16/bf…

82f8630

…16_to_fp32

slaren merged commit 77d5e9a into ggml-org:master Apr 26, 2025
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs #13107

ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs #13107

Uh oh!

SongXiaoXi commented Apr 25, 2025 •

edited

Loading

Uh oh!

slaren commented Apr 25, 2025

Uh oh!

SongXiaoXi commented Apr 25, 2025

Uh oh!

slaren commented Apr 25, 2025

Uh oh!

SongXiaoXi commented Apr 26, 2025

Uh oh!

slaren commented Apr 26, 2025

Uh oh!

Uh oh!

Uh oh!

ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs #13107

ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs #13107

Uh oh!

Conversation

SongXiaoXi commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark (Ryzen 9950X)

Uh oh!

slaren commented Apr 25, 2025

Uh oh!

SongXiaoXi commented Apr 25, 2025

Uh oh!

slaren commented Apr 25, 2025

Uh oh!

SongXiaoXi commented Apr 26, 2025

Uh oh!

slaren commented Apr 26, 2025

Uh oh!

Uh oh!

Uh oh!

SongXiaoXi commented Apr 25, 2025 •

edited

Loading