Skip to content

ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs #13107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 26, 2025

Conversation

SongXiaoXi
Copy link
Contributor

@SongXiaoXi SongXiaoXi commented Apr 25, 2025

This PR introduces two main improvements to the x86_64 implementations of low-precision floating point conversions in GGML:

  • Added runtime detection of CPU capabilities (e.g., AVX2, AVX-512) for the functions ggml_bf16_to_fp32_row() and ggml_fp32_to_fp16_row().
  • Added an AVX-512 optimized version of ggml_fp32_to_fp16_row()
  • Move fp convert to ggml-cpu.

Benchmark (Ryzen 9950X)

build: 13be08d (5186)
Before Optimization:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 16 pp512 213.78 ± 0.24
llama 8B F16 14.96 GiB 8.03 B CPU 16 tg128 4.16 ± 0.00

After Optimization:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 16 pp512 226.08 ± 0.09
llama 8B F16 14.96 GiB 8.03 B CPU 16 tg128 4.15 ± 0.00

pp512: ~5% throughput increase

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 25, 2025
@slaren
Copy link
Member

slaren commented Apr 25, 2025

It would be better to simply move the accelerated versions to the CPU backend, since that's the only place where these functions are going to affect evaluation performance. Keep the basic C-only functions in ggml.c so that applications can use this functionality. It is not good to add this much complexity and use compiler-specific features.

@SongXiaoXi
Copy link
Contributor Author

@slaren Hi.

You're right that moving the accelerated implementations to the CPU backend would make the separation cleaner. However, there's a practical issue here:

ggml.c is compiled into libggml-base.so, which does not link against libggml-cpu.so. As a result, unless the application uses dlopen with manual symbol resolution, it cannot access the optimized AVX versions of functions like ggml_fp32_to_fp16_row().

Unless we explicitly register the accelerated implementation function pointers from the CPU backend during ggml_cpu_init() to libggml-base, functions in libggml-base won't be able to see or invoke the optimized versions.

Let me know if you'd prefer this approach.

@slaren
Copy link
Member

slaren commented Apr 25, 2025

Yes, the optimized version will not be available to applications, but that's not really a problem since these functions are not generally used by applications in performance sensitive paths.

So, to be clear:

  • Move the vectorized implementations to the CPU backend
  • Modify the CPU backend to use its own version of these functions, rather than the version from ggml-base

Applications can continue to use the basic C implementation from ggml-base. Alternatively, applications can use the faster CPU backend implementation by doing the conversion in a graph with a ggml_cast operation.

@SongXiaoXi
Copy link
Contributor Author

Got it — I understand your point now.

I've updated the implementation accordingly: the vectorized versions have been moved to the CPU backend, and the CPU backend now uses its own optimized functions instead of relying on the ones from ggml-base. It is indeed much cleaner this way.

Additionally, I added four new exported functions:

GGML_BACKEND_API void ggml_cpu_fp32_to_fp16(const float *, ggml_fp16_t *, int64_t);
GGML_BACKEND_API void ggml_cpu_fp16_to_fp32(const ggml_fp16_t *, float *, int64_t);
GGML_BACKEND_API void ggml_cpu_fp32_to_bf16(const float *, ggml_bf16_t *, int64_t);
GGML_BACKEND_API void ggml_cpu_bf16_to_fp32(const ggml_bf16_t *, float *, int64_t);

@SongXiaoXi SongXiaoXi changed the title ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs Apr 26, 2025
@slaren
Copy link
Member

slaren commented Apr 26, 2025

ggml_compute_forward_get_rows_f16 and ggml_compute_forward_get_rows_bf16 could also be changed to use these functions.

@slaren slaren merged commit 77d5e9a into ggml-org:master Apr 26, 2025
48 checks passed
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Apr 28, 2025
… conversion APIs (ggml-org#13107)

* ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion

* move fp converter to ggml-cpu

* Switch ggml_compute_forward_get_rows_f16/bf16 to new ggml_cpu_fp16/bf16_to_fp32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants