Skip to content

llamafile: support s390x SIMD instruction set #14273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 19, 2025

Conversation

taronaeo
Copy link
Contributor

This pull request aims to integrate the SIMD instruction set into Llamafile for the s390x platform.

At current, only F32 and F16 data types are activated. Quantised data types are WIP and will not be part of this PR.

Verification

To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:

  • Tested IBM Granite 3.3 (F32, F16, Q4_0, Q4_1, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
  • Kindly request additional models for testing in this PR

Performance Results

I will be using IBM Granite 3.3 for the performance tests. We notice an average of 8.14% performance improvement in Prompt Processing.

Before Llamafile SIMD Instruction Set

model size params backend threads test t/s
granite 3B all F32 9.44 GiB 2.53 B BLAS 16 pp512 57.45 ± 0.28
granite 3B all F32 9.44 GiB 2.53 B BLAS 16 tg128 2.33 ± 0.09
granite 3B F16 4.72 GiB 2.53 B BLAS 16 pp512 56.54 ± 0.37
granite 3B F16 4.72 GiB 2.53 B BLAS 16 tg128 4.19 ± 0.31

After Llamafile SIMD Instruction Set

model size params backend threads test t/s
granite 3B all F32 9.44 GiB 2.53 B BLAS 16 pp512 62.33 ± 0.45
granite 3B all F32 9.44 GiB 2.53 B BLAS 16 tg128 2.34 ± 0.04
granite 3B F16 4.72 GiB 2.53 B BLAS 16 pp512 61.98 ± 0.10
granite 3B F16 4.72 GiB 2.53 B BLAS 16 tg128 4.46 ± 0.14

Note

Tests were conducted on an IBM z15 Mainframe with 16 IFLs (cores) and 160 GB Memory on a shared R&D LPAR.

Please review this pull request and consider merging into the main repository. Thank you!

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 19, 2025
@CISC CISC merged commit faed5a5 into ggml-org:master Jun 19, 2025
47 checks passed
@zhouwg
Copy link
Contributor

zhouwg commented Jun 20, 2025

thanks for your excellent PR. as you did in #14037, we can see the performance of horizontal summation of vector elements is one of bottlenecks of mulmat.

Screenshot from 2025-06-20 14-39-12

Screenshot from 2025-06-20 15-22-13

Screenshot from 2025-06-20 15-23-55
Screenshot from 2025-06-20 15-24-47

Screenshot from 2025-06-20 15-26-00

Screenshot from 2025-06-20 15-49-02

Screenshot from 2025-06-20 15-28-26

skip horizontal summation of vector elements and the computational result is incorrect.
Screenshot from 2025-06-20 15-37-57

@taronaeo
Copy link
Contributor Author

Hi @zhouwg, its been awhile :). While I'm unsure what is going on with your end, I've simplified hsum from 3 element additions to 2 element additions by adding the original vector with itself but reversed, and that simplifies the assembly calls while resulting to the same outcome.

Visualisation:

Original Vector (F32): { 9, 6, 3, 8 }
Reversed Vector (F32): { 8, 3, 6, 9 }

Original + Reversed Vector (F32): { 17, 9, 9, 17 } (we call this Sum Vector)

We finally add Sum Vector's first and second element together to get the hsum scalar value

Have you tried something similar in your implementation? :)

@zhouwg
Copy link
Contributor

zhouwg commented Jun 20, 2025

not yet, seems better than existing approaches I have tried. thanks so much!


this elegant approach works fine on Hexagon cDSP, I'm not sure why is a little slower than other approach(e.g.unroll loop), might be lack of some special SIMD instructions(it seems that extract float data from HVX vector need more cycles and memory to memory transfer is more efficient on Hexagon cDSP, I'm not sure why).

thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants