llamafile: support s390x SIMD instruction set #14273

taronaeo · 2025-06-19T06:56:08Z

This pull request aims to integrate the SIMD instruction set into Llamafile for the s390x platform.

At current, only F32 and F16 data types are activated. Quantised data types are WIP and will not be part of this PR.

Verification

To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:

Tested IBM Granite 3.3 (F32, F16, Q4_0, Q4_1, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
Kindly request additional models for testing in this PR

Performance Results

I will be using IBM Granite 3.3 for the performance tests. We notice an average of 8.14% performance improvement in Prompt Processing.

Before Llamafile SIMD Instruction Set

model	size	params	backend	threads	test	t/s
granite 3B all F32	9.44 GiB	2.53 B	BLAS	16	pp512	57.45 ± 0.28
granite 3B all F32	9.44 GiB	2.53 B	BLAS	16	tg128	2.33 ± 0.09
granite 3B F16	4.72 GiB	2.53 B	BLAS	16	pp512	56.54 ± 0.37
granite 3B F16	4.72 GiB	2.53 B	BLAS	16	tg128	4.19 ± 0.31

After Llamafile SIMD Instruction Set

model	size	params	backend	threads	test	t/s
granite 3B all F32	9.44 GiB	2.53 B	BLAS	16	pp512	62.33 ± 0.45
granite 3B all F32	9.44 GiB	2.53 B	BLAS	16	tg128	2.34 ± 0.04
granite 3B F16	4.72 GiB	2.53 B	BLAS	16	pp512	61.98 ± 0.10
granite 3B F16	4.72 GiB	2.53 B	BLAS	16	tg128	4.46 ± 0.14

Note

Tests were conducted on an IBM z15 Mainframe with 16 IFLs (cores) and 160 GB Memory on a shared R&D LPAR.

Please review this pull request and consider merging into the main repository. Thank you!

Signed-off-by: Aaron Teo <[email protected]>

zhouwg · 2025-06-20T06:36:52Z

thanks for your excellent PR. as you did in #14037, we can see the performance of horizontal summation of vector elements is one of bottlenecks of mulmat.

skip horizontal summation of vector elements and the computational result is incorrect.

taronaeo · 2025-06-20T08:06:57Z

Hi @zhouwg, its been awhile :). While I'm unsure what is going on with your end, I've simplified hsum from 3 element additions to 2 element additions by adding the original vector with itself but reversed, and that simplifies the assembly calls while resulting to the same outcome.

Visualisation:

Original Vector (F32): { 9, 6, 3, 8 }
Reversed Vector (F32): { 8, 3, 6, 9 }

Original + Reversed Vector (F32): { 17, 9, 9, 17 } (we call this Sum Vector)

We finally add Sum Vector's first and second element together to get the hsum scalar value

Have you tried something similar in your implementation? :)

zhouwg · 2025-06-20T09:21:29Z

not yet, seems better than existing approaches I have tried. thanks so much!

this elegant approach works fine on Hexagon cDSP, I'm not sure why is a little slower than other approach(e.g.unroll loop), might be lack of some special SIMD instructions(it seems that extract float data from HVX vector need more cycles and memory to memory transfer is more efficient on Hexagon cDSP, I'm not sure why).

thanks again!

taronaeo added 10 commits June 6, 2025 21:12

llamafile: include s390x simd

4af1898

Signed-off-by: Aaron Teo <[email protected]>

llamafile: turn on tinyblas for f32

187cad5

Signed-off-by: Aaron Teo <[email protected]>

llamafile: fix vector_registers redeclaration

12235c7

Signed-off-by: Aaron Teo <[email protected]>

llamafile: turn on tinyblas for f16

fb1cce8

Signed-off-by: Aaron Teo <[email protected]>

llamafile: add fp16 load

9121ea2

Signed-off-by: Aaron Teo <[email protected]>

llamafile: rework fp16 loading

f84a37b

Signed-off-by: Aaron Teo <[email protected]>

llamafile: fix fp16 loading typo

77ad802

Signed-off-by: Aaron Teo <[email protected]>

llamafile: fix fp32 miscalculation when activating fp16

6c9ebf4

Signed-off-by: Aaron Teo <[email protected]>

llamafile: fix type fp16 vs f16

deb78b3

Signed-off-by: Aaron Teo <[email protected]>

llamafile: remove s390x defs in sgemm.h

46363b4

Signed-off-by: Aaron Teo <[email protected]>

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 19, 2025

CISC approved these changes Jun 19, 2025

View reviewed changes

CISC merged commit faed5a5 into ggml-org:master Jun 19, 2025
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llamafile: support s390x SIMD instruction set #14273

llamafile: support s390x SIMD instruction set #14273

Uh oh!

taronaeo commented Jun 19, 2025

Uh oh!

Uh oh!

zhouwg commented Jun 20, 2025 •

edited

Loading

Uh oh!

taronaeo commented Jun 20, 2025

Uh oh!

zhouwg commented Jun 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

llamafile: support s390x SIMD instruction set #14273

llamafile: support s390x SIMD instruction set #14273

Uh oh!

Conversation

taronaeo commented Jun 19, 2025

Verification

Performance Results

Uh oh!

Uh oh!

zhouwg commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taronaeo commented Jun 20, 2025

Uh oh!

zhouwg commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zhouwg commented Jun 20, 2025 •

edited

Loading

zhouwg commented Jun 20, 2025 •

edited

Loading