-
Notifications
You must be signed in to change notification settings - Fork 12.2k
llamafile: support s390x SIMD instruction set #14273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
thanks for your excellent PR. as you did in #14037, we can see the performance of horizontal summation of vector elements is one of bottlenecks of mulmat. skip horizontal summation of vector elements and the computational result is incorrect. |
Hi @zhouwg, its been awhile :). While I'm unsure what is going on with your end, I've simplified hsum from 3 element additions to 2 element additions by adding the original vector with itself but reversed, and that simplifies the assembly calls while resulting to the same outcome. Visualisation:
Have you tried something similar in your implementation? :) |
not yet, seems better than existing approaches I have tried. thanks so much! this elegant approach works fine on Hexagon cDSP, I'm not sure why is a little slower than other approach(e.g.unroll loop), might be lack of some special SIMD instructions(it seems that extract float data from HVX vector need more cycles and memory to memory transfer is more efficient on Hexagon cDSP, I'm not sure why). thanks again! |
This pull request aims to integrate the SIMD instruction set into Llamafile for the s390x platform.
At current, only F32 and F16 data types are activated. Quantised data types are WIP and will not be part of this PR.
Verification
To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:
Performance Results
I will be using IBM Granite 3.3 for the performance tests. We notice an average of 8.14% performance improvement in Prompt Processing.
Before Llamafile SIMD Instruction Set
After Llamafile SIMD Instruction Set
Note
Tests were conducted on an IBM z15 Mainframe with 16 IFLs (cores) and 160 GB Memory on a shared R&D LPAR.
Please review this pull request and consider merging into the main repository. Thank you!