Skip to content

ggml : Implementations for Q4_0_8_8 quantization based functions - RISC-V vector version #10029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 30, 2024

Conversation

xctan
Copy link
Collaborator

@xctan xctan commented Oct 24, 2024

I've tested Mistral 7B in qemu, and it just worked. I'm still choosing suitable 3B models for my dev board with only 4GB RAM (Banana Pi BPI-F3), so I can't give any performance evaluation as of now and any help is welcome! BTW Mistral 7B could run on BPI-F3 4GB with another 4GB swap memory enabled, but it was much slower than even qemu.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 24, 2024
@xctan
Copy link
Collaborator Author

xctan commented Oct 24, 2024

Model: https://huggingface.co/CobraMamba/mamba-gpt-3b-v4
Compiler: GCC 13.2.0

model size params backend threads test t/s speedup commit
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 pp512 1.82 ± 0.00 271% 78c78e2
llama 3B Q4_0 1.84 GiB 3.43 B CPU 8 pp512 1.04 ± 0.00 112% 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 pp512 0.49 ± 0.00 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 tg128 2.25 ± 0.10 350% 78c78e2
llama 3B Q4_0 1.84 GiB 3.43 B CPU 8 tg128 1.27 ± 0.03 154% 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 tg128 0.50 ± 0.01 66c2c93

@xctan
Copy link
Collaborator Author

xctan commented Oct 30, 2024

Is there anything wrong with this PR? Should I provide more test results like llama-perplexity? @ggerganov
I'm trying ./llama-perplexity -m <model_name> -f wikitext-2-raw/wiki.test.raw for both Q4_0_8_8 GEMM implementations with and without this PR, but it takes ~400 hours in total to finish.

@ggerganov ggerganov merged commit fc83a9e into ggml-org:master Oct 30, 2024
51 of 52 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* ggml : RISC-V vector gemv for q4_0_8x8

* ggml : Added WIP rvv q4_0_8x8 gemm

* ggml : Added initial implementation of rvv gemm

* ggml : optimize gemm to avoid register spillover

* ggml : Fix GCC rvv load alignment issue

* ggml : Format gemm rvv code

* ggml : Fix a typo in RVV q4_0_8_8 GEMM
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* ggml : RISC-V vector gemv for q4_0_8x8

* ggml : Added WIP rvv q4_0_8x8 gemm

* ggml : Added initial implementation of rvv gemm

* ggml : optimize gemm to avoid register spillover

* ggml : Fix GCC rvv load alignment issue

* ggml : Format gemm rvv code

* ggml : Fix a typo in RVV q4_0_8_8 GEMM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants