|
7 | 7 |
|
8 | 8 | Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
|
9 | 9 |
|
10 |
| -**Warnings** |
11 |
| - |
12 |
| -- `Q4_2` and `Q4_3` are still in development. Do not expect any kind of backward compatibility until they are finalized |
13 |
| - |
14 | 10 | **Hot topics:**
|
15 | 11 |
|
| 12 | +- [New quantization methods](https://github.com/ggerganov/llama.cpp#quantization) |
16 | 13 | - [Added LoRA support](https://github.com/ggerganov/llama.cpp/pull/820)
|
17 | 14 | - [Add GPU support to ggml](https://github.com/ggerganov/llama.cpp/discussions/915)
|
18 | 15 | - [Roadmap Apr 2023](https://github.com/ggerganov/llama.cpp/discussions/784)
|
19 | 16 |
|
20 | 17 | ## Description
|
21 | 18 |
|
22 |
| -The main goal of llama.cpp is to run the llama model using 4-bit quantization on a MacBook. |
| 19 | +The main goal of `llama.cpp` is to run the LLaMA model using 4-bit integer quantization on a MacBook |
23 | 20 |
|
24 | 21 | - Plain C/C++ implementation without dependencies
|
25 | 22 | - Apple silicon first-class citizen - optimized via ARM NEON and Accelerate framework
|
26 | 23 | - AVX2 support for x86 architectures
|
27 | 24 | - Mixed F16 / F32 precision
|
28 |
| -- 4-bit quantization support |
| 25 | +- 4-bit integer quantization support |
29 | 26 | - Runs on the CPU
|
30 | 27 |
|
31 |
| -This was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022) - I have no idea if it works correctly. |
32 |
| -Please do not make conclusions about the models based on the results from this implementation. |
33 |
| -For all I know, it can be completely wrong. This project is for educational purposes. |
34 |
| -New features will probably be added mostly through community contributions. |
| 28 | +The original implementation of `llama.cpp` was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022). |
| 29 | +Since then, the project has improved significantly thanks to many contributions. This project is for educational purposes and serves |
| 30 | +as the main playground for developing new features for the [ggml](https://github.com/ggerganov/ggml) library. |
35 | 31 |
|
36 | 32 | **Supported platforms:**
|
37 | 33 |
|
@@ -294,6 +290,24 @@ As the models are currently fully loaded into memory, you will need adequate dis
|
294 | 290 | | 30B | 60 GB | 19.5 GB |
|
295 | 291 | | 65B | 120 GB | 38.5 GB |
|
296 | 292 |
|
| 293 | +### Quantization |
| 294 | +
|
| 295 | +Several quantization methods are supported. They differ in the resulting model disk size and inference speed. |
| 296 | +
|
| 297 | +Model | F16 | Q4_0 | Q4_1 | Q4_2 | Q4_3 | Q5_0 | Q5_1 | Q8_0 |
| 298 | +-- | -- | -- | -- | -- | -- | -- | -- | -- |
| 299 | +7B (ppl) | 5.9565 | 6.2103 | 6.1286 | 6.1698 | 6.0617 | 6.0139 | 5.9934 | 5.9571 |
| 300 | +7B (size) | 13.0G | 4.0G | 4.8G | 4.0G | 4.8G | 4.4G | 4.8G | 7.1G |
| 301 | +7B (ms/tok @ 4th) | 128 | 56 | 61 | 84 | 91 | 91 | 95 | 75 |
| 302 | +7B (ms/tok @ 8th) | 128 | 47 | 55 | 48 | 53 | 53 | 59 | 75 |
| 303 | +7B (bpw) | 16.0 | 5.0 | 6.0 | 5.0 | 6.0 | 5.5 | 6.0 | 9.0 |
| 304 | +-- | -- | -- | -- | -- | -- | -- | -- | -- |
| 305 | +13B (ppl) | 5.2455 | 5.3748 | 5.3471 | 5.3433 | 5.3234 | 5.2768 | 5.2582 | 5.2458 |
| 306 | +13B (size) | 25.0G | 7.6G | 9.1G | 7.6G | 9.1G | 8.4G | 9.1G | 14G |
| 307 | +13B (ms/tok @ 4th) | 239 | 104 | 113 | 160 | 175 | 176 | 185 | 141 |
| 308 | +13B (ms/tok @ 8th) | 240 | 85 | 99 | 97 | 114 | 108 | 117 | 147 |
| 309 | +13B (bpw) | 16.0 | 5.0 | 6.0 | 5.0 | 6.0 | 5.5 | 6.0 | 9.0 |
| 310 | +
|
297 | 311 | ### Interactive mode
|
298 | 312 |
|
299 | 313 | If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter.
|
|
0 commit comments