You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+75-43Lines changed: 75 additions & 43 deletions
Original file line number
Diff line number
Diff line change
@@ -12,26 +12,26 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
12
12
13
13
### Recent API changes
14
14
15
-
-[2024 Apr 21]`llama_token_to_piece` can now optionally render special tokens https://github.com/ggerganov/llama.cpp/pull/6807
16
-
-[2024 Apr 4] State and session file functions reorganized under `llama_state_*`https://github.com/ggerganov/llama.cpp/pull/6341
17
-
-[2024 Mar 26] Logits and embeddings API updated for compactness https://github.com/ggerganov/llama.cpp/pull/6122
18
-
-[2024 Mar 13] Add `llama_synchronize()` + `llama_context_params.n_ubatch`https://github.com/ggerganov/llama.cpp/pull/6017
19
-
-[2024 Mar 8]`llama_kv_cache_seq_rm()` returns a `bool` instead of `void`, and new `llama_n_seq_max()` returns the upper limit of acceptable `seq_id` in batches (relevant when dealing with multiple sequences) https://github.com/ggerganov/llama.cpp/pull/5328
20
-
-[2024 Mar 4] Embeddings API updated https://github.com/ggerganov/llama.cpp/pull/5796
21
-
-[2024 Mar 3]`struct llama_context_params`https://github.com/ggerganov/llama.cpp/pull/5849
15
+
-[2024 Apr 21]`llama_token_to_piece` can now optionally render special tokens <https://github.com/ggerganov/llama.cpp/pull/6807>
16
+
-[2024 Apr 4] State and session file functions reorganized under `llama_state_*`<https://github.com/ggerganov/llama.cpp/pull/6341>
17
+
-[2024 Mar 26] Logits and embeddings API updated for compactness <https://github.com/ggerganov/llama.cpp/pull/6122>
18
+
-[2024 Mar 13] Add `llama_synchronize()` + `llama_context_params.n_ubatch`<https://github.com/ggerganov/llama.cpp/pull/6017>
19
+
-[2024 Mar 8]`llama_kv_cache_seq_rm()` returns a `bool` instead of `void`, and new `llama_n_seq_max()` returns the upper limit of acceptable `seq_id` in batches (relevant when dealing with multiple sequences) <https://github.com/ggerganov/llama.cpp/pull/5328>
20
+
-[2024 Mar 4] Embeddings API updated <https://github.com/ggerganov/llama.cpp/pull/5796>
21
+
-[2024 Mar 3]`struct llama_context_params`<https://github.com/ggerganov/llama.cpp/pull/5849>
22
22
23
23
### Hot topics
24
24
25
-
-**`convert.py` has been deprecated and moved to `examples/convert-legacy-llama.py`, please use `convert-hf-to-gguf.py`**https://github.com/ggerganov/llama.cpp/pull/7430
- Initial Mamba support has been added: https://github.com/ggerganov/llama.cpp/pull/5328
25
+
-**`convert.py` has been deprecated and moved to `examples/convert-legacy-llama.py`, please use `convert-hf-to-gguf.py`**<https://github.com/ggerganov/llama.cpp/pull/7430>
- BPE pre-tokenization support has been added: <https://github.com/ggerganov/llama.cpp/pull/6920>
28
+
- MoE memory layout has been updated - reconvert models for `mmap` support and regenerate `imatrix`<https://github.com/ggerganov/llama.cpp/pull/6387>
29
+
- Model sharding instructions using `gguf-split`<https://github.com/ggerganov/llama.cpp/discussions/6404>
30
+
- Fix major bug in Metal batched inference <https://github.com/ggerganov/llama.cpp/pull/6225>
31
+
- Multi-GPU pipeline parallelism support <https://github.com/ggerganov/llama.cpp/pull/6017>
32
+
- Looking for contributions to add Deepseek support: <https://github.com/ggerganov/llama.cpp/issues/5981>
@@ -328,6 +328,7 @@ In order to build llama.cpp you have four different options.
328
328
3. Run `w64devkit.exe`.
329
329
4. Use the `cd`command to reach the `llama.cpp` folder.
330
330
5. From here you can run:
331
+
331
332
```bash
332
333
make
333
334
```
@@ -346,9 +347,9 @@ In order to build llama.cpp you have four different options.
346
347
347
348
**Notes**:
348
349
349
-
- For faster compilation, add the `-j` argument to run multiple jobsin parallel. For example, `cmake --build build --config Release -j 8` will run 8 jobsin parallel.
350
-
- For faster repeated compilation, install [ccache](https://ccache.dev/).
351
-
- For debug builds, there are two cases:
350
+
- For faster compilation, add the `-j` argument to run multiple jobsin parallel. For example, `cmake --build build --config Release -j 8` will run 8 jobsin parallel.
351
+
- For faster repeated compilation, install [ccache](https://ccache.dev/).
352
+
- For debug builds, there are two cases:
352
353
353
354
1. Single-config generators (e.g. default = `Unix Makefiles`; note that they just ignore the `--config` flag):
354
355
@@ -364,7 +365,7 @@ In order to build llama.cpp you have four different options.
364
365
cmake --build build --config Debug
365
366
```
366
367
367
-
- Using `gmake` (FreeBSD):
368
+
- Using `gmake` (FreeBSD):
368
369
369
370
1. Install and activate [DRM in FreeBSD](https://wiki.freebsd.org/Graphics)
370
371
2. Add your user to **video** group
@@ -379,10 +380,12 @@ In order to build llama.cpp you have four different options.
379
380
### Homebrew
380
381
381
382
On Mac and Linux, the homebrew package manager can be used via
383
+
382
384
```
383
385
brew install llama.cpp
384
386
```
385
-
The formula is automatically updated with new `llama.cpp` releases. More info: https://github.com/ggerganov/llama.cpp/discussions/7668
387
+
388
+
The formula is automatically updated with new `llama.cpp` releases. More info: <https://github.com/ggerganov/llama.cpp/discussions/7668>
386
389
387
390
### Metal Build
388
391
@@ -396,16 +399,17 @@ argument.
396
399
397
400
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Support with CPU-only BLAS implementations doesn't affect the normal generation performance. We may see generation performance improvements with GPU-involved BLAS implementations, e.g. cuBLAS, hipBLAS. There are currently several different BLAS implementations available for build and use:
398
401
399
-
- #### Accelerate Framework:
402
+
- #### Accelerate Framework
400
403
401
404
This is only available on Mac PCs and it's enabled by default. You can just build using the normal instructions.
402
405
403
-
- #### OpenBLAS:
406
+
- #### OpenBLAS
404
407
405
408
This provides BLAS acceleration using only the CPU. Make sure to have OpenBLAS installed on your machine.
406
409
407
410
- Using `make`:
408
411
- On Linux:
412
+
409
413
```bash
410
414
make LLAMA_OPENBLAS=1
411
415
```
@@ -437,17 +441,20 @@ Building the program with BLAS support may lead to some performance improvements
437
441
Check [BLIS.md](docs/BLIS.md) for more information.
438
442
439
443
- #### SYCL
444
+
440
445
SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators.
441
446
442
447
llama.cpp based on SYCL is used to **support Intel GPU** (Data Center Max series, Flex series, Arc series, Built-in GPU and iGPU).
443
448
444
449
For detailed info, please refer to [llama.cpp for SYCL](README-sycl.md).
445
450
446
451
- #### Intel oneMKL
452
+
447
453
Building through oneAPI compilers will make avx_vnni instruction set available for intel processors that do not support avx512 and avx512_vnni. Please note that this build config **does not support Intel GPU**. For Intel GPU support, please refer to [llama.cpp for SYCL](./README-sycl.md).
448
454
449
455
- Using manual oneAPI installation:
450
456
By default, `LLAMA_BLAS_VENDOR` is set to `Generic`, so if you already sourced intel environment script and assign `-DLLAMA_BLAS=ON` in cmake, the mkl version of Blas will automatically been selected. Otherwise please install oneAPI and follow the below steps:
457
+
451
458
```bash
452
459
source /opt/intel/oneapi/setvars.sh # You can skip this step if in oneapi-basekit docker image, only required for manual installation
@@ -466,9 +473,11 @@ Building the program with BLAS support may lead to some performance improvements
466
473
For Jetson user, if you have Jetson Orin, you can try this: [Offical Support](https://www.jetson-ai-lab.com/tutorial_text-generation.html). If you are using an old model(nano/TX2), need some additional operations before compiling.
467
474
468
475
- Using `make`:
476
+
469
477
```bash
470
478
make LLAMA_CUDA=1
471
479
```
480
+
472
481
- Using `CMake`:
473
482
474
483
```bash
@@ -496,26 +505,33 @@ Building the program with BLAS support may lead to some performance improvements
496
505
You can download it from your Linux distro's package manager or from here: [ROCm Quick Start (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html#rocm-install-quick).
497
506
498
507
- Using `make`:
508
+
499
509
```bash
500
510
make LLAMA_HIPBLAS=1
501
511
```
512
+
502
513
- Using `CMake` for Linux (assuming a gfx1030-compatible AMD GPU):
On Linux it is also possible to use unified memory architecture (UMA) to share main memory between the CPU and integrated GPU by setting `-DLLAMA_HIP_UMA=ON`.
509
522
However, this hurts performance for non-integrated GPUs (but enables working with integrated GPUs).
510
523
511
524
Note that if you get the following error:
525
+
512
526
```
513
527
clang: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library
514
528
```
529
+
515
530
Try searching for a directory under `HIP_PATH` that contains the file
516
531
`oclc_abi_version_400.bc`. Then, add the following to the start of the
517
532
command: `HIP_DEVICE_LIB_PATH=<directory-you-just-found>`, so something
Make sure that `AMDGPU_TARGETS` is set to the GPU arch you want to compile for. The above example uses `gfx1100` that corresponds to Radeon RX 7900XTX/XT/GRE. You can find a list of targets [here](https://llvm.org/docs/AMDGPUUsage.html#processors)
538
557
Find your gpu version string by matching the most significant version information from `rocminfo | grep gfx | head -1 | awk '{print $2}'` with the list of processors, e.g. `gfx1035` maps to `gfx1030`.
539
558
540
-
541
559
The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
542
560
If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 (e.g. gfx1030, gfx1031, or gfx1035) or 11.0.0 on RDNA3.
543
561
The following compilation options are also available to tweak performance (yes, they refer to CUDA, not HIP, because it uses the same code as the cuBLAS version above):
@@ -577,7 +595,9 @@ Building the program with BLAS support may lead to some performance improvements
577
595
vulkaninfo
578
596
```
579
597
580
-
Alternatively your package manager might be able to provide the appropiate libraries. For example for Ubuntu 22.04 you can install `libvulkan-dev` instead.
598
+
Alternatively your package manager might be able to provide the appropriate libraries.
599
+
For example for Ubuntu 22.04 you can install `libvulkan-dev` instead.
600
+
For Fedora 40, you may install `vulkan-devel`, `glslc` and `glslang` packages.
581
601
582
602
Then, build llama.cpp using the cmake command below:
583
603
@@ -701,19 +721,21 @@ Several quantization methods are supported. They differ in the resulting model d
701
721
You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
702
722
For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).
703
723
704
-
The perplexity measurements in table above are done against the `wikitext2` test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512.
724
+
The perplexity measurements in table above are done against the `wikitext2` test dataset (<https://paperswithcode.com/dataset/wikitext-2>), with context length of 512.
705
725
The time per token is measured on a MacBook M1 Pro 32GB RAM using 4 and 8 threads.
The `grammars/` folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](./grammars/README.md).
769
791
770
-
For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
792
+
For authoring more complex JSON grammars, you can also check out <https://grammar.intrinsiclabs.ai/>, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
836
+
814
837
- LLaMA:
815
-
-[Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
816
-
-[LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
838
+
-[Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
839
+
-[LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
817
840
- GPT-3
818
-
-[Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
841
+
-[Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
819
842
- GPT-3.5 / InstructGPT / ChatGPT:
820
-
-[Aligning language models to follow instructions](https://openai.com/research/instruction-following)
821
-
-[Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
843
+
-[Aligning language models to follow instructions](https://openai.com/research/instruction-following)
844
+
-[Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
822
845
823
846
### Android
824
847
825
848
#### Build on Android using Termux
849
+
826
850
[Termux](https://github.com/termux/termux-app#installation) is a method to execute `llama.cpp` on an Android device (no root required).
851
+
827
852
```
828
853
apt update && apt upgrade -y
829
854
apt install git make cmake
830
855
```
831
856
832
857
It's recommended to move your model inside the `~/` directory for best performance:
858
+
833
859
```
834
860
cd storage/downloads
835
861
mv model.gguf ~/
@@ -838,22 +864,25 @@ mv model.gguf ~/
838
864
[Get the code](https://github.com/ggerganov/llama.cpp#get-the-code) & [follow the Linux build instructions](https://github.com/ggerganov/llama.cpp#build) to build `llama.cpp`.
839
865
840
866
#### Building the Project using Android NDK
867
+
841
868
Obtain the [Android NDK](https://developer.android.com/ndk) and then build with CMake.
842
869
843
870
Execute the following commands on your computer to avoid downloading the NDK to your mobile. Alternatively, you can also do this in Termux:
Install [termux](https://github.com/termux/termux-app#installation) on your device and run `termux-setup-storage` to get access to your SD card (if Android 11+ then run the command twice).
853
881
854
882
Finally, copy these built `llama` binaries and the model file to your device storage. Because the file permissions in the Android sdcard cannot be changed, you can copy the executable files to the `/data/data/com.termux/files/home/bin` path, and then execute the following commands in Termux to add executable permission:
855
883
856
884
(Assumed that you have pushed the built executable files to the /sdcard/llama.cpp/bin path using `adb push`)
* Docker must be installed and running on your system.
883
-
* Create a folder to store big models & intermediate files (ex. /llama/models)
912
+
913
+
- Docker must be installed and running on your system.
914
+
- Create a folder to store big models & intermediate files (ex. /llama/models)
884
915
885
916
#### Images
917
+
886
918
We have three Docker images available for this project:
887
919
888
920
1.`ghcr.io/ggerganov/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms: `linux/amd64`, `linux/arm64`)
0 commit comments