Skip to content

Commit 563d8d2

Browse files
committed
Fix a typo + add Fedora packages for Vulkan
* "appropriate" has a mistake * append Feodra packages to install to be able to compile with Vulkan support
1 parent f5d7b26 commit 563d8d2

File tree

1 file changed

+75
-43
lines changed

1 file changed

+75
-43
lines changed

README.md

Lines changed: 75 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -12,26 +12,26 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
1212

1313
### Recent API changes
1414

15-
- [2024 Apr 21] `llama_token_to_piece` can now optionally render special tokens https://github.com/ggerganov/llama.cpp/pull/6807
16-
- [2024 Apr 4] State and session file functions reorganized under `llama_state_*` https://github.com/ggerganov/llama.cpp/pull/6341
17-
- [2024 Mar 26] Logits and embeddings API updated for compactness https://github.com/ggerganov/llama.cpp/pull/6122
18-
- [2024 Mar 13] Add `llama_synchronize()` + `llama_context_params.n_ubatch` https://github.com/ggerganov/llama.cpp/pull/6017
19-
- [2024 Mar 8] `llama_kv_cache_seq_rm()` returns a `bool` instead of `void`, and new `llama_n_seq_max()` returns the upper limit of acceptable `seq_id` in batches (relevant when dealing with multiple sequences) https://github.com/ggerganov/llama.cpp/pull/5328
20-
- [2024 Mar 4] Embeddings API updated https://github.com/ggerganov/llama.cpp/pull/5796
21-
- [2024 Mar 3] `struct llama_context_params` https://github.com/ggerganov/llama.cpp/pull/5849
15+
- [2024 Apr 21] `llama_token_to_piece` can now optionally render special tokens <https://github.com/ggerganov/llama.cpp/pull/6807>
16+
- [2024 Apr 4] State and session file functions reorganized under `llama_state_*` <https://github.com/ggerganov/llama.cpp/pull/6341>
17+
- [2024 Mar 26] Logits and embeddings API updated for compactness <https://github.com/ggerganov/llama.cpp/pull/6122>
18+
- [2024 Mar 13] Add `llama_synchronize()` + `llama_context_params.n_ubatch` <https://github.com/ggerganov/llama.cpp/pull/6017>
19+
- [2024 Mar 8] `llama_kv_cache_seq_rm()` returns a `bool` instead of `void`, and new `llama_n_seq_max()` returns the upper limit of acceptable `seq_id` in batches (relevant when dealing with multiple sequences) <https://github.com/ggerganov/llama.cpp/pull/5328>
20+
- [2024 Mar 4] Embeddings API updated <https://github.com/ggerganov/llama.cpp/pull/5796>
21+
- [2024 Mar 3] `struct llama_context_params` <https://github.com/ggerganov/llama.cpp/pull/5849>
2222

2323
### Hot topics
2424

25-
- **`convert.py` has been deprecated and moved to `examples/convert-legacy-llama.py`, please use `convert-hf-to-gguf.py`** https://github.com/ggerganov/llama.cpp/pull/7430
26-
- Initial Flash-Attention support: https://github.com/ggerganov/llama.cpp/pull/5021
27-
- BPE pre-tokenization support has been added: https://github.com/ggerganov/llama.cpp/pull/6920
28-
- MoE memory layout has been updated - reconvert models for `mmap` support and regenerate `imatrix` https://github.com/ggerganov/llama.cpp/pull/6387
29-
- Model sharding instructions using `gguf-split` https://github.com/ggerganov/llama.cpp/discussions/6404
30-
- Fix major bug in Metal batched inference https://github.com/ggerganov/llama.cpp/pull/6225
31-
- Multi-GPU pipeline parallelism support https://github.com/ggerganov/llama.cpp/pull/6017
32-
- Looking for contributions to add Deepseek support: https://github.com/ggerganov/llama.cpp/issues/5981
33-
- Quantization blind testing: https://github.com/ggerganov/llama.cpp/discussions/5962
34-
- Initial Mamba support has been added: https://github.com/ggerganov/llama.cpp/pull/5328
25+
- **`convert.py` has been deprecated and moved to `examples/convert-legacy-llama.py`, please use `convert-hf-to-gguf.py`** <https://github.com/ggerganov/llama.cpp/pull/7430>
26+
- Initial Flash-Attention support: <https://github.com/ggerganov/llama.cpp/pull/5021>
27+
- BPE pre-tokenization support has been added: <https://github.com/ggerganov/llama.cpp/pull/6920>
28+
- MoE memory layout has been updated - reconvert models for `mmap` support and regenerate `imatrix` <https://github.com/ggerganov/llama.cpp/pull/6387>
29+
- Model sharding instructions using `gguf-split` <https://github.com/ggerganov/llama.cpp/discussions/6404>
30+
- Fix major bug in Metal batched inference <https://github.com/ggerganov/llama.cpp/pull/6225>
31+
- Multi-GPU pipeline parallelism support <https://github.com/ggerganov/llama.cpp/pull/6017>
32+
- Looking for contributions to add Deepseek support: <https://github.com/ggerganov/llama.cpp/issues/5981>
33+
- Quantization blind testing: <https://github.com/ggerganov/llama.cpp/discussions/5962>
34+
- Initial Mamba support has been added: <https://github.com/ggerganov/llama.cpp/pull/5328>
3535

3636
----
3737

@@ -297,7 +297,7 @@ llama_print_timings: total time = 25431.49 ms
297297

298298
And here is another demo of running both LLaMA-7B and [whisper.cpp](https://github.com/ggerganov/whisper.cpp) on a single M1 Pro MacBook:
299299

300-
https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4
300+
<https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4>
301301

302302
## Usage
303303

@@ -328,6 +328,7 @@ In order to build llama.cpp you have four different options.
328328
3. Run `w64devkit.exe`.
329329
4. Use the `cd` command to reach the `llama.cpp` folder.
330330
5. From here you can run:
331+
331332
```bash
332333
make
333334
```
@@ -346,9 +347,9 @@ In order to build llama.cpp you have four different options.
346347

347348
**Notes**:
348349

349-
- For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `cmake --build build --config Release -j 8` will run 8 jobs in parallel.
350-
- For faster repeated compilation, install [ccache](https://ccache.dev/).
351-
- For debug builds, there are two cases:
350+
- For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `cmake --build build --config Release -j 8` will run 8 jobs in parallel.
351+
- For faster repeated compilation, install [ccache](https://ccache.dev/).
352+
- For debug builds, there are two cases:
352353

353354
1. Single-config generators (e.g. default = `Unix Makefiles`; note that they just ignore the `--config` flag):
354355

@@ -364,7 +365,7 @@ In order to build llama.cpp you have four different options.
364365
cmake --build build --config Debug
365366
```
366367

367-
- Using `gmake` (FreeBSD):
368+
- Using `gmake` (FreeBSD):
368369

369370
1. Install and activate [DRM in FreeBSD](https://wiki.freebsd.org/Graphics)
370371
2. Add your user to **video** group
@@ -379,10 +380,12 @@ In order to build llama.cpp you have four different options.
379380
### Homebrew
380381

381382
On Mac and Linux, the homebrew package manager can be used via
383+
382384
```
383385
brew install llama.cpp
384386
```
385-
The formula is automatically updated with new `llama.cpp` releases. More info: https://github.com/ggerganov/llama.cpp/discussions/7668
387+
388+
The formula is automatically updated with new `llama.cpp` releases. More info: <https://github.com/ggerganov/llama.cpp/discussions/7668>
386389
387390
### Metal Build
388391
@@ -396,16 +399,17 @@ argument.
396399
397400
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Support with CPU-only BLAS implementations doesn't affect the normal generation performance. We may see generation performance improvements with GPU-involved BLAS implementations, e.g. cuBLAS, hipBLAS. There are currently several different BLAS implementations available for build and use:
398401
399-
- #### Accelerate Framework:
402+
- #### Accelerate Framework
400403
401404
This is only available on Mac PCs and it's enabled by default. You can just build using the normal instructions.
402405
403-
- #### OpenBLAS:
406+
- #### OpenBLAS
404407
405408
This provides BLAS acceleration using only the CPU. Make sure to have OpenBLAS installed on your machine.
406409
407410
- Using `make`:
408411
- On Linux:
412+
409413
```bash
410414
make LLAMA_OPENBLAS=1
411415
```
@@ -437,17 +441,20 @@ Building the program with BLAS support may lead to some performance improvements
437441
Check [BLIS.md](docs/BLIS.md) for more information.
438442
439443
- #### SYCL
444+
440445
SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators.
441446
442447
llama.cpp based on SYCL is used to **support Intel GPU** (Data Center Max series, Flex series, Arc series, Built-in GPU and iGPU).
443448
444449
For detailed info, please refer to [llama.cpp for SYCL](README-sycl.md).
445450
446451
- #### Intel oneMKL
452+
447453
Building through oneAPI compilers will make avx_vnni instruction set available for intel processors that do not support avx512 and avx512_vnni. Please note that this build config **does not support Intel GPU**. For Intel GPU support, please refer to [llama.cpp for SYCL](./README-sycl.md).
448454
449455
- Using manual oneAPI installation:
450456
By default, `LLAMA_BLAS_VENDOR` is set to `Generic`, so if you already sourced intel environment script and assign `-DLLAMA_BLAS=ON` in cmake, the mkl version of Blas will automatically been selected. Otherwise please install oneAPI and follow the below steps:
457+
451458
```bash
452459
source /opt/intel/oneapi/setvars.sh # You can skip this step if in oneapi-basekit docker image, only required for manual installation
453460
cmake -B build -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_NATIVE=ON
@@ -466,9 +473,11 @@ Building the program with BLAS support may lead to some performance improvements
466473
For Jetson user, if you have Jetson Orin, you can try this: [Offical Support](https://www.jetson-ai-lab.com/tutorial_text-generation.html). If you are using an old model(nano/TX2), need some additional operations before compiling.
467474
468475
- Using `make`:
476+
469477
```bash
470478
make LLAMA_CUDA=1
471479
```
480+
472481
- Using `CMake`:
473482
474483
```bash
@@ -496,26 +505,33 @@ Building the program with BLAS support may lead to some performance improvements
496505
You can download it from your Linux distro's package manager or from here: [ROCm Quick Start (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html#rocm-install-quick).
497506
498507
- Using `make`:
508+
499509
```bash
500510
make LLAMA_HIPBLAS=1
501511
```
512+
502513
- Using `CMake` for Linux (assuming a gfx1030-compatible AMD GPU):
514+
503515
```bash
504516
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
505517
cmake -S . -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
506518
&& cmake --build build --config Release -- -j 16
507519
```
520+
508521
On Linux it is also possible to use unified memory architecture (UMA) to share main memory between the CPU and integrated GPU by setting `-DLLAMA_HIP_UMA=ON`.
509522
However, this hurts performance for non-integrated GPUs (but enables working with integrated GPUs).
510523
511524
Note that if you get the following error:
525+
512526
```
513527
clang: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library
514528
```
529+
515530
Try searching for a directory under `HIP_PATH` that contains the file
516531
`oclc_abi_version_400.bc`. Then, add the following to the start of the
517532
command: `HIP_DEVICE_LIB_PATH=<directory-you-just-found>`, so something
518533
like:
534+
519535
```bash
520536
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -p)" \
521537
HIP_DEVICE_LIB_PATH=<directory-you-just-found> \
@@ -524,20 +540,22 @@ Building the program with BLAS support may lead to some performance improvements
524540
```
525541
526542
- Using `make` (example for target gfx1030, build with 16 CPU threads):
543+
527544
```bash
528545
make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1030
529546
```
530547
531548
- Using `CMake` for Windows (using x64 Native Tools Command Prompt for VS, and assuming a gfx1100-compatible AMD GPU):
549+
532550
```bash
533551
set PATH=%HIP_PATH%\bin;%PATH%
534552
cmake -S . -B build -G Ninja -DAMDGPU_TARGETS=gfx1100 -DLLAMA_HIPBLAS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release
535553
cmake --build build
536554
```
555+
537556
Make sure that `AMDGPU_TARGETS` is set to the GPU arch you want to compile for. The above example uses `gfx1100` that corresponds to Radeon RX 7900XTX/XT/GRE. You can find a list of targets [here](https://llvm.org/docs/AMDGPUUsage.html#processors)
538557
Find your gpu version string by matching the most significant version information from `rocminfo | grep gfx | head -1 | awk '{print $2}'` with the list of processors, e.g. `gfx1035` maps to `gfx1030`.
539558
540-
541559
The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
542560
If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 (e.g. gfx1030, gfx1031, or gfx1035) or 11.0.0 on RDNA3.
543561
The following compilation options are also available to tweak performance (yes, they refer to CUDA, not HIP, because it uses the same code as the cuBLAS version above):
@@ -577,7 +595,9 @@ Building the program with BLAS support may lead to some performance improvements
577595
vulkaninfo
578596
```
579597

580-
Alternatively your package manager might be able to provide the appropiate libraries. For example for Ubuntu 22.04 you can install `libvulkan-dev` instead.
598+
Alternatively your package manager might be able to provide the appropriate libraries.
599+
For example for Ubuntu 22.04 you can install `libvulkan-dev` instead.
600+
For Fedora 40, you may install `vulkan-devel`, `glslc` and `glslang` packages.
581601

582602
Then, build llama.cpp using the cmake command below:
583603

@@ -701,19 +721,21 @@ Several quantization methods are supported. They differ in the resulting model d
701721
You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
702722
For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).
703723

704-
The perplexity measurements in table above are done against the `wikitext2` test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512.
724+
The perplexity measurements in table above are done against the `wikitext2` test dataset (<https://paperswithcode.com/dataset/wikitext-2>), with context length of 512.
705725
The time per token is measured on a MacBook M1 Pro 32GB RAM using 4 and 8 threads.
706726

707727
#### How to run
708728

709-
1. Download/extract: https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
729+
1. Download/extract: <https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip>
710730
2. Run `./perplexity -m models/7B/ggml-model-q4_0.gguf -f wiki.test.raw`
711731
3. Output:
732+
712733
```
713734
perplexity : calculating perplexity over 655 chunks
714735
24.43 seconds per pass - ETA 4.45 hours
715736
[1]4.5970,[2]5.1807,[3]6.0382,...
716737
```
738+
717739
And after 4.45 hours, you will have the final perplexity.
718740

719741
### Interactive mode
@@ -767,7 +789,7 @@ PROMPT_TEMPLATE=./prompts/chat-with-bob.txt PROMPT_CACHE_FILE=bob.prompt.bin \
767789

768790
The `grammars/` folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](./grammars/README.md).
769791

770-
For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
792+
For authoring more complex JSON grammars, you can also check out <https://grammar.intrinsiclabs.ai/>, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
771793

772794
### Instruct mode
773795

@@ -811,25 +833,29 @@ cadaver, cauliflower, cabbage (vegetable), catalpa (tree) and Cailleach.
811833
### Seminal papers and background on the models
812834

813835
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
836+
814837
- LLaMA:
815-
- [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
816-
- [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
838+
- [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
839+
- [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
817840
- GPT-3
818-
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
841+
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
819842
- GPT-3.5 / InstructGPT / ChatGPT:
820-
- [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
821-
- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
843+
- [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
844+
- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
822845

823846
### Android
824847

825848
#### Build on Android using Termux
849+
826850
[Termux](https://github.com/termux/termux-app#installation) is a method to execute `llama.cpp` on an Android device (no root required).
851+
827852
```
828853
apt update && apt upgrade -y
829854
apt install git make cmake
830855
```
831856

832857
It's recommended to move your model inside the `~/` directory for best performance:
858+
833859
```
834860
cd storage/downloads
835861
mv model.gguf ~/
@@ -838,22 +864,25 @@ mv model.gguf ~/
838864
[Get the code](https://github.com/ggerganov/llama.cpp#get-the-code) & [follow the Linux build instructions](https://github.com/ggerganov/llama.cpp#build) to build `llama.cpp`.
839865

840866
#### Building the Project using Android NDK
867+
841868
Obtain the [Android NDK](https://developer.android.com/ndk) and then build with CMake.
842869

843870
Execute the following commands on your computer to avoid downloading the NDK to your mobile. Alternatively, you can also do this in Termux:
871+
844872
```
845-
$ mkdir build-android
846-
$ cd build-android
847-
$ export NDK=<your_ndk_directory>
848-
$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
849-
$ make
873+
mkdir build-android
874+
cd build-android
875+
export NDK=<your_ndk_directory>
876+
cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
877+
make
850878
```
851879

852880
Install [termux](https://github.com/termux/termux-app#installation) on your device and run `termux-setup-storage` to get access to your SD card (if Android 11+ then run the command twice).
853881

854882
Finally, copy these built `llama` binaries and the model file to your device storage. Because the file permissions in the Android sdcard cannot be changed, you can copy the executable files to the `/data/data/com.termux/files/home/bin` path, and then execute the following commands in Termux to add executable permission:
855883

856884
(Assumed that you have pushed the built executable files to the /sdcard/llama.cpp/bin path using `adb push`)
885+
857886
```
858887
$cp -r /sdcard/llama.cpp/bin /data/data/com.termux/files/home/
859888
$cd /data/data/com.termux/files/home/bin
@@ -867,22 +896,25 @@ $mv /sdcard/llama.cpp/llama-2-7b-chat.Q4_K_M.gguf /data/data/com.termux/files/ho
867896
```
868897

869898
Now, you can start chatting:
899+
870900
```
871901
$cd /data/data/com.termux/files/home/bin
872902
$./main -m ../model/llama-2-7b-chat.Q4_K_M.gguf -n 128 -cml
873903
```
874904

875905
Here's a demo of an interactive session running on Pixel 5 phone:
876906

877-
https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4
907+
<https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4>
878908

879909
### Docker
880910

881911
#### Prerequisites
882-
* Docker must be installed and running on your system.
883-
* Create a folder to store big models & intermediate files (ex. /llama/models)
912+
913+
- Docker must be installed and running on your system.
914+
- Create a folder to store big models & intermediate files (ex. /llama/models)
884915

885916
#### Images
917+
886918
We have three Docker images available for this project:
887919

888920
1. `ghcr.io/ggerganov/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms: `linux/amd64`, `linux/arm64`)

0 commit comments

Comments
 (0)