Fix a typo + add Fedora packages for Vulkan

metal3d · metal3d · commit 563d8d2980ad · 2024-06-06T10:07:10.000+02:00
* "appropriate" has a mistake

* append Feodra packages to install to be able to compile with Vulkan
  support
diff --git a/README.md b/README.md
@@ -12,26 +12,26 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
 
 ### Recent API changes
 
-- [2024 Apr 21] `llama_token_to_piece` can now optionally render special tokens https://github.com/ggerganov/llama.cpp/pull/6807
-- [2024 Apr 4] State and session file functions reorganized under `llama_state_*` https://github.com/ggerganov/llama.cpp/pull/6341
-- [2024 Mar 26] Logits and embeddings API updated for compactness https://github.com/ggerganov/llama.cpp/pull/6122
-- [2024 Mar 13] Add `llama_synchronize()` + `llama_context_params.n_ubatch` https://github.com/ggerganov/llama.cpp/pull/6017
-- [2024 Mar 8] `llama_kv_cache_seq_rm()` returns a `bool` instead of `void`, and new `llama_n_seq_max()` returns the upper limit of acceptable `seq_id` in batches (relevant when dealing with multiple sequences) https://github.com/ggerganov/llama.cpp/pull/5328
-- [2024 Mar 4] Embeddings API updated https://github.com/ggerganov/llama.cpp/pull/5796
-- [2024 Mar 3] `struct llama_context_params` https://github.com/ggerganov/llama.cpp/pull/5849
+- [2024 Apr 21] `llama_token_to_piece` can now optionally render special tokens <https://github.com/ggerganov/llama.cpp/pull/6807>
+- [2024 Apr 4] State and session file functions reorganized under `llama_state_*` <https://github.com/ggerganov/llama.cpp/pull/6341>
+- [2024 Mar 26] Logits and embeddings API updated for compactness <https://github.com/ggerganov/llama.cpp/pull/6122>
+- [2024 Mar 13] Add `llama_synchronize()` + `llama_context_params.n_ubatch` <https://github.com/ggerganov/llama.cpp/pull/6017>
+- [2024 Mar 8] `llama_kv_cache_seq_rm()` returns a `bool` instead of `void`, and new `llama_n_seq_max()` returns the upper limit of acceptable `seq_id` in batches (relevant when dealing with multiple sequences) <https://github.com/ggerganov/llama.cpp/pull/5328>
+- [2024 Mar 4] Embeddings API updated <https://github.com/ggerganov/llama.cpp/pull/5796>
+- [2024 Mar 3] `struct llama_context_params` <https://github.com/ggerganov/llama.cpp/pull/5849>
 
 ### Hot topics
 
-- **`convert.py` has been deprecated and moved to `examples/convert-legacy-llama.py`, please use `convert-hf-to-gguf.py`** https://github.com/ggerganov/llama.cpp/pull/7430
-- Initial Flash-Attention support: https://github.com/ggerganov/llama.cpp/pull/5021
-- BPE pre-tokenization support has been added: https://github.com/ggerganov/llama.cpp/pull/6920
-- MoE memory layout has been updated - reconvert models for `mmap` support and regenerate `imatrix` https://github.com/ggerganov/llama.cpp/pull/6387
-- Model sharding instructions using `gguf-split` https://github.com/ggerganov/llama.cpp/discussions/6404
-- Fix major bug in Metal batched inference https://github.com/ggerganov/llama.cpp/pull/6225
-- Multi-GPU pipeline parallelism support https://github.com/ggerganov/llama.cpp/pull/6017
-- Looking for contributions to add Deepseek support: https://github.com/ggerganov/llama.cpp/issues/5981
-- Quantization blind testing: https://github.com/ggerganov/llama.cpp/discussions/5962
-- Initial Mamba support has been added: https://github.com/ggerganov/llama.cpp/pull/5328
+- **`convert.py` has been deprecated and moved to `examples/convert-legacy-llama.py`, please use `convert-hf-to-gguf.py`** <https://github.com/ggerganov/llama.cpp/pull/7430>
+- Initial Flash-Attention support: <https://github.com/ggerganov/llama.cpp/pull/5021>
+- BPE pre-tokenization support has been added: <https://github.com/ggerganov/llama.cpp/pull/6920>
+- MoE memory layout has been updated - reconvert models for `mmap` support and regenerate `imatrix` <https://github.com/ggerganov/llama.cpp/pull/6387>
+- Model sharding instructions using `gguf-split` <https://github.com/ggerganov/llama.cpp/discussions/6404>
+- Fix major bug in Metal batched inference <https://github.com/ggerganov/llama.cpp/pull/6225>
+- Multi-GPU pipeline parallelism support <https://github.com/ggerganov/llama.cpp/pull/6017>
+- Looking for contributions to add Deepseek support: <https://github.com/ggerganov/llama.cpp/issues/5981>
+- Quantization blind testing: <https://github.com/ggerganov/llama.cpp/discussions/5962>
+- Initial Mamba support has been added: <https://github.com/ggerganov/llama.cpp/pull/5328>
 
 ----
 
@@ -297,7 +297,7 @@ llama_print_timings:       total time = 25431.49 ms
 
 And here is another demo of running both LLaMA-7B and [whisper.cpp](https://github.com/ggerganov/whisper.cpp) on a single M1 Pro MacBook:
 
-https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4
+<https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4>
 
 ## Usage
 
@@ -328,6 +328,7 @@ In order to build llama.cpp you have four different options.
     3. Run `w64devkit.exe`.
     4. Use the `cd` command to reach the `llama.cpp` folder.
     5. From here you can run:
+
         ```bash
         make
         ```
@@ -346,9 +347,9 @@ In order to build llama.cpp you have four different options.
 
   **Notes**:
 
-    - For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `cmake --build build --config Release -j 8` will run 8 jobs in parallel.
-    - For faster repeated compilation, install [ccache](https://ccache.dev/).
-    - For debug builds, there are two cases:
+  - For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `cmake --build build --config Release -j 8` will run 8 jobs in parallel.
+  - For faster repeated compilation, install [ccache](https://ccache.dev/).
+  - For debug builds, there are two cases:
 
       1. Single-config generators (e.g. default = `Unix Makefiles`; note that they just ignore the `--config` flag):
 
@@ -364,7 +365,7 @@ In order to build llama.cpp you have four different options.
       cmake --build build --config Debug
       ```
 
--   Using `gmake` (FreeBSD):
+- Using `gmake` (FreeBSD):
 
     1. Install and activate [DRM in FreeBSD](https://wiki.freebsd.org/Graphics)
     2. Add your user to **video** group
@@ -379,10 +380,12 @@ In order to build llama.cpp you have four different options.
 ### Homebrew
 
 On Mac and Linux, the homebrew package manager can be used via
+
 ```
 brew install llama.cpp
 ```
-The formula is automatically updated with new `llama.cpp` releases. More info: https://github.com/ggerganov/llama.cpp/discussions/7668
+
+The formula is automatically updated with new `llama.cpp` releases. More info: <https://github.com/ggerganov/llama.cpp/discussions/7668>
 
 ### Metal Build
 
@@ -396,16 +399,17 @@ argument.
 
 Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Support with CPU-only BLAS implementations doesn't affect the normal generation performance. We may see generation performance improvements with GPU-involved BLAS implementations, e.g. cuBLAS, hipBLAS. There are currently several different BLAS implementations available for build and use:
 
-- #### Accelerate Framework:
+- #### Accelerate Framework
 
   This is only available on Mac PCs and it's enabled by default. You can just build using the normal instructions.
 
-- #### OpenBLAS:
+- #### OpenBLAS
 
   This provides BLAS acceleration using only the CPU. Make sure to have OpenBLAS installed on your machine.
 
   - Using `make`:
     - On Linux:
+
       ```bash
       make LLAMA_OPENBLAS=1
       ```
@@ -437,17 +441,20 @@ Building the program with BLAS support may lead to some performance improvements
   Check [BLIS.md](docs/BLIS.md) for more information.
 
 - #### SYCL
+
   SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators.
 
   llama.cpp based on SYCL is used to **support Intel GPU** (Data Center Max series, Flex series, Arc series, Built-in GPU and iGPU).
 
   For detailed info, please refer to [llama.cpp for SYCL](README-sycl.md).
 
 - #### Intel oneMKL
+
   Building through oneAPI compilers will make avx_vnni instruction set available for intel processors that do not support avx512 and avx512_vnni. Please note that this build config **does not support Intel GPU**. For Intel GPU support, please refer to [llama.cpp for SYCL](./README-sycl.md).
 
   - Using manual oneAPI installation:
     By default, `LLAMA_BLAS_VENDOR` is set to `Generic`, so if you already sourced intel environment script and assign `-DLLAMA_BLAS=ON` in cmake, the mkl version of Blas will automatically been selected. Otherwise please install oneAPI and follow the below steps:
+
       ```bash
       source /opt/intel/oneapi/setvars.sh # You can skip this step if  in oneapi-basekit docker image, only required for manual installation
       cmake -B build -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_NATIVE=ON
@@ -466,9 +473,11 @@ Building the program with BLAS support may lead to some performance improvements
   For Jetson user, if you have Jetson Orin, you can try this: [Offical Support](https://www.jetson-ai-lab.com/tutorial_text-generation.html). If you are using an old model(nano/TX2), need some additional operations before compiling.
 
   - Using `make`:
+
     ```bash
     make LLAMA_CUDA=1
     ```
+
   - Using `CMake`:
 
     ```bash
@@ -496,26 +505,33 @@ Building the program with BLAS support may lead to some performance improvements
   You can download it from your Linux distro's package manager or from here: [ROCm Quick Start (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html#rocm-install-quick).
 
   - Using `make`:
+
     ```bash
     make LLAMA_HIPBLAS=1
     ```
+
   - Using `CMake` for Linux (assuming a gfx1030-compatible AMD GPU):
+
     ```bash
     HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
         cmake -S . -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
         && cmake --build build --config Release -- -j 16
     ```
+
     On Linux it is also possible to use unified memory architecture (UMA) to share main memory between the CPU and integrated GPU by setting `-DLLAMA_HIP_UMA=ON`.
     However, this hurts performance for non-integrated GPUs (but enables working with integrated GPUs).
 
     Note that if you get the following error:
+
     ```
     clang: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library
     ```
+
     Try searching for a directory under `HIP_PATH` that contains the file
     `oclc_abi_version_400.bc`. Then, add the following to the start of the
     command: `HIP_DEVICE_LIB_PATH=<directory-you-just-found>`, so something
     like:
+
     ```bash
     HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -p)" \
     HIP_DEVICE_LIB_PATH=<directory-you-just-found> \
@@ -524,20 +540,22 @@ Building the program with BLAS support may lead to some performance improvements
     ```
 
   - Using `make` (example for target gfx1030, build with 16 CPU threads):
+
     ```bash
     make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1030
     ```
 
   - Using `CMake` for Windows (using x64 Native Tools Command Prompt for VS, and assuming a gfx1100-compatible AMD GPU):
+
     ```bash
     set PATH=%HIP_PATH%\bin;%PATH%
     cmake -S . -B build -G Ninja -DAMDGPU_TARGETS=gfx1100 -DLLAMA_HIPBLAS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release
     cmake --build build
     ```
+
     Make sure that `AMDGPU_TARGETS` is set to the GPU arch you want to compile for. The above example uses `gfx1100` that corresponds to Radeon RX 7900XTX/XT/GRE. You can find a list of targets [here](https://llvm.org/docs/AMDGPUUsage.html#processors)
     Find your gpu version string by matching the most significant version information from `rocminfo | grep gfx | head -1 | awk '{print $2}'` with the list of processors, e.g. `gfx1035` maps to `gfx1030`.
 
-
   The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
   If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 (e.g. gfx1030, gfx1031, or gfx1035) or 11.0.0 on RDNA3.
   The following compilation options are also available to tweak performance (yes, they refer to CUDA, not HIP, because it uses the same code as the cuBLAS version above):
@@ -577,7 +595,9 @@ Building the program with BLAS support may lead to some performance improvements
   vulkaninfo
   ```
 
-  Alternatively your package manager might be able to provide the appropiate libraries. For example for Ubuntu 22.04 you can install `libvulkan-dev` instead.
+  Alternatively your package manager might be able to provide the appropriate libraries.
+  For example for Ubuntu 22.04 you can install `libvulkan-dev` instead.
+  For Fedora 40, you may install `vulkan-devel`, `glslc` and `glslang` packages.
 
   Then, build llama.cpp using the cmake command below:
 
@@ -701,19 +721,21 @@ Several quantization methods are supported. They differ in the resulting model d
 You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
 For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).
 
-The perplexity measurements in table above are done against the `wikitext2` test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512.
+The perplexity measurements in table above are done against the `wikitext2` test dataset (<https://paperswithcode.com/dataset/wikitext-2>), with context length of 512.
 The time per token is measured on a MacBook M1 Pro 32GB RAM using 4 and 8 threads.
 
 #### How to run
 
-1. Download/extract: https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
+1. Download/extract: <https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip>
 2. Run `./perplexity -m models/7B/ggml-model-q4_0.gguf -f wiki.test.raw`
 3. Output:
+
 ```
 perplexity : calculating perplexity over 655 chunks
 24.43 seconds per pass - ETA 4.45 hours
 [1]4.5970,[2]5.1807,[3]6.0382,...
 ```
+
 And after 4.45 hours, you will have the final perplexity.
 
 ### Interactive mode
@@ -767,7 +789,7 @@ PROMPT_TEMPLATE=./prompts/chat-with-bob.txt PROMPT_CACHE_FILE=bob.prompt.bin \
 
 The `grammars/` folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](./grammars/README.md).
 
-For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
+For authoring more complex JSON grammars, you can also check out <https://grammar.intrinsiclabs.ai/>, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
 
 ### Instruct mode
 
@@ -811,25 +833,29 @@ cadaver, cauliflower, cabbage (vegetable), catalpa (tree) and Cailleach.
 ### Seminal papers and background on the models
 
 If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
+
 - LLaMA:
-    - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
-    - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
+  - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
+  - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
 - GPT-3
-    - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
+  - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
 - GPT-3.5 / InstructGPT / ChatGPT:
-    - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
-    - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
+  - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
+  - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
 
 ### Android
 
 #### Build on Android using Termux
+
 [Termux](https://github.com/termux/termux-app#installation) is a method to execute `llama.cpp` on an Android device (no root required).
+
 ```
 apt update && apt upgrade -y
 apt install git make cmake
 ```
 
 It's recommended to move your model inside the `~/` directory for best performance:
+
 ```
 cd storage/downloads
 mv model.gguf ~/
@@ -838,22 +864,25 @@ mv model.gguf ~/
 [Get the code](https://github.com/ggerganov/llama.cpp#get-the-code) & [follow the Linux build instructions](https://github.com/ggerganov/llama.cpp#build) to build `llama.cpp`.
 
 #### Building the Project using Android NDK
+
 Obtain the [Android NDK](https://developer.android.com/ndk) and then build with CMake.
 
 Execute the following commands on your computer to avoid downloading the NDK to your mobile. Alternatively, you can also do this in Termux:
+
 ```
-$ mkdir build-android
-$ cd build-android
-$ export NDK=<your_ndk_directory>
-$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
-$ make
+mkdir build-android
+cd build-android
+export NDK=<your_ndk_directory>
+cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
+make
 ```
 
 Install [termux](https://github.com/termux/termux-app#installation) on your device and run `termux-setup-storage` to get access to your SD card (if Android 11+ then run the command twice).
 
 Finally, copy these built `llama` binaries and the model file to your device storage. Because the file permissions in the Android sdcard cannot be changed, you can copy the executable files to the `/data/data/com.termux/files/home/bin` path, and then execute the following commands in Termux to add executable permission:
 
 (Assumed that you have pushed the built executable files to the /sdcard/llama.cpp/bin path using `adb push`)
+
 ```
 $cp -r /sdcard/llama.cpp/bin /data/data/com.termux/files/home/
 $cd /data/data/com.termux/files/home/bin
@@ -867,22 +896,25 @@ $mv /sdcard/llama.cpp/llama-2-7b-chat.Q4_K_M.gguf /data/data/com.termux/files/ho
 ```
 
 Now, you can start chatting:
+
 ```
 $cd /data/data/com.termux/files/home/bin
 $./main -m ../model/llama-2-7b-chat.Q4_K_M.gguf -n 128 -cml
 ```
 
 Here's a demo of an interactive session running on Pixel 5 phone:
 
-https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4
+<https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4>
 
 ### Docker
 
 #### Prerequisites
-* Docker must be installed and running on your system.
-* Create a folder to store big models & intermediate files (ex. /llama/models)
+
+- Docker must be installed and running on your system.
+- Create a folder to store big models & intermediate files (ex. /llama/models)
 
 #### Images
+
 We have three Docker images available for this project:
 
 1. `ghcr.io/ggerganov/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms: `linux/amd64`, `linux/arm64`)