Skip to content

Commit 447a3d2

Browse files
committed
Merge branch 'main' into setup
2 parents bebe771 + 030fafe commit 447a3d2

22 files changed

+1091
-250
lines changed

.gitmodules

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
[submodule "vendor/llama.cpp"]
22
path = vendor/llama.cpp
3-
url = git@github.com:ggerganov/llama.cpp.git
3+
url = https://github.com/ggerganov/llama.cpp.git

CHANGELOG.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [Unreleased]
9+
10+
### Added
11+
12+
- Added first version of the changelog

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,4 +28,4 @@ else()
2828
LIBRARY DESTINATION llama_cpp
2929
RUNTIME DESTINATION llama_cpp
3030
)
31-
endif(UNIX)
31+
endif()

README.md

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ This package provides:
1515
- OpenAI-like API
1616
- LangChain compatibility
1717

18+
Documentation is available at [https://abetlen.github.io/llama-cpp-python](https://abetlen.github.io/llama-cpp-python).
19+
1820
## Installation from PyPI (recommended)
1921

2022
Install from PyPI (requires a c compiler):
@@ -26,6 +28,18 @@ pip install llama-cpp-python
2628
The above command will attempt to install the package and build build `llama.cpp` from source.
2729
This is the recommended installation method as it ensures that `llama.cpp` is built with the available optimizations for your system.
2830

31+
If you have previously installed `llama-cpp-python` through pip and want to upgrade your version or rebuild the package with different compiler options, please add the following flags to ensure that the package is rebuilt correctly:
32+
33+
```bash
34+
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
35+
```
36+
37+
Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:
38+
```
39+
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
40+
bash Miniforge3-MacOSX-arm64.sh
41+
```
42+
Otherwise, while installing it will build the llama.ccp x86 version which will be 10x slower on Apple Silicon (M1) Mac.
2943

3044
### Installation with OpenBLAS / cuBLAS / CLBlast
3145

@@ -35,19 +49,19 @@ Use the `FORCE_CMAKE=1` environment variable to force the use of `cmake` and ins
3549
To install with OpenBLAS, set the `LLAMA_OPENBLAS=1` environment variable before installing:
3650

3751
```bash
38-
LLAMA_OPENBLAS=1 FORCE_CMAKE=1 pip install llama-cpp-python
52+
CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
3953
```
4054

4155
To install with cuBLAS, set the `LLAMA_CUBLAS=1` environment variable before installing:
4256

4357
```bash
44-
LLAMA_CUBLAS=1 FORCE_CMAKE=1 pip install llama-cpp-python
58+
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
4559
```
4660

4761
To install with CLBlast, set the `LLAMA_CLBLAST=1` environment variable before installing:
4862

4963
```bash
50-
LLAMA_CLBLAST=1 FORCE_CMAKE=1 pip install llama-cpp-python
64+
CMAKE_ARGS="-DLLAMA_CLBLAST=on" FORCE_CMAKE=1 pip install llama-cpp-python
5165
```
5266

5367

@@ -102,7 +116,7 @@ Navigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the
102116
A Docker image is available on [GHCR](https://ghcr.io/abetlen/llama-cpp-python). To run the server:
103117

104118
```bash
105-
docker run --rm -it -p8000:8000 -v /path/to/models:/models -eMODEL=/models/ggml-model-name.bin ghcr.io/abetlen/llama-cpp-python:latest
119+
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/ggml-model-name.bin ghcr.io/abetlen/llama-cpp-python:latest
106120
```
107121

108122
## Low-level API
@@ -120,7 +134,7 @@ Below is a short example demonstrating how to use the low-level API to tokenize
120134
>>> ctx = llama_cpp.llama_init_from_file(b"./models/7b/ggml-model.bin", params)
121135
>>> max_tokens = params.n_ctx
122136
# use ctypes arrays for array params
123-
>>> tokens = (llama_cppp.llama_token * int(max_tokens))()
137+
>>> tokens = (llama_cpp.llama_token * int(max_tokens))()
124138
>>> n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, add_bos=llama_cpp.c_bool(True))
125139
>>> llama_cpp.llama_free(ctx)
126140
```

docker/Dockerfile

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Define the image argument and provide a default value
2+
ARG IMAGE=python:3-slim-bullseye
3+
4+
# Use the image as specified
5+
FROM ${IMAGE}
6+
7+
# Re-declare the ARG after FROM
8+
ARG IMAGE
9+
10+
# Update and upgrade the existing packages
11+
RUN apt-get update && apt-get upgrade -y && apt-get install -y --no-install-recommends \
12+
python3 \
13+
python3-pip \
14+
ninja-build \
15+
build-essential
16+
17+
RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette
18+
19+
# Perform the conditional installations based on the image
20+
RUN echo "Image: ${IMAGE}" && \
21+
if [ "${IMAGE}" = "python:3-slim-bullseye" ] ; then \
22+
echo "OpenBLAS install:" && \
23+
apt-get install -y --no-install-recommends libopenblas-dev && \
24+
LLAMA_OPENBLAS=1 pip install llama-cpp-python --verbose; \
25+
else \
26+
echo "CuBLAS install:" && \
27+
LLAMA_CUBLAS=1 pip install llama-cpp-python --verbose; \
28+
fi
29+
30+
# Clean up apt cache
31+
RUN rm -rf /var/lib/apt/lists/*
32+
33+
# Set a working directory for better clarity
34+
WORKDIR /app
35+
36+
# Copy files to the app directory
37+
RUN echo "Installing model...this can take some time..."
38+
COPY ./model.bin /app/model.bin
39+
COPY ./start_server.sh /app/start_server.sh
40+
41+
# Make the server start script executable
42+
RUN chmod +x /app/start_server.sh
43+
44+
# Set environment variable for the host
45+
ENV HOST=0.0.0.0
46+
47+
# Expose a port for the server
48+
EXPOSE 8000
49+
50+
# Run the server start script
51+
CMD ["/bin/sh", "/app/start_server.sh"]

Dockerfile.cuda renamed to docker/Dockerfile.cuda_simple

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
FROM nvidia/cuda:12.1.1-devel-ubuntu20.04
1+
ARG CUDA_IMAGE="12.1.1-devel-ubuntu22.04"
2+
FROM ${CUDA_IMAGE}
23

34
# We need to set the host to 0.0.0.0 to allow outside access
45
ENV HOST 0.0.0.0
@@ -12,4 +13,4 @@ RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fa
1213
RUN LLAMA_CUBLAS=1 python3 setup.py develop
1314

1415
# Run the server
15-
CMD python3 -m llama_cpp.server
16+
CMD python3 -m llama_cpp.server
File renamed without changes.

docker/README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Dockerfiles for building the llama-cpp-python server
2+
- `Dockerfile.openblas_simple` - a simple Dockerfile for non-GPU OpenBLAS
3+
- `Dockerfile.cuda_simple` - a simple Dockerfile for CUDA accelerated CuBLAS
4+
- `hug_model.py` - a Python utility for interactively choosing and downloading the latest `5_1` quantized models from [huggingface.co/TheBloke]( https://huggingface.co/TheBloke)
5+
- `Dockerfile` - a single OpenBLAS and CuBLAS combined Dockerfile that automatically installs a previously downloaded model `model.bin`
6+
7+
# Get model from Hugging Face
8+
`python3 ./hug_model.py`
9+
10+
You should now have a model in the current directory and `model.bin` symlinked to it for the subsequent Docker build and copy step. e.g.
11+
```
12+
docker $ ls -lh *.bin
13+
-rw-rw-r-- 1 user user 4.8G May 23 18:30 <downloaded-model-file>.q5_1.bin
14+
lrwxrwxrwx 1 user user 24 May 23 18:30 model.bin -> <downloaded-model-file>.q5_1.bin
15+
```
16+
**Note #1:** Make sure you have enough disk space to download the model. As the model is then copied into the image you will need at least
17+
**TWICE** as much disk space as the size of the model:
18+
19+
| Model | Quantized size |
20+
|------:|----------------:|
21+
| 7B | 5 GB |
22+
| 13B | 10 GB |
23+
| 30B | 25 GB |
24+
| 65B | 50 GB |
25+
26+
**Note #2:** If you want to pass or tune additional parameters, customise `./start_server.sh` before running `docker build ...`
27+
28+
# Install Docker Server
29+
30+
**Note #3:** This was tested with Docker running on Linux. If you can get it working on Windows or MacOS, please update this `README.md` with a PR!
31+
32+
[Install Docker Engine](https://docs.docker.com/engine/install)
33+
34+
# Use OpenBLAS
35+
Use if you don't have a NVidia GPU. Defaults to `python:3-slim-bullseye` Docker base image and OpenBLAS:
36+
## Build:
37+
`docker build --build-arg -t openblas .`
38+
## Run:
39+
`docker run --cap-add SYS_RESOURCE -t openblas`
40+
41+
# Use CuBLAS
42+
Requires a NVidia GPU with sufficient VRAM (approximately as much as the size above) and Docker NVidia support (see [container-toolkit/install-guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html))
43+
## Build:
44+
`docker build --build-arg IMAGE=nvidia/cuda:12.1.1-devel-ubuntu22.04 -t cublas .`
45+
## Run:
46+
`docker run --cap-add SYS_RESOURCE -t cublas`

docker/hug_model.py

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
import requests
2+
import json
3+
import os
4+
import struct
5+
6+
def make_request(url, params=None):
7+
print(f"Making request to {url}...")
8+
response = requests.get(url, params=params)
9+
if response.status_code == 200:
10+
return json.loads(response.text)
11+
else:
12+
print(f"Request failed with status code {response.status_code}")
13+
return None
14+
15+
def check_magic_and_version(filename):
16+
with open(filename, 'rb') as f:
17+
# Read the first 6 bytes from the file
18+
data = f.read(6)
19+
20+
# Unpack the binary data, interpreting the first 4 bytes as a little-endian unsigned int
21+
# and the next 2 bytes as a little-endian unsigned short
22+
magic, version = struct.unpack('<I H', data)
23+
24+
print(f"magic: 0x{magic:08x}, version: 0x{version:04x}, file: {filename}")
25+
26+
return magic, version
27+
28+
def download_file(url, destination):
29+
print(f"Downloading {url} to {destination}...")
30+
response = requests.get(url, stream=True)
31+
if response.status_code == 200:
32+
with open(destination, 'wb') as f:
33+
total_downloaded = 0
34+
for chunk in response.iter_content(chunk_size=1024):
35+
if chunk: # filter out keep-alive new chunks
36+
f.write(chunk)
37+
total_downloaded += len(chunk)
38+
if total_downloaded >= 10485760: # 10 MB
39+
print('.', end='', flush=True)
40+
total_downloaded = 0
41+
print("\nDownload complete.")
42+
43+
# Creating a symbolic link from destination to "model.bin"
44+
if os.path.isfile("model.bin"):
45+
os.remove("model.bin") # remove the existing link if any
46+
os.symlink(destination, "model.bin")
47+
else:
48+
print(f"Download failed with status code {response.status_code}")
49+
50+
def get_user_choice(model_list):
51+
# Print the enumerated list
52+
print("\n")
53+
for i, (model_id, rfilename) in enumerate(model_list):
54+
print(f"{i+1}: Model ID: {model_id}, RFilename: {rfilename}")
55+
56+
# Get user's choice
57+
choice = input("Choose a model to download by entering the corresponding number: ")
58+
try:
59+
index = int(choice) - 1
60+
if 0 <= index < len(model_list):
61+
# Return the chosen model
62+
return model_list[index]
63+
else:
64+
print("Invalid choice.")
65+
except ValueError:
66+
print("Invalid input. Please enter a number corresponding to a model.")
67+
except IndexError:
68+
print("Invalid choice. Index out of range.")
69+
70+
return None
71+
72+
import argparse
73+
74+
def main():
75+
# Create an argument parser
76+
parser = argparse.ArgumentParser(description='Process the model version.')
77+
parser.add_argument('-v', '--version', type=int, default=0x0003,
78+
help='an integer for the version to be used')
79+
80+
# Parse the arguments
81+
args = parser.parse_args()
82+
83+
# Define the parameters
84+
params = {
85+
"author": "TheBloke", # Filter by author
86+
"tags": "llama"
87+
}
88+
89+
models = make_request('https://huggingface.co/api/models', params=params)
90+
if models is None:
91+
return
92+
93+
model_list = []
94+
# Iterate over the models
95+
for model in models:
96+
model_id = model['id']
97+
model_info = make_request(f'https://huggingface.co/api/models/{model_id}')
98+
if model_info is None:
99+
continue
100+
101+
for sibling in model_info.get('siblings', []):
102+
rfilename = sibling.get('rfilename')
103+
if rfilename and 'q5_1' in rfilename:
104+
model_list.append((model_id, rfilename))
105+
106+
model_choice = get_user_choice(model_list)
107+
if model_choice is not None:
108+
model_id, rfilename = model_choice
109+
url = f"https://huggingface.co/{model_id}/resolve/main/{rfilename}"
110+
download_file(url, rfilename)
111+
_, version = check_magic_and_version(rfilename)
112+
if version != args.version:
113+
print(f"Warning: Expected version {args.version}, but found different version in the file.")
114+
115+
if __name__ == '__main__':
116+
main()

docker/start_server.sh

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#!/bin/sh
2+
3+
# For mmap support
4+
ulimit -l unlimited
5+
6+
if [ "$IMAGE" = "python:3-slim-bullseye" ]; then
7+
python3 -B -m llama_cpp.server --model /app/model.bin
8+
else
9+
# You may have to reduce --n_gpu_layers=1000 to 20 or less if you don't have enough VRAM
10+
python3 -B -m llama_cpp.server --model /app/model.bin --n_gpu_layers=1000
11+
fi

docs/index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,8 +112,12 @@ python3 setup.py develop
112112
show_root_heading: true
113113

114114
::: llama_cpp.LlamaCache
115+
options:
116+
show_root_heading: true
115117

116118
::: llama_cpp.LlamaState
119+
options:
120+
show_root_heading: true
117121

118122
::: llama_cpp.llama_cpp
119123
options:

examples/low_level_api/low_level_api_chat_cpp.py

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -368,10 +368,10 @@ def generate(self):
368368
id = llama_cpp.llama_sample_token_mirostat_v2(self.ctx, candidates_p, llama_cpp.c_float(self.params.mirostat_tau), llama_cpp.c_float(self.params.mirostat_eta), llama_cpp.c_float(mirostat_mu))
369369
else:
370370
# Temperature sampling
371-
llama_cpp.llama_sample_top_k(self.ctx, candidates_p, top_k)
372-
llama_cpp.llama_sample_tail_free(self.ctx, candidates_p, llama_cpp.c_float(self.params.tfs_z))
373-
llama_cpp.llama_sample_typical(self.ctx, candidates_p, llama_cpp.c_float(self.params.typical_p))
374-
llama_cpp.llama_sample_top_p(self.ctx, candidates_p, llama_cpp.c_float(self.params.top_p))
371+
llama_cpp.llama_sample_top_k(self.ctx, candidates_p, top_k, min_keep=llama_cpp.c_size_t(1))
372+
llama_cpp.llama_sample_tail_free(self.ctx, candidates_p, llama_cpp.c_float(self.params.tfs_z), min_keep=llama_cpp.c_size_t(1))
373+
llama_cpp.llama_sample_typical(self.ctx, candidates_p, llama_cpp.c_float(self.params.typical_p), min_keep=llama_cpp.c_size_t(1))
374+
llama_cpp.llama_sample_top_p(self.ctx, candidates_p, llama_cpp.c_float(self.params.top_p), min_keep=llama_cpp.c_size_t(1))
375375
llama_cpp.llama_sample_temperature(self.ctx, candidates_p, llama_cpp.c_float(self.params.temp))
376376
id = llama_cpp.llama_sample_token(self.ctx, candidates_p)
377377
# print("`{}`".format(candidates_p.size))
@@ -382,12 +382,15 @@ def generate(self):
382382
# replace end of text token with newline token when in interactive mode
383383
if (id == llama_cpp.llama_token_eos() and self.params.interactive and not self.params.instruct):
384384
id = self.llama_token_newline[0]
385+
self.embd.append(id)
385386
if (self.use_antiprompt()):
386387
# tokenize and inject first reverse prompt
387388
self.embd_inp += self.first_antiprompt[0]
388-
389-
# add it to the context
390-
self.embd.append(id)
389+
for id in self.first_antiprompt[0]:
390+
self.embd.append(id)
391+
else:
392+
# add it to the context
393+
self.embd.append(id)
391394

392395
# echo this to console
393396
self.output_echo = True
@@ -493,7 +496,7 @@ def output(self):
493496
# Contains multi-byte UTF8
494497
for num, pattern in [(2, 192), (3, 224), (4, 240)]:
495498
# Bitwise AND check
496-
if pattern & int.from_bytes(cur_char) == pattern:
499+
if pattern & int.from_bytes(cur_char, 'little') == pattern:
497500
self.multibyte_fix = [cur_char] + ([None] * (num-1))
498501

499502
# Stop incomplete bytes from passing

0 commit comments

Comments
 (0)