Skip to content

Commit 14dc2e2

Browse files
bug fixing (#925)
1 parent 7f524d9 commit 14dc2e2

File tree

2 files changed

+86
-8
lines changed

2 files changed

+86
-8
lines changed

examples/low_level_api/low_level_api_llama_cpp.py

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,20 +11,34 @@
1111

1212
prompt = b"\n\n### Instruction:\nWhat is the capital of France?\n\n### Response:\n"
1313

14-
lparams = llama_cpp.llama_context_default_params()
14+
lparams = llama_cpp.llama_model_default_params()
15+
cparams = llama_cpp.llama_context_default_params()
1516
model = llama_cpp.llama_load_model_from_file(MODEL_PATH.encode('utf-8'), lparams)
16-
ctx = llama_cpp.llama_new_context_with_model(model, lparams)
17+
ctx = llama_cpp.llama_new_context_with_model(model, cparams)
1718

1819
# determine the required inference memory per token:
1920
tmp = [0, 1, 2, 3]
20-
llama_cpp.llama_eval(ctx, (llama_cpp.c_int * len(tmp))(*tmp), len(tmp), 0, N_THREADS)
21+
llama_cpp.llama_eval(
22+
ctx = ctx,
23+
tokens=(llama_cpp.c_int * len(tmp))(*tmp),
24+
n_tokens=len(tmp),
25+
n_past=0
26+
)# Deprecated
2127

2228
n_past = 0
2329

2430
prompt = b" " + prompt
2531

2632
embd_inp = (llama_cpp.llama_token * (len(prompt) + 1))()
27-
n_of_tok = llama_cpp.llama_tokenize(ctx, prompt, embd_inp, len(embd_inp), True)
33+
n_of_tok = llama_cpp.llama_tokenize(
34+
model=model,
35+
text=bytes(str(prompt),'utf-8'),
36+
text_len=len(embd_inp),
37+
tokens=embd_inp,
38+
n_max_tokens=len(embd_inp),
39+
add_bos=False,
40+
special=False
41+
)
2842
embd_inp = embd_inp[:n_of_tok]
2943

3044
n_ctx = llama_cpp.llama_n_ctx(ctx)
@@ -49,8 +63,11 @@
4963
while remaining_tokens > 0:
5064
if len(embd) > 0:
5165
llama_cpp.llama_eval(
52-
ctx, (llama_cpp.c_int * len(embd))(*embd), len(embd), n_past, N_THREADS
53-
)
66+
ctx = ctx,
67+
tokens=(llama_cpp.c_int * len(embd))(*embd),
68+
n_tokens=len(embd),
69+
n_past=n_past
70+
)# Deprecated
5471

5572
n_past += len(embd)
5673
embd = []
@@ -93,7 +110,7 @@
93110
for id in embd:
94111
size = 32
95112
buffer = (ctypes.c_char * size)()
96-
n = llama_cpp.llama_token_to_piece_with_model(
113+
n = llama_cpp.llama_token_to_piece(
97114
model, llama_cpp.llama_token(id), buffer, size)
98115
assert n <= size
99116
print(
@@ -109,4 +126,4 @@
109126

110127
llama_cpp.llama_print_timings(ctx)
111128

112-
llama_cpp.llama_free(ctx)
129+
llama_cpp.llama_free(ctx)
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Low-Level API for Llama_cpp
2+
3+
## Overview
4+
This Python script, low_level_api_llama_cpp.py, demonstrates the implementation of a low-level API for interacting with the llama_cpp library. The script defines an inference that generates embeddings based on a given prompt using .gguf model.
5+
6+
### Prerequisites
7+
Before running the script, ensure that you have the following dependencies installed:
8+
9+
. Python 3.6 or higher
10+
. llama_cpp: A C++ library for working with .gguf model
11+
. NumPy: A fundamental package for scientific computing with Python
12+
. multiprocessing: A Python module for parallel computing
13+
14+
### Usage
15+
install depedencies:
16+
```bash
17+
python -m pip install llama-cpp-python ctypes os multiprocessing
18+
```
19+
Run the script:
20+
```bash
21+
python low_level_api_llama_cpp.py
22+
```
23+
24+
## Code Structure
25+
The script is organized as follows:
26+
27+
### . Initialization:
28+
Load the model from the specified path.
29+
Create a context for model evaluation.
30+
31+
### . Tokenization:
32+
Tokenize the input prompt using the llama_tokenize function.
33+
Prepare the input tokens for model evaluation.
34+
35+
### . Inference:
36+
Perform model evaluation to generate responses.
37+
Sample from the model's output using various strategies (top-k, top-p, temperature).
38+
39+
### . Output:
40+
Print the generated tokens and the corresponding decoded text.
41+
42+
### .Cleanup:
43+
Free resources and print timing information.
44+
45+
## Configuration
46+
Customize the inference behavior by adjusting the following variables:
47+
48+
#### . N_THREADS: Number of CPU threads to use for model evaluation.
49+
#### . MODEL_PATH: Path to the model file.
50+
#### . prompt: Input prompt for the chatbot.
51+
52+
## Notes
53+
. Ensure that the llama_cpp library is built and available in the system. Follow the instructions in the llama_cpp repository for building and installing the library.
54+
55+
. This script is designed to work with the .gguf model and may require modifications for compatibility with other models.
56+
57+
## Acknowledgments
58+
This code is based on the llama_cpp library developed by the community. Special thanks to the contributors for their efforts.
59+
60+
## License
61+
This project is licensed under the MIT License - see the LICENSE file for details.

0 commit comments

Comments
 (0)