Skip to content

Commit bcd6167

Browse files
committed
improve docs and example
1 parent 7cebe2e commit bcd6167

File tree

2 files changed

+109
-227
lines changed

2 files changed

+109
-227
lines changed

examples/server/README.md

Lines changed: 48 additions & 227 deletions
Original file line numberDiff line numberDiff line change
@@ -2,32 +2,45 @@
22

33
This example allow you to have a llama.cpp http server to interact from a web page or consume the API.
44

5-
## Table of Contents
5+
Command line options:
6+
7+
- `--threads N`, `-t N`: use N threads.
8+
- `-m FNAME`, `--model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
9+
- `-c N`, `--ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
10+
- `-ngl N`, `--n-gpu-layers N`: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
11+
- `--embedding`: Enable the embedding mode. **Completion function doesn't work in this mode**.
12+
- `--host`: Set the hostname or ip address to listen. Default `127.0.0.1`;
13+
- `--port`: Set the port to listen. Default: `8080`.
614

7-
1. [Quick Start](#quick-start)
8-
2. [Node JS Test](#node-js-test)
9-
3. [API Endpoints](#api-endpoints)
10-
4. [More examples](#more-examples)
11-
5. [Common Options](#common-options)
12-
6. [Performance Tuning and Memory Options](#performance-tuning-and-memory-options)
1315

1416
## Quick Start
1517

1618
To get started right away, run the following command, making sure to use the correct path for the model you have:
1719

18-
#### Unix-based systems (Linux, macOS, etc.):
20+
### Unix-based systems (Linux, macOS, etc.):
1921

2022
```bash
21-
./server -m models/7B/ggml-model.bin --ctx_size 2048
23+
./server -m models/7B/ggml-model.bin -c 2048
2224
```
2325

24-
#### Windows:
26+
### Windows:
2527

2628
```powershell
27-
server.exe -m models\7B\ggml-model.bin --ctx_size 2048
29+
server.exe -m models\7B\ggml-model.bin -c 2048
2830
```
2931

30-
That will start a server that by default listens on `127.0.0.1:8080`. You can consume the endpoints with Postman or NodeJS with axios library.
32+
That will start a server that by default listens on `127.0.0.1:8080`.
33+
You can consume the endpoints with Postman or NodeJS with axios library.
34+
35+
## Testing with CURL
36+
37+
Using [curl](https://curl.se/). On Windows `curl.exe` should be available in the base OS.
38+
39+
```sh
40+
curl --request POST \
41+
--url http://localhost:8080/completion \
42+
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
43+
```
3144

3245
## Node JS Test
3346

@@ -50,7 +63,6 @@ const prompt = `Building a website can be done in 10 simple steps:`;
5063
async function Test() {
5164
let result = await axios.post("http://127.0.0.1:8080/completion", {
5265
prompt,
53-
batch_size: 128,
5466
n_predict: 512,
5567
});
5668

@@ -69,244 +81,53 @@ node .
6981

7082
## API Endpoints
7183

72-
You can interact with this API Endpoints. This implementations just support chat style interaction.
84+
You can interact with this API Endpoints.
85+
This implementations just support chat style interaction.
7386

7487
- **POST** `hostname:port/completion`: Setting up the Llama Context to begin the completions tasks.
7588

76-
*Options:*
77-
78-
`batch_size`: Set the batch size for prompt processing (default: 512).
79-
80-
`temperature`: Adjust the randomness of the generated text (default: 0.8).
81-
82-
`top_k`: Limit the next token selection to the K most probable tokens (default: 40).
89+
*Options:*
8390

84-
`top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.9).
91+
`temperature`: Adjust the randomness of the generated text (default: 0.8).
8592

86-
`n_predict`: Set the number of tokens to predict when generating text (default: 128, -1 = infinity).
93+
`top_k`: Limit the next token selection to the K most probable tokens (default: 40).
8794

88-
`threads`: Set the number of threads to use during computation.
95+
`top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.9).
8996

90-
`n_keep`: Specify the number of tokens from the initial prompt to retain when the model resets its internal context. By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the initial prompt.
97+
`n_predict`: Set the number of tokens to predict when generating text (default: 128, -1 = infinity).
9198

92-
`as_loop`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`.
99+
`n_keep`: Specify the number of tokens from the initial prompt to retain when the model resets its internal context.
100+
By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the initial prompt.
93101

94-
`interactive`: It allows interacting with the completion, and the completion stops as soon as it encounters a `stop word`. To enable this, set to `true`.
102+
`stream`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`.
95103

96-
`prompt`: Provide a prompt. Internally, the prompt is compared, and it detects if a part has already been evaluated, and the remaining part will be evaluate.
104+
`prompt`: Provide a prompt. Internally, the prompt is compared, and it detects if a part has already been evaluated, and the remaining part will be evaluate.
97105

98-
`stop`: Specify the words or characters that indicate a stop. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration.
99-
100-
`exclude`: Specify the words or characters you do not want to appear in the completion. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration.
106+
`stop`: Specify the strings that indicate a stop.
107+
These words will not be included in the completion, so make sure to add them to the prompt for the next iteration.
108+
Default: `[]`
101109

102110
- **POST** `hostname:port/embedding`: Generate embedding of a given text
103111

104-
*Options:*
105-
106-
`content`: Set the text to get generate the embedding.
112+
*Options:*
107113

108-
`threads`: Set the number of threads to use during computation.
114+
`content`: Set the text to get generate the embedding.
109115

110-
To use this endpoint, you need to start the server with the `--embedding` option added.
116+
To use this endpoint, you need to start the server with the `--embedding` option added.
111117

112118
- **POST** `hostname:port/tokenize`: Tokenize a given text
113119

114-
*Options:*
115-
116-
`content`: Set the text to tokenize.
120+
*Options:*
117121

118-
- **GET** `hostname:port/next-token`: Receive the next token predicted, execute this request in a loop. Make sure set `as_loop` as `true` in the completion request.
119-
120-
*Options:*
121-
122-
`stop`: Set `hostname:port/next-token?stop=true` to stop the token generation.
122+
`content`: Set the text to tokenize.
123123

124124
## More examples
125125

126126
### Interactive mode
127127

128-
This mode allows interacting in a chat-like manner. It is recommended for models designed as assistants such as `Vicuna`, `WizardLM`, `Koala`, among others. Make sure to add the correct stop word for the corresponding model.
129-
130-
The prompt should be generated by you, according to the model's guidelines. You should keep adding the model's completions to the context as well.
131-
132-
This example works well for `Vicuna - version 1`.
133-
134-
```javascript
135-
const axios = require("axios");
136-
137-
let prompt = `A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
138-
### Human: Hello, Assistant.
139-
### Assistant: Hello. How may I help you today?
140-
### Human: Please tell me the largest city in Europe.
141-
### Assistant: Sure. The largest city in Europe is Moscow, the capital of Russia.`;
142-
143-
async function ChatCompletion(answer) {
144-
// the user's next question to the prompt
145-
prompt += `\n### Human: ${answer}\n`
146-
147-
result = await axios.post("http://127.0.0.1:8080/completion", {
148-
prompt,
149-
batch_size: 128,
150-
temperature: 0.2,
151-
top_k: 40,
152-
top_p: 0.9,
153-
n_keep: -1,
154-
n_predict: 2048,
155-
stop: ["\n### Human:"], // when detect this, stop completion
156-
exclude: ["### Assistant:"], // no show in the completion
157-
threads: 8,
158-
as_loop: true, // use this to request the completion token by token
159-
interactive: true, // enable the detection of a stop word
160-
});
161-
162-
// create a loop to receive every token predicted
163-
// note: this operation is blocking, avoid use this in a ui thread
164-
165-
let message = "";
166-
while (true) {
167-
// you can stop the inference adding '?stop=true' like this http://127.0.0.1:8080/next-token?stop=true
168-
result = await axios.get("http://127.0.0.1:8080/next-token");
169-
process.stdout.write(result.data.content);
170-
message += result.data.content;
171-
172-
// to avoid an infinite loop
173-
if (result.data.stop) {
174-
console.log("Completed");
175-
// make sure to add the completion to the prompt.
176-
prompt += `### Assistant: ${message}`;
177-
break;
178-
}
179-
}
180-
}
181-
182-
// This function should be called every time a question to the model is needed.
183-
async function Test() {
184-
// the server can't inference in paralell
185-
await ChatCompletion("Write a long story about a time magician in a fantasy world");
186-
await ChatCompletion("Summary the story");
187-
}
188-
189-
Test();
190-
```
191-
192-
### Alpaca example
193-
194-
**Temporaly note:** no tested, if you have the model, please test it and report me some issue
195-
196-
```javascript
197-
const axios = require("axios");
198-
199-
let prompt = `Below is an instruction that describes a task. Write a response that appropriately completes the request.
200-
`;
201-
202-
async function DoInstruction(instruction) {
203-
prompt += `\n\n### Instruction:\n\n${instruction}\n\n### Response:\n\n`;
204-
result = await axios.post("http://127.0.0.1:8080/completion", {
205-
prompt,
206-
batch_size: 128,
207-
temperature: 0.2,
208-
top_k: 40,
209-
top_p: 0.9,
210-
n_keep: -1,
211-
n_predict: 2048,
212-
stop: ["### Instruction:\n\n"], // when detect this, stop completion
213-
exclude: [], // no show in the completion
214-
threads: 8,
215-
as_loop: true, // use this to request the completion token by token
216-
interactive: true, // enable the detection of a stop word
217-
});
218-
219-
// create a loop to receive every token predicted
220-
// note: this operation is blocking, avoid use this in a ui thread
221-
222-
let message = "";
223-
while (true) {
224-
result = await axios.get("http://127.0.0.1:8080/next-token");
225-
process.stdout.write(result.data.content);
226-
message += result.data.content;
227-
228-
// to avoid an infinite loop
229-
if (result.data.stop) {
230-
console.log("Completed");
231-
// make sure to add the completion and the user's next question to the prompt.
232-
prompt += message;
233-
break;
234-
}
235-
}
236-
}
237-
238-
// This function should be called every time a instruction to the model is needed.
239-
DoInstruction("Destroy the world"); // as joke
240-
```
241-
242-
### Embeddings
243-
244-
First, run the server with `--embedding` option:
245-
246-
```bash
247-
server -m models/7B/ggml-model.bin --ctx_size 2048 --embedding
248-
```
249-
250-
Run this code in NodeJS:
251-
252-
```javascript
253-
const axios = require('axios');
254-
255-
async function Test() {
256-
let result = await axios.post("http://127.0.0.1:8080/embedding", {
257-
content: `Hello`,
258-
threads: 5
259-
});
260-
// print the embedding array
261-
console.log(result.data.embedding);
262-
}
128+
Check the sample in [chat.mjs](chat.mjs).
129+
Run with node:
263130

264-
Test();
131+
```sh
132+
node chat.mjs
265133
```
266-
267-
### Tokenize
268-
269-
Run this code in NodeJS:
270-
271-
```javascript
272-
const axios = require('axios');
273-
274-
async function Test() {
275-
let result = await axios.post("http://127.0.0.1:8080/tokenize", {
276-
content: `Hello`
277-
});
278-
// print the embedding array
279-
console.log(result.data.tokens);
280-
}
281-
282-
Test();
283-
```
284-
285-
## Common Options
286-
287-
- `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
288-
- `-c N, --ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
289-
- `-ngl N, --n-gpu-layers N`: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
290-
- `--embedding`: Enable the embedding mode. **Completion function doesn't work in this mode**.
291-
- `--host`: Set the hostname or ip address to listen. Default `127.0.0.1`;
292-
- `--port`: Set the port to listen. Default: `8080`.
293-
294-
### RNG Seed
295-
296-
- `-s SEED, --seed SEED`: Set the random number generator (RNG) seed (default: -1, < 0 = random seed).
297-
298-
The RNG seed is used to initialize the random number generator that influences the text generation process. By setting a specific seed value, you can obtain consistent and reproducible results across multiple runs with the same input and settings. This can be helpful for testing, debugging, or comparing the effects of different options on the generated text to see when they diverge. If the seed is set to a value less than 0, a random seed will be used, which will result in different outputs on each run.
299-
300-
## Performance Tuning and Memory Options
301-
302-
### No Memory Mapping
303-
304-
- `--no-mmap`: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed. However, if the model is larger than your total amount of RAM or if your system is low on available memory, using mmap might increase the risk of pageouts, negatively impacting performance.
305-
306-
### Memory Float 32
307-
308-
- `--memory-f32`: Use 32-bit floats instead of 16-bit floats for memory key+value. This doubles the context memory requirement but does not appear to increase generation quality in a measurable way. Not recommended.
309-
310-
## Limitations:
311-
312-
- The actual implementation of llama.cpp need a `llama-state` for handle multiple contexts and clients, but this could require more powerful hardware.

examples/server/chat.mjs

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
import * as readline from 'node:readline/promises';
2+
import { stdin as input, stdout as output } from 'node:process';
3+
4+
const chat = [
5+
{ human: "Hello, Assistant.",
6+
assistant: "Hello. How may I help you today?" },
7+
{ human: "Please tell me the largest city in Europe.",
8+
assistant: "Sure. The largest city in Europe is Moscow, the capital of Russia." },
9+
]
10+
11+
function format_prompt(question) {
12+
return "A chat between a curious human and an artificial intelligence assistant. "
13+
+ "The assistant gives helpful, detailed, and polite answers to the human's questions.\n"
14+
+ chat.map(m => `### Human: ${m.human}\n### Assistant: ${m.assistant}`).join("\n")
15+
+ `\n### Human: ${question}\n### Assistant:`
16+
}
17+
18+
async function ChatCompletion(question) {
19+
const result = await fetch("http://127.0.0.1:8080/completion", {
20+
method: 'POST',
21+
body: JSON.stringify({
22+
prompt: format_prompt(question),
23+
temperature: 0.2,
24+
top_k: 40,
25+
top_p: 0.9,
26+
n_keep: 29,
27+
n_predict: 256,
28+
stop: ["\n### Human:"], // when detect this, stop completion
29+
stream: true,
30+
})
31+
})
32+
33+
if (!result.ok) {
34+
return;
35+
}
36+
37+
let answer = ''
38+
39+
for await (var chunk of result.body) {
40+
const t = Buffer.from(chunk).toString('utf8')
41+
if (t.startsWith('data: ')) {
42+
const message = JSON.parse(t.substring(6))
43+
answer += message.content
44+
process.stdout.write(message.content)
45+
if (message.stop) break;
46+
}
47+
}
48+
49+
process.stdout.write('\n')
50+
chat.push({ human: question, assistant: answer })
51+
}
52+
53+
const rl = readline.createInterface({ input, output });
54+
55+
while(true) {
56+
57+
const question = await rl.question('> ')
58+
await ChatCompletion(question);
59+
60+
}
61+

0 commit comments

Comments
 (0)