Skip to content

Commit 4537839

Browse files
ggerganovarthw
authored andcommitted
readme : update the usage section with examples (ggml-org#10596)
* readme : update the usage section with examples * readme : more examples
1 parent 12a744d commit 4537839

File tree

1 file changed

+202
-74
lines changed

1 file changed

+202
-74
lines changed

README.md

Lines changed: 202 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -76,9 +76,9 @@ The `llama.cpp` project is the main playground for developing new features for t
7676

7777
Typically finetunes of the base models below are supported as well.
7878

79-
Instructions for adding support for new models: [HOWTO-add-model.md](./docs/development/HOWTO-add-model.md)
79+
Instructions for adding support for new models: [HOWTO-add-model.md](docs/development/HOWTO-add-model.md)
8080

81-
**Text-only:**
81+
#### Text-only
8282

8383
- [X] LLaMA 🦙
8484
- [x] LLaMA 2 🦙🦙
@@ -133,7 +133,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](./docs/deve
133133
- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
134134
- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
135135

136-
**Multimodal:**
136+
#### Multimodal
137137

138138
- [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
139139
- [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
@@ -247,27 +247,27 @@ Instructions for adding support for new models: [HOWTO-add-model.md](./docs/deve
247247

248248
| Backend | Target devices |
249249
| --- | --- |
250-
| [Metal](./docs/build.md#metal-build) | Apple Silicon |
251-
| [BLAS](./docs/build.md#blas-build) | All |
252-
| [BLIS](./docs/backend/BLIS.md) | All |
253-
| [SYCL](./docs/backend/SYCL.md) | Intel and Nvidia GPU |
254-
| [MUSA](./docs/build.md#musa) | Moore Threads MTT GPU |
255-
| [CUDA](./docs/build.md#cuda) | Nvidia GPU |
256-
| [hipBLAS](./docs/build.md#hipblas) | AMD GPU |
257-
| [Vulkan](./docs/build.md#vulkan) | GPU |
258-
| [CANN](./docs/build.md#cann) | Ascend NPU |
259-
260-
## Building and usage
250+
| [Metal](docs/build.md#metal-build) | Apple Silicon |
251+
| [BLAS](docs/build.md#blas-build) | All |
252+
| [BLIS](docs/backend/BLIS.md) | All |
253+
| [SYCL](docs/backend/SYCL.md) | Intel and Nvidia GPU |
254+
| [MUSA](docs/build.md#musa) | Moore Threads MTT GPU |
255+
| [CUDA](docs/build.md#cuda) | Nvidia GPU |
256+
| [hipBLAS](docs/build.md#hipblas) | AMD GPU |
257+
| [Vulkan](docs/build.md#vulkan) | GPU |
258+
| [CANN](docs/build.md#cann) | Ascend NPU |
259+
260+
## Building the project
261261

262262
The main product of this project is the `llama` library. Its C-style interface can be found in [include/llama.h](include/llama.h).
263263
The project also includes many example programs and tools using the `llama` library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server. Possible methods for obtaining the binaries:
264264

265-
- Clone this repository and build locally, see [how to build](./docs/build.md)
266-
- On MacOS or Linux, install `llama.cpp` via [brew, flox or nix](./docs/install.md)
267-
- Use a Docker image, see [documentation for Docker](./docs/docker.md)
265+
- Clone this repository and build locally, see [how to build](docs/build.md)
266+
- On MacOS or Linux, install `llama.cpp` via [brew, flox or nix](docs/install.md)
267+
- Use a Docker image, see [documentation for Docker](docs/docker.md)
268268
- Download pre-built binaries from [releases](https://github.com/ggerganov/llama.cpp/releases)
269269

270-
### Obtaining and quantizing models
270+
## Obtaining and quantizing models
271271

272272
The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](https://huggingface.co/models?library=gguf&sort=trending) compatible with `llama.cpp`:
273273

@@ -285,79 +285,204 @@ The Hugging Face platform provides a variety of online tools for converting, qua
285285
- Use the [GGUF-editor space](https://huggingface.co/spaces/CISCai/gguf-editor) to edit GGUF meta data in the browser (more info: https://github.com/ggerganov/llama.cpp/discussions/9268)
286286
- Use the [Inference Endpoints](https://ui.endpoints.huggingface.co/) to directly host `llama.cpp` in the cloud (more info: https://github.com/ggerganov/llama.cpp/discussions/9669)
287287

288-
To learn more about model quantization, [read this documentation](./examples/quantize/README.md)
288+
To learn more about model quantization, [read this documentation](examples/quantize/README.md)
289289

290-
### Using the `llama-cli` tool
290+
## [`llama-cli`](examples/main)
291291

292-
Run a basic text completion:
292+
#### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.
293293

294-
```bash
295-
llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128
294+
- <details open>
295+
<summary>Run simple text completion</summary>
296296

297-
# Output:
298-
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
299-
```
297+
```bash
298+
llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128
300299

301-
See [this page](./examples/main/README.md) for a full list of parameters.
300+
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
301+
```
302302

303-
### Conversation mode
303+
</details>
304304

305-
Run `llama-cli` in conversation/chat mode by passing the `-cnv` parameter:
305+
- <details>
306+
<summary>Run in conversation mode</summary>
306307

307-
```bash
308-
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv
308+
```bash
309+
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv
309310
310-
# Output:
311-
# > hi, who are you?
312-
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
313-
#
314-
# > what is 1+1?
315-
# Easy peasy! The answer to 1+1 is... 2!
316-
```
311+
# > hi, who are you?
312+
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
313+
#
314+
# > what is 1+1?
315+
# Easy peasy! The answer to 1+1 is... 2!
316+
```
317317

318-
By default, the chat template will be taken from the input model. If you want to use another chat template, pass `--chat-template NAME` as a parameter. See the list of [supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
318+
</details>
319319

320-
```bash
321-
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
322-
```
320+
- <details>
321+
<summary>Run with custom chat template</summary>
323322

324-
You can also use your own template via in-prefix, in-suffix and reverse-prompt parameters:
323+
```bash
324+
# use the "chatml" template
325+
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
325326
326-
```bash
327-
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
328-
```
327+
# use a custom template
328+
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
329+
```
329330

330-
### Constrained output with grammars
331+
[Supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
331332

332-
`llama.cpp` can constrain the output of the model via custom grammars. For example, you can force the model to output only JSON:
333+
</details>
333334

334-
```bash
335-
llama-cli -m your_model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
336-
```
335+
- <details>
336+
<summary>Constrain the output with a custom grammar</summary>
337337

338-
The `grammars/` folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](./grammars/README.md).
338+
```bash
339+
llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
339340
340-
For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/
341+
# {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
342+
```
341343

342-
### Web server (`llama-server`)
344+
The [grammars/](grammars/) folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](grammars/README.md).
343345

344-
The [llama-server](./examples/server/README.md) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
346+
For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/
345347

346-
Example usage:
348+
</details>
347349

348-
```bash
349-
llama-server -m your_model.gguf --port 8080
350350

351-
# Basic web UI can be accessed via browser: http://localhost:8080
352-
# Chat completion endpoint: http://localhost:8080/v1/chat/completions
353-
```
351+
## [`llama-server`](examples/server)
354352

355-
### Perplexity (measuring model quality)
353+
#### A lightweight, [OpenAI API](https://github.com/openai/openai-openapi) compatible, HTTP server for serving LLMs.
356354

357-
Use the `llama-perplexity` tool to measure perplexity over a given prompt (lower perplexity is better).
358-
For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).
355+
- <details open>
356+
<summary>Start a local HTTP server with default configuration on port 8080</summary>
357+
358+
```bash
359+
llama-server -m model.gguf --port 8080
360+
361+
# Basic web UI can be accessed via browser: http://localhost:8080
362+
# Chat completion endpoint: http://localhost:8080/v1/chat/completions
363+
```
364+
365+
</details>
366+
367+
- <details>
368+
<summary>Support multiple-users and parallel decoding</summary>
369+
370+
```bash
371+
# up to 4 concurrent requests, each with 4096 max context
372+
llama-server -m model.gguf -c 16384 -np 4
373+
```
374+
375+
</details>
376+
377+
- <details>
378+
<summary>Enable speculative decoding</summary>
379+
380+
```bash
381+
# the draft.gguf model should be a small variant of the target model.gguf
382+
llama-server -m model.gguf -md draft.gguf
383+
```
384+
385+
</details>
386+
387+
- <details>
388+
<summary>Serve an embedding model</summary>
389+
390+
```bash
391+
# use the /embedding endpoint
392+
llama-server -m model.gguf --embedding --pooling cls -ub 8192
393+
```
394+
395+
</details>
396+
397+
- <details>
398+
<summary>Serve a reranking model</summary>
399+
400+
```bash
401+
# use the /reranking endpoint
402+
llama-server -m model.gguf --reranking
403+
```
404+
405+
</details>
406+
407+
- <details>
408+
<summary>Constrain all outputs with a grammar</summary>
409+
410+
```bash
411+
# custom grammar
412+
llama-server -m model.gguf --grammar-file grammar.gbnf
413+
414+
# JSON
415+
llama-server -m model.gguf --grammar-file grammars/json.gbnf
416+
```
417+
418+
</details>
419+
420+
421+
## [`llama-perplexity`](examples/perplexity)
422+
423+
#### A tool for measuring the perplexity [^1][^2] (and other quality metrics) of a model over a given text.
424+
425+
- <details open>
426+
<summary>Measure the perplexity over a text file</summary>
427+
428+
```bash
429+
llama-perplexity -m model.gguf -f file.txt
430+
431+
# [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...
432+
# Final estimate: PPL = 5.4007 +/- 0.67339
433+
```
434+
435+
</details>
436+
437+
- <details>
438+
<summary>Measure KL divergence</summary>
439+
440+
```bash
441+
# TODO
442+
```
443+
444+
</details>
445+
446+
[^1]: [examples/perplexity/README.md](examples/perplexity/README.md)
447+
[^2]: [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity)
448+
449+
## [`llama-bench`](example/bench)
450+
451+
#### Benchmark the performance of the inference for various parameters.
452+
453+
- <details open>
454+
<summary>Run default benchmark</summary>
455+
456+
```bash
457+
llama-bench -m model.gguf
458+
459+
# Output:
460+
# | model | size | params | backend | threads | test | t/s |
461+
# | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
462+
# | qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | pp512 | 5765.41 ± 20.55 |
463+
# | qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | tg128 | 197.71 ± 0.81 |
464+
#
465+
# build: 3e0ba0e60 (4229)
466+
```
467+
468+
</details>
469+
470+
471+
## [`llama-simple`](examples/simple)
472+
473+
#### A minimal example for implementing apps with `llama.cpp`. Useful for developers.
474+
475+
- <details>
476+
<summary>Basic text completion</summary>
477+
478+
```bash
479+
llama-simple -m model.gguf
480+
481+
# Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of
482+
```
483+
484+
</details>
359485

360-
To learn more how to measure perplexity using llama.cpp, [read this documentation](./examples/perplexity/README.md)
361486

362487
## Contributing
363488

@@ -372,19 +497,19 @@ To learn more how to measure perplexity using llama.cpp, [read this documentatio
372497

373498
## Other documentation
374499

375-
- [main (cli)](./examples/main/README.md)
376-
- [server](./examples/server/README.md)
377-
- [GBNF grammars](./grammars/README.md)
500+
- [main (cli)](examples/main/README.md)
501+
- [server](examples/server/README.md)
502+
- [GBNF grammars](grammars/README.md)
378503

379-
**Development documentation**
504+
#### Development documentation
380505

381-
- [How to build](./docs/build.md)
382-
- [Running on Docker](./docs/docker.md)
383-
- [Build on Android](./docs/android.md)
384-
- [Performance troubleshooting](./docs/development/token_generation_performance_tips.md)
506+
- [How to build](docs/build.md)
507+
- [Running on Docker](docs/docker.md)
508+
- [Build on Android](docs/android.md)
509+
- [Performance troubleshooting](docs/development/token_generation_performance_tips.md)
385510
- [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)
386511

387-
**Seminal papers and background on the models**
512+
#### Seminal papers and background on the models
388513

389514
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
390515
- LLaMA:
@@ -395,3 +520,6 @@ If your issue is with model generation quality, then please at least scan the fo
395520
- GPT-3.5 / InstructGPT / ChatGPT:
396521
- [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
397522
- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
523+
524+
#### References
525+

0 commit comments

Comments
 (0)