Skip to content

Commit 9c4fabf

Browse files
phymbertggerganov
authored andcommitted
server: docs - refresh and tease a little bit more the http server (ggml-org#5718)
* server: docs - refresh and tease a little bit more the http server * Rephrase README.md server doc Co-authored-by: Georgi Gerganov <[email protected]> * Update examples/server/README.md Co-authored-by: Georgi Gerganov <[email protected]> * Update examples/server/README.md Co-authored-by: Georgi Gerganov <[email protected]> * Update README.md --------- Co-authored-by: Georgi Gerganov <[email protected]>
1 parent b6c4e55 commit 9c4fabf

File tree

2 files changed

+18
-3
lines changed

2 files changed

+18
-3
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,9 @@ Typically finetunes of the base models below are supported as well.
114114
- [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
115115
- [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
116116

117+
**HTTP server**
118+
119+
[llama.cpp web server](./examples/server) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
117120

118121
**Bindings:**
119122

examples/server/README.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,20 @@
1-
# llama.cpp/example/server
1+
# LLaMA.cpp HTTP Server
22

3-
This example demonstrates a simple HTTP API server and a simple web front end to interact with llama.cpp.
3+
Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**.
44

5-
Command line options:
5+
Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
6+
7+
**Features:**
8+
* LLM inference of F16 and quantum models on GPU and CPU
9+
* [OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions and embeddings routes
10+
* Parallel decoding with multi-user support
11+
* Continuous batching
12+
* Multimodal (wip)
13+
* Monitoring endpoints
14+
15+
The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggerganov/llama.cpp/issues/4216).
16+
17+
**Command line options:**
618

719
- `--threads N`, `-t N`: Set the number of threads to use during generation.
820
- `-tb N, --threads-batch N`: Set the number of threads to use during batch and prompt processing. If not specified, the number of threads will be set to the number of threads used for generation.

0 commit comments

Comments
 (0)