1
1
## Overview
2
2
3
- The ` rpc-server ` allows running ` ggml ` backend on a remote host.
4
- The RPC backend communicates with one or several instances of ` rpc-server ` and offloads computations to them.
3
+ ` llama- rpc-server` allows running ` ggml ` backend on a remote host.
4
+ The RPC backend communicates with one or several instances of ` llama- rpc-server` and offloads computations to them.
5
5
This can be used for distributed LLM inference with ` llama.cpp ` in the following way:
6
6
7
7
``` mermaid
@@ -10,13 +10,13 @@ flowchart TD
10
10
rpcb---|TCP|srvb
11
11
rpcb-.-|TCP|srvn
12
12
subgraph hostn[Host N]
13
- srvn[rpc-server]-.-backend3["Backend (CUDA,Metal,etc.)"]
13
+ srvn[llama- rpc-server]-.-backend3["Backend (CUDA,Metal,etc.)"]
14
14
end
15
15
subgraph hostb[Host B]
16
- srvb[rpc-server]---backend2["Backend (CUDA,Metal,etc.)"]
16
+ srvb[llama- rpc-server]---backend2["Backend (CUDA,Metal,etc.)"]
17
17
end
18
18
subgraph hosta[Host A]
19
- srva[rpc-server]---backend["Backend (CUDA,Metal,etc.)"]
19
+ srva[llama- rpc-server]---backend["Backend (CUDA,Metal,etc.)"]
20
20
end
21
21
subgraph host[Main Host]
22
22
ggml[llama.cpp]---rpcb[RPC backend]
@@ -25,24 +25,22 @@ flowchart TD
25
25
```
26
26
27
27
Each host can run a different backend, e.g. one with CUDA and another with Metal.
28
- You can also run multiple ` rpc-server ` instances on the same host, each with a different backend.
28
+ You can also run multiple ` llama- rpc-server` instances on the same host, each with a different backend.
29
29
30
30
## Usage
31
31
32
32
On each host, build the corresponding backend with ` cmake ` and add ` -DLLAMA_RPC=ON ` to the build options.
33
33
For example, to build the CUDA backend with RPC support:
34
34
35
35
``` bash
36
- mkdir build-rpc-cuda
37
- cd build-rpc-cuda
38
- cmake .. -DLLAMA_CUDA=ON -DLLAMA_RPC=ON
39
- cmake --build . --config Release
36
+ cmake -B build-rpc-cuda -DLLAMA_CUDA=ON -DLLAMA_RPC=ON
37
+ cmake --build build-rpc-cuda --config Release
40
38
```
41
39
42
- Then, start the ` rpc-server ` with the backend:
40
+ Then, start ` llama- rpc-server` with the backend:
43
41
44
42
``` bash
45
- $ bin/rpc-server -p 50052
43
+ $ bin/llama- rpc-server -p 50052
46
44
create_backend: using CUDA backend
47
45
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
48
46
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
@@ -53,21 +51,19 @@ Starting RPC server on 0.0.0.0:50052
53
51
54
52
When using the CUDA backend, you can specify the device with the ` CUDA_VISIBLE_DEVICES ` environment variable, e.g.:
55
53
``` bash
56
- $ CUDA_VISIBLE_DEVICES=0 bin/rpc-server -p 50052
54
+ $ CUDA_VISIBLE_DEVICES=0 bin/llama- rpc-server -p 50052
57
55
```
58
- This way you can run multiple ` rpc-server ` instances on the same host, each with a different CUDA device.
56
+ This way you can run multiple ` llama- rpc-server` instances on the same host, each with a different CUDA device.
59
57
60
58
61
59
On the main host build ` llama.cpp ` only with ` -DLLAMA_RPC=ON ` :
62
60
63
61
``` bash
64
- mkdir build-rpc
65
- cd build-rpc
66
- cmake .. -DLLAMA_RPC=ON
67
- cmake --build . --config Release
62
+ cmake -B build-rpc -DLLAMA_RPC=ON
63
+ cmake --build build-rpc --config Release -t -j
68
64
```
69
65
70
- Finally, use the ` --rpc ` option to specify the host and port of each ` rpc-server ` :
66
+ Finally, use the ` --rpc ` option to specify the host and port of each ` llama- rpc-server` :
71
67
72
68
``` bash
73
69
$ bin/llama-cli -m ../models/tinyllama-1b/ggml-model-f16.gguf -p " Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.10:50052,192.168.88.11:50052 -ngl 99
0 commit comments