@@ -22,37 +22,52 @@ This guide provides detailed steps on benchmarking Large Language Models (LLMs)
22
22
> [ !NOTE]
23
23
> We recommend trying out the [ LLM Deployment Examples] ( ./README.md ) before benchmarking.
24
24
25
+
25
26
## Prerequisites
26
27
27
- H100 80GB x8 node(s) are required for benchmarking.
28
+ > [ !Important]
29
+ > At least one 8xH100-80GB node is required for the following instructions.
30
+
31
+ 1 . Build benchmarking image
32
+
33
+ ``` bash
34
+ ./container/build.sh
35
+ ```
36
+
37
+ 2. Download model
38
+
39
+ ` ` ` bash
40
+ huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
41
+ ` ` `
42
+
43
+ 3. Start NATS and ETCD
44
+
45
+ ` ` ` bash
46
+ docker compose -f deploy/docker_compose.yml up -d
47
+ ` ` `
28
48
29
49
> [! NOTE]
30
50
> This guide was tested on node(s) with the following hardware configuration:
31
- > * ** GPUs** : 8xH100 80GB HBM3 (GPU Memory Bandwidth 3.2 TBs)
32
- > * ** CPU** : 2x Intel Saphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
33
- > * ** NVLink** : NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
34
- > * ** InfiniBand** : 8X400Gbit/s (Compute Links), 2X400Gbit/s (Storage Links)
51
+ >
52
+ > * ** GPUs** :
53
+ > 8xH100-80GB-HBM3 (GPU Memory Bandwidth 3.2 TBs)
54
+ >
55
+ > * ** CPU** :
56
+ > 2 x Intel Sapphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
57
+ >
58
+ > * ** NVLink** :
59
+ > NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
60
+ >
61
+ > * ** InfiniBand** :
62
+ > 8x400Gbit/s (Compute Links), 2x400Gbit/s (Storage Links)
35
63
>
36
64
> Benchmarking with a different hardware configuration may yield suboptimal results.
37
65
38
- 1\. Build benchmarking image
39
- ``` bash
40
- ./container/build.sh
41
- ```
42
-
43
- 2\. Download model
44
- ``` bash
45
- huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
46
- ```
47
-
48
- 3\. Start NATS and ETCD
49
- ``` bash
50
- docker compose -f deploy/docker_compose.yml up -d
51
- ```
52
66
53
67
# # Disaggregated Single Node Benchmarking
54
68
55
- One H100 80GB x8 node is required for this setup.
69
+ > [! Important]
70
+ > One 8xH100-80GB node is required for the following instructions.
56
71
57
72
In the following setup we compare Dynamo disaggregated vLLM performance to
58
73
[native vLLM Aggregated Baseline](# vllm-aggregated-baseline-benchmarking) on a single node. These were chosen to optimize
@@ -64,24 +79,32 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te
64
79
65
80
With the Dynamo repository, benchmarking image and model available, and ** NATS and ETCD started** , perform the following steps:
66
81
67
- 1\. Run benchmarking container
68
- ``` bash
69
- ./container/run.sh --mount-workspace
70
- ```
71
- Note: The huggingface home source mount can be changed by setting ` --hf-cache ~/.cache/huggingface ` .
82
+ 1. Run benchmarking container
72
83
73
- 2\. Start disaggregated services
74
- ``` bash
75
- cd /workspace/examples/llm
76
- dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
77
- ```
78
- Note: Check the ` disagg.log ` to make sure the service is fully started before collecting performance numbers.
84
+ ` ` ` bash
85
+ ./container/run.sh --mount-workspace
86
+ ` ` `
87
+
88
+ > [! Tip]
89
+ > The huggingface home source mount can be changed by setting ` --hf-cache ~ /.cache/huggingface` .
90
+
91
+ 2. Start disaggregated services
92
+
93
+ ` ` ` bash
94
+ cd /workspace/examples/llm
95
+ dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
96
+ ` ` `
79
97
80
- Collect the performance numbers as shown on the [ Collecting Performance Numbers] ( #collecting-performance-numbers ) section below.
98
+ > [! Tip]
99
+ > Check the ` disagg.log` to make sure the service is fully started before collecting performance numbers.
81
100
82
- ## Disaggregated Multi Node Benchmarking
101
+ 3. Collect the performance numbers as shown on the [Collecting Performance Numbers]( # collecting-performance-numbers) section below.
83
102
84
- Two H100 80GB x8 nodes are required for this setup.
103
+
104
+ # # Disaggregated Multinode Benchmarking
105
+
106
+ > [! Important]
107
+ > Two 8xH100-80GB nodes are required the following instructions.
85
108
86
109
In the following steps we compare Dynamo disaggregated vLLM performance to
87
110
[native vLLM Aggregated Baseline](# vllm-aggregated-baseline-benchmarking) on two nodes. These were chosen to optimize
@@ -93,87 +116,136 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te
93
116
94
117
With the Dynamo repository, benchmarking image and model available, and ** NATS and ETCD started on node 0** , perform the following steps:
95
118
96
- 1\. Run benchmarking container (node 0 & 1)
97
- ``` bash
98
- ./container/run.sh --mount-workspace
99
- ```
100
- Note: The huggingface home source mount can be changed by setting ` --hf-cache ~/.cache/huggingface ` .
119
+ 1. Run benchmarking container (nodes 0 & 1)
101
120
102
- 2\. Config NATS and ETCD (node 1)
103
- ``` bash
104
- export NATS_SERVER=" nats://<node_0_ip_addr>"
105
- export ETCD_ENDPOINTS=" <node_0_ip_addr>:2379"
106
- ```
107
- Note: Node 1 must be able to reach Node 0 over the network for the above services.
121
+ ` ` ` bash
122
+ ./container/run.sh --mount-workspace
123
+ ` ` `
108
124
109
- 3\. Start workers (node 0)
110
- ``` bash
111
- cd /workspace/examples/llm
112
- dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
113
- ```
114
- Note: Check the ` disagg_multinode.log ` to make sure the service is fully started before collecting performance numbers.
125
+ > [! Tip]
126
+ > The huggingface home source mount can be changed by setting ` --hf-cache ~ /.cache/huggingface` .
115
127
116
- 4\. Start workers (node 1)
117
- ``` bash
118
- cd /workspace/examples/llm
119
- dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
120
- ```
121
- Note: Check the ` prefill_multinode.log ` to make sure the service is fully started before collecting performance numbers.
128
+ 2. Config NATS and ETCD (node 1)
129
+
130
+ ` ` ` bash
131
+ export NATS_SERVER=" nats://<node_0_ip_addr>"
132
+ export ETCD_ENDPOINTS=" <node_0_ip_addr>:2379"
133
+ ` ` `
134
+
135
+ > [! Important]
136
+ > Node 1 must be able to reach Node 0 over the network for the above services.
137
+
138
+ 3. Start workers (node 0)
139
+
140
+ ` ` ` bash
141
+ cd /workspace/examples/llm
142
+ dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
143
+ ` ` `
144
+
145
+ > [! Tip]
146
+ > Check the ` disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
147
+
148
+ 4. Start workers (node 1)
149
+
150
+ ` ` ` bash
151
+ cd /workspace/examples/llm
152
+ dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
153
+ ` ` `
154
+
155
+ > [! Tip]
156
+ > Check the ` prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.
157
+
158
+ 5. Collect the performance numbers as shown on the [Collecting Performance Numbers](# collecting-performance-numbers) section above.
122
159
123
- Collect the performance numbers as shown on the [ Collecting Performance Numbers] ( #collecting-performance-numbers ) section above.
124
160
125
161
# # vLLM Aggregated Baseline Benchmarking
126
162
127
- One (or two) H100 80GB x8 nodes are required for this setup.
163
+ > [! Important]
164
+ > One (or two) 8xH100-80GB nodes are required the following instructions.
128
165
129
166
With the Dynamo repository and the benchmarking image available, perform the following steps:
130
167
131
- 1\. Run benchmarking container
132
- ``` bash
133
- ./container/run.sh --mount-workspace
134
- ```
135
- Note: The huggingface home source mount can be changed by setting ` --hf-cache ~/.cache/huggingface ` .
168
+ 1. Run benchmarking container
136
169
137
- 2\. Start vLLM serve
138
- ``` bash
139
- CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
140
- --block-size 128 \
141
- --max-model-len 3500 \
142
- --max-num-batched-tokens 3500 \
143
- --tensor-parallel-size 4 \
144
- --gpu-memory-utilization 0.95 \
145
- --disable-log-requests \
146
- --port 8001 1> vllm_0.log 2>&1 &
147
- CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
148
- --block-size 128 \
149
- --max-model-len 3500 \
150
- --max-num-batched-tokens 3500 \
151
- --tensor-parallel-size 4 \
152
- --gpu-memory-utilization 0.95 \
153
- --disable-log-requests \
154
- --port 8002 1> vllm_1.log 2>&1 &
155
- ```
156
- Notes:
157
- * Check the ` vllm_0.log ` and ` vllm_1.log ` to make sure the service is fully started before collecting performance numbers.
158
- * If benchmarking over 2 nodes, ` --tensor-parallel-size 8 ` should be used and only run one ` vllm serve ` instance per node.
170
+ ` ` ` bash
171
+ ./container/run.sh --mount-workspace
172
+ ` ` `
159
173
160
- 3\. Use NGINX as load balancer
161
- ``` bash
162
- apt update && apt install -y nginx
163
- cp /workspace/examples/llm/benchmarks/nginx.conf /etc/nginx/nginx.conf
164
- service nginx restart
165
- ```
166
- Note: If benchmarking over 2 nodes, the ` upstream ` configuration will need to be updated to link to the ` vllm serve ` on the second node.
174
+ > [! Tip]
175
+ > The Hugging Face home source mount can be changed by setting ` --hf-cache ~ /.cache/huggingface` .
176
+
177
+ 2. Start vLLM serve
178
+
179
+ ` ` ` bash
180
+ CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
181
+ --block-size 128 \
182
+ --max-model-len 3500 \
183
+ --max-num-batched-tokens 3500 \
184
+ --tensor-parallel-size 4 \
185
+ --gpu-memory-utilization 0.95 \
186
+ --disable-log-requests \
187
+ --port 8001 1> vllm_0.log 2>&1 &
188
+ CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
189
+ --block-size 128 \
190
+ --max-model-len 3500 \
191
+ --max-num-batched-tokens 3500 \
192
+ --tensor-parallel-size 4 \
193
+ --gpu-memory-utilization 0.95 \
194
+ --disable-log-requests \
195
+ --port 8002 1> vllm_1.log 2>&1 &
196
+ ` ` `
197
+
198
+ > [! Tip]
199
+ > Check the ` vllm_0.log` and ` vllm_1.log` to make sure the service is fully started before collecting performance numbers.
200
+ >
201
+ > If benchmarking with two or more nodes, ` --tensor-parallel-size 8` should be used and only run one ` vllm serve` instance per node.
202
+
203
+ 3. Use NGINX as load balancer
204
+
205
+ ` ` ` bash
206
+ apt update && apt install -y nginx
207
+ cp /workspace/benchmarks/llm/nginx.conf /etc/nginx/nginx.conf
208
+ service nginx restart
209
+ ` ` `
210
+
211
+ > [! Note]
212
+ > If benchmarking over 2 nodes, the ` upstream` configuration will need to be updated to link to the ` vllm serve` on the second node.
213
+
214
+ 4. Collect the performance numbers as shown on the [Collecting Performance Numbers](# collecting-performance-numbers) section below.
167
215
168
- Collect the performance numbers as shown on the [ Collecting Performance Numbers] ( #collecting-performance-numbers ) section below.
169
216
170
217
# # Collecting Performance Numbers
171
218
172
219
Run the benchmarking script
220
+
173
221
` ` ` bash
174
- bash -x /workspace/examples /llm/benchmarks /perf.sh
222
+ bash -x /workspace/benchmarks /llm/perf.sh
175
223
` ` `
176
224
177
- ## Future Roadmap
225
+ > [! Tip]
226
+ > See [GenAI-Perf tutorial](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/docs/tutorial.md)
227
+ > @ [GitHub](https://github.com/triton-inference-server/perf_analyzer) for additional information about how to run GenAI-Perf
228
+ > and how to interpret results.
229
+
230
+
231
+ # # Supporting Additional Models
232
+
233
+ The instructions above can be used for nearly any model desired.
234
+ More complex setup instructions might be required for certain models.
235
+ The above instruction regarding ETCD, NATS, nginx, dynamo-serve, and GenAI-Perf still apply and can be reused.
236
+ The specifics of deploying with different hardware, in a unique environment, or using another model framework can be adapted using the links below.
237
+
238
+ Regardless of the deployment mechanism, the GenAI-Perf tool will report the same metrics and measurements so long as an accessible endpoint is available for it to interact with. Use the provided [perf.sh](../../../benchmarks/llm/perf.sh) script to automate the measurement of model throughput and latency against multiple request concurrences.
239
+
240
+ # ## Deployment Examples
241
+
242
+ - [Dynamo Multinode Deployments](../../../docs/examples/multinode.md)
243
+ - [Dynamo TensorRT LLM Deployments](../../../docs/examples/trtllm.md)
244
+ - [Aggregated Deployment of Very Large Models](../../../docs/examples/multinode.md#aggregated-deployment)
245
+ - [Dynamo vLLM Deployments](../../../docs/examples/llm_deployment.md)
246
+
247
+
248
+ # # Metrics and Visualization
178
249
179
- * Results Interpretation
250
+ For instructions on how to acquire per worker metrics and visualize them using Grafana,
251
+ please see the provided [Visualization with Prometheus and Grafana](../../../deploy/metrics/README.md).
0 commit comments