Skip to content

Commit 2ae9ab9

Browse files
whoisjtanmayv25kthui
authored
chore: Move Benchmarking to Top Level (#1461)
Signed-off-by: Tanmay Verma <[email protected]> Co-authored-by: Tanmay Verma <[email protected]> Co-authored-by: Jacky <[email protected]>
1 parent 08355da commit 2ae9ab9

File tree

4 files changed

+184
-97
lines changed

4 files changed

+184
-97
lines changed

benchmarks/llm/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
http://www.apache.org/licenses/LICENSE-2.0
8+
Unless required by applicable law or agreed to in writing, software
9+
distributed under the License is distributed on an "AS IS" BASIS,
10+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
See the License for the specific language governing permissions and
12+
limitations under the License.
13+
-->
14+
15+
[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md)
File renamed without changes.
File renamed without changes.

examples/llm/benchmarks/README.md

Lines changed: 169 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -22,37 +22,52 @@ This guide provides detailed steps on benchmarking Large Language Models (LLMs)
2222
> [!NOTE]
2323
> We recommend trying out the [LLM Deployment Examples](./README.md) before benchmarking.
2424
25+
2526
## Prerequisites
2627

27-
H100 80GB x8 node(s) are required for benchmarking.
28+
> [!Important]
29+
> At least one 8xH100-80GB node is required for the following instructions.
30+
31+
1. Build benchmarking image
32+
33+
```bash
34+
./container/build.sh
35+
```
36+
37+
2. Download model
38+
39+
```bash
40+
huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
41+
```
42+
43+
3. Start NATS and ETCD
44+
45+
```bash
46+
docker compose -f deploy/docker_compose.yml up -d
47+
```
2848

2949
> [!NOTE]
3050
> This guide was tested on node(s) with the following hardware configuration:
31-
> * **GPUs**: 8xH100 80GB HBM3 (GPU Memory Bandwidth 3.2 TBs)
32-
> * **CPU**: 2x Intel Saphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
33-
> * **NVLink**: NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
34-
> * **InfiniBand**: 8X400Gbit/s (Compute Links), 2X400Gbit/s (Storage Links)
51+
>
52+
> * **GPUs**:
53+
> 8xH100-80GB-HBM3 (GPU Memory Bandwidth 3.2 TBs)
54+
>
55+
> * **CPU**:
56+
> 2 x Intel Sapphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
57+
>
58+
> * **NVLink**:
59+
> NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
60+
>
61+
> * **InfiniBand**:
62+
> 8x400Gbit/s (Compute Links), 2x400Gbit/s (Storage Links)
3563
>
3664
> Benchmarking with a different hardware configuration may yield suboptimal results.
3765

38-
1\. Build benchmarking image
39-
```bash
40-
./container/build.sh
41-
```
42-
43-
2\. Download model
44-
```bash
45-
huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
46-
```
47-
48-
3\. Start NATS and ETCD
49-
```bash
50-
docker compose -f deploy/docker_compose.yml up -d
51-
```
5266

5367
## Disaggregated Single Node Benchmarking
5468

55-
One H100 80GB x8 node is required for this setup.
69+
> [!Important]
70+
> One 8xH100-80GB node is required for the following instructions.
5671

5772
In the following setup we compare Dynamo disaggregated vLLM performance to
5873
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on a single node. These were chosen to optimize
@@ -64,24 +79,32 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te
6479

6580
With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started**, perform the following steps:
6681

67-
1\. Run benchmarking container
68-
```bash
69-
./container/run.sh --mount-workspace
70-
```
71-
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
82+
1. Run benchmarking container
7283

73-
2\. Start disaggregated services
74-
```bash
75-
cd /workspace/examples/llm
76-
dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
77-
```
78-
Note: Check the `disagg.log` to make sure the service is fully started before collecting performance numbers.
84+
```bash
85+
./container/run.sh --mount-workspace
86+
```
87+
88+
> [!Tip]
89+
> The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
90+
91+
2. Start disaggregated services
92+
93+
```bash
94+
cd /workspace/examples/llm
95+
dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
96+
```
7997

80-
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
98+
> [!Tip]
99+
> Check the `disagg.log` to make sure the service is fully started before collecting performance numbers.
81100

82-
## Disaggregated Multi Node Benchmarking
101+
3. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
83102

84-
Two H100 80GB x8 nodes are required for this setup.
103+
104+
## Disaggregated Multinode Benchmarking
105+
106+
> [!Important]
107+
> Two 8xH100-80GB nodes are required the following instructions.
85108

86109
In the following steps we compare Dynamo disaggregated vLLM performance to
87110
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on two nodes. These were chosen to optimize
@@ -93,87 +116,136 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te
93116

94117
With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started on node 0**, perform the following steps:
95118

96-
1\. Run benchmarking container (node 0 & 1)
97-
```bash
98-
./container/run.sh --mount-workspace
99-
```
100-
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
119+
1. Run benchmarking container (nodes 0 & 1)
101120

102-
2\. Config NATS and ETCD (node 1)
103-
```bash
104-
export NATS_SERVER="nats://<node_0_ip_addr>"
105-
export ETCD_ENDPOINTS="<node_0_ip_addr>:2379"
106-
```
107-
Note: Node 1 must be able to reach Node 0 over the network for the above services.
121+
```bash
122+
./container/run.sh --mount-workspace
123+
```
108124

109-
3\. Start workers (node 0)
110-
```bash
111-
cd /workspace/examples/llm
112-
dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
113-
```
114-
Note: Check the `disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
125+
> [!Tip]
126+
> The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
115127

116-
4\. Start workers (node 1)
117-
```bash
118-
cd /workspace/examples/llm
119-
dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
120-
```
121-
Note: Check the `prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.
128+
2. Config NATS and ETCD (node 1)
129+
130+
```bash
131+
export NATS_SERVER="nats://<node_0_ip_addr>"
132+
export ETCD_ENDPOINTS="<node_0_ip_addr>:2379"
133+
```
134+
135+
> [!Important]
136+
> Node 1 must be able to reach Node 0 over the network for the above services.
137+
138+
3. Start workers (node 0)
139+
140+
```bash
141+
cd /workspace/examples/llm
142+
dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
143+
```
144+
145+
> [!Tip]
146+
> Check the `disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
147+
148+
4. Start workers (node 1)
149+
150+
```bash
151+
cd /workspace/examples/llm
152+
dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
153+
```
154+
155+
> [!Tip]
156+
> Check the `prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.
157+
158+
5. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section above.
122159

123-
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section above.
124160

125161
## vLLM Aggregated Baseline Benchmarking
126162

127-
One (or two) H100 80GB x8 nodes are required for this setup.
163+
> [!Important]
164+
> One (or two) 8xH100-80GB nodes are required the following instructions.
128165

129166
With the Dynamo repository and the benchmarking image available, perform the following steps:
130167

131-
1\. Run benchmarking container
132-
```bash
133-
./container/run.sh --mount-workspace
134-
```
135-
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
168+
1. Run benchmarking container
136169

137-
2\. Start vLLM serve
138-
```bash
139-
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
140-
--block-size 128 \
141-
--max-model-len 3500 \
142-
--max-num-batched-tokens 3500 \
143-
--tensor-parallel-size 4 \
144-
--gpu-memory-utilization 0.95 \
145-
--disable-log-requests \
146-
--port 8001 1> vllm_0.log 2>&1 &
147-
CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
148-
--block-size 128 \
149-
--max-model-len 3500 \
150-
--max-num-batched-tokens 3500 \
151-
--tensor-parallel-size 4 \
152-
--gpu-memory-utilization 0.95 \
153-
--disable-log-requests \
154-
--port 8002 1> vllm_1.log 2>&1 &
155-
```
156-
Notes:
157-
* Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
158-
* If benchmarking over 2 nodes, `--tensor-parallel-size 8` should be used and only run one `vllm serve` instance per node.
170+
```bash
171+
./container/run.sh --mount-workspace
172+
```
159173

160-
3\. Use NGINX as load balancer
161-
```bash
162-
apt update && apt install -y nginx
163-
cp /workspace/examples/llm/benchmarks/nginx.conf /etc/nginx/nginx.conf
164-
service nginx restart
165-
```
166-
Note: If benchmarking over 2 nodes, the `upstream` configuration will need to be updated to link to the `vllm serve` on the second node.
174+
> [!Tip]
175+
> The Hugging Face home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
176+
177+
2. Start vLLM serve
178+
179+
```bash
180+
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
181+
--block-size 128 \
182+
--max-model-len 3500 \
183+
--max-num-batched-tokens 3500 \
184+
--tensor-parallel-size 4 \
185+
--gpu-memory-utilization 0.95 \
186+
--disable-log-requests \
187+
--port 8001 1> vllm_0.log 2>&1 &
188+
CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
189+
--block-size 128 \
190+
--max-model-len 3500 \
191+
--max-num-batched-tokens 3500 \
192+
--tensor-parallel-size 4 \
193+
--gpu-memory-utilization 0.95 \
194+
--disable-log-requests \
195+
--port 8002 1> vllm_1.log 2>&1 &
196+
```
197+
198+
> [!Tip]
199+
> Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
200+
>
201+
> If benchmarking with two or more nodes, `--tensor-parallel-size 8` should be used and only run one `vllm serve` instance per node.
202+
203+
3. Use NGINX as load balancer
204+
205+
```bash
206+
apt update && apt install -y nginx
207+
cp /workspace/benchmarks/llm/nginx.conf /etc/nginx/nginx.conf
208+
service nginx restart
209+
```
210+
211+
> [!Note]
212+
> If benchmarking over 2 nodes, the `upstream` configuration will need to be updated to link to the `vllm serve` on the second node.
213+
214+
4. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
167215

168-
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
169216

170217
## Collecting Performance Numbers
171218

172219
Run the benchmarking script
220+
173221
```bash
174-
bash -x /workspace/examples/llm/benchmarks/perf.sh
222+
bash -x /workspace/benchmarks/llm/perf.sh
175223
```
176224

177-
## Future Roadmap
225+
> [!Tip]
226+
> See [GenAI-Perf tutorial](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/docs/tutorial.md)
227+
> @ [GitHub](https://github.com/triton-inference-server/perf_analyzer) for additional information about how to run GenAI-Perf
228+
> and how to interpret results.
229+
230+
231+
## Supporting Additional Models
232+
233+
The instructions above can be used for nearly any model desired.
234+
More complex setup instructions might be required for certain models.
235+
The above instruction regarding ETCD, NATS, nginx, dynamo-serve, and GenAI-Perf still apply and can be reused.
236+
The specifics of deploying with different hardware, in a unique environment, or using another model framework can be adapted using the links below.
237+
238+
Regardless of the deployment mechanism, the GenAI-Perf tool will report the same metrics and measurements so long as an accessible endpoint is available for it to interact with. Use the provided [perf.sh](../../../benchmarks/llm/perf.sh) script to automate the measurement of model throughput and latency against multiple request concurrences.
239+
240+
### Deployment Examples
241+
242+
- [Dynamo Multinode Deployments](../../../docs/examples/multinode.md)
243+
- [Dynamo TensorRT LLM Deployments](../../../docs/examples/trtllm.md)
244+
- [Aggregated Deployment of Very Large Models](../../../docs/examples/multinode.md#aggregated-deployment)
245+
- [Dynamo vLLM Deployments](../../../docs/examples/llm_deployment.md)
246+
247+
248+
## Metrics and Visualization
178249

179-
* Results Interpretation
250+
For instructions on how to acquire per worker metrics and visualize them using Grafana,
251+
please see the provided [Visualization with Prometheus and Grafana](../../../deploy/metrics/README.md).

0 commit comments

Comments
 (0)