You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/tensorrt_llm/README.md
+7-5Lines changed: 7 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -129,14 +129,15 @@ cd /workspace/examples/tensorrt_llm
129
129
dynamo serve graphs.disagg_router:Frontend -f ./configs/disagg_router.yaml
130
130
```
131
131
132
-
#### Aggregated serving with Multi-Token Prediction(MTP) and DeepSeek R1
132
+
#### Aggregated serving with Multi-Token Prediction(MTP) and DeepSeek R1
133
133
```bash
134
134
cd /workspace/examples/tensorrt_llm
135
135
dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_agg.yaml
136
136
```
137
+
137
138
Notes:
138
139
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
139
-
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking
140
+
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
140
141
141
142
#### Multi-Node Disaggregated Serving
142
143
@@ -233,7 +234,7 @@ Notes:
233
234
unset SLURM_JOBID SLURM_JOB_ID SLURM_NODELIST
234
235
```
235
236
236
-
#### Multi-Node Disaggregated Serving with Multi-Token Prediction(MTP) and DeepSeek R1
237
+
#### Multi-Node Disaggregated Serving with Multi-Token Prediction(MTP) and DeepSeek R1
237
238
238
239
Most of the steps remain the same as the above example, but this time we will have `dynamo serve` point to different config files that contains the MTP configurations
239
240
@@ -268,8 +269,9 @@ dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f configs/deeps
268
269
```
269
270
270
271
Notes:
271
-
- There is a noticeable latency for the first four inference requests. Please send warm-up requests before starting the benchmark.
272
-
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking
272
+
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
273
+
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
0 commit comments