@@ -203,7 +203,7 @@ you need to specify a different `shm-region-prefix-name` for each server. See
203
203
for more information.
204
204
205
205
## Triton Metrics
206
- Starting with the 24.08 release of Triton, users can now obtain partial
206
+ Starting with the 24.08 release of Triton, users can now obtain specific
207
207
vLLM metrics by querying the Triton metrics endpoint (see complete vLLM metrics
208
208
[ here] ( https://docs.vllm.ai/en/latest/serving/metrics.html ) ). This can be
209
209
accomplished by launching a Triton server in any of the ways described above
@@ -213,16 +213,42 @@ the following:
213
213
``` bash
214
214
curl localhost:8002/metrics
215
215
```
216
- VLLM stats are reported by the metrics endpoint in fields that
217
- are prefixed with ` vllm: ` . Your output for these fields should look
218
- similar to the following:
216
+ VLLM stats are reported by the metrics endpoint in fields that are prefixed with
217
+ ` vllm: ` . Triton currently supports reporting of the following metrics from vLLM.
218
+ ``` bash
219
+ # Number of prefill tokens processed.
220
+ counter_prompt_tokens
221
+ # Number of generation tokens processed.
222
+ counter_generation_tokens
223
+ # Histogram of time to first token in seconds.
224
+ histogram_time_to_first_token
225
+ # Histogram of time per output token in seconds.
226
+ histogram_time_per_output_token
227
+ ```
228
+ Your output for these fields should look similar to the following:
219
229
``` bash
220
230
# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
221
231
# TYPE vllm:prompt_tokens_total counter
222
232
vllm:prompt_tokens_total{model=" vllm_model" ,version=" 1" } 10
223
233
# HELP vllm:generation_tokens_total Number of generation tokens processed.
224
234
# TYPE vllm:generation_tokens_total counter
225
235
vllm:generation_tokens_total{model=" vllm_model" ,version=" 1" } 16
236
+ # HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
237
+ # TYPE vllm:time_to_first_token_seconds histogram
238
+ vllm:time_to_first_token_seconds_count{model=" vllm_model" ,version=" 1" } 1
239
+ vllm:time_to_first_token_seconds_sum{model=" vllm_model" ,version=" 1" } 0.03233122825622559
240
+ vllm:time_to_first_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" 0.001" } 0
241
+ vllm:time_to_first_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" 0.005" } 0
242
+ ...
243
+ vllm:time_to_first_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" +Inf" } 1
244
+ # HELP vllm:time_per_output_token_seconds Histogram of time per output token in seconds.
245
+ # TYPE vllm:time_per_output_token_seconds histogram
246
+ vllm:time_per_output_token_seconds_count{model=" vllm_model" ,version=" 1" } 15
247
+ vllm:time_per_output_token_seconds_sum{model=" vllm_model" ,version=" 1" } 0.04501533508300781
248
+ vllm:time_per_output_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" 0.01" } 14
249
+ vllm:time_per_output_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" 0.025" } 15
250
+ ...
251
+ vllm:time_per_output_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" +Inf" } 15
226
252
```
227
253
To enable vLLM engine colleting metrics, "disable_log_stats" option need to be either false
228
254
or left empty (false by default) in [ model.json] ( https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json ) .
0 commit comments