test(perf): Add remaining Llama-Nemotron perftests (nano, super, ultra) + extras ✨ #5066

venkywonka · 2025-06-10T04:08:55Z

Description

Refine the logic to override test harness with extra-llma-api-args that was previously verbose for including exceptions.
Add super pyt backend tests (bf16)
Add super pyt backend tests (fp8 pre)
Add ultra pyt backend tests (fp8 pre)
Add nano pyt backend tests (fp8 pre)

Performance Summary

Llama v3.1 Nemotron Nano 8B

Invariants:

Model: llama_v3.1_nemotron_nano_8b_fp8
Quantization: FP8 pre-quantized
GPUs: 1 H100
Parallelism: None (single GPU)

ISL/OSL	Total Requests	Concurrency	Request Throughput (req/sec)	Total Token Throughput (tokens/sec)	Total Output Throughput (tokens/sec)	Avg. Request Latency (ms)	Per User Output Throughput (tps/user)
5000/500	8	1 (low-latency)	0.3114	1712.5529	155.6866	3211.5057	155.6902
500/2000	8	1 (low-latency)	0.0829	207.2239	165.7791	12064.1526	165.7804
1000/1000	8	1 (low-latency)	0.1654	330.7232	165.3616	6047.2742	165.3638
20000/2000	8	1 (low-latency)	0.0699	1536.8189	139.7108	14315.1906	139.7117
5000/500	500	250 (high-thput)	0.3114	1712.8874	155.7170	602893.2472	2.1568
500/2000	500	250 (high-thput)	5.3177	13294.1857	10635.3486	46703.0254	42.8700
1000/1000	500	250 (high-thput)	8.8740	17747.9144	8873.9572	27362.7071	37.0793
20000/2000	500	250 (high-thput)	0.0699	1537.6918	139.7902	2686479.5368	1.9492

Llama v3.3 Nemotron Super 49B (FP8)

Invariants:

Model: llama_v3.3_nemotron_super_49b_fp8
Quantization: FP8 pre-quantized
GPUs: 4 H100s
Parallelism: TP=4

ISL/OSL	Total Requests	Concurrency	Request Throughput (req/sec)	Total Token Throughput (tokens/sec)	Total Output Throughput (tokens/sec)	Avg. Request Latency (ms)	Per User Output Throughput (tps/user)
5000/500	8	1 (low-latency)	0.2004	1102.3076	100.2098	4989.3921	100.2126
500/2000	8	1 (low-latency)	0.0529	132.2748	105.8199	18899.8642	105.8209
5000/500	250	250 (high-thput)	0.1991	1095.1056	99.5551	633542.7594	2.3312
500/2000	250	250 (high-thput)	3.6579	9144.6989	7315.7591	67539.0391	29.6147

Llama v3.3 Nemotron Super 49B (BF16)

Invariants:

Model: llama_v3.3_nemotron_super_49b
Quantization: None (BFloat16)
GPUs: 4 H100s
Parallelism: TP=4

ISL/OSL	Total Requests	Concurrency	Request Throughput (req/sec)	Total Token Throughput (tokens/sec)	Total Output Throughput (tokens/sec)	Avg. Request Latency (ms)	Per User Output Throughput (tps/user)
5000/500	8	1 (low-latency)	0.1201	660.4628	60.0421	8327.3392	60.0432
500/2000	8	1 (low-latency)	0.0315	78.6292	62.9034	31794.6209	62.9037
5000/500	250	250 (high-thput)	0.1201	660.8213	60.0747	1047295.6879	1.4272
500/2000	250	250 (high-thput)	2.6072	6517.9790	5214.3832	94645.0469	21.1334

Llama v3.1 Nemotron Ultra 253B

Invariants:

Model: llama_v3.1_nemotron_ultra_253b_fp8
Quantization: FP8 pre-quantized
GPUs: 8 H100s
Parallelism: TP=8

ISL/OSL	Total Requests	Concurrency	Request Throughput (req/sec)	Total Token Throughput (tokens/sec)	Total Output Throughput (tokens/sec)	Avg. Request Latency (ms)	Per User Output Throughput (tps/user)
5000/500	8	1 (low-latency)	0.0732	402.6274	36.6025	13660.0937	36.6078
500/2000	8	1 (low-latency)	0.0204	50.9418	40.7534	49075.3022	40.7543
5000/500	250	250 (high-thput)	0.0699	384.3698	34.9427	1812039.2223	0.8413
500/2000	250	250 (high-thput)	0.8270	2067.5576	1654.0461	299370.8895	6.6809

NOTES

FP8 vs BF16 (Super 49B): FP8 shows ~40-60% higher throughput compared to BF16
All above models were run with enable_attention_dp=False override. (Enabling it was causing hangs that are tracked be several bugs).

Signed-off-by: Venky Ganesh <[email protected]>

Copilot

Pull Request Overview

This PR adds additional performance tests for Llama-Nemotron models, refining test harness logic and updating configuration mappings to support FP8 backends for nano, super, and ultra variants.

Added FP8 prequantized performance tests for nano, super, and ultra models in the YAML test list.
Updated test mapping in test_perf.py to include FP8 variants.
Enhanced pattern-based model configuration in pytorch_model_config.py to support new FP8 tests and disable attention_dp for certain models.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
tests/integration/test_lists/qa/trt_llm_release_perf_test.yml	Added new FP8 and extended performance test cases with clarifying comments.
tests/integration/defs/perf/test_perf.py	Updated model mapping to include FP8 variants for performance tests.
tests/integration/defs/perf/pytorch_model_config.py	Introduced pattern-based configuration updates and adjustments for FP8 model tests.

Comments suppressed due to low confidence (3)

tests/integration/defs/perf/pytorch_model_config.py:82

Consider adding a reference, such as a ticket or issue ID, in the comment regarding the hang issue to provide better context for future maintainability.

'enable_attention_dp': False,

tests/integration/defs/perf/test_perf.py:59

[nitpick] Ensure that the updated FP8 model naming is consistent across all mapping dictionaries to avoid potential confusion during usage.

"llama_v3.1_nemotron_nano_8b_fp8": "Llama-3.1-Nemotron-Nano-8B-v1-FP8",

tests/integration/test_lists/qa/trt_llm_release_perf_test.yml:294

[nitpick] Consider expanding the inline comments to provide more context on the test categories, which can improve clarity and maintainability.

# pyt

venkywonka · 2025-06-10T04:11:16Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-10T04:16:55Z

PR_Github #8197 [ run ] triggered by Bot

Signed-off-by: Venky Ganesh <[email protected]>

venkywonka · 2025-06-10T04:56:05Z

/bot run

tensorrt-cicd · 2025-06-10T05:01:52Z

PR_Github #8207 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-10T05:01:54Z

PR_Github #8197 [ run ] completed with state ABORTED

Signed-off-by: Venky Ganesh <[email protected]>

venkywonka · 2025-06-10T05:11:28Z

/bot run

tensorrt-cicd · 2025-06-10T05:17:03Z

PR_Github #8209 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-10T05:17:06Z

PR_Github #8207 [ run ] completed with state ABORTED

tests/integration/defs/perf/pytorch_model_config.py

Signed-off-by: Venky Ganesh <[email protected]>

venkywonka · 2025-06-10T05:34:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-10T05:56:53Z

PR_Github #8216 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-10T05:56:55Z

PR_Github #8209 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-06-10T11:11:43Z

PR_Github #8216 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5959 completed with status: 'SUCCESS'

Signed-off-by: Venky <[email protected]>

venkywonka · 2025-06-11T19:45:17Z

/bot reuse-pipeline --number 5959

tensorrt-cicd · 2025-06-11T19:51:00Z

PR_Github #8539 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-06-11T20:06:27Z

PR_Github #8539 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #8216 for commit ab29fb7

venkywonka added 4 commits June 9, 2025 20:36

refine extra-llm-api-args parsing logic

f085f99

Signed-off-by: Venky Ganesh <[email protected]>

harness-side changes for super n ultra

1f4f722

Signed-off-by: Venky Ganesh <[email protected]>

test_list changes for super,ultra

1a4dc1a

Signed-off-by: Venky Ganesh <[email protected]>

add nano pyt fp8

4762948

Signed-off-by: Venky Ganesh <[email protected]>

venkywonka marked this pull request as ready for review June 10, 2025 04:10

venkywonka requested review from Copilot, ruodil and tijyojwad June 10, 2025 04:10

Copilot AI reviewed Jun 10, 2025

View reviewed changes

venkywonka requested review from schetlur-nv and LarryXFly June 10, 2025 04:10

yapf

5a58f61

Signed-off-by: Venky Ganesh <[email protected]>

add missing nano tests

89b30cc

Signed-off-by: Venky Ganesh <[email protected]>

venkywonka changed the title ~~test(perf): Add remaining Llama-Nemotron perftests (nano, super, ultra)~~ test(perf): Add remaining Llama-Nemotron perftests (nano, super, ultra) + extras ✨ Jun 10, 2025

venkywonka commented Jun 10, 2025

View reviewed changes

tests/integration/defs/perf/pytorch_model_config.py Outdated Show resolved Hide resolved

flatten pytorch_config change

934aa0e

Signed-off-by: Venky Ganesh <[email protected]>

Merge branch 'main' into user/venky/llama-nemotron-remaining-perftests

ab29fb7

Signed-off-by: Venky <[email protected]>

LarryXFly approved these changes Jun 12, 2025

View reviewed changes

test(perf): Add remaining Llama-Nemotron perftests (nano, super, ultra) + extras ✨ #5066

Are you sure you want to change the base?

test(perf): Add remaining Llama-Nemotron perftests (nano, super, ultra) + extras ✨ #5066

Conversation

venkywonka commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Performance Summary

Llama v3.1 Nemotron Nano 8B

Llama v3.3 Nemotron Super 49B (FP8)

Llama v3.3 Nemotron Super 49B (BF16)

Llama v3.1 Nemotron Ultra 253B

NOTES

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

venkywonka commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

venkywonka commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

venkywonka commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

Uh oh!

venkywonka commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

venkywonka commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

Uh oh!

venkywonka commented Jun 10, 2025 •

edited

Loading