Skip to content

test(perf): Add remaining Llama-Nemotron perftests (nano, super, ultra) + extras ✨ #5066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

venkywonka
Copy link
Collaborator

@venkywonka venkywonka commented Jun 10, 2025

Description

  • Refine the logic to override test harness with extra-llma-api-args that was previously verbose for including exceptions.
  • Add super pyt backend tests (bf16)
  • Add super pyt backend tests (fp8 pre)
  • Add ultra pyt backend tests (fp8 pre)
  • Add nano pyt backend tests (fp8 pre)

Performance Summary

Llama v3.1 Nemotron Nano 8B

Invariants:

  • Model: llama_v3.1_nemotron_nano_8b_fp8
  • Quantization: FP8 pre-quantized
  • GPUs: 1 H100
  • Parallelism: None (single GPU)
ISL/OSL Total Requests Concurrency Request Throughput (req/sec) Total Token Throughput (tokens/sec) Total Output Throughput (tokens/sec) Avg. Request Latency (ms) Per User Output Throughput (tps/user)
5000/500 8 1 (low-latency) 0.3114 1712.5529 155.6866 3211.5057 155.6902
500/2000 8 1 (low-latency) 0.0829 207.2239 165.7791 12064.1526 165.7804
1000/1000 8 1 (low-latency) 0.1654 330.7232 165.3616 6047.2742 165.3638
20000/2000 8 1 (low-latency) 0.0699 1536.8189 139.7108 14315.1906 139.7117
5000/500 500 250 (high-thput) 0.3114 1712.8874 155.7170 602893.2472 2.1568
500/2000 500 250 (high-thput) 5.3177 13294.1857 10635.3486 46703.0254 42.8700
1000/1000 500 250 (high-thput) 8.8740 17747.9144 8873.9572 27362.7071 37.0793
20000/2000 500 250 (high-thput) 0.0699 1537.6918 139.7902 2686479.5368 1.9492

Llama v3.3 Nemotron Super 49B (FP8)

Invariants:

  • Model: llama_v3.3_nemotron_super_49b_fp8
  • Quantization: FP8 pre-quantized
  • GPUs: 4 H100s
  • Parallelism: TP=4
ISL/OSL Total Requests Concurrency Request Throughput (req/sec) Total Token Throughput (tokens/sec) Total Output Throughput (tokens/sec) Avg. Request Latency (ms) Per User Output Throughput (tps/user)
5000/500 8 1 (low-latency) 0.2004 1102.3076 100.2098 4989.3921 100.2126
500/2000 8 1 (low-latency) 0.0529 132.2748 105.8199 18899.8642 105.8209
5000/500 250 250 (high-thput) 0.1991 1095.1056 99.5551 633542.7594 2.3312
500/2000 250 250 (high-thput) 3.6579 9144.6989 7315.7591 67539.0391 29.6147

Llama v3.3 Nemotron Super 49B (BF16)

Invariants:

  • Model: llama_v3.3_nemotron_super_49b
  • Quantization: None (BFloat16)
  • GPUs: 4 H100s
  • Parallelism: TP=4
ISL/OSL Total Requests Concurrency Request Throughput (req/sec) Total Token Throughput (tokens/sec) Total Output Throughput (tokens/sec) Avg. Request Latency (ms) Per User Output Throughput (tps/user)
5000/500 8 1 (low-latency) 0.1201 660.4628 60.0421 8327.3392 60.0432
500/2000 8 1 (low-latency) 0.0315 78.6292 62.9034 31794.6209 62.9037
5000/500 250 250 (high-thput) 0.1201 660.8213 60.0747 1047295.6879 1.4272
500/2000 250 250 (high-thput) 2.6072 6517.9790 5214.3832 94645.0469 21.1334

Llama v3.1 Nemotron Ultra 253B

Invariants:

  • Model: llama_v3.1_nemotron_ultra_253b_fp8
  • Quantization: FP8 pre-quantized
  • GPUs: 8 H100s
  • Parallelism: TP=8
ISL/OSL Total Requests Concurrency Request Throughput (req/sec) Total Token Throughput (tokens/sec) Total Output Throughput (tokens/sec) Avg. Request Latency (ms) Per User Output Throughput (tps/user)
5000/500 8 1 (low-latency) 0.0732 402.6274 36.6025 13660.0937 36.6078
500/2000 8 1 (low-latency) 0.0204 50.9418 40.7534 49075.3022 40.7543
5000/500 250 250 (high-thput) 0.0699 384.3698 34.9427 1812039.2223 0.8413
500/2000 250 250 (high-thput) 0.8270 2067.5576 1654.0461 299370.8895 6.6809

NOTES

  • FP8 vs BF16 (Super 49B): FP8 shows ~40-60% higher throughput compared to BF16
  • All above models were run with enable_attention_dp=False override. (Enabling it was causing hangs that are tracked be several bugs).

@venkywonka venkywonka marked this pull request as ready for review June 10, 2025 04:10
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds additional performance tests for Llama-Nemotron models, refining test harness logic and updating configuration mappings to support FP8 backends for nano, super, and ultra variants.

  • Added FP8 prequantized performance tests for nano, super, and ultra models in the YAML test list.
  • Updated test mapping in test_perf.py to include FP8 variants.
  • Enhanced pattern-based model configuration in pytorch_model_config.py to support new FP8 tests and disable attention_dp for certain models.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
tests/integration/test_lists/qa/trt_llm_release_perf_test.yml Added new FP8 and extended performance test cases with clarifying comments.
tests/integration/defs/perf/test_perf.py Updated model mapping to include FP8 variants for performance tests.
tests/integration/defs/perf/pytorch_model_config.py Introduced pattern-based configuration updates and adjustments for FP8 model tests.
Comments suppressed due to low confidence (3)

tests/integration/defs/perf/pytorch_model_config.py:82

  • Consider adding a reference, such as a ticket or issue ID, in the comment regarding the hang issue to provide better context for future maintainability.
'enable_attention_dp': False,

tests/integration/defs/perf/test_perf.py:59

  • [nitpick] Ensure that the updated FP8 model naming is consistent across all mapping dictionaries to avoid potential confusion during usage.
"llama_v3.1_nemotron_nano_8b_fp8": "Llama-3.1-Nemotron-Nano-8B-v1-FP8",

tests/integration/test_lists/qa/trt_llm_release_perf_test.yml:294

  • [nitpick] Consider expanding the inline comments to provide more context on the test categories, which can improve clarity and maintainability.
# pyt

@venkywonka
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8197 [ run ] triggered by Bot

Signed-off-by: Venky Ganesh <[email protected]>
@venkywonka
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8207 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8197 [ run ] completed with state ABORTED

Signed-off-by: Venky Ganesh <[email protected]>
@venkywonka
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8209 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8207 [ run ] completed with state ABORTED

@venkywonka venkywonka changed the title test(perf): Add remaining Llama-Nemotron perftests (nano, super, ultra) test(perf): Add remaining Llama-Nemotron perftests (nano, super, ultra) + extras ✨ Jun 10, 2025
Signed-off-by: Venky Ganesh <[email protected]>
@venkywonka
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8216 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8209 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8216 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5959 completed with status: 'SUCCESS'

@venkywonka
Copy link
Collaborator Author

/bot reuse-pipeline --number 5959

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8539 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8539 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #8216 for commit ab29fb7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants