Add script to fetch benchmark results for execuTorch #11734

yangw-dev · 2025-06-16T18:29:30Z

Summary

Provide methods and script to fetch all execuTorch benchamrk data from HUD API into two dataset,private and public, the script will:

fetch all data from HUD API from input time range in UTC
clean out records and tables with only FAILURE_REPORT due to job-level failures
get all private table metrics, generate table_name and find intersected public table metrics
generate private and public table groups
output data

OutputType:

run with excel-sheet export
run with csv export
run with dataframe format print
run with json format print

See more guidance in README.md

the data is similar to the excel sheet generated manually in #10982
The result should be the same as the hud per model datatable:

helper methods: common.py

provide common.py helper method to convert back csv and excel sheets back to {"groupInfo":{}, "df":df.DataFrame} format.

run with

python3 .ci/scripts/benchmark_tooling/get_benchmark_analysis_data.py \
--startTime "2025-04-29T09:48:57" \
--endTime "2025-05-13T22:00:00" \
--outputType "excel" \
--models "mv3"

python3 .ci/scripts/benchmark_tooling/analyze_benchmark_stability.py \
--primary-file private.xlsx \
--reference-file public.xlsx

Generate excel files:
private.xlsx
public.xlsx

For instance you can find result for mv3 xnnq_q8 S22 Ultra android 14:


Latency Stability Analysis: table10 (Primary)
================================================================================
Model: mv3(xnnpack_q8)
Device: Samsung Galaxy S22 Ultra 5G (private)(Android 14)

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 2.91 ms
  - Median latency (P50): 2.54 ms
  - Mean trimmed latency: 2.41 ms
  - Median trimmed latency: 2.15 ms

Dispersion Metrics:
  - Standard deviation: 1.14 ms
  - Coefficient of variation (CV): 39.08%
  - Interquartile range (IQR): 0.82 ms
  - Trimmed standard deviation: 0.76 ms
  - Trimmed coefficient of variation: 31.60%

Percentile Metrics:
  - P50 (median): 2.54 ms
  - P90: 3.88 ms
  - P95: 4.60 ms
  - P99: 5.91 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 5.6103
  - P99/P50 ratio: 2.3319
  - Mean rolling std (window=5): 0.79 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 15.37%
  - Max trimming effect ratio: 38.83%

Stability Assessment:
  - Overall stability score: 0.0/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 0.0/100) with significant
  variation between runs (CV: 39.08%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The significant difference between raw and trimmed means suggests
  considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs.

  The max/min ratio of 5.61 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 2.33 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

Signed-off-by: Yang Wang <[email protected]>

pytorch-bot · 2025-06-16T18:29:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11734

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 2 Pending, 3 Unrelated Failures

As of commit 1a3795e with merge base da36d8a ():

CANCELLED JOB - The following job was cancelled. Please retry:

pull / unittest-arm-backend-with-no-fvp (test_pytest_models) / linux-job (gh)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / test-models-linux (add_mul, portable, linux.2xlarge) / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
pull / test-moshi-linux / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

trunk / test-qnn-optimum-model (fp32, bert) / linux-job (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Signed-off-by: Yang Wang <[email protected]>

yangw-dev · 2025-06-17T00:54:40Z

FYI, this method can be more general, but since only execuTorch is using it, i just make it execuTorch specific @huydhn

Signed-off-by: Yang Wang <[email protected]>

yangw-dev · 2025-06-17T01:15:18Z

the excel sheet has limit of sheet name len < 31, which can be easy to break in the future. @huydhn @guangy10 , I think instead of generate one file per category, maybe we can generate list of excel files stored in folders [private, public]

But right now with the hard-coded abbreviation, this works fine. THe excel sheet option is there in case people want to use it.

.ci/scripts/benchmark_tooling/README.md

Signed-off-by: Yang Wang <[email protected]>

huydhn

Stamped to unblock! Let's start using the script and improve it along the way

guangy10 · 2025-06-17T19:45:12Z

We configured a fixed list of matching names to list limited tables

I think we should make it more flexible as there are always new models, recipes, devices added. For example, we recently add more models (see on dash) from huggingface/optimum-executorch to the benchmark infra, and the list will keep expanding.

Similarly with new "devices" or "backends" available, we want to be able to query the results via the script as well.

the excel sheet has limit of sheet name len < 31, which can be easy to break in the future. @huydhn @guangy10 , I think instead of generate one file per category, maybe we can generate list of excel files stored in folders [private, public]

Yeah noticed the limits when I manually created the excel sheet. Ideally I'd like to get rid of the excel sheet by wiring the outputs from db to the analysis script directly. Given what is currently supported in this PR, what does the workflow look like if I want to rerun the analysis? That is, how is this script interfaced to the analysis script?

.ci/scripts/benchmark_tooling/README.md

.ci/scripts/benchmark_tooling/get_benchmark_analysis_data.py

Signed-off-by: Yang Wang <[email protected]>

yangw-dev · 2025-06-21T05:01:47Z

Just to clarify, the whole purpose of this PR is to make it easier to analysis the stability of the benchmark by end of H1, and that should be part of the core of the benchmark infra. The work is incomplete if the newly added script can not work with the analysis script or require additional changes to be handled separately. The DevX is not getting better.

please review this again, this script is synced with analysis script with excel output now, test mv3 data results, and post the analysis example in the comment, for samples i saw, the result generated from the new script is similar to the results post in #10982

.ci/scripts/benchmark_tooling/README.md

Signed-off-by: Yang Wang <[email protected]>

guangy10 · 2025-06-23T19:36:30Z

.ci/scripts/benchmark_tooling/README.md

+- `--device-pools`: Filter by private device pool names (e.g., "samsung-galaxy-s22-5g", "samsung-galaxy-s22plus-5g")
+- `--backends`: Filter by specific backend names (e.g.,"xnnpack_q8")
+- `--models`: Filter by specific model names (e.g., "mv3", "meta-llama-llama-3.2-1b-instruct-qlora-int4-eo8")


Note that the examples names are still incorrect

guangy10

linter error to fix

Signed-off-by: Yang Wang <[email protected]>

# Summary Provide methods and script to fetch all execuTorch benchamrk data from HUD API into two dataset,private and public, the script will: - fetch all data from HUD API from input time range in UTC - clean out records and tables with only FAILURE_REPORT due to job-level failures - get all private table metrics, generate `table_name` and find intersected public table metrics - generate private and public table groups - output data OutputType: - run with excel-sheet export - run with csv export - run with dataframe format print - run with json format print See more guidance in README.md the data is similar to the excel sheet generated manually in pytorch#10982 The result should be the same as the hud per model datatable: <img width="1480" alt="image" src="https://github.com/user-attachments/assets/7c6cc12e-50c5-4ce2-ac87-5cac650486e3" /> ## helper methods: common.py provide common.py helper method to convert back csv and excel sheets back to {"groupInfo":{}, "df":df.DataFrame} format. # run with ``` bash python3 .ci/scripts/benchmark_tooling/get_benchmark_analysis_data.py \ --startTime "2025-04-29T09:48:57" \ --endTime "2025-05-13T22:00:00" \ --outputType "excel" \ --models "mv3" python3 .ci/scripts/benchmark_tooling/analyze_benchmark_stability.py \ --primary-file private.xlsx \ --reference-file public.xlsx ``` Generate excel files: [private.xlsx](https://github.com/user-attachments/files/20844977/private.xlsx) [public.xlsx](https://github.com/user-attachments/files/20844978/public.xlsx) For instance you can find result for mv3 xnnq_q8 S22 Ultra android 14: ``` Latency Stability Analysis: table10 (Primary) ================================================================================ Model: mv3(xnnpack_q8) Device: Samsung Galaxy S22 Ultra 5G (private)(Android 14) Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 2.91 ms - Median latency (P50): 2.54 ms - Mean trimmed latency: 2.41 ms - Median trimmed latency: 2.15 ms Dispersion Metrics: - Standard deviation: 1.14 ms - Coefficient of variation (CV): 39.08% - Interquartile range (IQR): 0.82 ms - Trimmed standard deviation: 0.76 ms - Trimmed coefficient of variation: 31.60% Percentile Metrics: - P50 (median): 2.54 ms - P90: 3.88 ms - P95: 4.60 ms - P99: 5.91 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 5.6103 - P99/P50 ratio: 2.3319 - Mean rolling std (window=5): 0.79 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 15.37% - Max trimming effect ratio: 38.83% Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 39.08%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs. The max/min ratio of 5.61 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.33 suggests occasional latency spikes that could affect tail latency sensitive applications. ``` --------- Signed-off-by: Yang Wang <[email protected]>

final

3e8fa8f

Signed-off-by: Yang Wang <[email protected]>

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 16, 2025

final

07896ea

Signed-off-by: Yang Wang <[email protected]>

yangw-dev changed the title ~~final~~ Add script to fetch benchmark results for execuTorch Jun 17, 2025

yangw-dev added 4 commits June 16, 2025 17:21

final

79f7788

Signed-off-by: Yang Wang <[email protected]>

final

101d631

Signed-off-by: Yang Wang <[email protected]>

final

83e76fe

Signed-off-by: Yang Wang <[email protected]>

final

3b4047f

Signed-off-by: Yang Wang <[email protected]>

yangw-dev requested review from guangy10 and huydhn June 17, 2025 00:54

final

1da3e87

Signed-off-by: Yang Wang <[email protected]>

yangw-dev marked this pull request as ready for review June 17, 2025 01:02

huydhn reviewed Jun 17, 2025

View reviewed changes

.ci/scripts/benchmark_tooling/README.md Outdated Show resolved Hide resolved

yangw-dev requested a review from huydhn June 17, 2025 17:41

yangw-dev self-assigned this Jun 17, 2025

yangw-dev added 9 commits June 17, 2025 10:58

final

2f604a0

Signed-off-by: Yang Wang <[email protected]>

final

9fa50a4

Signed-off-by: Yang Wang <[email protected]>

final

b56863d

Signed-off-by: Yang Wang <[email protected]>

final

b2ad5b6

Signed-off-by: Yang Wang <[email protected]>

final

4aced24

Signed-off-by: Yang Wang <[email protected]>

final

ab6e6cf

Signed-off-by: Yang Wang <[email protected]>

final

8e6956d

Signed-off-by: Yang Wang <[email protected]>

final

87ba460

Signed-off-by: Yang Wang <[email protected]>

final

5d22567

Signed-off-by: Yang Wang <[email protected]>

huydhn approved these changes Jun 17, 2025

View reviewed changes

guangy10 requested changes Jun 17, 2025

View reviewed changes

yangw-dev added 3 commits June 20, 2025 15:59

setup link

7be520e

Signed-off-by: Yang Wang <[email protected]>

setup link

600bb1a

Signed-off-by: Yang Wang <[email protected]>

setup link

4c0fdd2

Signed-off-by: Yang Wang <[email protected]>

yangw-dev requested review from jackzhxng, larryliu0820, swolchok, mergennachin, digantdesai and mcr229 as code owners June 21, 2025 03:43

yangw-dev requested a review from guangy10 June 21, 2025 03:53

yangw-dev added 3 commits June 20, 2025 20:56

setup link

faa2012

Signed-off-by: Yang Wang <[email protected]>

setup link

82c72eb

Signed-off-by: Yang Wang <[email protected]>

setup link

9e2ee88

Signed-off-by: Yang Wang <[email protected]>

yangw-dev removed request for swolchok, digantdesai, mergennachin, larryliu0820, jackzhxng and mcr229 June 21, 2025 05:05

Merge branch 'main' into addScript

e97850b

guangy10 approved these changes Jun 23, 2025

View reviewed changes

.ci/scripts/benchmark_tooling/README.md Outdated Show resolved Hide resolved

yangw-dev added 2 commits June 23, 2025 11:54

setup link

26dec1e

Signed-off-by: Yang Wang <[email protected]>

Merge branch 'main' into addScript

12b4ed7

guangy10 reviewed Jun 23, 2025

View reviewed changes

guangy10 approved these changes Jun 23, 2025

View reviewed changes

setup link

1a3795e

Signed-off-by: Yang Wang <[email protected]>

yangw-dev merged commit 7f2fcb0 into main Jun 23, 2025
197 of 201 checks passed

yangw-dev deleted the addScript branch June 23, 2025 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add script to fetch benchmark results for execuTorch #11734

Add script to fetch benchmark results for execuTorch #11734

Uh oh!

yangw-dev commented Jun 16, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 16, 2025 •

edited

Loading

Uh oh!

yangw-dev commented Jun 17, 2025

Uh oh!

yangw-dev commented Jun 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

huydhn left a comment

Uh oh!

guangy10 commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yangw-dev commented Jun 21, 2025

Uh oh!

Uh oh!

guangy10 Jun 23, 2025

Uh oh!

guangy10 left a comment

Uh oh!

Uh oh!

Uh oh!

Add script to fetch benchmark results for execuTorch #11734

Add script to fetch benchmark results for execuTorch #11734

Uh oh!

Conversation

yangw-dev commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

helper methods: common.py

run with

Uh oh!

pytorch-bot bot commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11734

❌ 1 Cancelled Job, 2 Pending, 3 Unrelated Failures

Uh oh!

yangw-dev commented Jun 17, 2025

Uh oh!

yangw-dev commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

huydhn left a comment

Choose a reason for hiding this comment

Uh oh!

guangy10 commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yangw-dev commented Jun 21, 2025

Uh oh!

Uh oh!

guangy10 Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

guangy10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yangw-dev commented Jun 16, 2025 •

edited

Loading

pytorch-bot bot commented Jun 16, 2025 •

edited

Loading

yangw-dev commented Jun 17, 2025 •

edited

Loading