-
Notifications
You must be signed in to change notification settings - Fork 608
Script for benchmark stability assessment #10982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10982
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 7b6d907 with merge base 0c9a4f5 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommand to add the pip dependencies in requirements.txt next to the analyze_latency_stability.py
maybe it's good it has its own folder
cae229e
to
3b8aa35
Compare
3b8aa35
to
dd44b4e
Compare
dd44b4e
to
cd676b8
Compare
Fixed linter |
As discussed with @yangw-dev offline, to make the stability assessment part of the benchmark infra as suggested in this post, I will merge this script under .ci/scripts together with other scripts used by CI and benchmark infra. @yangw-dev will take over from there and rework on the interface to
|
cd676b8
to
7b6d907
Compare
# Summary Provide methods and script to fetch all execuTorch benchamrk data from HUD API into two dataset,private and public, the script will: - fetch all data from HUD API from input time range in UTC - clean out records and tables with only FAILURE_REPORT due to job-level failures - get all private table metrics, generate `table_name` and find intersected public table metrics - generate private and public table groups - output data OutputType: - run with excel-sheet export - run with csv export - run with dataframe format print - run with json format print See more guidance in README.md the data is similar to the excel sheet generated manually in #10982 The result should be the same as the hud per model datatable: <img width="1480" alt="image" src="https://github.com/user-attachments/assets/7c6cc12e-50c5-4ce2-ac87-5cac650486e3" /> ## helper methods: common.py provide common.py helper method to convert back csv and excel sheets back to {"groupInfo":{}, "df":df.DataFrame} format. # run with ``` bash python3 .ci/scripts/benchmark_tooling/get_benchmark_analysis_data.py \ --startTime "2025-04-29T09:48:57" \ --endTime "2025-05-13T22:00:00" \ --outputType "excel" \ --models "mv3" python3 .ci/scripts/benchmark_tooling/analyze_benchmark_stability.py \ --primary-file private.xlsx \ --reference-file public.xlsx ``` Generate excel files: [private.xlsx](https://github.com/user-attachments/files/20844977/private.xlsx) [public.xlsx](https://github.com/user-attachments/files/20844978/public.xlsx) For instance you can find result for mv3 xnnq_q8 S22 Ultra android 14: ``` Latency Stability Analysis: table10 (Primary) ================================================================================ Model: mv3(xnnpack_q8) Device: Samsung Galaxy S22 Ultra 5G (private)(Android 14) Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 2.91 ms - Median latency (P50): 2.54 ms - Mean trimmed latency: 2.41 ms - Median trimmed latency: 2.15 ms Dispersion Metrics: - Standard deviation: 1.14 ms - Coefficient of variation (CV): 39.08% - Interquartile range (IQR): 0.82 ms - Trimmed standard deviation: 0.76 ms - Trimmed coefficient of variation: 31.60% Percentile Metrics: - P50 (median): 2.54 ms - P90: 3.88 ms - P95: 4.60 ms - P99: 5.91 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 5.6103 - P99/P50 ratio: 2.3319 - Mean rolling std (window=5): 0.79 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 15.37% - Max trimming effect ratio: 38.83% Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 39.08%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs. The max/min ratio of 5.61 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.33 suggests occasional latency spikes that could affect tail latency sensitive applications. ``` --------- Signed-off-by: Yang Wang <[email protected]>
# Summary Provide methods and script to fetch all execuTorch benchamrk data from HUD API into two dataset,private and public, the script will: - fetch all data from HUD API from input time range in UTC - clean out records and tables with only FAILURE_REPORT due to job-level failures - get all private table metrics, generate `table_name` and find intersected public table metrics - generate private and public table groups - output data OutputType: - run with excel-sheet export - run with csv export - run with dataframe format print - run with json format print See more guidance in README.md the data is similar to the excel sheet generated manually in pytorch#10982 The result should be the same as the hud per model datatable: <img width="1480" alt="image" src="https://github.com/user-attachments/assets/7c6cc12e-50c5-4ce2-ac87-5cac650486e3" /> ## helper methods: common.py provide common.py helper method to convert back csv and excel sheets back to {"groupInfo":{}, "df":df.DataFrame} format. # run with ``` bash python3 .ci/scripts/benchmark_tooling/get_benchmark_analysis_data.py \ --startTime "2025-04-29T09:48:57" \ --endTime "2025-05-13T22:00:00" \ --outputType "excel" \ --models "mv3" python3 .ci/scripts/benchmark_tooling/analyze_benchmark_stability.py \ --primary-file private.xlsx \ --reference-file public.xlsx ``` Generate excel files: [private.xlsx](https://github.com/user-attachments/files/20844977/private.xlsx) [public.xlsx](https://github.com/user-attachments/files/20844978/public.xlsx) For instance you can find result for mv3 xnnq_q8 S22 Ultra android 14: ``` Latency Stability Analysis: table10 (Primary) ================================================================================ Model: mv3(xnnpack_q8) Device: Samsung Galaxy S22 Ultra 5G (private)(Android 14) Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 2.91 ms - Median latency (P50): 2.54 ms - Mean trimmed latency: 2.41 ms - Median trimmed latency: 2.15 ms Dispersion Metrics: - Standard deviation: 1.14 ms - Coefficient of variation (CV): 39.08% - Interquartile range (IQR): 0.82 ms - Trimmed standard deviation: 0.76 ms - Trimmed coefficient of variation: 31.60% Percentile Metrics: - P50 (median): 2.54 ms - P90: 3.88 ms - P95: 4.60 ms - P99: 5.91 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 5.6103 - P99/P50 ratio: 2.3319 - Mean rolling std (window=5): 0.79 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 15.37% - Max trimming effect ratio: 38.83% Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 39.08%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs. The max/min ratio of 5.61 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.33 suggests occasional latency spikes that could affect tail latency sensitive applications. ``` --------- Signed-off-by: Yang Wang <[email protected]>
Summary
The custom script for ET benchmark stability assessment.
Then
Datasets:
The generated analysis: