Skip to content

Commit 7f2fcb0

Browse files
authored
Add script to fetch benchmark results for execuTorch (#11734)
# Summary Provide methods and script to fetch all execuTorch benchamrk data from HUD API into two dataset,private and public, the script will: - fetch all data from HUD API from input time range in UTC - clean out records and tables with only FAILURE_REPORT due to job-level failures - get all private table metrics, generate `table_name` and find intersected public table metrics - generate private and public table groups - output data OutputType: - run with excel-sheet export - run with csv export - run with dataframe format print - run with json format print See more guidance in README.md the data is similar to the excel sheet generated manually in #10982 The result should be the same as the hud per model datatable: <img width="1480" alt="image" src="https://github.com/user-attachments/assets/7c6cc12e-50c5-4ce2-ac87-5cac650486e3" /> ## helper methods: common.py provide common.py helper method to convert back csv and excel sheets back to {"groupInfo":{}, "df":df.DataFrame} format. # run with ``` bash python3 .ci/scripts/benchmark_tooling/get_benchmark_analysis_data.py \ --startTime "2025-04-29T09:48:57" \ --endTime "2025-05-13T22:00:00" \ --outputType "excel" \ --models "mv3" python3 .ci/scripts/benchmark_tooling/analyze_benchmark_stability.py \ --primary-file private.xlsx \ --reference-file public.xlsx ``` Generate excel files: [private.xlsx](https://github.com/user-attachments/files/20844977/private.xlsx) [public.xlsx](https://github.com/user-attachments/files/20844978/public.xlsx) For instance you can find result for mv3 xnnq_q8 S22 Ultra android 14: ``` Latency Stability Analysis: table10 (Primary) ================================================================================ Model: mv3(xnnpack_q8) Device: Samsung Galaxy S22 Ultra 5G (private)(Android 14) Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 2.91 ms - Median latency (P50): 2.54 ms - Mean trimmed latency: 2.41 ms - Median trimmed latency: 2.15 ms Dispersion Metrics: - Standard deviation: 1.14 ms - Coefficient of variation (CV): 39.08% - Interquartile range (IQR): 0.82 ms - Trimmed standard deviation: 0.76 ms - Trimmed coefficient of variation: 31.60% Percentile Metrics: - P50 (median): 2.54 ms - P90: 3.88 ms - P95: 4.60 ms - P99: 5.91 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 5.6103 - P99/P50 ratio: 2.3319 - Mean rolling std (window=5): 0.79 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 15.37% - Max trimming effect ratio: 38.83% Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 39.08%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs. The max/min ratio of 5.61 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.33 suggests occasional latency spikes that could affect tail latency sensitive applications. ``` --------- Signed-off-by: Yang Wang <[email protected]>
1 parent be07160 commit 7f2fcb0

File tree

8 files changed

+2048
-115
lines changed

8 files changed

+2048
-115
lines changed

.ci/docker/requirements-ci.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,6 @@ matplotlib>=3.9.4
2828
myst-parser==0.18.1
2929
sphinx_design==0.4.1
3030
sphinx-copybutton==0.5.0
31+
32+
# script unit test requirements
33+
yaspin==3.1.0
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# Executorch Benchmark Tooling
2+
3+
A library providing tools for fetching, processing, and analyzing ExecutorchBenchmark data from the HUD Open API. This tooling helps compare performance metrics between private and public devices with identical settings.
4+
5+
## Table of Contents
6+
7+
- [Overview](#overview)
8+
- [Installation](#installation)
9+
- [Tools](#tools)
10+
- [get_benchmark_analysis_data.py](#get_benchmark_analysis_datapy)
11+
- [Quick Start](#quick-start)
12+
- [Command Line Options](#command-line-options)
13+
- [Example Usage](#example-usage)
14+
- [Working with Output Files](#working-with-output-files-csv-and-excel)
15+
- [Python API Usage](#python-api-usage)
16+
- [Running Unit Tests](#running-unit-tests)
17+
18+
## Overview
19+
20+
The Executorch Benchmark Tooling provides a suite of utilities designed to:
21+
22+
- Fetch benchmark data from HUD Open API for specified time ranges
23+
- Clean and process data by filtering out failures
24+
- Compare metrics between private and public devices with matching configurations
25+
- Generate analysis reports in various formats (CSV, Excel, JSON)
26+
- Support filtering by device pools, backends, and models
27+
28+
This tooling is particularly useful for performance analysis, regression testing, and cross-device comparisons.
29+
30+
## Installation
31+
32+
Install dependencies:
33+
34+
```bash
35+
pip install -r requirements.txt
36+
```
37+
38+
## Tools
39+
40+
### get_benchmark_analysis_data.py
41+
42+
This script is mainly used to generate analysis data comparing private devices with public devices using the same settings.
43+
44+
It fetches benchmark data from HUD Open API for a specified time range, cleans the data by removing entries with FAILURE indicators, and retrieves all private device metrics along with equivalent public device metrics based on matching [model, backend, device_pool_names, arch] configurations. Users can filter the data by specifying private device_pool_names, backends, and models.
45+
46+
#### Quick Start
47+
48+
```bash
49+
# generate excel sheets for all private devices with public devices using the same settings
50+
python3 .ci/scripts/benchmark_tooling/get_benchmark_analysis_data.py \
51+
--startTime "2025-06-11T00:00:00" \
52+
--endTime "2025-06-17T18:00:00" \
53+
--outputType "excel"
54+
55+
# generate the benchmark stability analysis
56+
python3 .ci/scripts/benchmark_tooling/analyze_benchmark_stability.py \
57+
--primary-file private.xlsx \
58+
--reference-file public.xlsx
59+
```
60+
61+
#### Command Line Options
62+
63+
##### Basic Options:
64+
- `--startTime`: Start time in ISO format (e.g., "2025-06-11T00:00:00") (required)
65+
- `--endTime`: End time in ISO format (e.g., "2025-06-17T18:00:00") (required)
66+
- `--env`: Choose environment ("local" or "prod", default: "prod")
67+
- `--no-silent`: Show processing logs (default: only show results & minimum logging)
68+
69+
##### Output Options:
70+
- `--outputType`: Choose output format (default: "print")
71+
- `print`: Display results in console
72+
- `json`: Generate JSON file
73+
- `df`: Display results in DataFrame format: `{'private': List[{'groupInfo':Dict,'df': DF},...],'public':List[{'groupInfo':Dict,'df': DF}]`
74+
- `excel`: Generate Excel files with multiple sheets, the field in first row and first column contains the JSON string of the raw metadata
75+
- `csv`: Generate CSV files in separate folders, the field in first row and first column contains the JSON string of the raw metadata
76+
- `--outputDir`: Directory to save output files (default: current directory)
77+
78+
##### Filtering Options:
79+
80+
- `--device-pools`: Filter by private device pool names (e.g., "samsung-galaxy-s22-5g", "samsung-galaxy-s22plus-5g")
81+
- `--backends`: Filter by specific backend names (e.g.,"xnnpack_q8")
82+
- `--models`: Filter by specific model names (e.g., "mv3", "meta-llama-llama-3.2-1b-instruct-qlora-int4-eo8")
83+
84+
#### Example Usage
85+
86+
Filter by multiple private device pools and models:
87+
```bash
88+
# This fetches all private table data for models 'llama-3.2-1B' and 'mv3'
89+
python3 get_benchmark_analysis_data.py \
90+
--startTime "2025-06-01T00:00:00" \
91+
--endTime "2025-06-11T00:00:00" \
92+
--device-pools 'apple_iphone_15_private' 'samsung_s22_private' \
93+
--models 'meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8' 'mv3'
94+
```
95+
96+
Filter by specific device pool and models:
97+
```bash
98+
# This fetches all private iPhone table data for models 'llama-3.2-1B' and 'mv3',
99+
# and associated public iPhone data
100+
python3 get_benchmark_analysis_data.py \
101+
--startTime "2025-06-01T00:00:00" \
102+
--endTime "2025-06-11T00:00:00" \
103+
--device-pools 'apple_iphone_15_private' \
104+
--models 'meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8' 'mv3'
105+
```
106+
107+
#### Working with Output Files CSV and Excel
108+
109+
You can use methods in `common.py` to convert the file data back to DataFrame format. These methods read the first row in CSV/Excel files and return results with the format `list of {"groupInfo":DICT, "df":df.Dataframe{}}`.
110+
111+
```python
112+
import logging
113+
logging.basicConfig(level=logging.INFO)
114+
from .ci.scripts.benchmark_tooling.common import read_all_csv_with_metadata, read_excel_with_json_header
115+
116+
# For CSV files (assuming the 'private' folder is in the current directory)
117+
folder_path = './private'
118+
res = read_all_csv_with_metadata(folder_path)
119+
logging.info(res)
120+
121+
# For Excel files (assuming the Excel file is in the current directory)
122+
file_path = "./private.xlsx"
123+
res = read_excel_with_json_header(file_path)
124+
logging.info(res)
125+
```
126+
127+
#### Python API Usage
128+
129+
To use the benchmark fetcher in your own scripts:
130+
131+
```python
132+
from .ci.scripts.benchmark_tooling.get_benchmark_analysis_data import ExecutorchBenchmarkFetcher
133+
134+
# Initialize the fetcher
135+
fetcher = ExecutorchBenchmarkFetcher(env="prod", disable_logging=False)
136+
137+
# Fetch data for a specific time range
138+
fetcher.run(
139+
start_time="2025-06-11T00:00:00",
140+
end_time="2025-06-17T18:00:00"
141+
)
142+
143+
# Get results in different formats
144+
# As DataFrames
145+
df_results = fetcher.to_df()
146+
147+
# Export to Excel
148+
fetcher.to_excel(output_dir="./results")
149+
150+
# Export to CSV
151+
fetcher.to_csv(output_dir="./results")
152+
153+
# Export to JSON
154+
json_path = fetcher.to_json(output_dir="./results")
155+
156+
# Get raw dictionary results
157+
dict_results = fetcher.to_dict()
158+
159+
# Use the output_data method for flexible output
160+
results = fetcher.output_data(output_type="excel", output_dir="./results")
161+
```
162+
163+
## Running Unit Tests
164+
165+
The benchmark tooling includes unit tests to ensure functionality.
166+
167+
### Using pytest for unit tests
168+
169+
```bash
170+
# From the executorch root directory
171+
pytest -c /dev/null .ci/scripts/tests/test_get_benchmark_analysis_data.py
172+
```

.ci/scripts/benchmark_tooling/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)