Skip to content

Script for benchmark stability assessment #10982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 4, 2025
Merged

Conversation

guangy10
Copy link
Contributor

@guangy10 guangy10 commented May 19, 2025

Summary

The custom script for ET benchmark stability assessment.

pip install openpyxl tabulate matplotlib

Then

python .ci/scripts/analyze_benchmark_stability.py \
    Benchmark\ Dataset\ with\ Private\ AWS\ Devices.xlsx \
    --reference_file Benchmark\ Dataset\ with\ Public\ AWS\ Devices.xlsx

Datasets:

The generated analysis:

Analyzing latency stability from primary file: /Users/guangyang/Desktop/Benchmark Dataset with Private AWS Devices.xlsx
Using reference file for comparison: /Users/guangyang/Desktop/Benchmark Dataset with Public AWS Devices.xlsx


====================================================================================================
===== LOADING PRIMARY DATASETS (Private) ==========================================================
====================================================================================================

Loading dataset: llama3_qlora+s22_android13
Loading dataset: llama3_spinq+s22_android13
Loading dataset: mv3_qnn+s22_android13
Loading dataset: mv3_xnnq8+s22_android13
Loading dataset: llama3_qlora+s22ultra_android14
Loading dataset: llama3_spinq+s22ultra_android14
Loading dataset: mv3_qnn+s22ultra_android14
Loading dataset: mv3_xnnq8+s22ultra_android14
Loading dataset: mv3_xnnq8+pixel3_rooted_android
Loading dataset: llama3_qlora+iphone15max_ios17
Loading dataset: llama3_spinq+iphone15max_ios17
Loading dataset: mv3_xnnq8+iphone15max_ios17
Loading dataset: mv3_coreml+iphone15max_ios17
Loading dataset: mv3_mps+iphone15max_ios17
Loading dataset: llama3_qlora+iphone15_ios18
Loading dataset: llama3_spinq+iphone15_ios18
Loading dataset: mv3_xnnq8+iphone15_ios18
Loading dataset: mv3_coreml+iphone15_ios18
Loading dataset: mv3_mps+iphone15_ios18


====================================================================================================
===== LOADING REFERENCE DATASETS (Public) =========================================================
====================================================================================================

Loading reference dataset: llama3_qlora+s22_android13
Loading reference dataset: llama3_spinq+s22_android13
Loading reference dataset: mv3_qnn+s22_android13
Loading reference dataset: mv3_xnnq8+s22_android13
Loading reference dataset: llama3_spinq+s22_android12
Loading reference dataset: llama3_qlora+s22Ultra5G_android
Loading reference dataset: llama3_spinq+s22ultra_android12
Loading reference dataset: mv3_xnnq8+s22ultra_android12
Loading reference dataset: mv3_qnn+s22ultra_android12
Loading reference dataset: llama3_qlora+iphone15max_ios17
Loading reference dataset: llama3_spinq+iphone15max_ios17
Loading reference dataset: mv3_xnnq8+iphone15max_ios17
Loading reference dataset: mv3_coreml+iphone15max_ios17
Loading reference dataset: mv3_mps+iphone15max_ios17
Loading reference dataset: llama3_qlora+iphone15_ios18
Loading reference dataset: llama3_spinq+iphone15_ios18
Loading reference dataset: mv3_xnnq8+iphone15_ios18
Loading reference dataset: mv3_coreml+iphone15_ios18
Loading reference dataset: mv3_mps+iphone15_ios18


====================================================================================================
===== ANALYZING PRIMARY DATASETS ==================================================================
====================================================================================================


Latency Stability Analysis: llama3_qlora+s22_android13 (Primary)
================================================================================
Model: llama3_qlora
Device: s22_android13

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 22502.10 ms
  - Median latency (P50): 22447.56 ms
  - Mean trimmed latency: 22388.87 ms
  - Median trimmed latency: 22343.47 ms

Dispersion Metrics:
  - Standard deviation: 595.01 ms
  - Coefficient of variation (CV): 2.64%
  - Interquartile range (IQR): 858.26 ms
  - Trimmed standard deviation: 596.25 ms
  - Trimmed coefficient of variation: 2.66%

Percentile Metrics:
  - P50 (median): 22447.56 ms
  - P90: 23231.99 ms
  - P95: 23518.35 ms
  - P99: 23910.11 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.1423
  - P99/P50 ratio: 1.0652
  - Mean rolling std (window=5): 539.36 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.50%
  - Max trimming effect ratio: 0.81%

Throughput Metrics:
  - Mean TPS: 33.07
  - TPS coefficient of variation: 6.92%

Stability Assessment:
  - Overall stability score: 83.4/100
  - Overall stability rating: Good

Interpretation:
  The benchmark shows good stability (score: 83.4/100) with low
  variation between runs (CV: 2.64%).
  Performance is consistent and predictable for most use cases.

================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+s22_android13_primary_time_series.png

Latency Stability Analysis: llama3_spinq+s22_android13 (Primary)
================================================================================
Model: llama3_spinq
Device: s22_android13

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 21771.59 ms
  - Median latency (P50): 21668.24 ms
  - Mean trimmed latency: 21662.53 ms
  - Median trimmed latency: 21559.89 ms

Dispersion Metrics:
  - Standard deviation: 514.89 ms
  - Coefficient of variation (CV): 2.36%
  - Interquartile range (IQR): 602.75 ms
  - Trimmed standard deviation: 515.03 ms
  - Trimmed coefficient of variation: 2.38%

Percentile Metrics:
  - P50 (median): 21668.24 ms
  - P90: 22438.74 ms
  - P95: 22542.42 ms
  - P99: 23104.76 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.1452
  - P99/P50 ratio: 1.0663
  - Mean rolling std (window=5): 449.10 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.50%
  - Max trimming effect ratio: 0.89%

Throughput Metrics:
  - Mean TPS: 33.76
  - TPS coefficient of variation: 4.70%

Stability Assessment:
  - Overall stability score: 84.7/100
  - Overall stability rating: Good

Interpretation:
  The benchmark shows good stability (score: 84.7/100) with low
  variation between runs (CV: 2.36%).
  Performance is consistent and predictable for most use cases.

================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+s22_android13_primary_time_series.png

Latency Stability Analysis: mv3_qnn+s22_android13 (Primary)
================================================================================
Model: mv3_qnn
Device: s22_android13

Dataset Overview:
  - Number of samples: 100
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-15 21:14:41+00:00

Central Tendency Metrics:
  - Mean latency: 1.01 ms
  - Median latency (P50): 1.00 ms
  - Mean trimmed latency: 1.00 ms
  - Median trimmed latency: 1.00 ms

Dispersion Metrics:
  - Standard deviation: 0.02 ms
  - Coefficient of variation (CV): 2.34%
  - Interquartile range (IQR): 0.01 ms
  - Trimmed standard deviation: 0.02 ms
  - Trimmed coefficient of variation: 2.27%

Percentile Metrics:
  - P50 (median): 1.00 ms
  - P90: 1.01 ms
  - P95: 1.01 ms
  - P99: 1.14 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.1919
  - P99/P50 ratio: 1.1404
  - Mean rolling std (window=5): 0.01 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.19%
  - Max trimming effect ratio: 1.00%

Stability Assessment:
  - Overall stability score: 82.4/100
  - Overall stability rating: Good

Interpretation:
  The benchmark shows good stability (score: 82.4/100) with low
  variation between runs (CV: 2.34%).
  Performance is consistent and predictable for most use cases.

  The P99/P50 ratio of 1.14 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_qnn+s22_android13_primary_time_series.png

Latency Stability Analysis: mv3_xnnq8+s22_android13 (Primary)
================================================================================
Model: mv3_xnnq8
Device: s22_android13

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 2.73 ms
  - Median latency (P50): 2.65 ms
  - Mean trimmed latency: 2.22 ms
  - Median trimmed latency: 2.10 ms

Dispersion Metrics:
  - Standard deviation: 0.63 ms
  - Coefficient of variation (CV): 23.03%
  - Interquartile range (IQR): 0.95 ms
  - Trimmed standard deviation: 0.36 ms
  - Trimmed coefficient of variation: 15.98%

Percentile Metrics:
  - P50 (median): 2.65 ms
  - P90: 3.59 ms
  - P95: 3.74 ms
  - P99: 4.46 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 2.4427
  - P99/P50 ratio: 1.6812
  - Mean rolling std (window=5): 0.60 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 16.52%
  - Max trimming effect ratio: 36.96%

Stability Assessment:
  - Overall stability score: 14.9/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 14.9/100) with significant
  variation between runs (CV: 23.03%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The significant difference between raw and trimmed means suggests
  considerable intra-run jitter (16.5%) with occasional outliers within benchmark runs.

  The max/min ratio of 2.44 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.68 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+s22_android13_primary_time_series.png

Latency Stability Analysis: llama3_qlora+s22ultra_android14 (Primary)
================================================================================
Model: llama3_qlora
Device: s22ultra_android14

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 25022.84 ms
  - Median latency (P50): 25427.33 ms
  - Mean trimmed latency: 24748.06 ms
  - Median trimmed latency: 25062.01 ms

Dispersion Metrics:
  - Standard deviation: 1545.62 ms
  - Coefficient of variation (CV): 6.18%
  - Interquartile range (IQR): 2844.11 ms
  - Trimmed standard deviation: 1467.60 ms
  - Trimmed coefficient of variation: 5.93%

Percentile Metrics:
  - P50 (median): 25427.33 ms
  - P90: 26581.31 ms
  - P95: 27184.07 ms
  - P99: 28668.97 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.2710
  - P99/P50 ratio: 1.1275
  - Mean rolling std (window=5): 1560.71 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 1.08%
  - Max trimming effect ratio: 4.80%

Throughput Metrics:
  - Mean TPS: 28.35
  - TPS coefficient of variation: 7.88%

Stability Assessment:
  - Overall stability score: 62.5/100
  - Overall stability rating: Moderate

Interpretation:
  The benchmark shows moderate stability (score: 62.5/100) with noticeable
  variation between runs (CV: 6.18%).
  While average performance is acceptable, occasional latency spikes may occur.

  The max/min ratio of 1.27 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.13 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+s22ultra_android14_primary_time_series.png

Latency Stability Analysis: llama3_spinq+s22ultra_android14 (Primary)
================================================================================
Model: llama3_spinq
Device: s22ultra_android14

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 24761.78 ms
  - Median latency (P50): 25043.89 ms
  - Mean trimmed latency: 24466.21 ms
  - Median trimmed latency: 24731.04 ms

Dispersion Metrics:
  - Standard deviation: 1552.25 ms
  - Coefficient of variation (CV): 6.27%
  - Interquartile range (IQR): 1931.42 ms
  - Trimmed standard deviation: 1466.19 ms
  - Trimmed coefficient of variation: 5.99%

Percentile Metrics:
  - P50 (median): 25043.89 ms
  - P90: 26163.60 ms
  - P95: 26948.68 ms
  - P99: 28868.51 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.3648
  - P99/P50 ratio: 1.1527
  - Mean rolling std (window=5): 1451.05 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 1.17%
  - Max trimming effect ratio: 4.90%

Throughput Metrics:
  - Mean TPS: 29.85
  - TPS coefficient of variation: 8.24%

Stability Assessment:
  - Overall stability score: 60.3/100
  - Overall stability rating: Moderate

Interpretation:
  The benchmark shows moderate stability (score: 60.3/100) with noticeable
  variation between runs (CV: 6.27%).
  While average performance is acceptable, occasional latency spikes may occur.

  The max/min ratio of 1.36 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.15 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+s22ultra_android14_primary_time_series.png

Latency Stability Analysis: mv3_qnn+s22ultra_android14 (Primary)
================================================================================
Model: mv3_qnn
Device: s22ultra_android14

Dataset Overview:
  - Number of samples: 100
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-15 21:14:41+00:00

Central Tendency Metrics:
  - Mean latency: 1.01 ms
  - Median latency (P50): 1.01 ms
  - Mean trimmed latency: 1.01 ms
  - Median trimmed latency: 1.01 ms

Dispersion Metrics:
  - Standard deviation: 0.01 ms
  - Coefficient of variation (CV): 0.91%
  - Interquartile range (IQR): 0.01 ms
  - Trimmed standard deviation: 0.01 ms
  - Trimmed coefficient of variation: 0.70%

Percentile Metrics:
  - P50 (median): 1.01 ms
  - P90: 1.02 ms
  - P95: 1.02 ms
  - P99: 1.03 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.0900
  - P99/P50 ratio: 1.0204
  - Mean rolling std (window=5): 0.01 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.19%
  - Max trimming effect ratio: 1.94%

Stability Assessment:
  - Overall stability score: 93.8/100
  - Overall stability rating: Excellent

Interpretation:
  The benchmark shows excellent stability (score: 93.8/100) with very low
  variation between runs (CV: 0.91%).
  This indicates highly consistent performance suitable for latency-sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_qnn+s22ultra_android14_primary_time_series.png

Latency Stability Analysis: mv3_xnnq8+s22ultra_android14 (Primary)
================================================================================
Model: mv3_xnnq8
Device: s22ultra_android14

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 2.91 ms
  - Median latency (P50): 2.54 ms
  - Mean trimmed latency: 2.41 ms
  - Median trimmed latency: 2.15 ms

Dispersion Metrics:
  - Standard deviation: 1.14 ms
  - Coefficient of variation (CV): 39.08%
  - Interquartile range (IQR): 0.82 ms
  - Trimmed standard deviation: 0.76 ms
  - Trimmed coefficient of variation: 31.60%

Percentile Metrics:
  - P50 (median): 2.54 ms
  - P90: 3.88 ms
  - P95: 4.60 ms
  - P99: 5.91 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 5.6103
  - P99/P50 ratio: 2.3319
  - Mean rolling std (window=5): 0.79 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 15.37%
  - Max trimming effect ratio: 38.83%

Stability Assessment:
  - Overall stability score: 0.0/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 0.0/100) with significant
  variation between runs (CV: 39.08%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The significant difference between raw and trimmed means suggests
  considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs.

  The max/min ratio of 5.61 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 2.33 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+s22ultra_android14_primary_time_series.png

Latency Stability Analysis: mv3_xnnq8+pixel3_rooted_android (Primary)
================================================================================
Model: mv3_xnnq8
Device: pixel3_rooted_android

Dataset Overview:
  - Number of samples: 148
  - Date range: 2025-04-16 02:47:21+00:00 to 2025-04-29 01:17:49+00:00

Central Tendency Metrics:
  - Mean latency: 5.93 ms
  - Median latency (P50): 5.87 ms
  - Mean trimmed latency: 5.51 ms
  - Median trimmed latency: 5.45 ms

Dispersion Metrics:
  - Standard deviation: 0.46 ms
  - Coefficient of variation (CV): 7.68%
  - Interquartile range (IQR): 0.56 ms
  - Trimmed standard deviation: 0.27 ms
  - Trimmed coefficient of variation: 4.84%

Percentile Metrics:
  - P50 (median): 5.87 ms
  - P90: 6.44 ms
  - P95: 6.57 ms
  - P99: 7.26 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.6964
  - P99/P50 ratio: 1.2386
  - Mean rolling std (window=5): 0.41 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 6.66%
  - Max trimming effect ratio: 26.67%

Stability Assessment:
  - Overall stability score: 46.9/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 46.9/100) with significant
  variation between runs (CV: 7.68%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The significant difference between raw and trimmed means suggests
  considerable intra-run jitter (6.7%) with occasional outliers within benchmark runs.

  The max/min ratio of 1.70 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.24 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+pixel3_rooted_android_primary_time_series.png

Latency Stability Analysis: llama3_qlora+iphone15max_ios17 (Primary)
================================================================================
Model: llama3_qlora
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 54
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00

Central Tendency Metrics:
  - Mean latency: 12972.80 ms
  - Median latency (P50): 12774.50 ms

Dispersion Metrics:
  - Standard deviation: 483.26 ms
  - Coefficient of variation (CV): 3.73%
  - Interquartile range (IQR): 624.00 ms

Percentile Metrics:
  - P50 (median): 12774.50 ms
  - P90: 13389.70 ms
  - P95: 13736.05 ms
  - P99: 14730.49 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.1916
  - P99/P50 ratio: 1.1531
  - Mean rolling std (window=5): 431.32 ms

Throughput Metrics:
  - Mean TPS: 10.18
  - TPS coefficient of variation: 11.47%

Stability Assessment:
  - Overall stability score: 75.2/100
  - Overall stability rating: Moderate

Interpretation:
  The benchmark shows moderate stability (score: 75.2/100) with noticeable
  variation between runs (CV: 3.73%).
  While average performance is acceptable, occasional latency spikes may occur.

  The P99/P50 ratio of 1.15 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+iphone15max_ios17_primary_time_series.png

Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Primary)
================================================================================
Model: llama3_spinq
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 54
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00

Central Tendency Metrics:
  - Mean latency: 12195.41 ms
  - Median latency (P50): 12104.50 ms

Dispersion Metrics:
  - Standard deviation: 461.27 ms
  - Coefficient of variation (CV): 3.78%
  - Interquartile range (IQR): 154.25 ms

Percentile Metrics:
  - P50 (median): 12104.50 ms
  - P90: 12567.20 ms
  - P95: 12760.05 ms
  - P99: 14052.31 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.3331
  - P99/P50 ratio: 1.1609
  - Mean rolling std (window=5): 365.79 ms

Throughput Metrics:
  - Mean TPS: 13.89
  - TPS coefficient of variation: 16.58%

Stability Assessment:
  - Overall stability score: 72.9/100
  - Overall stability rating: Moderate

Interpretation:
  The benchmark shows moderate stability (score: 72.9/100) with noticeable
  variation between runs (CV: 3.78%).
  While average performance is acceptable, occasional latency spikes may occur.

  The max/min ratio of 1.33 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.16 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+iphone15max_ios17_primary_time_series.png

Latency Stability Analysis: mv3_xnnq8+iphone15max_ios17 (Primary)
================================================================================
Model: mv3_xnnq8
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 54
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00

Central Tendency Metrics:
  - Mean latency: 13.98 ms
  - Median latency (P50): 14.00 ms

Dispersion Metrics:
  - Standard deviation: 3.44 ms
  - Coefficient of variation (CV): 24.60%
  - Interquartile range (IQR): 4.00 ms

Percentile Metrics:
  - P50 (median): 14.00 ms
  - P90: 18.00 ms
  - P95: 20.00 ms
  - P99: 21.94 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 3.2857
  - P99/P50 ratio: 1.5671
  - Mean rolling std (window=5): 3.40 ms

Stability Assessment:
  - Overall stability score: 10.8/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 10.8/100) with significant
  variation between runs (CV: 24.60%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 3.29 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.57 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15max_ios17_primary_time_series.png

Latency Stability Analysis: mv3_coreml+iphone15max_ios17 (Primary)
================================================================================
Model: mv3_coreml
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 50
  - Date range: 2025-04-30 05:23:09+00:00 to 2025-05-10 09:24:40+00:00

Central Tendency Metrics:
  - Mean latency: 1.00 ms
  - Median latency (P50): 1.00 ms

Dispersion Metrics:
  - Standard deviation: 0.00 ms
  - Coefficient of variation (CV): 0.00%
  - Interquartile range (IQR): 0.00 ms

Percentile Metrics:
  - P50 (median): 1.00 ms
  - P90: 1.00 ms
  - P95: 1.00 ms
  - P99: 1.00 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.0000
  - P99/P50 ratio: 1.0000
  - Mean rolling std (window=5): 0.00 ms

Stability Assessment:
  - Overall stability score: 100.0/100
  - Overall stability rating: Excellent

Interpretation:
  The benchmark shows excellent stability (score: 100.0/100) with very low
  variation between runs (CV: 0.00%).
  This indicates highly consistent performance suitable for latency-sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_coreml+iphone15max_ios17_primary_time_series.png

Latency Stability Analysis: mv3_mps+iphone15max_ios17 (Primary)
================================================================================
Model: mv3_mps
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 51
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00

Central Tendency Metrics:
  - Mean latency: 1.25 ms
  - Median latency (P50): 1.00 ms

Dispersion Metrics:
  - Standard deviation: 0.44 ms
  - Coefficient of variation (CV): 35.07%
  - Interquartile range (IQR): 0.50 ms

Percentile Metrics:
  - P50 (median): 1.00 ms
  - P90: 2.00 ms
  - P95: 2.00 ms
  - P99: 2.00 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 2.0000
  - P99/P50 ratio: 2.0000
  - Mean rolling std (window=5): 0.39 ms

Stability Assessment:
  - Overall stability score: 12.5/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 12.5/100) with significant
  variation between runs (CV: 35.07%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 2.00 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 2.00 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_mps+iphone15max_ios17_primary_time_series.png

Latency Stability Analysis: llama3_qlora+iphone15_ios18 (Primary)
================================================================================
Model: llama3_qlora
Device: iphone15_ios18

Dataset Overview:
  - Number of samples: 121
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00

Central Tendency Metrics:
  - Mean latency: 23169.07 ms
  - Median latency (P50): 21328.00 ms

Dispersion Metrics:
  - Standard deviation: 5889.20 ms
  - Coefficient of variation (CV): 25.42%
  - Interquartile range (IQR): 8558.00 ms

Percentile Metrics:
  - P50 (median): 21328.00 ms
  - P90: 31324.00 ms
  - P95: 33057.00 ms
  - P99: 40256.40 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 3.0072
  - P99/P50 ratio: 1.8875
  - Mean rolling std (window=5): 4851.03 ms

Throughput Metrics:
  - Mean TPS: 3.32
  - TPS coefficient of variation: 34.24%

Stability Assessment:
  - Overall stability score: 2.8/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 2.8/100) with significant
  variation between runs (CV: 25.42%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 3.01 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.89 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+iphone15_ios18_primary_time_series.png

Latency Stability Analysis: llama3_spinq+iphone15_ios18 (Primary)
================================================================================
Model: llama3_spinq
Device: iphone15_ios18

Dataset Overview:
  - Number of samples: 116
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00

Central Tendency Metrics:
  - Mean latency: 22076.03 ms
  - Median latency (P50): 20174.00 ms

Dispersion Metrics:
  - Standard deviation: 6076.94 ms
  - Coefficient of variation (CV): 27.53%
  - Interquartile range (IQR): 7826.00 ms

Percentile Metrics:
  - P50 (median): 20174.00 ms
  - P90: 32507.00 ms
  - P95: 34673.00 ms
  - P99: 37690.75 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 2.7320
  - P99/P50 ratio: 1.8683
  - Mean rolling std (window=5): 4837.19 ms

Throughput Metrics:
  - Mean TPS: 4.90
  - TPS coefficient of variation: 35.91%

Stability Assessment:
  - Overall stability score: 6.6/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 6.6/100) with significant
  variation between runs (CV: 27.53%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 2.73 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.87 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+iphone15_ios18_primary_time_series.png

Latency Stability Analysis: mv3_xnnq8+iphone15_ios18 (Primary)
================================================================================
Model: mv3_xnnq8
Device: iphone15_ios18

Dataset Overview:
  - Number of samples: 121
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00

Central Tendency Metrics:
  - Mean latency: 48.23 ms
  - Median latency (P50): 47.00 ms

Dispersion Metrics:
  - Standard deviation: 6.19 ms
  - Coefficient of variation (CV): 12.84%
  - Interquartile range (IQR): 6.00 ms

Percentile Metrics:
  - P50 (median): 47.00 ms
  - P90: 55.00 ms
  - P95: 57.00 ms
  - P99: 64.40 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 2.2973
  - P99/P50 ratio: 1.3702
  - Mean rolling std (window=5): 5.53 ms

Stability Assessment:
  - Overall stability score: 24.5/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 24.5/100) with significant
  variation between runs (CV: 12.84%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 2.30 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.37 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15_ios18_primary_time_series.png

Latency Stability Analysis: mv3_coreml+iphone15_ios18 (Primary)
================================================================================
Model: mv3_coreml
Device: iphone15_ios18

Dataset Overview:
  - Number of samples: 114
  - Date range: 2025-04-30 05:23:09+00:00 to 2025-05-22 22:41:19+00:00

Central Tendency Metrics:
  - Mean latency: 1.00 ms
  - Median latency (P50): 1.00 ms

Dispersion Metrics:
  - Standard deviation: 0.00 ms
  - Coefficient of variation (CV): 0.00%
  - Interquartile range (IQR): 0.00 ms

Percentile Metrics:
  - P50 (median): 1.00 ms
  - P90: 1.00 ms
  - P95: 1.00 ms
  - P99: 1.00 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.0000
  - P99/P50 ratio: 1.0000
  - Mean rolling std (window=5): 0.00 ms

Stability Assessment:
  - Overall stability score: 100.0/100
  - Overall stability rating: Excellent

Interpretation:
  The benchmark shows excellent stability (score: 100.0/100) with very low
  variation between runs (CV: 0.00%).
  This indicates highly consistent performance suitable for latency-sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_coreml+iphone15_ios18_primary_time_series.png

Latency Stability Analysis: mv3_mps+iphone15_ios18 (Primary)
================================================================================
Model: mv3_mps
Device: iphone15_ios18

Dataset Overview:
  - Number of samples: 118
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00

Central Tendency Metrics:
  - Mean latency: 4.01 ms
  - Median latency (P50): 4.00 ms

Dispersion Metrics:
  - Standard deviation: 0.16 ms
  - Coefficient of variation (CV): 3.99%
  - Interquartile range (IQR): 0.00 ms

Percentile Metrics:
  - P50 (median): 4.00 ms
  - P90: 4.00 ms
  - P95: 4.00 ms
  - P99: 4.83 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.6667
  - P99/P50 ratio: 1.2075
  - Mean rolling std (window=5): 0.06 ms

Stability Assessment:
  - Overall stability score: 66.5/100
  - Overall stability rating: Moderate

Interpretation:
  The benchmark shows moderate stability (score: 66.5/100) with noticeable
  variation between runs (CV: 3.99%).
  While average performance is acceptable, occasional latency spikes may occur.

  The max/min ratio of 1.67 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.21 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_mps+iphone15_ios18_primary_time_series.png


====================================================================================================
===== ANALYZING REFERENCE DATASETS ================================================================
====================================================================================================


Latency Stability Analysis: llama3_qlora+s22_android13 (Reference)
================================================================================
Model: llama3_qlora
Device: s22_android13

Dataset Overview:
  - Number of samples: 48
  - Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 01:48:22+00:00

Central Tendency Metrics:
  - Mean latency: 23841.98 ms
  - Median latency (P50): 23381.83 ms
  - Mean trimmed latency: 23727.32 ms
  - Median trimmed latency: 23286.98 ms

Dispersion Metrics:
  - Standard deviation: 2079.97 ms
  - Coefficient of variation (CV): 8.72%
  - Interquartile range (IQR): 3183.16 ms
  - Trimmed standard deviation: 2068.95 ms
  - Trimmed coefficient of variation: 8.72%

Percentile Metrics:
  - P50 (median): 23381.83 ms
  - P90: 26530.88 ms
  - P95: 27370.45 ms
  - P99: 28001.62 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.4300
  - P99/P50 ratio: 1.1976
  - Mean rolling std (window=5): 1967.20 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.48%
  - Max trimming effect ratio: 1.00%

Throughput Metrics:
  - Mean TPS: 32.18
  - TPS coefficient of variation: 7.85%

Stability Assessment:
  - Overall stability score: 46.1/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 46.1/100) with significant
  variation between runs (CV: 8.72%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 1.43 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.20 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+s22_android13_reference_time_series.png

Latency Stability Analysis: llama3_spinq+s22_android13 (Reference)
================================================================================
Model: llama3_spinq
Device: s22_android13

Dataset Overview:
  - Number of samples: 48
  - Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 01:48:22+00:00

Central Tendency Metrics:
  - Mean latency: 22774.60 ms
  - Median latency (P50): 22491.89 ms
  - Mean trimmed latency: 22648.15 ms
  - Median trimmed latency: 22393.30 ms

Dispersion Metrics:
  - Standard deviation: 1947.04 ms
  - Coefficient of variation (CV): 8.55%
  - Interquartile range (IQR): 3455.61 ms
  - Trimmed standard deviation: 1930.79 ms
  - Trimmed coefficient of variation: 8.53%

Percentile Metrics:
  - P50 (median): 22491.89 ms
  - P90: 25323.67 ms
  - P95: 25925.82 ms
  - P99: 26148.53 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.3483
  - P99/P50 ratio: 1.1626
  - Mean rolling std (window=5): 1745.98 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.55%
  - Max trimming effect ratio: 2.26%

Throughput Metrics:
  - Mean TPS: 32.96
  - TPS coefficient of variation: 8.16%

Stability Assessment:
  - Overall stability score: 48.8/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 48.8/100) with significant
  variation between runs (CV: 8.55%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 1.35 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.16 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+s22_android13_reference_time_series.png

Latency Stability Analysis: mv3_qnn+s22_android13 (Reference)
================================================================================
Model: mv3_qnn
Device: s22_android13

Dataset Overview:
  - Number of samples: 175
  - Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00

Central Tendency Metrics:
  - Mean latency: 1.44 ms
  - Median latency (P50): 1.00 ms
  - Mean trimmed latency: 1.35 ms
  - Median trimmed latency: 1.00 ms

Dispersion Metrics:
  - Standard deviation: 0.83 ms
  - Coefficient of variation (CV): 57.29%
  - Interquartile range (IQR): 0.06 ms
  - Trimmed standard deviation: 0.65 ms
  - Trimmed coefficient of variation: 48.32%

Percentile Metrics:
  - P50 (median): 1.00 ms
  - P90: 2.71 ms
  - P95: 3.25 ms
  - P99: 3.95 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 4.5354
  - P99/P50 ratio: 3.9482
  - Mean rolling std (window=5): 0.70 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 3.01%
  - Max trimming effect ratio: 32.04%

Stability Assessment:
  - Overall stability score: 0.0/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 0.0/100) with significant
  variation between runs (CV: 57.29%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 4.54 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 3.95 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_qnn+s22_android13_reference_time_series.png

Latency Stability Analysis: mv3_xnnq8+s22_android13 (Reference)
================================================================================
Model: mv3_xnnq8
Device: s22_android13

Dataset Overview:
  - Number of samples: 175
  - Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00

Central Tendency Metrics:
  - Mean latency: 1.92 ms
  - Median latency (P50): 1.06 ms
  - Mean trimmed latency: 1.74 ms
  - Median trimmed latency: 1.06 ms

Dispersion Metrics:
  - Standard deviation: 1.06 ms
  - Coefficient of variation (CV): 55.09%
  - Interquartile range (IQR): 1.63 ms
  - Trimmed standard deviation: 0.85 ms
  - Trimmed coefficient of variation: 48.75%

Percentile Metrics:
  - P50 (median): 1.06 ms
  - P90: 3.45 ms
  - P95: 3.85 ms
  - P99: 4.63 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 6.1313
  - P99/P50 ratio: 4.3683
  - Mean rolling std (window=5): 1.08 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 5.85%
  - Max trimming effect ratio: 32.08%

Stability Assessment:
  - Overall stability score: 0.0/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 0.0/100) with significant
  variation between runs (CV: 55.09%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The significant difference between raw and trimmed means suggests
  considerable intra-run jitter (5.8%) with occasional outliers within benchmark runs.

  The max/min ratio of 6.13 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 4.37 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+s22_android13_reference_time_series.png

Latency Stability Analysis: llama3_spinq+s22_android12 (Reference)
================================================================================
Model: llama3_spinq
Device: s22_android12

Dataset Overview:
  - Number of samples: 48
  - Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 01:48:22+00:00

Central Tendency Metrics:
  - Mean latency: 23902.04 ms
  - Median latency (P50): 22762.35 ms
  - Mean trimmed latency: 23743.12 ms
  - Median trimmed latency: 22590.46 ms

Dispersion Metrics:
  - Standard deviation: 2609.94 ms
  - Coefficient of variation (CV): 10.92%
  - Interquartile range (IQR): 4958.35 ms
  - Trimmed standard deviation: 2588.36 ms
  - Trimmed coefficient of variation: 10.90%

Percentile Metrics:
  - P50 (median): 22762.35 ms
  - P90: 27325.35 ms
  - P95: 27425.17 ms
  - P99: 27527.28 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.3689
  - P99/P50 ratio: 1.2093
  - Mean rolling std (window=5): 2739.23 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.66%
  - Max trimming effect ratio: 1.58%

Throughput Metrics:
  - Mean TPS: 30.86
  - TPS coefficient of variation: 10.84%

Stability Assessment:
  - Overall stability score: 40.2/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 40.2/100) with significant
  variation between runs (CV: 10.92%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 1.37 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.21 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+s22_android12_reference_time_series.png

Latency Stability Analysis: llama3_qlora+s22Ultra5G_android (Reference)
================================================================================
Model: llama3_qlora
Device: s22Ultra5G_android

Dataset Overview:
  - Number of samples: 50
  - Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 17:28:34+00:00

Central Tendency Metrics:
  - Mean latency: 24685.50 ms
  - Median latency (P50): 23145.09 ms
  - Mean trimmed latency: 24531.08 ms
  - Median trimmed latency: 22945.87 ms

Dispersion Metrics:
  - Standard deviation: 2677.07 ms
  - Coefficient of variation (CV): 10.84%
  - Interquartile range (IQR): 5112.26 ms
  - Trimmed standard deviation: 2657.25 ms
  - Trimmed coefficient of variation: 10.83%

Percentile Metrics:
  - P50 (median): 23145.09 ms
  - P90: 28096.67 ms
  - P95: 28195.43 ms
  - P99: 29486.39 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.4421
  - P99/P50 ratio: 1.2740
  - Mean rolling std (window=5): 2527.53 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.62%
  - Max trimming effect ratio: 1.43%

Throughput Metrics:
  - Mean TPS: 30.61
  - TPS coefficient of variation: 10.01%

Stability Assessment:
  - Overall stability score: 37.6/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 37.6/100) with significant
  variation between runs (CV: 10.84%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 1.44 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.27 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+s22Ultra5G_android_reference_time_series.png

Latency Stability Analysis: llama3_spinq+s22ultra_android12 (Reference)
================================================================================
Model: llama3_spinq
Device: s22ultra_android12

Dataset Overview:
  - Number of samples: 41
  - Date range: 2025-04-30 01:33:50+00:00 to 2025-05-13 17:16:32+00:00

Central Tendency Metrics:
  - Mean latency: 24769.21 ms
  - Median latency (P50): 23249.93 ms
  - Mean trimmed latency: 24611.41 ms
  - Median trimmed latency: 22998.15 ms

Dispersion Metrics:
  - Standard deviation: 2714.46 ms
  - Coefficient of variation (CV): 10.96%
  - Interquartile range (IQR): 5002.67 ms
  - Trimmed standard deviation: 2691.09 ms
  - Trimmed coefficient of variation: 10.93%

Percentile Metrics:
  - P50 (median): 23249.93 ms
  - P90: 28126.42 ms
  - P95: 28225.43 ms
  - P99: 29591.36 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.4421
  - P99/P50 ratio: 1.2728
  - Mean rolling std (window=5): 2490.40 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.63%
  - Max trimming effect ratio: 1.43%

Throughput Metrics:
  - Mean TPS: 30.58
  - TPS coefficient of variation: 10.08%

Stability Assessment:
  - Overall stability score: 37.7/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 37.7/100) with significant
  variation between runs (CV: 10.96%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 1.44 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.27 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+s22ultra_android12_reference_time_series.png

Latency Stability Analysis: mv3_xnnq8+s22ultra_android12 (Reference)
================================================================================
Model: mv3_xnnq8
Device: s22ultra_android12

Dataset Overview:
  - Number of samples: 87
  - Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00

Central Tendency Metrics:
  - Mean latency: 3.63 ms
  - Median latency (P50): 3.62 ms
  - Mean trimmed latency: 2.94 ms
  - Median trimmed latency: 2.87 ms

Dispersion Metrics:
  - Standard deviation: 0.81 ms
  - Coefficient of variation (CV): 22.35%
  - Interquartile range (IQR): 0.94 ms
  - Trimmed standard deviation: 0.60 ms
  - Trimmed coefficient of variation: 20.24%

Percentile Metrics:
  - P50 (median): 3.62 ms
  - P90: 4.87 ms
  - P95: 5.15 ms
  - P99: 5.50 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 2.7228
  - P99/P50 ratio: 1.5193
  - Mean rolling std (window=5): 0.77 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 17.69%
  - Max trimming effect ratio: 45.14%

Stability Assessment:
  - Overall stability score: 15.5/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 15.5/100) with significant
  variation between runs (CV: 22.35%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The significant difference between raw and trimmed means suggests
  considerable intra-run jitter (17.7%) with occasional outliers within benchmark runs.

  The max/min ratio of 2.72 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.52 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+s22ultra_android12_reference_time_series.png

Latency Stability Analysis: mv3_qnn+s22ultra_android12 (Reference)
================================================================================
Model: mv3_qnn
Device: s22ultra_android12

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00

Central Tendency Metrics:
  - Mean latency: 1.02 ms
  - Median latency (P50): 1.01 ms
  - Mean trimmed latency: 1.01 ms
  - Median trimmed latency: 1.01 ms

Dispersion Metrics:
  - Standard deviation: 0.01 ms
  - Coefficient of variation (CV): 1.35%
  - Interquartile range (IQR): 0.01 ms
  - Trimmed standard deviation: 0.01 ms
  - Trimmed coefficient of variation: 1.15%

Percentile Metrics:
  - P50 (median): 1.01 ms
  - P90: 1.02 ms
  - P95: 1.03 ms
  - P99: 1.08 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.0990
  - P99/P50 ratio: 1.0646
  - Mean rolling std (window=5): 0.01 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.16%
  - Max trimming effect ratio: 1.94%

Stability Assessment:
  - Overall stability score: 90.4/100
  - Overall stability rating: Excellent

Interpretation:
  The benchmark shows excellent stability (score: 90.4/100) with very low
  variation between runs (CV: 1.35%).
  This indicates highly consistent performance suitable for latency-sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_qnn+s22ultra_android12_reference_time_series.png

Latency Stability Analysis: llama3_qlora+iphone15max_ios17 (Reference)
================================================================================
Model: llama3_qlora
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 74
  - Date range: 2025-02-21 03:12:32+00:00 to 2025-05-15 02:43:34+00:00

Central Tendency Metrics:
  - Mean latency: 14133.01 ms
  - Median latency (P50): 13132.50 ms

Dispersion Metrics:
  - Standard deviation: 3019.85 ms
  - Coefficient of variation (CV): 21.37%
  - Interquartile range (IQR): 527.50 ms

Percentile Metrics:
  - P50 (median): 13132.50 ms
  - P90: 17308.70 ms
  - P95: 21197.30 ms
  - P99: 25167.92 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 2.3216
  - P99/P50 ratio: 1.9165
  - Mean rolling std (window=5): 1535.43 ms

Throughput Metrics:
  - Mean TPS: 8.81
  - TPS coefficient of variation: 27.97%

Stability Assessment:
  - Overall stability score: 10.6/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 10.6/100) with significant
  variation between runs (CV: 21.37%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 2.32 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.92 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+iphone15max_ios17_reference_time_series.png

Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Reference)
================================================================================
Model: llama3_spinq
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 72
  - Date range: 2025-02-21 03:12:32+00:00 to 2025-05-15 02:43:34+00:00

Central Tendency Metrics:
  - Mean latency: 13118.40 ms
  - Median latency (P50): 12382.50 ms

Dispersion Metrics:
  - Standard deviation: 2853.94 ms
  - Coefficient of variation (CV): 21.76%
  - Interquartile range (IQR): 680.50 ms

Percentile Metrics:
  - P50 (median): 12382.50 ms
  - P90: 14481.00 ms
  - P95: 15865.05 ms
  - P99: 26265.08 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 2.7878
  - P99/P50 ratio: 2.1211
  - Mean rolling std (window=5): 1464.57 ms

Throughput Metrics:
  - Mean TPS: 12.30
  - TPS coefficient of variation: 21.24%

Stability Assessment:
  - Overall stability score: 2.7/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 2.7/100) with significant
  variation between runs (CV: 21.76%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 2.79 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 2.12 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+iphone15max_ios17_reference_time_series.png

Latency Stability Analysis: mv3_xnnq8+iphone15max_ios17 (Reference)
================================================================================
Model: mv3_xnnq8
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 73
  - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00

Central Tendency Metrics:
  - Mean latency: 13.97 ms
  - Median latency (P50): 13.00 ms

Dispersion Metrics:
  - Standard deviation: 4.74 ms
  - Coefficient of variation (CV): 33.93%
  - Interquartile range (IQR): 7.00 ms

Percentile Metrics:
  - P50 (median): 13.00 ms
  - P90: 21.80 ms
  - P95: 22.00 ms
  - P99: 25.40 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 4.1429
  - P99/P50 ratio: 1.9538
  - Mean rolling std (window=5): 4.51 ms

Stability Assessment:
  - Overall stability score: 1.2/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 1.2/100) with significant
  variation between runs (CV: 33.93%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 4.14 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.95 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15max_ios17_reference_time_series.png

Latency Stability Analysis: mv3_coreml+iphone15max_ios17 (Reference)
================================================================================
Model: mv3_coreml
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 21
  - Date range: 2025-05-01 03:29:21+00:00 to 2025-05-22 02:53:58+00:00

Central Tendency Metrics:
  - Mean latency: 1.00 ms
  - Median latency (P50): 1.00 ms

Dispersion Metrics:
  - Standard deviation: 0.00 ms
  - Coefficient of variation (CV): 0.00%
  - Interquartile range (IQR): 0.00 ms

Percentile Metrics:
  - P50 (median): 1.00 ms
  - P90: 1.00 ms
  - P95: 1.00 ms
  - P99: 1.00 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.0000
  - P99/P50 ratio: 1.0000
  - Mean rolling std (window=5): 0.00 ms

Stability Assessment:
  - Overall stability score: 100.0/100
  - Overall stability rating: Excellent

Interpretation:
  The benchmark shows excellent stability (score: 100.0/100) with very low
  variation between runs (CV: 0.00%).
  This indicates highly consistent performance suitable for latency-sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_coreml+iphone15max_ios17_reference_time_series.png

Latency Stability Analysis: mv3_mps+iphone15max_ios17 (Reference)
================================================================================
Model: mv3_mps
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 72
  - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00

Central Tendency Metrics:
  - Mean latency: 1.03 ms
  - Median latency (P50): 1.00 ms

Dispersion Metrics:
  - Standard deviation: 0.17 ms
  - Coefficient of variation (CV): 16.10%
  - Interquartile range (IQR): 0.00 ms

Percentile Metrics:
  - P50 (median): 1.00 ms
  - P90: 1.00 ms
  - P95: 1.00 ms
  - P99: 2.00 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 2.0000
  - P99/P50 ratio: 2.0000
  - Mean rolling std (window=5): 0.07 ms

Stability Assessment:
  - Overall stability score: 12.5/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 12.5/100) with significant
  variation between runs (CV: 16.10%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 2.00 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 2.00 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_mps+iphone15max_ios17_reference_time_series.png

Latency Stability Analysis: llama3_qlora+iphone15_ios18 (Reference)
================================================================================
Model: llama3_qlora
Device: iphone15_ios18

Dataset Overview:
  - Number of samples: 70
  - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00

Central Tendency Metrics:
  - Mean latency: 14429.20 ms
  - Median latency (P50): 14401.00 ms

Dispersion Metrics:
  - Standard deviation: 593.06 ms
  - Coefficient of variation (CV): 4.11%
  - Interquartile range (IQR): 637.25 ms

Percentile Metrics:
  - P50 (median): 14401.00 ms
  - P90: 14970.00 ms
  - P95: 15441.85 ms
  - P99: 16444.58 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.2195
  - P99/P50 ratio: 1.1419
  - Mean rolling std (window=5): 540.47 ms

Throughput Metrics:
  - Mean TPS: 5.47
  - TPS coefficient of variation: 13.24%

Stability Assessment:
  - Overall stability score: 73.2/100
  - Overall stability rating: Moderate

Interpretation:
  The benchmark shows moderate stability (score: 73.2/100) with noticeable
  variation between runs (CV: 4.11%).
  While average performance is acceptable, occasional latency spikes may occur.

  The max/min ratio of 1.22 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.14 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+iphone15_ios18_reference_time_series.png

Latency Stability Analysis: llama3_spinq+iphone15_ios18 (Reference)
================================================================================
Model: llama3_spinq
Device: iphone15_ios18

Dataset Overview:
  - Number of samples: 74
  - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00

Central Tendency Metrics:
  - Mean latency: 13820.34 ms
  - Median latency (P50): 13724.00 ms

Dispersion Metrics:
  - Standard deviation: 662.49 ms
  - Coefficient of variation (CV): 4.79%
  - Interquartile range (IQR): 683.50 ms

Percentile Metrics:
  - P50 (median): 13724.00 ms
  - P90: 14527.80 ms
  - P95: 14992.20 ms
  - P99: 15822.16 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.3302
  - P99/P50 ratio: 1.1529
  - Mean rolling std (window=5): 542.03 ms

Throughput Metrics:
  - Mean TPS: 7.96
  - TPS coefficient of variation: 14.45%

Stability Assessment:
  - Overall stability score: 68.1/100
  - Overall stability rating: Moderate

Interpretation:
  The benchmark shows moderate stability (score: 68.1/100) with noticeable
  variation between runs (CV: 4.79%).
  While average performance is acceptable, occasional latency spikes may occur.

  The max/min ratio of 1.33 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.15 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+iphone15_ios18_reference_time_series.png

Latency Stability Analysis: mv3_xnnq8+iphone15_ios18 (Reference)
================================================================================
Model: mv3_xnnq8
Device: iphone15_ios18

Dataset Overview:
  - Number of samples: 73
  - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00

Central Tendency Metrics:
  - Mean latency: 49.85 ms
  - Median latency (P50): 44.00 ms

Dispersion Metrics:
  - Standard deviation: 20.47 ms
  - Coefficient of variation (CV): 41.06%
  - Interquartile range (IQR): 12.00 ms

Percentile Metrics:
  - P50 (median): 44.00 ms
  - P90: 82.00 ms
  - P95: 100.20 ms
  - P99: 121.28 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 3.9355
  - P99/P50 ratio: 2.7564
  - Mean rolling std (window=5): 16.45 ms

Stability Assessment:
  - Overall stability score: 0.0/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 0.0/100) with significant
  variation between runs (CV: 41.06%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 3.94 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 2.76 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15_ios18_reference_time_series.png

Latency Stability Analysis: mv3_coreml+iphone15_ios18 (Reference)
================================================================================
Model: mv3_coreml
Device: iphone15_ios18

Dataset Overview:
  - Number of samples: 21
  - Date range: 2025-05-01 03:29:21+00:00 to 2025-05-22 02:53:58+00:00

Central Tendency Metrics:
  - Mean latency: 1.00 ms
  - Median latency (P50): 1.00 ms

Dispersion Metrics:
  - Standard deviation: 0.00 ms
  - Coefficient of variation (CV): 0.00%
  - Interquartile range (IQR): 0.00 ms

Percentile Metrics:
  - P50 (median): 1.00 ms
  - P90: 1.00 ms
  - P95: 1.00 ms
  - P99: 1.00 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.0000
  - P99/P50 ratio: 1.0000
  - Mean rolling std (window=5): 0.00 ms

Stability Assessment:
  - Overall stability score: 100.0/100
  - Overall stability rating: Excellent

Interpretation:
  The benchmark shows excellent stability (score: 100.0/100) with very low
  variation between runs (CV: 0.00%).
  This indicates highly consistent performance suitable for latency-sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_coreml+iphone15_ios18_reference_time_series.png

Latency Stability Analysis: mv3_mps+iphone15_ios18 (Reference)
================================================================================
Model: mv3_mps
Device: iphone15_ios18

Dataset Overview:
  - Number of samples: 72
  - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00

Central Tendency Metrics:
  - Mean latency: 3.75 ms
  - Median latency (P50): 4.00 ms

Dispersion Metrics:
  - Standard deviation: 0.67 ms
  - Coefficient of variation (CV): 17.76%
  - Interquartile range (IQR): 0.00 ms

Percentile Metrics:
  - P50 (median): 4.00 ms
  - P90: 4.00 ms
  - P95: 4.00 ms
  - P99: 4.00 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 2.0000
  - P99/P50 ratio: 1.0000
  - Mean rolling std (window=5): 0.44 ms

Stability Assessment:
  - Overall stability score: 37.5/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 37.5/100) with significant
  variation between runs (CV: 17.76%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 2.00 indicates
  substantial performance differences between the best and worst runs.

================================================================================
Generated time series plot: stability_analysis_results/mv3_mps+iphone15_ios18_reference_time_series.png


====================================================================================================
===== PRIVATE VS PUBLIC STABILITY COMPARISON ======================================================
====================================================================================================

Matched: llama3_qlora+s22_android13 (Private) with llama3_qlora+s22_android13 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: llama3_qlora+s22_android13
Public Dataset: llama3_qlora+s22_android13
Model: llama3_qlora
Private Device: s22_android13
Public Device: s22_android13

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 22502.10 ms         | 23841.98 ms          | -1339.88 ms  | -5.6%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 22447.56 ms         | 23381.83 ms          | -934.27 ms   | -4.0%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 595.01 ms           | 2079.97 ms           | -1484.97 ms  | -71.4%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 2.64%               | 8.72%                | -6.08%       | -69.7%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 858.26 ms           | 3183.16 ms           | -2324.90 ms  | -73.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 23910.11 ms         | 28001.62 ms          | -4091.51 ms  | -14.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 1.1423              | 1.4300               | -0.2877      | -20.1%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.0652              | 1.1976               | -0.1324      | -11.1%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 83.4/100            | 46.1/100             | 37.3         | 81.0%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Good                | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Private environment shows better stability with a 81.0% higher stability score.
  (Private: 83.4/100 vs Public: 46.1/100)
  Private environment has 69.7% lower coefficient of variation, indicating more consistent performance.
  Private environment has 5.6% lower mean latency, indicating better performance.

Recommendation:
  The private environment provides better stability for this model+device combination.
  It is recommended for applications where consistent performance is critical.

================================================================================
Matched: llama3_spinq+s22_android13 (Private) with llama3_spinq+s22_android13 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: llama3_spinq+s22_android13
Public Dataset: llama3_spinq+s22_android13
Model: llama3_spinq
Private Device: s22_android13
Public Device: s22_android13

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 21771.59 ms         | 22774.60 ms          | -1003.01 ms  | -4.4%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 21668.24 ms         | 22491.89 ms          | -823.65 ms   | -3.7%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 514.89 ms           | 1947.04 ms           | -1432.15 ms  | -73.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 2.36%               | 8.55%                | -6.18%       | -72.3%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 602.75 ms           | 3455.61 ms           | -2852.87 ms  | -82.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 23104.76 ms         | 26148.53 ms          | -3043.77 ms  | -11.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 1.1452              | 1.3483               | -0.2031      | -15.1%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.0663              | 1.1626               | -0.0963      | -8.3%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 84.7/100            | 48.8/100             | 35.9         | 73.4%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Good                | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Private environment shows better stability with a 73.4% higher stability score.
  (Private: 84.7/100 vs Public: 48.8/100)
  Private environment has 72.3% lower coefficient of variation, indicating more consistent performance.
  Private environment has 4.4% lower mean latency, indicating better performance.

Recommendation:
  The private environment provides better stability for this model+device combination.
  It is recommended for applications where consistent performance is critical.

================================================================================
Matched: mv3_qnn+s22_android13 (Private) with mv3_qnn+s22_android13 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: mv3_qnn+s22_android13
Public Dataset: mv3_qnn+s22_android13
Model: mv3_qnn
Private Device: s22_android13
Public Device: s22_android13

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 1.01 ms             | 1.44 ms              | -0.44 ms     | -30.3%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 1.00 ms             | 1.00 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 0.02 ms             | 0.83 ms              | -0.80 ms     | -97.2%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 2.34%               | 57.29%               | -54.95%      | -95.9%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 0.01 ms             | 0.06 ms              | -0.05 ms     | -83.3%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 1.14 ms             | 3.95 ms              | -2.81 ms     | -71.1%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 1.1919              | 4.5354               | -3.3434      | -73.7%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.1404              | 3.9482               | -2.8078      | -71.1%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 82.4/100            | 0.0/100              | 82.4         | Infinity   |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Good                | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Private environment shows better stability.
  (Private: 82.4/100 vs Public: 0.0/100)
  Private environment has 95.9% lower coefficient of variation, indicating more consistent performance.
  Private environment has 30.3% lower mean latency, indicating better performance.

Recommendation:
  The private environment provides better stability for this model+device combination.
  It is recommended for applications where consistent performance is critical.

================================================================================
Matched: mv3_xnnq8+s22_android13 (Private) with mv3_xnnq8+s22_android13 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: mv3_xnnq8+s22_android13
Public Dataset: mv3_xnnq8+s22_android13
Model: mv3_xnnq8
Private Device: s22_android13
Public Device: s22_android13

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 2.73 ms             | 1.92 ms              | 0.81 ms      | 42.1%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 2.65 ms             | 1.06 ms              | 1.59 ms      | 150.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 0.63 ms             | 1.06 ms              | -0.43 ms     | -40.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 23.03%              | 55.09%               | -32.06%      | -58.2%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 0.95 ms             | 1.63 ms              | -0.68 ms     | -41.9%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 4.46 ms             | 4.63 ms              | -0.18 ms     | -3.8%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 2.4427              | 6.1313               | -3.6886      | -60.2%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.6812              | 4.3683               | -2.6871      | -61.5%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 14.9/100            | 0.0/100              | 14.9         | Infinity   |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Poor                | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Private environment shows better stability.
  (Private: 14.9/100 vs Public: 0.0/100)
  Private environment has 58.2% lower coefficient of variation, indicating more consistent performance.
  Public environment has 42.1% lower mean latency, indicating better performance.

Recommendation:
  The private environment provides better stability for this model+device combination.
  It is recommended for applications where consistent performance is critical.

================================================================================
Warning: No matching reference dataset for llama3_qlora+s22ultra_android14
Matched: llama3_spinq+s22ultra_android14 (Private) with llama3_spinq+s22ultra_android12 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: llama3_spinq+s22ultra_android14
Public Dataset: llama3_spinq+s22ultra_android12
Model: llama3_spinq
Private Device: s22ultra_android14
Public Device: s22ultra_android12

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 24761.78 ms         | 24769.21 ms          | -7.43 ms     | -0.0%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 25043.89 ms         | 23249.93 ms          | 1793.96 ms   | 7.7%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 1552.25 ms          | 2714.46 ms           | -1162.21 ms  | -42.8%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 6.27%               | 10.96%               | -4.69%       | -42.8%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 1931.42 ms          | 5002.67 ms           | -3071.25 ms  | -61.4%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 28868.51 ms         | 29591.36 ms          | -722.85 ms   | -2.4%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 1.3648              | 1.4421               | -0.0773      | -5.4%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.1527              | 1.2728               | -0.1200      | -9.4%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 60.3/100            | 37.7/100             | 22.6         | 60.1%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Moderate            | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Private environment shows better stability with a 60.1% higher stability score.
  (Private: 60.3/100 vs Public: 37.7/100)
  Private environment has 42.8% lower coefficient of variation, indicating more consistent performance.
  Private environment has 0.0% lower mean latency, indicating better performance.

  Note: This comparison is between s22ultra with _android14 (Private) and
  s22ultra with _android12 (Public). OS version differences may
  contribute to observed stability variations.

Recommendation:
  The private environment provides better stability for this model+device combination.
  It is recommended for applications where consistent performance is critical.

================================================================================
Matched: mv3_qnn+s22ultra_android14 (Private) with mv3_qnn+s22ultra_android12 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: mv3_qnn+s22ultra_android14
Public Dataset: mv3_qnn+s22ultra_android12
Model: mv3_qnn
Private Device: s22ultra_android14
Public Device: s22ultra_android12

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 1.01 ms             | 1.02 ms              | -0.00 ms     | -0.1%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 1.01 ms             | 1.01 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 0.01 ms             | 0.01 ms              | -0.00 ms     | -32.7%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 0.91%               | 1.35%                | -0.44%       | -32.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 0.01 ms             | 0.01 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 1.03 ms             | 1.08 ms              | -0.04 ms     | -4.1%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 1.0900              | 1.0990               | -0.0090      | -0.8%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.0204              | 1.0646               | -0.0442      | -4.1%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 93.8/100            | 90.4/100             | 3.4          | 3.8%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Excellent           | Excellent            | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Private environment shows better stability with a 3.8% higher stability score.
  (Private: 93.8/100 vs Public: 90.4/100)
  Private environment has 32.6% lower coefficient of variation, indicating more consistent performance.
  Private environment has 0.1% lower mean latency, indicating better performance.

  Note: This comparison is between s22ultra with _android14 (Private) and
  s22ultra with _android12 (Public). OS version differences may
  contribute to observed stability variations.

Recommendation:
  The private environment provides better stability for this model+device combination.
  It is recommended for applications where consistent performance is critical.

================================================================================
Matched: mv3_xnnq8+s22ultra_android14 (Private) with mv3_xnnq8+s22ultra_android12 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: mv3_xnnq8+s22ultra_android14
Public Dataset: mv3_xnnq8+s22ultra_android12
Model: mv3_xnnq8
Private Device: s22ultra_android14
Public Device: s22ultra_android12

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 2.91 ms             | 3.63 ms              | -0.72 ms     | -20.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 2.54 ms             | 3.62 ms              | -1.08 ms     | -30.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 1.14 ms             | 0.81 ms              | 0.32 ms      | 39.9%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 39.08%              | 22.35%               | 16.73%       | 74.8%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 0.82 ms             | 0.94 ms              | -0.12 ms     | -12.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 5.91 ms             | 5.50 ms              | 0.41 ms      | 7.5%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 5.6103              | 2.7228               | 2.8875       | 106.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 2.3319              | 1.5193               | 0.8126       | 53.5%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 0.0/100             | 15.5/100             | -15.5        | -100.0%    |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Poor                | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Public environment shows better stability.
  (Private: 0.0/100 vs Public: 15.5/100)
  Public environment has 74.8% lower coefficient of variation, indicating more consistent performance.
  Private environment has 20.0% lower mean latency, indicating better performance.

  Note: This comparison is between s22ultra with _android14 (Private) and
  s22ultra with _android12 (Public). OS version differences may
  contribute to observed stability variations.

Recommendation:
  The public environment provides better stability for this model+device combination.
  Consider investigating factors affecting stability in the private environment.

================================================================================
Warning: No matching reference dataset for mv3_xnnq8+pixel3_rooted_android
Matched: llama3_qlora+iphone15max_ios17 (Private) with llama3_qlora+iphone15max_ios17 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: llama3_qlora+iphone15max_ios17
Public Dataset: llama3_qlora+iphone15max_ios17
Model: llama3_qlora
Private Device: iphone15max_ios17
Public Device: iphone15max_ios17

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 12972.80 ms         | 14133.01 ms          | -1160.22 ms  | -8.2%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 12774.50 ms         | 13132.50 ms          | -358.00 ms   | -2.7%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 483.26 ms           | 3019.85 ms           | -2536.58 ms  | -84.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 3.73%               | 21.37%               | -17.64%      | -82.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 624.00 ms           | 527.50 ms            | 96.50 ms     | 18.3%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 14730.49 ms         | 25167.92 ms          | -10437.43 ms | -41.5%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 1.1916              | 2.3216               | -1.1300      | -48.7%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.1531              | 1.9165               | -0.7633      | -39.8%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 75.2/100            | 10.6/100             | 64.6         | 611.1%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Moderate            | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Private environment shows better stability with a 611.1% higher stability score.
  (Private: 75.2/100 vs Public: 10.6/100)
  Private environment has 82.6% lower coefficient of variation, indicating more consistent performance.
  Private environment has 8.2% lower mean latency, indicating better performance.

Recommendation:
  The private environment provides better stability for this model+device combination.
  It is recommended for applications where consistent performance is critical.

================================================================================
Matched: llama3_spinq+iphone15max_ios17 (Private) with llama3_spinq+iphone15max_ios17 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: llama3_spinq+iphone15max_ios17
Public Dataset: llama3_spinq+iphone15max_ios17
Model: llama3_spinq
Private Device: iphone15max_ios17
Public Device: iphone15max_ios17

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 12195.41 ms         | 13118.40 ms          | -923.00 ms   | -7.0%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 12104.50 ms         | 12382.50 ms          | -278.00 ms   | -2.2%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 461.27 ms           | 2853.94 ms           | -2392.67 ms  | -83.8%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 3.78%               | 21.76%               | -17.97%      | -82.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 154.25 ms           | 680.50 ms            | -526.25 ms   | -77.3%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 14052.31 ms         | 26265.08 ms          | -12212.77 ms | -46.5%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 1.3331              | 2.7878               | -1.4546      | -52.2%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.1609              | 2.1211               | -0.9602      | -45.3%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 72.9/100            | 2.7/100              | 70.2         | 2648.0%    |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Moderate            | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Private environment shows better stability with a 2648.0% higher stability score.
  (Private: 72.9/100 vs Public: 2.7/100)
  Private environment has 82.6% lower coefficient of variation, indicating more consistent performance.
  Private environment has 7.0% lower mean latency, indicating better performance.

Recommendation:
  The private environment provides better stability for this model+device combination.
  It is recommended for applications where consistent performance is critical.

================================================================================
Matched: mv3_xnnq8+iphone15max_ios17 (Private) with mv3_xnnq8+iphone15max_ios17 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: mv3_xnnq8+iphone15max_ios17
Public Dataset: mv3_xnnq8+iphone15max_ios17
Model: mv3_xnnq8
Private Device: iphone15max_ios17
Public Device: iphone15max_ios17

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 13.98 ms            | 13.97 ms             | 0.01 ms      | 0.1%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 14.00 ms            | 13.00 ms             | 1.00 ms      | 7.7%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 3.44 ms             | 4.74 ms              | -1.30 ms     | -27.4%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 24.60%              | 33.93%               | -9.33%       | -27.5%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 4.00 ms             | 7.00 ms              | -3.00 ms     | -42.9%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 21.94 ms            | 25.40 ms             | -3.46 ms     | -13.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 3.2857              | 4.1429               | -0.8571      | -20.7%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.5671              | 1.9538               | -0.3867      | -19.8%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 10.8/100            | 1.2/100              | 9.7          | 837.9%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Poor                | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Private environment shows better stability with a 837.9% higher stability score.
  (Private: 10.8/100 vs Public: 1.2/100)
  Private environment has 27.5% lower coefficient of variation, indicating more consistent performance.
  Public environment has 0.1% lower mean latency, indicating better performance.

Recommendation:
  The private environment provides better stability for this model+device combination.
  It is recommended for applications where consistent performance is critical.

================================================================================
Matched: mv3_coreml+iphone15max_ios17 (Private) with mv3_coreml+iphone15max_ios17 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: mv3_coreml+iphone15max_ios17
Public Dataset: mv3_coreml+iphone15max_ios17
Model: mv3_coreml
Private Device: iphone15max_ios17
Public Device: iphone15max_ios17

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 1.00 ms             | 1.00 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 1.00 ms             | 1.00 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 0.00 ms             | 0.00 ms              | 0.00 ms      | Infinity%  |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 0.00%               | 0.00%                | 0.00%        | Infinity%  |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 0.00 ms             | 0.00 ms              | 0.00 ms      | Infinity%  |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 1.00 ms             | 1.00 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 1.0000              | 1.0000               | 0.0000       | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.0000              | 1.0000               | 0.0000       | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 100.0/100           | 100.0/100            | 0.0          | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Excellent           | Excellent            | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Both environments show identical stability scores.

Recommendation:
  Both environments provide similar stability. Other factors like cost or availability
  may be considered for choosing between them.

================================================================================
Matched: mv3_mps+iphone15max_ios17 (Private) with mv3_mps+iphone15max_ios17 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: mv3_mps+iphone15max_ios17
Public Dataset: mv3_mps+iphone15max_ios17
Model: mv3_mps
Private Device: iphone15max_ios17
Public Device: iphone15max_ios17

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 1.25 ms             | 1.03 ms              | 0.23 ms      | 22.1%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 1.00 ms             | 1.00 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 0.44 ms             | 0.17 ms              | 0.27 ms      | 166.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 35.07%              | 16.10%               | 18.97%       | 117.8%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 0.50 ms             | 0.00 ms              | 0.50 ms      | Infinity%  |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 2.00 ms             | 2.00 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 2.0000              | 2.0000               | 0.0000       | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 2.0000              | 2.0000               | 0.0000       | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 12.5/100            | 12.5/100             | 0.0          | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Poor                | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Both environments show identical stability scores.
  Public environment has 117.8% lower coefficient of variation, indicating more consistent performance.
  Public environment has 22.1% lower mean latency, indicating better performance.

Recommendation:
  Both environments provide similar stability. Other factors like cost or availability
  may be considered for choosing between them.

================================================================================
Matched: llama3_qlora+iphone15_ios18 (Private) with llama3_qlora+iphone15_ios18 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: llama3_qlora+iphone15_ios18
Public Dataset: llama3_qlora+iphone15_ios18
Model: llama3_qlora
Private Device: iphone15_ios18
Public Device: iphone15_ios18

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 23169.07 ms         | 14429.20 ms          | 8739.87 ms   | 60.6%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 21328.00 ms         | 14401.00 ms          | 6927.00 ms   | 48.1%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 5889.20 ms          | 593.06 ms            | 5296.15 ms   | 893.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 25.42%              | 4.11%                | 21.31%       | 518.4%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 8558.00 ms          | 637.25 ms            | 7920.75 ms   | 1243.0%    |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 40256.40 ms         | 16444.58 ms          | 23811.82 ms  | 144.8%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 3.0072              | 1.2195               | 1.7877       | 146.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.8875              | 1.1419               | 0.7456       | 65.3%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 2.8/100             | 73.2/100             | -70.3        | -96.2%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Poor                | Moderate             | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Public environment shows better stability with a 96.2% higher stability score.
  (Private: 2.8/100 vs Public: 73.2/100)
  Public environment has 518.4% lower coefficient of variation, indicating more consistent performance.
  Public environment has 60.6% lower mean latency, indicating better performance.

Recommendation:
  The public environment provides better stability for this model+device combination.
  Consider investigating factors affecting stability in the private environment.

================================================================================
Matched: llama3_spinq+iphone15_ios18 (Private) with llama3_spinq+iphone15_ios18 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: llama3_spinq+iphone15_ios18
Public Dataset: llama3_spinq+iphone15_ios18
Model: llama3_spinq
Private Device: iphone15_ios18
Public Device: iphone15_ios18

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 22076.03 ms         | 13820.34 ms          | 8255.70 ms   | 59.7%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 20174.00 ms         | 13724.00 ms          | 6450.00 ms   | 47.0%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 6076.94 ms          | 662.49 ms            | 5414.45 ms   | 817.3%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 27.53%              | 4.79%                | 22.73%       | 474.3%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 7826.00 ms          | 683.50 ms            | 7142.50 ms   | 1045.0%    |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 37690.75 ms         | 15822.16 ms          | 21868.59 ms  | 138.2%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 2.7320              | 1.3302               | 1.4018       | 105.4%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.8683              | 1.1529               | 0.7154       | 62.1%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 6.6/100             | 68.1/100             | -61.4        | -90.2%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Poor                | Moderate             | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Public environment shows better stability with a 90.2% higher stability score.
  (Private: 6.6/100 vs Public: 68.1/100)
  Public environment has 474.3% lower coefficient of variation, indicating more consistent performance.
  Public environment has 59.7% lower mean latency, indicating better performance.

Recommendation:
  The public environment provides better stability for this model+device combination.
  Consider investigating factors affecting stability in the private environment.

================================================================================
Matched: mv3_xnnq8+iphone15_ios18 (Private) with mv3_xnnq8+iphone15_ios18 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: mv3_xnnq8+iphone15_ios18
Public Dataset: mv3_xnnq8+iphone15_ios18
Model: mv3_xnnq8
Private Device: iphone15_ios18
Public Device: iphone15_ios18

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 48.23 ms            | 49.85 ms             | -1.62 ms     | -3.2%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 47.00 ms            | 44.00 ms             | 3.00 ms      | 6.8%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 6.19 ms             | 20.47 ms             | -14.28 ms    | -69.7%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 12.84%              | 41.06%               | -28.22%      | -68.7%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 6.00 ms             | 12.00 ms             | -6.00 ms     | -50.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 64.40 ms            | 121.28 ms            | -56.88 ms    | -46.9%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 2.2973              | 3.9355               | -1.6382      | -41.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.3702              | 2.7564               | -1.3862      | -50.3%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 24.5/100            | 0.0/100              | 24.5         | Infinity   |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Poor                | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Private environment shows better stability.
  (Private: 24.5/100 vs Public: 0.0/100)
  Private environment has 68.7% lower coefficient of variation, indicating more consistent performance.
  Private environment has 3.2% lower mean latency, indicating better performance.

Recommendation:
  The private environment provides better stability for this model+device combination.
  It is recommended for applications where consistent performance is critical.

================================================================================
Matched: mv3_coreml+iphone15_ios18 (Private) with mv3_coreml+iphone15_ios18 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: mv3_coreml+iphone15_ios18
Public Dataset: mv3_coreml+iphone15_ios18
Model: mv3_coreml
Private Device: iphone15_ios18
Public Device: iphone15_ios18

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 1.00 ms             | 1.00 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 1.00 ms             | 1.00 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 0.00 ms             | 0.00 ms              | 0.00 ms      | Infinity%  |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 0.00%               | 0.00%                | 0.00%        | Infinity%  |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 0.00 ms             | 0.00 ms              | 0.00 ms      | Infinity%  |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 1.00 ms             | 1.00 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 1.0000              | 1.0000               | 0.0000       | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.0000              | 1.0000               | 0.0000       | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 100.0/100           | 100.0/100            | 0.0          | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Excellent           | Excellent            | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Both environments show identical stability scores.

Recommendation:
  Both environments provide similar stability. Other factors like cost or availability
  may be considered for choosing between them.

================================================================================
Matched: mv3_mps+iphone15_ios18 (Private) with mv3_mps+iphone15_ios18 (Public)

Private vs Public Stability Comparison
================================================================================
Private Dataset: mv3_mps+iphone15_ios18
Public Dataset: mv3_mps+iphone15_ios18
Model: mv3_mps
Private Device: iphone15_ios18
Public Device: iphone15_ios18

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 4.01 ms             | 3.75 ms              | 0.26 ms      | 6.9%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 4.00 ms             | 4.00 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 0.16 ms             | 0.67 ms              | -0.51 ms     | -76.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 3.99%               | 17.76%               | -13.77%      | -77.5%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 0.00 ms             | 0.00 ms              | 0.00 ms      | Infinity%  |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 4.83 ms             | 4.00 ms              | 0.83 ms      | 20.7%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 1.6667              | 2.0000               | -0.3333      | -16.7%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.2075              | 1.0000               | 0.2075       | 20.7%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 66.5/100            | 37.5/100             | 29.0         | 77.4%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Moderate            | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Private environment shows better stability with a 77.4% higher stability score.
  (Private: 66.5/100 vs Public: 37.5/100)
  Private environment has 77.5% lower coefficient of variation, indicating more consistent performance.
  Public environment has 6.9% lower mean latency, indicating better performance.

Recommendation:
  The private environment provides better stability for this model+device combination.
  It is recommended for applications where consistent performance is critical.

================================================================================


====================================================================================================
===== INTRA-PRIMARY STABILITY COMPARISON ==========================================================
====================================================================================================


Intra-Primary Stability Comparison
================================================================================

Overall Summary:
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| Sheet                           | Model        | Device                |   Mean Latency (ms) |   CV (%) |   Stability Score | Stability Rating   |   Max/Min Ratio |   P99/P50 Ratio |
+=================================+==============+=======================+=====================+==========+===================+====================+=================+=================+
| mv3_coreml+iphone15_ios18       | mv3_coreml   | iphone15_ios18        |                1.00 |     0.00 |            100.00 | Excellent          |            1.00 |            1.00 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_coreml+iphone15max_ios17    | mv3_coreml   | iphone15max_ios17     |                1.00 |     0.00 |            100.00 | Excellent          |            1.00 |            1.00 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_qnn+s22ultra_android14      | mv3_qnn      | s22ultra_android14    |                1.01 |     0.91 |             93.81 | Excellent          |            1.09 |            1.02 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| llama3_spinq+s22_android13      | llama3_spinq | s22_android13         |            21771.59 |     2.36 |             84.70 | Good               |            1.15 |            1.07 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| llama3_qlora+s22_android13      | llama3_qlora | s22_android13         |            22502.10 |     2.64 |             83.37 | Good               |            1.14 |            1.07 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_qnn+s22_android13           | mv3_qnn      | s22_android13         |                1.01 |     2.34 |             82.41 | Good               |            1.19 |            1.14 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| llama3_qlora+iphone15max_ios17  | llama3_qlora | iphone15max_ios17     |            12972.80 |     3.73 |             75.15 | Moderate           |            1.19 |            1.15 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| llama3_spinq+iphone15max_ios17  | llama3_spinq | iphone15max_ios17     |            12195.41 |     3.78 |             72.90 | Moderate           |            1.33 |            1.16 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_mps+iphone15_ios18          | mv3_mps      | iphone15_ios18        |                4.01 |     3.99 |             66.53 | Moderate           |            1.67 |            1.21 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| llama3_qlora+s22ultra_android14 | llama3_qlora | s22ultra_android14    |            25022.84 |     6.18 |             62.54 | Moderate           |            1.27 |            1.13 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| llama3_spinq+s22ultra_android14 | llama3_spinq | s22ultra_android14    |            24761.78 |     6.27 |             60.28 | Moderate           |            1.36 |            1.15 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_xnnq8+pixel3_rooted_android | mv3_xnnq8    | pixel3_rooted_android |                5.93 |     7.68 |             46.93 | Poor               |            1.70 |            1.24 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_xnnq8+iphone15_ios18        | mv3_xnnq8    | iphone15_ios18        |               48.23 |    12.84 |             24.53 | Poor               |            2.30 |            1.37 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_xnnq8+s22_android13         | mv3_xnnq8    | s22_android13         |                2.73 |    23.03 |             14.94 | Poor               |            2.44 |            1.68 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_mps+iphone15max_ios17       | mv3_mps      | iphone15max_ios17     |                1.25 |    35.07 |             12.50 | Poor               |            2.00 |            2.00 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_xnnq8+iphone15max_ios17     | mv3_xnnq8    | iphone15max_ios17     |               13.98 |    24.60 |             10.82 | Poor               |            3.29 |            1.57 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| llama3_spinq+iphone15_ios18     | llama3_spinq | iphone15_ios18        |            22076.03 |    27.53 |              6.64 | Poor               |            2.73 |            1.87 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| llama3_qlora+iphone15_ios18     | llama3_qlora | iphone15_ios18        |            23169.07 |    25.42 |              2.81 | Poor               |            3.01 |            1.89 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_xnnq8+s22ultra_android14    | mv3_xnnq8    | s22ultra_android14    |                2.91 |    39.08 |              0.00 | Poor               |            5.61 |            2.33 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+

Best and Worst Performers:
  Best stability: mv3_coreml+iphone15_ios18 (Score: 100.0/100)
  Worst stability: mv3_xnnq8+s22ultra_android14 (Score: 0.0/100)

Model-based Comparison:
+--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| Model        |   ('Stability Score', 'mean') |   ('Stability Score', 'min') |   ('Stability Score', 'max') |   ('CV (%)', 'mean') |   ('CV (%)', 'min') |   ('CV (%)', 'max') |
+==============+===============================+==============================+==============================+======================+=====================+=====================+
| mv3_coreml   |                        100.00 |                       100.00 |                       100.00 |                 0.00 |                0.00 |                0.00 |
+--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| mv3_qnn      |                         88.11 |                        82.41 |                        93.81 |                 1.62 |                0.91 |                2.34 |
+--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| llama3_spinq |                         56.13 |                         6.64 |                        84.70 |                 9.99 |                2.36 |               27.53 |
+--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| llama3_qlora |                         55.97 |                         2.81 |                        83.37 |                 9.49 |                2.64 |               25.42 |
+--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| mv3_mps      |                         39.52 |                        12.50 |                        66.53 |                19.53 |                3.99 |               35.07 |
+--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| mv3_xnnq8    |                         19.44 |                         0.00 |                        46.93 |                21.45 |                7.68 |               39.08 |
+--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
  Most stable model: mv3_coreml (Avg. Score: 100.0/100)

Device-based Comparison (Grouped by Base Device):
+---------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| Device Base   |   ('Stability Score', 'mean') |   ('Stability Score', 'min') |   ('Stability Score', 'max') |   ('CV (%)', 'mean') |   ('CV (%)', 'min') |   ('CV (%)', 'max') |
+===============+===============================+==============================+==============================+======================+=====================+=====================+
| s22           |                         66.36 |                        14.94 |                        84.70 |                 7.59 |                2.34 |               23.03 |
+---------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| iphone15max   |                         54.27 |                        10.82 |                       100.00 |                13.44 |                0.00 |               35.07 |
+---------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| s22ultra      |                         54.16 |                         0.00 |                        93.81 |                13.11 |                0.91 |               39.08 |
+---------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| pixel3        |                         46.93 |                        46.93 |                        46.93 |                 7.68 |                7.68 |                7.68 |
+---------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| iphone15      |                         40.10 |                         2.81 |                       100.00 |                13.95 |                0.00 |               27.53 |
+---------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
  Most stable device: s22 (Avg. Score: 66.4/100)

OS Version Comparison:
+-----------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| OS Version      |   ('Stability Score', 'mean') |   ('Stability Score', 'min') |   ('Stability Score', 'max') |   ('CV (%)', 'mean') |   ('CV (%)', 'min') |   ('CV (%)', 'max') |
+=================+===============================+==============================+==============================+======================+=====================+=====================+
| _android13      |                         66.36 |                        14.94 |                        84.70 |                 7.59 |                2.34 |               23.03 |
+-----------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| _ios17          |                         54.27 |                        10.82 |                       100.00 |                13.44 |                0.00 |               35.07 |
+-----------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| _android14      |                         54.16 |                         0.00 |                        93.81 |                13.11 |                0.91 |               39.08 |
+-----------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| _rooted_android |                         46.93 |                        46.93 |                        46.93 |                 7.68 |                7.68 |                7.68 |
+-----------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| _ios18          |                         40.10 |                         2.81 |                       100.00 |                13.95 |                0.00 |               27.53 |
+-----------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
  Most stable OS version: _android13 (Avg. Score: 66.4/100)

Insights and Recommendations:
  - mv3_coreml shows the most consistent performance across devices.
  - mv3_xnnq8 shows more variability and may need further optimization.
  - s22 provides the most stable environment for model execution.
  - iphone15 shows higher variability and may not be ideal for latency-sensitive applications.
  - _android13 provides better stability than _ios18 across tested devices.
  - For critical applications requiring consistent performance, prefer:
    * Model: mv3_coreml
    * Device: s22
    * OS Version: _android13

================================================================================


====================================================================================================
===== COMPREHENSIVE STABILITY SUMMARY =============================================================
====================================================================================================


Comprehensive Latency Stability Analysis Summary
================================================================================

Primary (Private) Datasets Summary:
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| Dataset                         | Model        | Device                   |   Mean Latency (ms) |   CV (%) |   Stability Score | Stability Rating   |
+=================================+==============+==========================+=====================+==========+===================+====================+
| mv3_coreml+iphone15_ios18       | mv3_coreml   | iphone15 (_ios18)        |                1.00 |     0.00 |            100.00 | Excellent          |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| mv3_coreml+iphone15max_ios17    | mv3_coreml   | iphone15max (_ios17)     |                1.00 |     0.00 |            100.00 | Excellent          |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| mv3_qnn+s22ultra_android14      | mv3_qnn      | s22ultra (_android14)    |                1.01 |     0.91 |             93.81 | Excellent          |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| llama3_spinq+s22_android13      | llama3_spinq | s22 (_android13)         |            21771.59 |     2.36 |             84.70 | Good               |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| llama3_qlora+s22_android13      | llama3_qlora | s22 (_android13)         |            22502.10 |     2.64 |             83.37 | Good               |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| mv3_qnn+s22_android13           | mv3_qnn      | s22 (_android13)         |                1.01 |     2.34 |             82.41 | Good               |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| llama3_qlora+iphone15max_ios17  | llama3_qlora | iphone15max (_ios17)     |            12972.80 |     3.73 |             75.15 | Moderate           |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| llama3_spinq+iphone15max_ios17  | llama3_spinq | iphone15max (_ios17)     |            12195.41 |     3.78 |             72.90 | Moderate           |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| mv3_mps+iphone15_ios18          | mv3_mps      | iphone15 (_ios18)        |                4.01 |     3.99 |             66.53 | Moderate           |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| llama3_qlora+s22ultra_android14 | llama3_qlora | s22ultra (_android14)    |            25022.84 |     6.18 |             62.54 | Moderate           |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| llama3_spinq+s22ultra_android14 | llama3_spinq | s22ultra (_android14)    |            24761.78 |     6.27 |             60.28 | Moderate           |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+pixel3_rooted_android | mv3_xnnq8    | pixel3 (_rooted_android) |                5.93 |     7.68 |             46.93 | Poor               |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+iphone15_ios18        | mv3_xnnq8    | iphone15 (_ios18)        |               48.23 |    12.84 |             24.53 | Poor               |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+s22_android13         | mv3_xnnq8    | s22 (_android13)         |                2.73 |    23.03 |             14.94 | Poor               |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| mv3_mps+iphone15max_ios17       | mv3_mps      | iphone15max (_ios17)     |                1.25 |    35.07 |             12.50 | Poor               |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+iphone15max_ios17     | mv3_xnnq8    | iphone15max (_ios17)     |               13.98 |    24.60 |             10.82 | Poor               |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| llama3_spinq+iphone15_ios18     | llama3_spinq | iphone15 (_ios18)        |            22076.03 |    27.53 |              6.64 | Poor               |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| llama3_qlora+iphone15_ios18     | llama3_qlora | iphone15 (_ios18)        |            23169.07 |    25.42 |              2.81 | Poor               |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+s22ultra_android14    | mv3_xnnq8    | s22ultra (_android14)    |                2.91 |    39.08 |              0.00 | Poor               |
+---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+

Reference (Public) Datasets Summary:
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| Dataset                         | Model        | Device                |   Mean Latency (ms) |   CV (%) |   Stability Score | Stability Rating   |
+=================================+==============+=======================+=====================+==========+===================+====================+
| mv3_coreml+iphone15max_ios17    | mv3_coreml   | iphone15max (_ios17)  |                1.00 |     0.00 |            100.00 | Excellent          |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| mv3_coreml+iphone15_ios18       | mv3_coreml   | iphone15 (_ios18)     |                1.00 |     0.00 |            100.00 | Excellent          |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| mv3_qnn+s22ultra_android12      | mv3_qnn      | s22ultra (_android12) |                1.02 |     1.35 |             90.39 | Excellent          |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_qlora+iphone15_ios18     | llama3_qlora | iphone15 (_ios18)     |            14429.20 |     4.11 |             73.16 | Moderate           |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_spinq+iphone15_ios18     | llama3_spinq | iphone15 (_ios18)     |            13820.34 |     4.79 |             68.08 | Moderate           |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_spinq+s22_android13      | llama3_spinq | s22 (_android13)      |            22774.60 |     8.55 |             48.84 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_qlora+s22_android13      | llama3_qlora | s22 (_android13)      |            23841.98 |     8.72 |             46.07 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_spinq+s22_android12      | llama3_spinq | s22 (_android12)      |            23902.04 |    10.92 |             40.15 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_spinq+s22ultra_android12 | llama3_spinq | s22ultra (_android12) |            24769.21 |    10.96 |             37.66 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_qlora+s22Ultra5G_android | llama3_qlora | s22Ultra5G (_android) |            24685.50 |    10.84 |             37.62 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| mv3_mps+iphone15_ios18          | mv3_mps      | iphone15 (_ios18)     |                3.75 |    17.76 |             37.50 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+s22ultra_android12    | mv3_xnnq8    | s22ultra (_android12) |                3.63 |    22.35 |             15.48 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| mv3_mps+iphone15max_ios17       | mv3_mps      | iphone15max (_ios17)  |                1.03 |    16.10 |             12.50 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_qlora+iphone15max_ios17  | llama3_qlora | iphone15max (_ios17)  |            14133.01 |    21.37 |             10.57 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_spinq+iphone15max_ios17  | llama3_spinq | iphone15max (_ios17)  |            13118.40 |    21.76 |              2.65 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+iphone15max_ios17     | mv3_xnnq8    | iphone15max (_ios17)  |               13.97 |    33.93 |              1.15 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+s22_android13         | mv3_xnnq8    | s22 (_android13)      |                1.92 |    55.09 |              0.00 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+iphone15_ios18        | mv3_xnnq8    | iphone15 (_ios18)     |               49.85 |    41.06 |              0.00 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| mv3_qnn+s22_android13           | mv3_qnn      | s22 (_android13)      |                1.44 |    57.29 |              0.00 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+

Private vs Public Comparison:
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| Dataset                     | Private Device        | Public Device         |   Private Score |   Public Score |   Score Diff |   Private CV (%) |   Public CV (%) |   CV Diff (%) |
+=============================+=======================+=======================+=================+================+==============+==================+=================+===============+
| mv3_qnn on s22              | s22 (_android13)      | s22 (_android13)      |           82.41 |           0.00 |        82.41 |             2.34 |           57.29 |        -54.95 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| llama3_spinq on iphone15max | iphone15max (_ios17)  | iphone15max (_ios17)  |           72.90 |           2.65 |        70.25 |             3.78 |           21.76 |        -17.97 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| llama3_qlora on iphone15max | iphone15max (_ios17)  | iphone15max (_ios17)  |           75.15 |          10.57 |        64.58 |             3.73 |           21.37 |        -17.64 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| llama3_qlora on s22         | s22 (_android13)      | s22 (_android13)      |           83.37 |          46.07 |        37.31 |             2.64 |            8.72 |         -6.08 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| llama3_spinq on s22         | s22 (_android13)      | s22 (_android13)      |           84.70 |          48.84 |        35.87 |             2.36 |            8.55 |         -6.18 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| mv3_mps on iphone15         | iphone15 (_ios18)     | iphone15 (_ios18)     |           66.53 |          37.50 |        29.03 |             3.99 |           17.76 |        -13.77 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| mv3_xnnq8 on iphone15       | iphone15 (_ios18)     | iphone15 (_ios18)     |           24.53 |           0.00 |        24.53 |            12.84 |           41.06 |        -28.22 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| llama3_spinq on s22ultra    | s22ultra (_android14) | s22ultra (_android12) |           60.28 |          37.66 |        22.62 |             6.27 |           10.96 |         -4.69 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| mv3_xnnq8 on s22            | s22 (_android13)      | s22 (_android13)      |           14.94 |           0.00 |        14.94 |            23.03 |           55.09 |        -32.06 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| mv3_xnnq8 on iphone15max    | iphone15max (_ios17)  | iphone15max (_ios17)  |           10.82 |           1.15 |         9.67 |            24.60 |           33.93 |         -9.33 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| mv3_qnn on s22ultra         | s22ultra (_android14) | s22ultra (_android12) |           93.81 |          90.39 |         3.42 |             0.91 |            1.35 |         -0.44 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| mv3_coreml on iphone15max   | iphone15max (_ios17)  | iphone15max (_ios17)  |          100.00 |         100.00 |         0.00 |             0.00 |            0.00 |          0.00 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| mv3_mps on iphone15max      | iphone15max (_ios17)  | iphone15max (_ios17)  |           12.50 |          12.50 |         0.00 |            35.07 |           16.10 |         18.97 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| mv3_coreml on iphone15      | iphone15 (_ios18)     | iphone15 (_ios18)     |          100.00 |         100.00 |         0.00 |             0.00 |            0.00 |          0.00 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| mv3_xnnq8 on s22ultra       | s22ultra (_android14) | s22ultra (_android12) |            0.00 |          15.48 |       -15.48 |            39.08 |           22.35 |         16.73 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| llama3_spinq on iphone15    | iphone15 (_ios18)     | iphone15 (_ios18)     |            6.64 |          68.08 |       -61.44 |            27.53 |            4.79 |         22.73 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| llama3_qlora on iphone15    | iphone15 (_ios18)     | iphone15 (_ios18)     |            2.81 |          73.16 |       -70.35 |            25.42 |            4.11 |         21.31 |
+-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+

Copy link

pytorch-bot bot commented May 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10982

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7b6d907 with merge base 0c9a4f5 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 19, 2025
@guangy10 guangy10 requested review from huydhn and yangw-dev May 19, 2025 22:13
Copy link
Contributor

@yangw-dev yangw-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommand to add the pip dependencies in requirements.txt next to the analyze_latency_stability.py

maybe it's good it has its own folder

@guangy10 guangy10 changed the title Script for benchmark satbility assessment [Not To Land] Script for benchmark satbility assessment May 19, 2025
@guangy10 guangy10 force-pushed the benchmark_assessment branch 2 times, most recently from cae229e to 3b8aa35 Compare May 27, 2025 20:27
@guangy10 guangy10 force-pushed the benchmark_assessment branch from 3b8aa35 to dd44b4e Compare June 4, 2025 17:45
@guangy10 guangy10 changed the title [Not To Land] Script for benchmark satbility assessment Script for benchmark stability assessment Jun 4, 2025
@guangy10 guangy10 force-pushed the benchmark_assessment branch from dd44b4e to cd676b8 Compare June 4, 2025 18:09
@guangy10
Copy link
Contributor Author

guangy10 commented Jun 4, 2025

Fixed linter

@guangy10 guangy10 added the release notes: none Do not include this in the release notes label Jun 4, 2025
@guangy10 guangy10 marked this pull request as ready for review June 4, 2025 18:17
@guangy10
Copy link
Contributor Author

guangy10 commented Jun 4, 2025

As discussed with @yangw-dev offline, to make the stability assessment part of the benchmark infra as suggested in this post, I will merge this script under .ci/scripts together with other scripts used by CI and benchmark infra. @yangw-dev will take over from there and rework on the interface to

  1. directly piping the data from DB instead of requiring manual dumping to the .xlsx first
  2. support running stability assessment on any combination of time frame, devices, models, backends, etc.
  3. chrono jobs to run this stability assessment and visualize results in the dashboard UI

@guangy10 guangy10 force-pushed the benchmark_assessment branch from cd676b8 to 7b6d907 Compare June 4, 2025 18:30
@guangy10 guangy10 requested a review from yangw-dev June 4, 2025 18:31
@guangy10 guangy10 merged commit 2269160 into main Jun 4, 2025
190 checks passed
@guangy10 guangy10 deleted the benchmark_assessment branch June 4, 2025 23:41
yangw-dev added a commit that referenced this pull request Jun 23, 2025
# Summary
Provide methods and script to fetch all execuTorch benchamrk data from
HUD API into two dataset,private and public, the script will:
- fetch all data from HUD API from input time range in UTC
- clean out records and tables with only FAILURE_REPORT due to job-level
failures
- get all private table metrics, generate `table_name` and find
intersected public table metrics
- generate private and public table groups
- output data

OutputType:
- run with excel-sheet export
- run with csv export
- run with dataframe format print
- run with json format print

See more guidance in README.md

the data is similar to the excel sheet generated manually in
#10982
The result should be the same as the hud per model datatable:
<img width="1480" alt="image"
src="https://github.com/user-attachments/assets/7c6cc12e-50c5-4ce2-ac87-5cac650486e3"
/>

## helper methods: common.py
provide common.py helper method to convert back csv and excel sheets
back to {"groupInfo":{}, "df":df.DataFrame} format.

# run with
``` bash
python3 .ci/scripts/benchmark_tooling/get_benchmark_analysis_data.py \
--startTime "2025-04-29T09:48:57" \
--endTime "2025-05-13T22:00:00" \
--outputType "excel" \
--models "mv3"

python3 .ci/scripts/benchmark_tooling/analyze_benchmark_stability.py \
--primary-file private.xlsx \
--reference-file public.xlsx
```
Generate excel files:

[private.xlsx](https://github.com/user-attachments/files/20844977/private.xlsx)

[public.xlsx](https://github.com/user-attachments/files/20844978/public.xlsx)


For instance you can find result for mv3 xnnq_q8 S22 Ultra android 14:
```

Latency Stability Analysis: table10 (Primary)
================================================================================
Model: mv3(xnnpack_q8)
Device: Samsung Galaxy S22 Ultra 5G (private)(Android 14)

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 2.91 ms
  - Median latency (P50): 2.54 ms
  - Mean trimmed latency: 2.41 ms
  - Median trimmed latency: 2.15 ms

Dispersion Metrics:
  - Standard deviation: 1.14 ms
  - Coefficient of variation (CV): 39.08%
  - Interquartile range (IQR): 0.82 ms
  - Trimmed standard deviation: 0.76 ms
  - Trimmed coefficient of variation: 31.60%

Percentile Metrics:
  - P50 (median): 2.54 ms
  - P90: 3.88 ms
  - P95: 4.60 ms
  - P99: 5.91 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 5.6103
  - P99/P50 ratio: 2.3319
  - Mean rolling std (window=5): 0.79 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 15.37%
  - Max trimming effect ratio: 38.83%

Stability Assessment:
  - Overall stability score: 0.0/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 0.0/100) with significant
  variation between runs (CV: 39.08%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The significant difference between raw and trimmed means suggests
  considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs.

  The max/min ratio of 5.61 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 2.33 suggests
  occasional latency spikes that could affect tail latency sensitive applications.
```

---------

Signed-off-by: Yang Wang <[email protected]>
hinriksnaer pushed a commit to hinriksnaer/executorch that referenced this pull request Jun 26, 2025
# Summary
Provide methods and script to fetch all execuTorch benchamrk data from
HUD API into two dataset,private and public, the script will:
- fetch all data from HUD API from input time range in UTC
- clean out records and tables with only FAILURE_REPORT due to job-level
failures
- get all private table metrics, generate `table_name` and find
intersected public table metrics
- generate private and public table groups
- output data

OutputType:
- run with excel-sheet export
- run with csv export
- run with dataframe format print
- run with json format print

See more guidance in README.md

the data is similar to the excel sheet generated manually in
pytorch#10982
The result should be the same as the hud per model datatable:
<img width="1480" alt="image"
src="https://github.com/user-attachments/assets/7c6cc12e-50c5-4ce2-ac87-5cac650486e3"
/>

## helper methods: common.py
provide common.py helper method to convert back csv and excel sheets
back to {"groupInfo":{}, "df":df.DataFrame} format.

# run with
``` bash
python3 .ci/scripts/benchmark_tooling/get_benchmark_analysis_data.py \
--startTime "2025-04-29T09:48:57" \
--endTime "2025-05-13T22:00:00" \
--outputType "excel" \
--models "mv3"

python3 .ci/scripts/benchmark_tooling/analyze_benchmark_stability.py \
--primary-file private.xlsx \
--reference-file public.xlsx
```
Generate excel files:

[private.xlsx](https://github.com/user-attachments/files/20844977/private.xlsx)

[public.xlsx](https://github.com/user-attachments/files/20844978/public.xlsx)


For instance you can find result for mv3 xnnq_q8 S22 Ultra android 14:
```

Latency Stability Analysis: table10 (Primary)
================================================================================
Model: mv3(xnnpack_q8)
Device: Samsung Galaxy S22 Ultra 5G (private)(Android 14)

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 2.91 ms
  - Median latency (P50): 2.54 ms
  - Mean trimmed latency: 2.41 ms
  - Median trimmed latency: 2.15 ms

Dispersion Metrics:
  - Standard deviation: 1.14 ms
  - Coefficient of variation (CV): 39.08%
  - Interquartile range (IQR): 0.82 ms
  - Trimmed standard deviation: 0.76 ms
  - Trimmed coefficient of variation: 31.60%

Percentile Metrics:
  - P50 (median): 2.54 ms
  - P90: 3.88 ms
  - P95: 4.60 ms
  - P99: 5.91 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 5.6103
  - P99/P50 ratio: 2.3319
  - Mean rolling std (window=5): 0.79 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 15.37%
  - Max trimming effect ratio: 38.83%

Stability Assessment:
  - Overall stability score: 0.0/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 0.0/100) with significant
  variation between runs (CV: 39.08%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The significant difference between raw and trimmed means suggests
  considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs.

  The max/min ratio of 5.61 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 2.33 suggests
  occasional latency spikes that could affect tail latency sensitive applications.
```

---------

Signed-off-by: Yang Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: none Do not include this in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants