Skip to content

Commit d07ca42

Browse files
committed
Change significance to be determined by IQR fencing
1 parent f58033a commit d07ca42

File tree

2 files changed

+60
-23
lines changed

2 files changed

+60
-23
lines changed

docs/comparison-analysis.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,11 @@ At the core of comparison analysis are the collection of test results for the tw
1818
100 * ((statisticForArtifactB - statisticForArtifactA) / statisticForArtifactA)
1919
```
2020

21-
## High-level analysis description
21+
## High-level analysis description
2222

2323
Analysis of the changes is performed in order to determine whether artifact B represents a performance change over artifact A. At a high level the analysis performed takes the following form:
2424

25-
How many _significant_ test results indicate performance changes and what is the magnitude of the changes (i.e., how large are the changes regardless of the direction of change)?
25+
How many _significant_ test results indicate performance changes and what is the magnitude of the changes (i.e., how large are the changes regardless of the direction of change)?
2626

2727
* If there are improvements and regressions with magnitude of medium or above then the comparison is mixed.
2828
* If there are only either improvements or regressions then the comparison is labeled with that kind.
@@ -37,7 +37,16 @@ Whether we actually _report_ an analysis or not depends on the context and how _
3737

3838
### What makes a test result significant?
3939

40-
A test result is significant if the relative change percentage meets some threshold. What the threshold is depends of whether the test case is "dodgy" or not (see below for an examination of "dodginess"). For dodgy test cases, the threshold is set at 1%. For non-dodgy test cases, the threshold is set to 0.1%.
40+
A test result is significant if the relative change percentage is considered an outlier against historical data. Determining whether a value is an outlier is done through interquartile range "fencing" (i.e., whether a value exceeds a threshold equal to the third quartile plus 1.5 times the interquartile range):
41+
42+
```
43+
interquartile_range = Q3 - Q1
44+
result > Q3 + (interquartile_range * 1.5)
45+
```
46+
47+
(Assuming the data is ordered, Q3 is the median of the upper half of the data while Q1 is the median of the lower half.)
48+
49+
We ignore the lower fence, because result data is bounded by 0.
4150

4251
### What makes a test case "dodgy"?
4352

site/src/comparison.rs

Lines changed: 48 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -639,24 +639,30 @@ impl BenchmarkVariance {
639639
self.data.iter().sum::<f64>() / self.data.len() as f64
640640
}
641641

642+
/// The percent change of the deltas sorted from smallest delta to largest
643+
fn percent_changes(&self) -> Vec<f64> {
644+
let mut deltas = self
645+
.deltas()
646+
.zip(self.data.iter())
647+
.map(|(d, &r)| d / r)
648+
.collect::<Vec<_>>();
649+
deltas.sort_by(|d1, d2| d1.partial_cmp(d2).unwrap_or(std::cmp::Ordering::Equal));
650+
deltas
651+
}
652+
642653
fn calculate_description(&mut self) {
643654
self.description = BenchmarkVarianceDescription::Normal;
644655

645656
let results_mean = self.mean();
646-
let mut deltas = self
647-
.data
648-
.windows(2)
649-
.map(|window| (window[0] - window[1]).abs())
650-
.collect::<Vec<_>>();
651-
deltas.sort_by(|d1, d2| d1.partial_cmp(d2).unwrap_or(std::cmp::Ordering::Equal));
652-
let non_significant = deltas
657+
let percent_changes = self.percent_changes();
658+
let non_significant = percent_changes
653659
.iter()
654-
.zip(self.data.iter())
655-
.take_while(|(&d, &r)| d / r < Self::SIGNFICANT_DELTA_THRESHOLD)
660+
.take_while(|&&c| c < Self::SIGNFICANT_DELTA_THRESHOLD)
656661
.collect::<Vec<_>>();
657662

658-
let percent_significant_changes =
659-
((deltas.len() - non_significant.len()) as f64 / deltas.len() as f64) * 100.0;
663+
let percent_significant_changes = ((percent_changes.len() - non_significant.len()) as f64
664+
/ percent_changes.len() as f64)
665+
* 100.0;
660666
debug!(
661667
"Percent significant changes: {:.1}%",
662668
percent_significant_changes
@@ -668,14 +674,20 @@ impl BenchmarkVariance {
668674
}
669675

670676
let delta_mean =
671-
non_significant.iter().map(|(&d, _)| d).sum::<f64>() / (non_significant.len() as f64);
677+
non_significant.iter().cloned().sum::<f64>() / (non_significant.len() as f64);
672678
let ratio_change = delta_mean / results_mean;
673-
debug!("Ratio change: {:.3}", ratio_change);
674679
if ratio_change > Self::NOISE_THRESHOLD {
675680
self.description = BenchmarkVarianceDescription::Noisy;
676681
}
677682
}
678683

684+
// Absolute deltas between adjacent results
685+
fn deltas(&self) -> impl Iterator<Item = f64> + '_ {
686+
self.data
687+
.windows(2)
688+
.map(|window| (window[0] - window[1]).abs())
689+
}
690+
679691
/// Whether we can trust this benchmark or not
680692
fn is_dodgy(&self) -> bool {
681693
matches!(
@@ -740,13 +752,9 @@ pub struct TestResultComparison {
740752

741753
impl TestResultComparison {
742754
/// The amount of relative change considered significant when
743-
/// the test case is not dodgy
755+
/// we cannot determine from historical data
744756
const SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD: f64 = 0.002;
745757

746-
/// The amount of relative change considered significant when
747-
/// the test case is dodgy
748-
const SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD_DODGY: f64 = 0.008;
749-
750758
fn is_regression(&self) -> bool {
751759
let (a, b) = self.results;
752760
b > a
@@ -761,8 +769,28 @@ impl TestResultComparison {
761769
}
762770

763771
fn signifcance_threshold(&self) -> f64 {
764-
if self.is_dodgy() {
765-
Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD_DODGY
772+
if let Some(pcs) = self.variance.as_ref().map(|s| s.percent_changes()) {
773+
fn median(data: &[f64]) -> f64 {
774+
if data.len() % 2 == 0 {
775+
(data[(data.len() - 1) / 2] + data[data.len() / 2]) / 2.0
776+
} else {
777+
data[data.len() / 2]
778+
}
779+
}
780+
781+
let len = pcs.len();
782+
let (h1_end, h2_begin) = if len % 2 == 0 {
783+
(len / 2 - 2, len / 2 + 1)
784+
} else {
785+
(len / 2 - 1, len / 2 + 1)
786+
};
787+
// significance is determined by the upper
788+
// interquartile range fence
789+
let q1 = median(&pcs[..=h1_end]);
790+
let q3 = median(&pcs[h2_begin..]);
791+
let iqr = q3 - q1;
792+
let upper_fence = q3 + (iqr * 1.5);
793+
upper_fence
766794
} else {
767795
Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD
768796
}

0 commit comments

Comments
 (0)