Skip to content

Commit cd2cd93

Browse files
authored
Merge pull request #996 from rylev/significance
Change significance to be determined by IQR fencing
2 parents f58033a + b351e06 commit cd2cd93

File tree

4 files changed

+156
-56
lines changed

4 files changed

+156
-56
lines changed

docs/comparison-analysis.md

Lines changed: 26 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Comparison Analysis
22

3-
The following is a detailed explanation of the process undertaken to automate the analysis of test results for two artifacts of interest (artifact A and B).
3+
The following is a detailed explanation of the process undertaken to automate the analysis of test results for two artifacts of interest (artifact A and B).
44

55
This analysis can be done by hand, by using the [comparison page](https://perf.rust-lang.org/compare.html) and entering the two artifacts of interest in the form at the top.
66

@@ -18,11 +18,11 @@ At the core of comparison analysis are the collection of test results for the tw
1818
100 * ((statisticForArtifactB - statisticForArtifactA) / statisticForArtifactA)
1919
```
2020

21-
## High-level analysis description
21+
## High-level analysis description
2222

2323
Analysis of the changes is performed in order to determine whether artifact B represents a performance change over artifact A. At a high level the analysis performed takes the following form:
2424

25-
How many _significant_ test results indicate performance changes and what is the magnitude of the changes (i.e., how large are the changes regardless of the direction of change)?
25+
How many _significant_ test results indicate performance changes and what is the magnitude of the changes (i.e., how large are the changes regardless of the direction of change)?
2626

2727
* If there are improvements and regressions with magnitude of medium or above then the comparison is mixed.
2828
* If there are only either improvements or regressions then the comparison is labeled with that kind.
@@ -37,29 +37,38 @@ Whether we actually _report_ an analysis or not depends on the context and how _
3737

3838
### What makes a test result significant?
3939

40-
A test result is significant if the relative change percentage meets some threshold. What the threshold is depends of whether the test case is "dodgy" or not (see below for an examination of "dodginess"). For dodgy test cases, the threshold is set at 1%. For non-dodgy test cases, the threshold is set to 0.1%.
41-
42-
### What makes a test case "dodgy"?
43-
44-
A test case is "dodgy" if it shows signs of either being noisy or highly variable.
45-
46-
To determine noise and high variability, the previous 100 test results for the test case in question are examined by calculating relative delta changes between adjacent test results. This is done with the following formula (where `testResult1` is the test result immediately proceeding `testResult2`):
40+
A test result is significant if the relative change percentage is considered an outlier against historical data. Determining whether a value is an outlier is done through interquartile range "fencing" (i.e., whether a value exceeds a threshold equal to the third quartile plus 1.5 times the interquartile range):
4741

4842
```
49-
testResult2 - testResult1 / testResult1
43+
interquartile_range = Q3 - Q1
44+
result > Q3 + (interquartile_range * 1.5)
5045
```
5146

52-
Any relative delta change that is above a threshold (currently 0.1) is considered "significant" for the purposes of dodginess detection.
47+
(Assuming the data is ordered, Q3 is the median of the upper half of the data while Q1 is the median of the lower half.)
5348

54-
A highly variable test case is one where a certain percentage (currently 5%) of relative delta changes are significant. The logic being that test cases should only display significant relative delta changes a small percentage of the time.
49+
We ignore the lower fence, because result data is bounded by 0.
5550

56-
A noisy test case is one where of all the non-significant relative delta changes, the average delta change is still above some threshold (0.001). The logic being that non-significant changes should, on average, being very close to 0. If they are not close to zero, then they are noisy.
51+
This upper fence is often called the "significance threshold".
5752

5853
### How is confidence in whether a test analysis is "relevant" determined?
5954

60-
The confidence in whether a test analysis is relevant depends on the number of significant test results and their magnitude (how large a change is regardless of the direction of the change).
55+
The confidence in whether a test analysis is relevant depends on the number of significant test results and their magnitude.
56+
57+
#### Magnitude
58+
59+
Magnitude is a combination of two factors:
60+
* how large a change is regardless of the direction of the change
61+
* how much that change went over the significance threshold
62+
63+
If a large change only happens to go over the significance threshold by a small factor, then the over magnitude of the change is considered small.
64+
65+
#### Confidence algorithm
6166

6267
The actual algorithm for determining confidence may change, but in general the following rules apply:
63-
* Definitely relevant: any number of very large changes, a small amount of large and/or medium changes, or a large amount of small changes.
64-
* Probably relevant: any number of large changes, more than 1 medium change, or smaller but still substantial amount of small changes.
68+
* Definitely relevant: any number of very large or large changes, a small amount of medium changes, or a large amount of small or very small changes.
69+
* Probably relevant: any number of very large or large changes, any medium change, or smaller but still substantial amount of small or very small changes.
6570
* Maybe relevant: if it doesn't fit into the above two categories, it ends in this category.
71+
72+
### "Dodgy" Test Cases
73+
74+
"Dodgy" test cases are test cases that tend to produce unreliable results (i.e., noise). A test case is considered "dodgy" if its significance threshold is sufficiently far enough away from 0.

site/src/api.rs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,10 +147,12 @@ pub mod comparison {
147147
use std::collections::HashMap;
148148

149149
#[derive(Debug, PartialEq, Clone, Serialize, Deserialize)]
150+
#[allow(non_snake_case)]
150151
pub struct Request {
151152
pub start: Bound,
152153
pub end: Bound,
153154
pub stat: String,
155+
pub calcNewSig: Option<bool>,
154156
}
155157

156158
#[derive(Debug, Clone, Serialize)]
@@ -358,9 +360,11 @@ pub mod triage {
358360
use serde::{Deserialize, Serialize};
359361

360362
#[derive(Debug, Clone, Serialize, Deserialize)]
363+
#[allow(non_snake_case)]
361364
pub struct Request {
362365
pub start: Bound,
363366
pub end: Option<Bound>,
367+
pub calcNewSig: Option<bool>,
364368
}
365369

366370
#[derive(Debug, Clone, Serialize, Deserialize)]

site/src/comparison.rs

Lines changed: 125 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ pub async fn handle_triage(
4545
"instructions:u".to_owned(),
4646
ctxt,
4747
&master_commits,
48+
body.calcNewSig.unwrap_or(false),
4849
)
4950
.await?
5051
{
@@ -89,10 +90,16 @@ pub async fn handle_compare(
8990
) -> Result<api::comparison::Response, BoxedError> {
9091
let master_commits = collector::master_commits().await?;
9192
let end = body.end;
92-
let comparison =
93-
compare_given_commits(body.start, end.clone(), body.stat, ctxt, &master_commits)
94-
.await?
95-
.ok_or_else(|| format!("could not find end commit for bound {:?}", end))?;
93+
let comparison = compare_given_commits(
94+
body.start,
95+
end.clone(),
96+
body.stat,
97+
ctxt,
98+
&master_commits,
99+
body.calcNewSig.unwrap_or(false),
100+
)
101+
.await?
102+
.ok_or_else(|| format!("could not find end commit for bound {:?}", end))?;
96103

97104
let conn = ctxt.conn().await;
98105
let prev = comparison.prev(&master_commits);
@@ -338,7 +345,7 @@ pub async fn compare(
338345
ctxt: &SiteCtxt,
339346
) -> Result<Option<Comparison>, BoxedError> {
340347
let master_commits = collector::master_commits().await?;
341-
compare_given_commits(start, end, stat, ctxt, &master_commits).await
348+
compare_given_commits(start, end, stat, ctxt, &master_commits, false).await
342349
}
343350

344351
/// Compare two bounds on a given stat
@@ -348,6 +355,7 @@ async fn compare_given_commits(
348355
stat: String,
349356
ctxt: &SiteCtxt,
350357
master_commits: &[collector::MasterCommit],
358+
calc_new_sig: bool,
351359
) -> Result<Option<Comparison>, BoxedError> {
352360
let a = ctxt
353361
.artifact_id_for_bound(start.clone(), true)
@@ -387,6 +395,7 @@ async fn compare_given_commits(
387395
.as_ref()
388396
.and_then(|v| v.data.get(&test_case).cloned()),
389397
results: (a, b),
398+
calc_new_sig,
390399
})
391400
})
392401
.collect();
@@ -635,28 +644,53 @@ impl BenchmarkVariance {
635644
self.data.push(value);
636645
}
637646

638-
fn mean(&self) -> f64 {
639-
self.data.iter().sum::<f64>() / self.data.len() as f64
647+
/// The percent change of the deltas sorted from smallest delta to largest
648+
fn percent_changes(&self) -> Vec<f64> {
649+
let mut deltas = self
650+
.deltas()
651+
.zip(self.data.iter())
652+
.map(|(d, &r)| d / r)
653+
.collect::<Vec<_>>();
654+
deltas.sort_by(|d1, d2| d1.partial_cmp(d2).unwrap_or(std::cmp::Ordering::Equal));
655+
deltas
656+
}
657+
658+
fn upper_fence(&self) -> f64 {
659+
let pcs = self.percent_changes();
660+
fn median(data: &[f64]) -> f64 {
661+
if data.len() % 2 == 0 {
662+
(data[(data.len() - 1) / 2] + data[data.len() / 2]) / 2.0
663+
} else {
664+
data[data.len() / 2]
665+
}
666+
}
667+
668+
let len = pcs.len();
669+
let (h1_end, h2_begin) = if len % 2 == 0 {
670+
(len / 2 - 2, len / 2 + 1)
671+
} else {
672+
(len / 2 - 1, len / 2 + 1)
673+
};
674+
// significance is determined by the upper
675+
// interquartile range fence
676+
let q1 = median(&pcs[..=h1_end]);
677+
let q3 = median(&pcs[h2_begin..]);
678+
let iqr = q3 - q1;
679+
q3 + (iqr * 1.5)
640680
}
641681

642682
fn calculate_description(&mut self) {
643683
self.description = BenchmarkVarianceDescription::Normal;
644684

645-
let results_mean = self.mean();
646-
let mut deltas = self
647-
.data
648-
.windows(2)
649-
.map(|window| (window[0] - window[1]).abs())
650-
.collect::<Vec<_>>();
651-
deltas.sort_by(|d1, d2| d1.partial_cmp(d2).unwrap_or(std::cmp::Ordering::Equal));
652-
let non_significant = deltas
685+
let percent_changes = self.percent_changes();
686+
let non_significant = percent_changes
653687
.iter()
654-
.zip(self.data.iter())
655-
.take_while(|(&d, &r)| d / r < Self::SIGNFICANT_DELTA_THRESHOLD)
688+
.take_while(|&&c| c < Self::SIGNFICANT_DELTA_THRESHOLD)
656689
.collect::<Vec<_>>();
657690

658-
let percent_significant_changes =
659-
((deltas.len() - non_significant.len()) as f64 / deltas.len() as f64) * 100.0;
691+
let percent_significant_changes = ((percent_changes.len() - non_significant.len()) as f64
692+
/ percent_changes.len() as f64)
693+
* 100.0;
660694
debug!(
661695
"Percent significant changes: {:.1}%",
662696
percent_significant_changes
@@ -668,20 +702,30 @@ impl BenchmarkVariance {
668702
}
669703

670704
let delta_mean =
671-
non_significant.iter().map(|(&d, _)| d).sum::<f64>() / (non_significant.len() as f64);
672-
let ratio_change = delta_mean / results_mean;
673-
debug!("Ratio change: {:.3}", ratio_change);
674-
if ratio_change > Self::NOISE_THRESHOLD {
705+
non_significant.iter().cloned().sum::<f64>() / (non_significant.len() as f64);
706+
debug!("Ratio change: {:.4}", delta_mean);
707+
if delta_mean > Self::NOISE_THRESHOLD {
675708
self.description = BenchmarkVarianceDescription::Noisy;
676709
}
677710
}
678711

712+
// Absolute deltas between adjacent results
713+
fn deltas(&self) -> impl Iterator<Item = f64> + '_ {
714+
self.data
715+
.windows(2)
716+
.map(|window| (window[0] - window[1]).abs())
717+
}
718+
679719
/// Whether we can trust this benchmark or not
680-
fn is_dodgy(&self) -> bool {
681-
matches!(
682-
self.description,
683-
BenchmarkVarianceDescription::Noisy | BenchmarkVarianceDescription::HighlyVariable
684-
)
720+
fn is_dodgy(&self, calc_new_sig: bool) -> bool {
721+
if !calc_new_sig {
722+
matches!(
723+
self.description,
724+
BenchmarkVarianceDescription::Noisy | BenchmarkVarianceDescription::HighlyVariable
725+
)
726+
} else {
727+
self.upper_fence() > 0.002
728+
}
685729
}
686730
}
687731

@@ -736,11 +780,12 @@ pub struct TestResultComparison {
736780
scenario: Scenario,
737781
variance: Option<BenchmarkVariance>,
738782
results: (f64, f64),
783+
calc_new_sig: bool,
739784
}
740785

741786
impl TestResultComparison {
742787
/// The amount of relative change considered significant when
743-
/// the test case is not dodgy
788+
/// we cannot determine from historical data
744789
const SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD: f64 = 0.002;
745790

746791
/// The amount of relative change considered significant when
@@ -761,33 +806,74 @@ impl TestResultComparison {
761806
}
762807

763808
fn signifcance_threshold(&self) -> f64 {
764-
if self.is_dodgy() {
765-
Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD_DODGY
809+
if !self.calc_new_sig {
810+
if self.is_dodgy() {
811+
Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD_DODGY
812+
} else {
813+
Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD
814+
}
766815
} else {
767-
Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD
816+
self.variance
817+
.as_ref()
818+
.map(|s| s.upper_fence())
819+
.unwrap_or(Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD)
768820
}
769821
}
770822

771823
fn magnitude(&self) -> Magnitude {
772-
let mag = self.relative_change().abs();
824+
let change = self.relative_change().abs();
773825
let threshold = self.signifcance_threshold();
774-
if mag < threshold * 1.5 {
826+
let over_threshold = if change < threshold * 1.5 {
775827
Magnitude::VerySmall
776-
} else if mag < threshold * 3.0 {
828+
} else if change < threshold * 3.0 {
777829
Magnitude::Small
778-
} else if mag < threshold * 10.0 {
830+
} else if change < threshold * 10.0 {
779831
Magnitude::Medium
780-
} else if mag < threshold * 25.0 {
832+
} else if change < threshold * 25.0 {
781833
Magnitude::Large
782834
} else {
783835
Magnitude::VeryLarge
836+
};
837+
if !self.calc_new_sig {
838+
return over_threshold;
784839
}
840+
let change_magnitude = if change < 0.002 {
841+
Magnitude::VerySmall
842+
} else if change < 0.01 {
843+
Magnitude::Small
844+
} else if change < 0.02 {
845+
Magnitude::Medium
846+
} else if change < 0.05 {
847+
Magnitude::Large
848+
} else {
849+
Magnitude::VeryLarge
850+
};
851+
fn as_u8(m: Magnitude) -> u8 {
852+
match m {
853+
Magnitude::VerySmall => 1,
854+
Magnitude::Small => 2,
855+
Magnitude::Medium => 3,
856+
Magnitude::Large => 4,
857+
Magnitude::VeryLarge => 5,
858+
}
859+
}
860+
fn from_u8(m: u8) -> Magnitude {
861+
match m {
862+
1 => Magnitude::VerySmall,
863+
2 => Magnitude::Small,
864+
3 => Magnitude::Medium,
865+
4 => Magnitude::Large,
866+
_ => Magnitude::VeryLarge,
867+
}
868+
}
869+
870+
from_u8((as_u8(over_threshold) + as_u8(change_magnitude)) / 2)
785871
}
786872

787873
fn is_dodgy(&self) -> bool {
788874
self.variance
789875
.as_ref()
790-
.map(|v| v.is_dodgy())
876+
.map(|v| v.is_dodgy(self.calc_new_sig))
791877
.unwrap_or(false)
792878
}
793879

@@ -867,7 +953,7 @@ impl std::fmt::Display for Direction {
867953
}
868954

869955
/// The relative size of a performance change
870-
#[derive(Debug, PartialOrd, PartialEq, Ord, Eq)]
956+
#[derive(Clone, Copy, Debug, PartialOrd, PartialEq, Ord, Eq)]
871957
pub enum Magnitude {
872958
VerySmall,
873959
Small,

site/static/compare.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -766,6 +766,7 @@ <h2>Comparing <span id="stat-header">{{stat}}</span> between <span id="before">{
766766
end: "",
767767
stat: "instructions:u",
768768
}, state);
769+
values["calcNewSig"] = values.calcNewSig === 'true';
769770
makeRequest("/get", values).then(function (data) {
770771
app.data = data;
771772
});

0 commit comments

Comments
 (0)