Skip to content

Commit 87ba918

Browse files
committed
Clarify relevance
1 parent bab9355 commit 87ba918

File tree

3 files changed

+47
-36
lines changed

3 files changed

+47
-36
lines changed

docs/glossary.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,20 +25,20 @@ The following is a glossary of domain specific terminology. Although benchmarks
2525
* **test case**: a combination of a benchmark, a profile, and a scenario.
2626
* **test**: the act of running an artifact under a test case. Each test result is composed of many iterations.
2727
* **test iteration**: a single iteration that makes up a test. Note: we currently normally run 2 test iterations for each test.
28-
* **test result**: the result of the collection of all statistics from running a test. Currently the minimum of the statistics.
28+
* **test result**: the result of the collection of all statistics from running a test. Currently, the minimum value of a statistic from all the test iterations is used.
2929
* **statistic**: a single value of a metric in a test result
3030
* **statistic description**: the combination of a metric and a test case which describes a statistic.
3131
* **statistic series**: statistics for the same statistic description over time.
3232
* **run**: a collection of test results for all currently available test cases run on a given artifact.
33-
* **test result delta**: the delta between two test results for the same test case but (optionally) different artifacts. The [comparison page](https://perf.rust-lang.org/compare.html) lists all the test result deltas as percentages comparing two runs.
3433

3534
## Analysis
3635

37-
* **test result delta**: the difference between two test results for the same metric and test case.
38-
* **significance threshold**: the threshold at which a test result delta is considered "significant" (i.e., a real change in performance and not just noise). This is calculated using [the upper IQR fence](https://www.statisticshowto.com/upper-and-lower-fences/#:~:text=Upper%20and%20lower%20fences%20cordon,%E2%80%93%20(1.5%20*%20IQR)) as seen [here](https://github.com/rust-lang/rustc-perf/blob/8ba845644b4cfcffd96b909898d7225931b55557/site/src/comparison.rs#L935-L941).
39-
* **significant test result delta**: a test result delta above the significance threshold. Significant test result deltas can be thought of as "statistically significant".
36+
* **test result comparison**: the delta between two test results for the same test case but (optionally) different artifacts. The [comparison page](https://perf.rust-lang.org/compare.html) lists all the test result comparisons as percentages between two runs.
37+
* **significance threshold**: the threshold at which a test result comparison is considered "significant" (i.e., a real change in performance and not just noise). This is calculated using [the upper IQR fence](https://www.statisticshowto.com/upper-and-lower-fences/#:~:text=Upper%20and%20lower%20fences%20cordon,%E2%80%93%20(1.5%20*%20IQR)) as seen [here](https://github.com/rust-lang/rustc-perf/blob/8ba845644b4cfcffd96b909898d7225931b55557/site/src/comparison.rs#L935-L941).
38+
* **significant test result comparison**: a test result comparison above the significance threshold. Significant test result comparisons can be thought of as being "statistically significant".
39+
* **relevant test result comparison**: a test result comparison can be significant but still not be relevant (i.e., worth paying attention to). Relevance is a factor of the test result comparison's *magnitude*. Comparisons are considered relevant if they have a small magnitude or more. This term is often used to mean "significant *and* relevant" since relevant changes are necessarily also significant.
40+
* **test result comparison magnitude**: how "large" the delta is between the two test result's under comparison. This is determined by the average of two factors: the absolute size of the change (i.e., a change of 5% is larger than a change of 1%) and the amount above the significance threshold (i.e., a change that is 5x the significance threshold is larger than a change 1.5x the significance threshold).
4041
* **dodgy test case**: a test case for which the significance threshold is significantly large indicating a high amount of variability in the test and thus making it necessary to be somewhat skeptical of any results too close to the significance threshold.
41-
* **relevant test result delta**: a synonym for *significant test result delta* in situations where the term "significant" might be ambiguous and readers may potentially interpret *significant* as "large" or "statistically significant". For example, in try run results, we use the term relevant instead of significant.
4242

4343
## Other
4444

site/src/comparison.rs

Lines changed: 36 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -145,11 +145,11 @@ async fn populate_report(
145145
report: &mut HashMap<Option<Direction>, Vec<String>>,
146146
) {
147147
if let Some(summary) = ComparisonSummary::summarize_comparison(comparison) {
148-
let confidence = summary.confidence();
149-
if confidence.is_atleast_probably_relevant() {
148+
let relevance = summary.relevance_level();
149+
if relevance.at_least_somewhat_relevant() {
150150
if let Some(direction) = summary.direction() {
151151
let entry = report
152-
.entry(confidence.is_definitely_relevant().then(|| direction))
152+
.entry(relevance.very_relevant().then(|| direction))
153153
.or_default();
154154

155155
entry.push(summary.write(comparison).await)
@@ -160,7 +160,7 @@ async fn populate_report(
160160

161161
/// A summary of a given comparison
162162
///
163-
/// This summary only includes changes that are significant and relevant (as determined by a changes magnitude).
163+
/// This summary only includes changes that are significant and relevant (as determined by a change's magnitude).
164164
pub struct ComparisonSummary {
165165
/// Significant comparisons of magnitude small and above
166166
/// and ordered by magnitude from largest to smallest
@@ -182,7 +182,7 @@ impl ComparisonSummary {
182182
.statistics
183183
.iter()
184184
.filter(|c| c.is_significant())
185-
.filter(|c| c.magnitude().is_small_or_above())
185+
.filter(|c| c.is_relevant())
186186
.inspect(|c| {
187187
if c.is_improvement() {
188188
num_improvements += 1;
@@ -361,24 +361,25 @@ impl ComparisonSummary {
361361
self.comparisons.iter().find(|s| s.is_regression())
362362
}
363363

364-
pub fn confidence(&self) -> ComparisonConfidence {
364+
/// The relevance level of the entire comparison
365+
pub fn relevance_level(&self) -> RelevanceLevel {
365366
let mut num_small_changes = 0;
366367
let mut num_medium_changes = 0;
367368
for c in self.comparisons.iter() {
368369
match c.magnitude() {
369370
Magnitude::Small => num_small_changes += 1,
370371
Magnitude::Medium => num_medium_changes += 1,
371-
Magnitude::Large => return ComparisonConfidence::DefinitelyRelevant,
372-
Magnitude::VeryLarge => return ComparisonConfidence::DefinitelyRelevant,
372+
Magnitude::Large => return RelevanceLevel::High,
373+
Magnitude::VeryLarge => return RelevanceLevel::High,
373374
Magnitude::VerySmall => unreachable!(),
374375
}
375376
}
376377

377378
match (num_small_changes, num_medium_changes) {
378-
(_, m) if m > 1 => ComparisonConfidence::DefinitelyRelevant,
379-
(_, 1) => ComparisonConfidence::ProbablyRelevant,
380-
(s, 0) if s > 10 => ComparisonConfidence::ProbablyRelevant,
381-
_ => ComparisonConfidence::MaybeRelevant,
379+
(_, m) if m > 1 => RelevanceLevel::High,
380+
(_, 1) => RelevanceLevel::Medium,
381+
(s, 0) if s > 10 => RelevanceLevel::Medium,
382+
_ => RelevanceLevel::Low,
382383
}
383384
}
384385

@@ -553,22 +554,21 @@ pub fn write_summary_table(
553554
.unwrap();
554555
}
555556

556-
/// The amount of confidence we have that a comparison actually represents a real
557-
/// change in the performance characteristics.
557+
/// How relevant a set of test result comparisons are.
558558
#[derive(Clone, Copy, Debug)]
559-
pub enum ComparisonConfidence {
560-
MaybeRelevant,
561-
ProbablyRelevant,
562-
DefinitelyRelevant,
559+
pub enum RelevanceLevel {
560+
Low,
561+
Medium,
562+
High,
563563
}
564564

565-
impl ComparisonConfidence {
566-
pub fn is_definitely_relevant(self) -> bool {
567-
matches!(self, Self::DefinitelyRelevant)
565+
impl RelevanceLevel {
566+
pub fn very_relevant(self) -> bool {
567+
matches!(self, Self::High)
568568
}
569569

570-
pub fn is_atleast_probably_relevant(self) -> bool {
571-
matches!(self, Self::DefinitelyRelevant | Self::ProbablyRelevant)
570+
pub fn at_least_somewhat_relevant(self) -> bool {
571+
matches!(self, Self::High | Self::Medium)
572572
}
573573
}
574574

@@ -1040,6 +1040,15 @@ impl TestResultComparison {
10401040
Some(change.abs() / threshold)
10411041
}
10421042

1043+
/// Whether the comparison is relevant or not
1044+
fn is_relevant(&self) -> bool {
1045+
self.magnitude().is_small_or_above()
1046+
}
1047+
1048+
/// The magnitude of the change.
1049+
///
1050+
/// This is the average of the absolute magnitude of the change
1051+
/// and the amount above the significance threshold.
10431052
fn magnitude(&self) -> Magnitude {
10441053
let change = self.relative_change().abs();
10451054
let threshold = self.significance_threshold();
@@ -1054,7 +1063,7 @@ impl TestResultComparison {
10541063
} else {
10551064
Magnitude::VeryLarge
10561065
};
1057-
let change_magnitude = if change < 0.002 {
1066+
let absolute_magnitude = if change < 0.002 {
10581067
Magnitude::VerySmall
10591068
} else if change < 0.01 {
10601069
Magnitude::Small
@@ -1084,7 +1093,9 @@ impl TestResultComparison {
10841093
}
10851094
}
10861095

1087-
from_u8((as_u8(over_threshold) + as_u8(change_magnitude)) / 2)
1096+
// Take the average of the absolute magnitude and the magnitude
1097+
// above the significance threshold.
1098+
from_u8((as_u8(over_threshold) + as_u8(absolute_magnitude)) / 2)
10881099
}
10891100

10901101
fn is_dodgy(&self) -> bool {

site/src/github.rs

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
use crate::api::github::Issue;
22
use crate::comparison::{
3-
write_summary_table, Comparison, ComparisonConfidence, ComparisonSummary, Direction,
3+
write_summary_table, Comparison, ComparisonSummary, Direction, RelevanceLevel,
44
};
55
use crate::load::{Config, SiteCtxt, TryCommit};
66

@@ -741,7 +741,7 @@ fn generate_short_summary(
741741

742742
match summary {
743743
Some(summary) => {
744-
if comparison_is_relevant(summary.confidence(), is_master_commit) {
744+
if comparison_is_relevant(summary.relevance_level(), is_master_commit) {
745745
let direction = summary.direction().unwrap();
746746
let num_improvements = summary.number_of_improvements();
747747
let num_regressions = summary.number_of_regressions();
@@ -811,12 +811,12 @@ fn split_comparison(
811811
}
812812

813813
/// Whether a comparison is relevant enough to show
814-
fn comparison_is_relevant(confidence: ComparisonConfidence, is_master_commit: bool) -> bool {
814+
fn comparison_is_relevant(relevance: RelevanceLevel, is_master_commit: bool) -> bool {
815815
if is_master_commit {
816-
confidence.is_definitely_relevant()
816+
relevance.very_relevant()
817817
} else {
818818
// is try run
819-
confidence.is_atleast_probably_relevant()
819+
relevance.at_least_somewhat_relevant()
820820
}
821821
}
822822

0 commit comments

Comments
 (0)