|
1 | 1 | # Comparison Analysis
|
2 | 2 |
|
3 |
| -The following is a detailed explanation of the process undertaken to automate the analysis of test results for two artifacts of interest (artifact A and B). |
| 3 | +The following is a detailed explanation of the process undertaken to automate the analysis of test results for two artifacts of interest (artifact A and B). |
4 | 4 |
|
5 | 5 | This analysis can be done by hand, by using the [comparison page](https://perf.rust-lang.org/compare.html) and entering the two artifacts of interest in the form at the top.
|
6 | 6 |
|
@@ -48,27 +48,27 @@ result > Q3 + (interquartile_range * 1.5)
|
48 | 48 |
|
49 | 49 | We ignore the lower fence, because result data is bounded by 0.
|
50 | 50 |
|
51 |
| -### What makes a test case "dodgy"? |
| 51 | +This upper fence is often called the "significance threshold". |
52 | 52 |
|
53 |
| -A test case is "dodgy" if it shows signs of either being noisy or highly variable. |
54 |
| - |
55 |
| -To determine noise and high variability, the previous 100 test results for the test case in question are examined by calculating relative delta changes between adjacent test results. This is done with the following formula (where `testResult1` is the test result immediately proceeding `testResult2`): |
56 |
| - |
57 |
| -``` |
58 |
| -testResult2 - testResult1 / testResult1 |
59 |
| -``` |
| 53 | +### How is confidence in whether a test analysis is "relevant" determined? |
60 | 54 |
|
61 |
| -Any relative delta change that is above a threshold (currently 0.1) is considered "significant" for the purposes of dodginess detection. |
| 55 | +The confidence in whether a test analysis is relevant depends on the number of significant test results and their magnitude. |
62 | 56 |
|
63 |
| -A highly variable test case is one where a certain percentage (currently 5%) of relative delta changes are significant. The logic being that test cases should only display significant relative delta changes a small percentage of the time. |
| 57 | +#### Magnitude |
64 | 58 |
|
65 |
| -A noisy test case is one where of all the non-significant relative delta changes, the average delta change is still above some threshold (0.001). The logic being that non-significant changes should, on average, being very close to 0. If they are not close to zero, then they are noisy. |
| 59 | +Magnitude is a combination of two factors: |
| 60 | +* how large a change is regardless of the direction of the change |
| 61 | +* how much that change went over the significance threshold |
66 | 62 |
|
67 |
| -### How is confidence in whether a test analysis is "relevant" determined? |
| 63 | +If a large change only happens to go over the significance threshold by a small factor, then the over magnitude of the change is considered small. |
68 | 64 |
|
69 |
| -The confidence in whether a test analysis is relevant depends on the number of significant test results and their magnitude (how large a change is regardless of the direction of the change). |
| 65 | +#### Confidence algorithm |
70 | 66 |
|
71 | 67 | The actual algorithm for determining confidence may change, but in general the following rules apply:
|
72 |
| -* Definitely relevant: any number of very large changes, a small amount of large and/or medium changes, or a large amount of small changes. |
73 |
| -* Probably relevant: any number of large changes, more than 1 medium change, or smaller but still substantial amount of small changes. |
| 68 | +* Definitely relevant: any number of very large or large changes, a small amount of medium changes, or a large amount of small or very small changes. |
| 69 | +* Probably relevant: any number of very large or large changes, any medium change, or smaller but still substantial amount of small or very small changes. |
74 | 70 | * Maybe relevant: if it doesn't fit into the above two categories, it ends in this category.
|
| 71 | + |
| 72 | +### "Dodgy" Test Cases |
| 73 | + |
| 74 | +"Dodgy" test cases are test cases that tend to produce unreliable results (i.e., noise). A test case is considered "dodgy" if its significance threshold is sufficiently far enough away from 0. |
0 commit comments