Skip to content

Commit 82f8de8

Browse files
George SpelvinSomasundaram Krishnasamy
authored andcommitted
lib/list_sort: optimize number of calls to comparison function
[ Upstream commit b5c56e0 ] CONFIG_RETPOLINE has severely degraded indirect function call performance, so it's worth putting some effort into reducing the number of times cmp() is called. This patch avoids badly unbalanced merges on unlucky input sizes. It slightly increases the code size, but saves an average of 0.2*n calls to cmp(). x86-64 code size 739 -> 803 bytes (+64) Unfortunately, there's not a lot of low-hanging fruit in a merge sort; it already performs only n*log2(n) - K*n + O(1) compares. The leading coefficient is already at the theoretical limit (log2(n!) corresponds to K=1.4427), so we're fighting over the linear term, and the best mergesort can do is K=1.2645, achieved when n is a power of 2. The differences between mergesort variants appear when n is *not* a power of 2; K is a function of the fractional part of log2(n). Top-down mergesort does best of all, achieving a minimum K=1.2408, and an average (over all sizes) K=1.248. However, that requires knowing the number of entries to be sorted ahead of time, and making a full pass over the input to count it conflicts with a second performance goal, which is cache blocking. Obviously, we have to read the entire list into L1 cache at some point, and performance is best if it fits. But if it doesn't fit, each full pass over the input causes a cache miss per element, which is undesirable. While textbooks explain bottom-up mergesort as a succession of merging passes, practical implementations do merging in depth-first order: as soon as two lists of the same size are available, they are merged. This allows as many merge passes as possible to fit into L1; only the final few merges force cache misses. This cache-friendly depth-first merge order depends on us merging the beginning of the input as much as possible before we've even seen the end of the input (and thus know its size). The simple eager merge pattern causes bad performance when n is just over a power of 2. If n=1028, the final merge is between 1024- and 4-element lists, which is wasteful of comparisons. (This is actually worse on average than n=1025, because a 1204:1 merge will, on average, end after 512 compares, while 1024:4 will walk 4/5 of the list.) Because of this, bottom-up mergesort achieves K < 0.5 for such sizes, and has an average (over all sizes) K of around 1. (My experiments show K=1.01, while theory predicts K=0.965.) There are "worst-case optimal" variants of bottom-up mergesort which avoid this bad performance, but the algorithms given in the literature, such as queue-mergesort and boustrodephonic mergesort, depend on the breadth-first multi-pass structure that we are trying to avoid. This implementation is as eager as possible while ensuring that all merge passes are at worst 1:2 unbalanced. This achieves the same average K=1.207 as queue-mergesort, which is 0.2*n better then bottom-up, and only 0.04*n behind top-down mergesort. Specifically, defers merging two lists of size 2^k until it is known that there are 2^k additional inputs following. This ensures that the final uneven merges triggered by reaching the end of the input will be at worst 2:1. This will avoid cache misses as long as 3*2^k elements fit into the cache. (I confess to being more than a little bit proud of how clean this code turned out. It took a lot of thinking, but the resultant inner loop is very simple and efficient.) Refs: Bottom-up Mergesort: A Detailed Analysis Wolfgang Panny, Helmut Prodinger Algorithmica 14(4):340--354, October 1995 https://doi.org/10.1007/BF01294131 https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.5260 The cost distribution of queue-mergesort, optimal mergesorts, and power-of-two rules Wei-Mei Chen, Hsien-Kuei Hwang, Gen-Huey Chen Journal of Algorithms 30(2); Pages 423--448, February 1999 https://doi.org/10.1006/jagm.1998.0986 https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.5380 Queue-Mergesort Mordecai J. Golin, Robert Sedgewick Information Processing Letters, 48(5):253--259, 10 December 1993 https://doi.org/10.1016/0020-0190(93)90088-q https://sci-hub.tw/10.1016/0020-0190(93)90088-Q Feedback from Rasmus Villemoes <[email protected]>. Link: http://lkml.kernel.org/r/fd560853cc4dca0d0f02184ffa888b4c1be89abc.1552704200.git.lkml@sdf.org Signed-off-by: George Spelvin <[email protected]> Acked-by: Andrey Abramov <[email protected]> Acked-by: Rasmus Villemoes <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Cc: Daniel Wagner <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Don Mullis <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> (cherry picked from commit b5c56e0) Orabug: 28894138 Signed-off-by: Thomas Tai <[email protected]> Reviewed-by: Tom Saeger <[email protected]> Signed-off-by: Somasundaram Krishnasamy <[email protected]>
1 parent cfad9ce commit 82f8de8

File tree

1 file changed

+91
-22
lines changed

1 file changed

+91
-22
lines changed

lib/list_sort.c

Lines changed: 91 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -107,11 +107,6 @@ static void merge_final(void *priv, cmp_func cmp, struct list_head *head,
107107
* @head: the list to sort
108108
* @cmp: the elements comparison function
109109
*
110-
* This function implements a bottom-up merge sort, which has O(nlog(n))
111-
* complexity. We use depth-first order to take advantage of cacheing.
112-
* (E.g. when we get to the fourth element, we immediately merge the
113-
* first two 2-element lists.)
114-
*
115110
* The comparison funtion @cmp must return > 0 if @a should sort after
116111
* @b ("@a > @b" if you want an ascending sort), and <= 0 if @a should
117112
* sort before @b *or* their original order should be preserved. It is
@@ -131,6 +126,60 @@ static void merge_final(void *priv, cmp_func cmp, struct list_head *head,
131126
* if (a->middle != b->middle)
132127
* return a->middle > b->middle;
133128
* return a->low > b->low;
129+
*
130+
*
131+
* This mergesort is as eager as possible while always performing at least
132+
* 2:1 balanced merges. Given two pending sublists of size 2^k, they are
133+
* merged to a size-2^(k+1) list as soon as we have 2^k following elements.
134+
*
135+
* Thus, it will avoid cache thrashing as long as 3*2^k elements can
136+
* fit into the cache. Not quite as good as a fully-eager bottom-up
137+
* mergesort, but it does use 0.2*n fewer comparisons, so is faster in
138+
* the common case that everything fits into L1.
139+
*
140+
*
141+
* The merging is controlled by "count", the number of elements in the
142+
* pending lists. This is beautiully simple code, but rather subtle.
143+
*
144+
* Each time we increment "count", we set one bit (bit k) and clear
145+
* bits k-1 .. 0. Each time this happens (except the very first time
146+
* for each bit, when count increments to 2^k), we merge two lists of
147+
* size 2^k into one list of size 2^(k+1).
148+
*
149+
* This merge happens exactly when the count reaches an odd multiple of
150+
* 2^k, which is when we have 2^k elements pending in smaller lists,
151+
* so it's safe to merge away two lists of size 2^k.
152+
*
153+
* After this happens twice, we have created two lists of size 2^(k+1),
154+
* which will be merged into a list of size 2^(k+2) before we create
155+
* a third list of size 2^(k+1), so there are never more than two pending.
156+
*
157+
* The number of pending lists of size 2^k is determined by the
158+
* state of bit k of "count" plus two extra pieces of information:
159+
* - The state of bit k-1 (when k == 0, consider bit -1 always set), and
160+
* - Whether the higher-order bits are zero or non-zero (i.e.
161+
* is count >= 2^(k+1)).
162+
* There are six states we distinguish. "x" represents some arbitrary
163+
* bits, and "y" represents some arbitrary non-zero bits:
164+
* 0: 00x: 0 pending of size 2^k; x pending of sizes < 2^k
165+
* 1: 01x: 0 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
166+
* 2: x10x: 0 pending of size 2^k; 2^k + x pending of sizes < 2^k
167+
* 3: x11x: 1 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
168+
* 4: y00x: 1 pending of size 2^k; 2^k + x pending of sizes < 2^k
169+
* 5: y01x: 2 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
170+
* (merge and loop back to state 2)
171+
*
172+
* We gain lists of size 2^k in the 2->3 and 4->5 transitions (because
173+
* bit k-1 is set while the more significant bits are non-zero) and
174+
* merge them away in the 5->2 transition. Note in particular that just
175+
* before the 5->2 transition, all lower-order bits are 11 (state 3),
176+
* so there is one list of each smaller size.
177+
*
178+
* When we reach the end of the input, we merge all the pending
179+
* lists, from smallest to largest. If you work through cases 2 to
180+
* 5 above, you can see that the number of elements we merge with a list
181+
* of size 2^k varies from 2^(k-1) (cases 3 and 5 when x == 0) to
182+
* 2^(k+1) - 1 (second merge of case 5 when x == 2^(k-1) - 1).
134183
*/
135184
__attribute__((nonnull(2,3)))
136185
void list_sort(void *priv, struct list_head *head,
@@ -152,33 +201,53 @@ void list_sort(void *priv, struct list_head *head,
152201
* pointers are not maintained.
153202
* - pending is a prev-linked "list of lists" of sorted
154203
* sublists awaiting further merging.
155-
* - Each of the sorted sublists is power-of-two in size,
156-
* corresponding to bits set in "count".
204+
* - Each of the sorted sublists is power-of-two in size.
157205
* - Sublists are sorted by size and age, smallest & newest at front.
206+
* - There are zero to two sublists of each size.
207+
* - A pair of pending sublists are merged as soon as the number
208+
* of following pending elements equals their size (i.e.
209+
* each time count reaches an odd multiple of that size).
210+
* That ensures each later final merge will be at worst 2:1.
211+
* - Each round consists of:
212+
* - Merging the two sublists selected by the highest bit
213+
* which flips when count is incremented, and
214+
* - Adding an element from the input as a size-1 sublist.
158215
*/
159216
do {
160217
size_t bits;
161-
struct list_head *cur = list;
218+
struct list_head **tail = &pending;
162219

163-
/* Extract the head of "list" as a single-element list "cur" */
164-
list = list->next;
165-
cur->next = NULL;
220+
/* Find the least-significant clear bit in count */
221+
for (bits = count; bits & 1; bits >>= 1)
222+
tail = &(*tail)->prev;
223+
/* Do the indicated merge */
224+
if (likely(bits)) {
225+
struct list_head *a = *tail, *b = a->prev;
166226

167-
/* Do merges corresponding to set lsbits in count */
168-
for (bits = count; bits & 1; bits >>= 1) {
169-
cur = merge(priv, (cmp_func)cmp, pending, cur);
170-
pending = pending->prev; /* Untouched by merge() */
227+
a = merge(priv, (cmp_func)cmp, b, a);
228+
/* Install the merged result in place of the inputs */
229+
a->prev = b->prev;
230+
*tail = a;
171231
}
172-
/* And place the result at the head of "pending" */
173-
cur->prev = pending;
174-
pending = cur;
232+
233+
/* Move one element from input list to pending */
234+
list->prev = pending;
235+
pending = list;
236+
list = list->next;
237+
pending->next = NULL;
175238
count++;
176-
} while (list->next);
239+
} while (list);
240+
241+
/* End of input; merge together all the pending lists. */
242+
list = pending;
243+
pending = pending->prev;
244+
for (;;) {
245+
struct list_head *next = pending->prev;
177246

178-
/* Now merge together last element with all pending lists */
179-
while (pending->prev) {
247+
if (!next)
248+
break;
180249
list = merge(priv, (cmp_func)cmp, pending, list);
181-
pending = pending->prev;
250+
pending = next;
182251
}
183252
/* The final merge, rebuilding prev links */
184253
merge_final(priv, (cmp_func)cmp, head, pending, list);

0 commit comments

Comments
 (0)