Skip to content

Commit d0ecd89

Browse files
netoptimizertorvalds
authored andcommitted
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the "slowpath" of kmem_cache_free_bulk. The calls slab_free (fastpath) and __slab_free (slowpath) have been extended with support for bulk free, which amortize the overhead of the (locked) cmpxchg_double. To use the new bulking feature, we build what I call a detached freelist. The detached freelist takes advantage of three properties: 1) the free function call owns the object that is about to be freed, thus writing into this memory is synchronization-free. 2) many freelist's can co-exist side-by-side in the same slab-page each with a separate head pointer. 3) it is the visibility of the head pointer that needs synchronization. Given these properties, the brilliant part is that the detached freelist can be constructed without any need for synchronization. The freelist is constructed directly in the page objects, without any synchronization needed. The detached freelist is allocated on the stack of the function call kmem_cache_free_bulk. Thus, the freelist head pointer is not visible to other CPUs. All objects in a SLUB freelist must belong to the same slab-page. Thus, constructing the detached freelist is about matching objects that belong to the same slab-page. The bulk free array is scanned is a progressive manor with a limited look-ahead facility. Kmem debug support is handled in call of slab_free(). Notice kmem_cache_free_bulk no longer need to disable IRQs. This only slowed down single free bulk with approx 3 cycles. Performance data: Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns To get stable and comparable numbers, the kernel have been booted with "slab_merge" (this also improve performance for larger bulk sizes). Performance data, compared against fallback bulking: bulk - fallback bulk - improvement with this patch 1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0% 2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5% 3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6% 4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5% 8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0% 16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3% 30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3% 32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0% 34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0% 48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7% 64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0% 128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0% 158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7% 250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4% Performance data, compared current in-kernel bulking: bulk - curr in-kernel - improvement with this patch 1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5% 2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1% 3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5% 4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1% 8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9% 16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6% 30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% 32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0% 34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5% 48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0% 64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2% 128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9% 158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0% 250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0% Performance with normal SLUB merging is significantly slower for larger bulking. This is believed to (primarily) be an effect of not having to share the per-CPU data-structures, as tuning per-CPU size can achieve similar performance. bulk - slab_nomerge - normal SLUB merge 1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0 2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0 3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0 4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0 8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0 16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0 30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5 32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4 34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1 48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1 64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28 128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30 158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29 250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19 Joint work with Alexander Duyck. [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c [[email protected]: BUG_ON -> WARN_ON;return] Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Alexander Duyck <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 8108465 commit d0ecd89

File tree

1 file changed

+79
-30
lines changed

1 file changed

+79
-30
lines changed

mm/slub.c

Lines changed: 79 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -2807,44 +2807,93 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
28072807
}
28082808
EXPORT_SYMBOL(kmem_cache_free);
28092809

2810-
/* Note that interrupts must be enabled when calling this function. */
2811-
void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
2812-
{
2813-
struct kmem_cache_cpu *c;
2810+
struct detached_freelist {
28142811
struct page *page;
2815-
int i;
2812+
void *tail;
2813+
void *freelist;
2814+
int cnt;
2815+
};
28162816

2817-
local_irq_disable();
2818-
c = this_cpu_ptr(s->cpu_slab);
2817+
/*
2818+
* This function progressively scans the array with free objects (with
2819+
* a limited look ahead) and extract objects belonging to the same
2820+
* page. It builds a detached freelist directly within the given
2821+
* page/objects. This can happen without any need for
2822+
* synchronization, because the objects are owned by running process.
2823+
* The freelist is build up as a single linked list in the objects.
2824+
* The idea is, that this detached freelist can then be bulk
2825+
* transferred to the real freelist(s), but only requiring a single
2826+
* synchronization primitive. Look ahead in the array is limited due
2827+
* to performance reasons.
2828+
*/
2829+
static int build_detached_freelist(struct kmem_cache *s, size_t size,
2830+
void **p, struct detached_freelist *df)
2831+
{
2832+
size_t first_skipped_index = 0;
2833+
int lookahead = 3;
2834+
void *object;
28192835

2820-
for (i = 0; i < size; i++) {
2821-
void *object = p[i];
2836+
/* Always re-init detached_freelist */
2837+
df->page = NULL;
28222838

2823-
BUG_ON(!object);
2824-
/* kmem cache debug support */
2825-
s = cache_from_obj(s, object);
2826-
if (unlikely(!s))
2827-
goto exit;
2828-
slab_free_hook(s, object);
2839+
do {
2840+
object = p[--size];
2841+
} while (!object && size);
28292842

2830-
page = virt_to_head_page(object);
2843+
if (!object)
2844+
return 0;
28312845

2832-
if (c->page == page) {
2833-
/* Fastpath: local CPU free */
2834-
set_freepointer(s, object, c->freelist);
2835-
c->freelist = object;
2836-
} else {
2837-
c->tid = next_tid(c->tid);
2838-
local_irq_enable();
2839-
/* Slowpath: overhead locked cmpxchg_double_slab */
2840-
__slab_free(s, page, object, object, 1, _RET_IP_);
2841-
local_irq_disable();
2842-
c = this_cpu_ptr(s->cpu_slab);
2846+
/* Start new detached freelist */
2847+
set_freepointer(s, object, NULL);
2848+
df->page = virt_to_head_page(object);
2849+
df->tail = object;
2850+
df->freelist = object;
2851+
p[size] = NULL; /* mark object processed */
2852+
df->cnt = 1;
2853+
2854+
while (size) {
2855+
object = p[--size];
2856+
if (!object)
2857+
continue; /* Skip processed objects */
2858+
2859+
/* df->page is always set at this point */
2860+
if (df->page == virt_to_head_page(object)) {
2861+
/* Opportunity build freelist */
2862+
set_freepointer(s, object, df->freelist);
2863+
df->freelist = object;
2864+
df->cnt++;
2865+
p[size] = NULL; /* mark object processed */
2866+
2867+
continue;
28432868
}
2869+
2870+
/* Limit look ahead search */
2871+
if (!--lookahead)
2872+
break;
2873+
2874+
if (!first_skipped_index)
2875+
first_skipped_index = size + 1;
28442876
}
2845-
exit:
2846-
c->tid = next_tid(c->tid);
2847-
local_irq_enable();
2877+
2878+
return first_skipped_index;
2879+
}
2880+
2881+
2882+
/* Note that interrupts must be enabled when calling this function. */
2883+
void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
2884+
{
2885+
if (WARN_ON(!size))
2886+
return;
2887+
2888+
do {
2889+
struct detached_freelist df;
2890+
2891+
size = build_detached_freelist(s, size, p, &df);
2892+
if (unlikely(!df.page))
2893+
continue;
2894+
2895+
slab_free(s, df.page, df.freelist, df.tail, df.cnt, _RET_IP_);
2896+
} while (likely(size));
28482897
}
28492898
EXPORT_SYMBOL(kmem_cache_free_bulk);
28502899

0 commit comments

Comments
 (0)