mm/slub: optimize alloc/free fastpath by removing preemption on/off

JoonsooKim · torvalds · commit 9aabf810a67c · 2015-02-10T14:30:30.000-08:00
We had to insert a preempt enable/disable in the fastpath a while ago in
order to guarantee that tid and kmem_cache_cpu are retrieved on the same
cpu.  It is the problem only for CONFIG_PREEMPT in which scheduler can
move the process to other cpu during retrieving data.

Now, I reach the solution to remove preempt enable/disable in the
fastpath.  If tid is matched with kmem_cache_cpu's tid after tid and
kmem_cache_cpu are retrieved by separate this_cpu operation, it means
that they are retrieved on the same cpu.  If not matched, we just have
to retry it.

With this guarantee, preemption enable/disable isn't need at all even if
CONFIG_PREEMPT, so this patch removes it.

I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free in
CONFIG_PREEMPT.  (14.821 ns -&gt; 14.049 ns)

Below is the result of Christoph's slab_test reported by Jesper Dangaard
Brouer.

* Before

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -&gt; 49 cycles kfree -&gt; 62 cycles
 10000 times kmalloc(16) -&gt; 48 cycles kfree -&gt; 64 cycles
 10000 times kmalloc(32) -&gt; 53 cycles kfree -&gt; 70 cycles
 10000 times kmalloc(64) -&gt; 64 cycles kfree -&gt; 77 cycles
 10000 times kmalloc(128) -&gt; 74 cycles kfree -&gt; 84 cycles
 10000 times kmalloc(256) -&gt; 84 cycles kfree -&gt; 114 cycles
 10000 times kmalloc(512) -&gt; 83 cycles kfree -&gt; 116 cycles
 10000 times kmalloc(1024) -&gt; 81 cycles kfree -&gt; 120 cycles
 10000 times kmalloc(2048) -&gt; 104 cycles kfree -&gt; 136 cycles
 10000 times kmalloc(4096) -&gt; 142 cycles kfree -&gt; 165 cycles
 10000 times kmalloc(8192) -&gt; 238 cycles kfree -&gt; 226 cycles
 10000 times kmalloc(16384) -&gt; 403 cycles kfree -&gt; 264 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -&gt; 68 cycles
 10000 times kmalloc(16)/kfree -&gt; 68 cycles
 10000 times kmalloc(32)/kfree -&gt; 69 cycles
 10000 times kmalloc(64)/kfree -&gt; 68 cycles
 10000 times kmalloc(128)/kfree -&gt; 68 cycles
 10000 times kmalloc(256)/kfree -&gt; 68 cycles
 10000 times kmalloc(512)/kfree -&gt; 74 cycles
 10000 times kmalloc(1024)/kfree -&gt; 75 cycles
 10000 times kmalloc(2048)/kfree -&gt; 74 cycles
 10000 times kmalloc(4096)/kfree -&gt; 74 cycles
 10000 times kmalloc(8192)/kfree -&gt; 75 cycles
 10000 times kmalloc(16384)/kfree -&gt; 510 cycles

* After

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -&gt; 46 cycles kfree -&gt; 61 cycles
 10000 times kmalloc(16) -&gt; 46 cycles kfree -&gt; 63 cycles
 10000 times kmalloc(32) -&gt; 49 cycles kfree -&gt; 69 cycles
 10000 times kmalloc(64) -&gt; 57 cycles kfree -&gt; 76 cycles
 10000 times kmalloc(128) -&gt; 66 cycles kfree -&gt; 83 cycles
 10000 times kmalloc(256) -&gt; 84 cycles kfree -&gt; 110 cycles
 10000 times kmalloc(512) -&gt; 77 cycles kfree -&gt; 114 cycles
 10000 times kmalloc(1024) -&gt; 80 cycles kfree -&gt; 116 cycles
 10000 times kmalloc(2048) -&gt; 102 cycles kfree -&gt; 131 cycles
 10000 times kmalloc(4096) -&gt; 135 cycles kfree -&gt; 163 cycles
 10000 times kmalloc(8192) -&gt; 238 cycles kfree -&gt; 218 cycles
 10000 times kmalloc(16384) -&gt; 399 cycles kfree -&gt; 262 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -&gt; 65 cycles
 10000 times kmalloc(16)/kfree -&gt; 66 cycles
 10000 times kmalloc(32)/kfree -&gt; 65 cycles
 10000 times kmalloc(64)/kfree -&gt; 66 cycles
 10000 times kmalloc(128)/kfree -&gt; 66 cycles
 10000 times kmalloc(256)/kfree -&gt; 71 cycles
 10000 times kmalloc(512)/kfree -&gt; 72 cycles
 10000 times kmalloc(1024)/kfree -&gt; 71 cycles
 10000 times kmalloc(2048)/kfree -&gt; 71 cycles
 10000 times kmalloc(4096)/kfree -&gt; 71 cycles
 10000 times kmalloc(8192)/kfree -&gt; 65 cycles
 10000 times kmalloc(16384)/kfree -&gt; 511 cycles

Most of the results are better than before.

Note that this change slightly worses performance in !CONFIG_PREEMPT,
roughly 0.3%.  Implementing each case separately would help performance,
but, since it's so marginal, I didn't do that.  This would help
maintanance since we have same code for all cases.

Signed-off-by: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Acked-by: Christoph Lameter &lt;cl@linux.com&gt;
Tested-by: Jesper Dangaard Brouer &lt;brouer@redhat.com&gt;
Acked-by: Jesper Dangaard Brouer &lt;brouer@redhat.com&gt;
Cc: Pekka Enberg &lt;penberg@kernel.org&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
diff --git a/mm/slub.c b/mm/slub.c
@@ -2398,22 +2398,31 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 	 * reading from one cpu area. That does not matter as long
 	 * as we end up on the original cpu again when doing the cmpxchg.
 	 *
-	 * Preemption is disabled for the retrieval of the tid because that
-	 * must occur from the current processor. We cannot allow rescheduling
-	 * on a different processor between the determination of the pointer
-	 * and the retrieval of the tid.
+	 * We should guarantee that tid and kmem_cache are retrieved on
+	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
+	 * to check if it is matched or not.
 	 */
-	preempt_disable();
-	c = this_cpu_ptr(s->cpu_slab);
+	do {
+		tid = this_cpu_read(s->cpu_slab->tid);
+		c = raw_cpu_ptr(s->cpu_slab);
+	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
+
+	/*
+	 * Irqless object alloc/free algorithm used here depends on sequence
+	 * of fetching cpu_slab's data. tid should be fetched before anything
+	 * on c to guarantee that object and page associated with previous tid
+	 * won't be used with current tid. If we fetch tid first, object and
+	 * page could be one associated with next tid and our alloc/free
+	 * request will be failed. In this case, we will retry. So, no problem.
+	 */
+	barrier();
 
 	/*
 	 * The transaction ids are globally unique per cpu and per operation on
 	 * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
 	 * occurs on the right processor and that there was no operation on the
 	 * linked list in between.
 	 */
-	tid = c->tid;
-	preempt_enable();
 
 	object = c->freelist;
 	page = c->page;
@@ -2659,11 +2668,13 @@ static __always_inline void slab_free(struct kmem_cache *s,
 	 * data is retrieved via this pointer. If we are on the same cpu
 	 * during the cmpxchg then the free will succedd.
 	 */
-	preempt_disable();
-	c = this_cpu_ptr(s->cpu_slab);
+	do {
+		tid = this_cpu_read(s->cpu_slab->tid);
+		c = raw_cpu_ptr(s->cpu_slab);
+	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
 
-	tid = c->tid;
-	preempt_enable();
+	/* Same with comment on barrier() in slab_alloc_node() */
+	barrier();
 
 	if (likely(page == c->page)) {
 		set_freepointer(s, object, c->freelist);