Skip to content

Commit 64489e7

Browse files
Waiman-LongIngo Molnar
authored andcommitted
locking/rwsem: Implement a new locking scheme
The current way of using various reader, writer and waiting biases in the rwsem code are confusing and hard to understand. I have to reread the rwsem count guide in the rwsem-xadd.c file from time to time to remind myself how this whole thing works. It also makes the rwsem code harder to be optimized. To make rwsem more sane, a new locking scheme similar to the one in qrwlock is now being used. The atomic long count has the following bit definitions: Bit 0 - writer locked bit Bit 1 - waiters present bit Bits 2-7 - reserved for future extension Bits 8-X - reader count (24/56 bits) The cmpxchg instruction is now used to acquire the write lock. The read lock is still acquired with xadd instruction, so there is no change here. This scheme will allow up to 16M/64P active readers which should be more than enough. We can always use some more reserved bits if necessary. With that change, we can deterministically know if a rwsem has been write-locked. Looking at the count alone, however, one cannot determine for certain if a rwsem is owned by readers or not as the readers that set the reader count bits may be in the process of backing out. So we still need the reader-owned bit in the owner field to be sure. With a locking microbenchmark running on 5.1 based kernel, the total locking rates (in kops/s) of the benchmark on a 8-socket 120-core IvyBridge-EX system before and after the patch were as follows: Before Patch After Patch # of Threads wlock rlock wlock rlock ------------ ----- ----- ----- ----- 1 30,659 31,341 31,055 31,283 2 8,909 16,457 9,884 17,659 4 9,028 15,823 8,933 20,233 8 8,410 14,212 7,230 17,140 16 8,217 25,240 7,479 24,607 The locking rates of the benchmark on a Power8 system were as follows: Before Patch After Patch # of Threads wlock rlock wlock rlock ------------ ----- ----- ----- ----- 1 12,963 13,647 13,275 13,601 2 7,570 11,569 7,902 10,829 4 5,232 5,516 5,466 5,435 8 5,233 3,386 5,467 3,168 The locking rates of the benchmark on a 2-socket ARM64 system were as follows: Before Patch After Patch # of Threads wlock rlock wlock rlock ------------ ----- ----- ----- ----- 1 21,495 21,046 21,524 21,074 2 5,293 10,502 5,333 10,504 4 5,325 11,463 5,358 11,631 8 5,391 11,712 5,470 11,680 The performance are roughly the same before and after the patch. There are run-to-run variations in performance. Runs with higher variances usually have higher throughput. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tim Chen <[email protected]> Cc: Will Deacon <[email protected]> Cc: huang ying <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
1 parent 5c1ec49 commit 64489e7

File tree

2 files changed

+85
-136
lines changed

2 files changed

+85
-136
lines changed

kernel/locking/rwsem-xadd.c

Lines changed: 49 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99
*
1010
* Optimistic spinning by Tim Chen <[email protected]>
1111
* and Davidlohr Bueso <[email protected]>. Based on mutexes.
12+
*
13+
* Rwsem count bit fields re-definition by Waiman Long <[email protected]>.
1214
*/
1315
#include <linux/rwsem.h>
1416
#include <linux/init.h>
@@ -22,52 +24,20 @@
2224
#include "rwsem.h"
2325

2426
/*
25-
* Guide to the rw_semaphore's count field for common values.
26-
* (32-bit case illustrated, similar for 64-bit)
27-
*
28-
* 0x0000000X (1) X readers active or attempting lock, no writer waiting
29-
* X = #active_readers + #readers attempting to lock
30-
* (X*ACTIVE_BIAS)
31-
*
32-
* 0x00000000 rwsem is unlocked, and no one is waiting for the lock or
33-
* attempting to read lock or write lock.
34-
*
35-
* 0xffff000X (1) X readers active or attempting lock, with waiters for lock
36-
* X = #active readers + # readers attempting lock
37-
* (X*ACTIVE_BIAS + WAITING_BIAS)
38-
* (2) 1 writer attempting lock, no waiters for lock
39-
* X-1 = #active readers + #readers attempting lock
40-
* ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
41-
* (3) 1 writer active, no waiters for lock
42-
* X-1 = #active readers + #readers attempting lock
43-
* ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
44-
*
45-
* 0xffff0001 (1) 1 reader active or attempting lock, waiters for lock
46-
* (WAITING_BIAS + ACTIVE_BIAS)
47-
* (2) 1 writer active or attempting lock, no waiters for lock
48-
* (ACTIVE_WRITE_BIAS)
27+
* Guide to the rw_semaphore's count field.
4928
*
50-
* 0xffff0000 (1) There are writers or readers queued but none active
51-
* or in the process of attempting lock.
52-
* (WAITING_BIAS)
53-
* Note: writer can attempt to steal lock for this count by adding
54-
* ACTIVE_WRITE_BIAS in cmpxchg and checking the old count
29+
* When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
30+
* by a writer.
5531
*
56-
* 0xfffe0001 (1) 1 writer active, or attempting lock. Waiters on queue.
57-
* (ACTIVE_WRITE_BIAS + WAITING_BIAS)
58-
*
59-
* Note: Readers attempt to lock by adding ACTIVE_BIAS in down_read and checking
60-
* the count becomes more than 0 for successful lock acquisition,
61-
* i.e. the case where there are only readers or nobody has lock.
62-
* (1st and 2nd case above).
63-
*
64-
* Writers attempt to lock by adding ACTIVE_WRITE_BIAS in down_write and
65-
* checking the count becomes ACTIVE_WRITE_BIAS for successful lock
66-
* acquisition (i.e. nobody else has lock or attempts lock). If
67-
* unsuccessful, in rwsem_down_write_failed, we'll check to see if there
68-
* are only waiters but none active (5th case above), and attempt to
69-
* steal the lock.
32+
* The lock is owned by readers when
33+
* (1) the RWSEM_WRITER_LOCKED isn't set in count,
34+
* (2) some of the reader bits are set in count, and
35+
* (3) the owner field has RWSEM_READ_OWNED bit set.
7036
*
37+
* Having some reader bits set is not enough to guarantee a readers owned
38+
* lock as the readers may be in the process of backing out from the count
39+
* and a writer has just released the lock. So another writer may steal
40+
* the lock immediately after that.
7141
*/
7242

7343
/*
@@ -113,9 +83,8 @@ enum rwsem_wake_type {
11383

11484
/*
11585
* handle the lock release when processes blocked on it that can now run
116-
* - if we come here from up_xxxx(), then:
117-
* - the 'active part' of count (&0x0000ffff) reached 0 (but may have changed)
118-
* - the 'waiting part' of count (&0xffff0000) is -ve (and will still be so)
86+
* - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
87+
* have been set.
11988
* - there must be someone on the queue
12089
* - the wait_lock must be held by the caller
12190
* - tasks are marked for wakeup, the caller must later invoke wake_up_q()
@@ -160,22 +129,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
160129
* so we can bail out early if a writer stole the lock.
161130
*/
162131
if (wake_type != RWSEM_WAKE_READ_OWNED) {
163-
adjustment = RWSEM_ACTIVE_READ_BIAS;
164-
try_reader_grant:
132+
adjustment = RWSEM_READER_BIAS;
165133
oldcount = atomic_long_fetch_add(adjustment, &sem->count);
166-
if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
167-
/*
168-
* If the count is still less than RWSEM_WAITING_BIAS
169-
* after removing the adjustment, it is assumed that
170-
* a writer has stolen the lock. We have to undo our
171-
* reader grant.
172-
*/
173-
if (atomic_long_add_return(-adjustment, &sem->count) <
174-
RWSEM_WAITING_BIAS)
175-
return;
176-
177-
/* Last active locker left. Retry waking readers. */
178-
goto try_reader_grant;
134+
if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
135+
atomic_long_sub(adjustment, &sem->count);
136+
return;
179137
}
180138
/*
181139
* Set it to reader-owned to give spinners an early
@@ -209,11 +167,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
209167
}
210168
list_cut_before(&wlist, &sem->wait_list, &waiter->list);
211169

212-
adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
170+
adjustment = woken * RWSEM_READER_BIAS - adjustment;
213171
lockevent_cond_inc(rwsem_wake_reader, woken);
214172
if (list_empty(&sem->wait_list)) {
215173
/* hit end of list above */
216-
adjustment -= RWSEM_WAITING_BIAS;
174+
adjustment -= RWSEM_FLAG_WAITERS;
217175
}
218176

219177
if (adjustment)
@@ -248,22 +206,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
248206
*/
249207
static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
250208
{
251-
/*
252-
* Avoid trying to acquire write lock if count isn't RWSEM_WAITING_BIAS.
253-
*/
254-
if (count != RWSEM_WAITING_BIAS)
209+
long new;
210+
211+
if (count & RWSEM_LOCK_MASK)
255212
return false;
256213

257-
/*
258-
* Acquire the lock by trying to set it to ACTIVE_WRITE_BIAS. If there
259-
* are other tasks on the wait list, we need to add on WAITING_BIAS.
260-
*/
261-
count = list_is_singular(&sem->wait_list) ?
262-
RWSEM_ACTIVE_WRITE_BIAS :
263-
RWSEM_ACTIVE_WRITE_BIAS + RWSEM_WAITING_BIAS;
214+
new = count + RWSEM_WRITER_LOCKED -
215+
(list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
264216

265-
if (atomic_long_cmpxchg_acquire(&sem->count, RWSEM_WAITING_BIAS, count)
266-
== RWSEM_WAITING_BIAS) {
217+
if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
267218
rwsem_set_owner(sem);
268219
return true;
269220
}
@@ -279,9 +230,9 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
279230
{
280231
long count = atomic_long_read(&sem->count);
281232

282-
while (!count || count == RWSEM_WAITING_BIAS) {
233+
while (!(count & RWSEM_LOCK_MASK)) {
283234
if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
284-
count + RWSEM_ACTIVE_WRITE_BIAS)) {
235+
count + RWSEM_WRITER_LOCKED)) {
285236
rwsem_set_owner(sem);
286237
lockevent_inc(rwsem_opt_wlock);
287238
return true;
@@ -424,7 +375,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
424375
static inline struct rw_semaphore __sched *
425376
__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
426377
{
427-
long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
378+
long count, adjustment = -RWSEM_READER_BIAS;
428379
struct rwsem_waiter waiter;
429380
DEFINE_WAKE_Q(wake_q);
430381

@@ -436,16 +387,16 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
436387
/*
437388
* In case the wait queue is empty and the lock isn't owned
438389
* by a writer, this reader can exit the slowpath and return
439-
* immediately as its RWSEM_ACTIVE_READ_BIAS has already
440-
* been set in the count.
390+
* immediately as its RWSEM_READER_BIAS has already been
391+
* set in the count.
441392
*/
442-
if (atomic_long_read(&sem->count) >= 0) {
393+
if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
443394
raw_spin_unlock_irq(&sem->wait_lock);
444395
rwsem_set_reader_owned(sem);
445396
lockevent_inc(rwsem_rlock_fast);
446397
return sem;
447398
}
448-
adjustment += RWSEM_WAITING_BIAS;
399+
adjustment += RWSEM_FLAG_WAITERS;
449400
}
450401
list_add_tail(&waiter.list, &sem->wait_list);
451402

@@ -458,9 +409,8 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
458409
* If there are no writers and we are first in the queue,
459410
* wake our own waiter to join the existing active readers !
460411
*/
461-
if (count == RWSEM_WAITING_BIAS ||
462-
(count > RWSEM_WAITING_BIAS &&
463-
adjustment != -RWSEM_ACTIVE_READ_BIAS))
412+
if (!(count & RWSEM_LOCK_MASK) ||
413+
(!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
464414
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
465415

466416
raw_spin_unlock_irq(&sem->wait_lock);
@@ -488,7 +438,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
488438
out_nolock:
489439
list_del(&waiter.list);
490440
if (list_empty(&sem->wait_list))
491-
atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
441+
atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
492442
raw_spin_unlock_irq(&sem->wait_lock);
493443
__set_current_state(TASK_RUNNING);
494444
lockevent_inc(rwsem_rlock_fail);
@@ -521,9 +471,6 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
521471
struct rw_semaphore *ret = sem;
522472
DEFINE_WAKE_Q(wake_q);
523473

524-
/* undo write bias from down_write operation, stop active locking */
525-
count = atomic_long_sub_return(RWSEM_ACTIVE_WRITE_BIAS, &sem->count);
526-
527474
/* do optimistic spinning and steal lock if possible */
528475
if (rwsem_optimistic_spin(sem))
529476
return sem;
@@ -543,16 +490,18 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
543490

544491
list_add_tail(&waiter.list, &sem->wait_list);
545492

546-
/* we're now waiting on the lock, but no longer actively locking */
493+
/* we're now waiting on the lock */
547494
if (waiting) {
548495
count = atomic_long_read(&sem->count);
549496

550497
/*
551498
* If there were already threads queued before us and there are
552-
* no active writers, the lock must be read owned; so we try to
553-
* wake any read locks that were queued ahead of us.
499+
* no active writers and some readers, the lock must be read
500+
* owned; so we try to any read locks that were queued ahead
501+
* of us.
554502
*/
555-
if (count > RWSEM_WAITING_BIAS) {
503+
if (!(count & RWSEM_WRITER_MASK) &&
504+
(count & RWSEM_READER_MASK)) {
556505
__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
557506
/*
558507
* The wakeup is normally called _after_ the wait_lock
@@ -569,8 +518,9 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
569518
wake_q_init(&wake_q);
570519
}
571520

572-
} else
573-
count = atomic_long_add_return(RWSEM_WAITING_BIAS, &sem->count);
521+
} else {
522+
count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
523+
}
574524

575525
/* wait until we successfully acquire the lock */
576526
set_current_state(state);
@@ -587,7 +537,8 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
587537
schedule();
588538
lockevent_inc(rwsem_sleep_writer);
589539
set_current_state(state);
590-
} while ((count = atomic_long_read(&sem->count)) & RWSEM_ACTIVE_MASK);
540+
count = atomic_long_read(&sem->count);
541+
} while (count & RWSEM_LOCK_MASK);
591542

592543
raw_spin_lock_irq(&sem->wait_lock);
593544
}
@@ -603,7 +554,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
603554
raw_spin_lock_irq(&sem->wait_lock);
604555
list_del(&waiter.list);
605556
if (list_empty(&sem->wait_list))
606-
atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
557+
atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
607558
else
608559
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
609560
raw_spin_unlock_irq(&sem->wait_lock);

0 commit comments

Comments
 (0)