Skip to content

Commit 5ccc387

Browse files
committed
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
* 'for-linus' of git://git.kernel.dk/linux-block: (23 commits) Revert "cfq: Remove special treatment for metadata rqs." block: fix flush machinery for stacking drivers with differring flush flags block: improve rq_affinity placement blktrace: add FLUSH/FUA support Move some REQ flags to the common bio/request area allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH xen/blkback: Make description more obvious. cfq-iosched: Add documentation about idling block: Make rq_affinity = 1 work as expected block: swim3: fix unterminated of_device_id table block/genhd.c: remove useless cast in diskstats_show() drivers/cdrom/cdrom.c: relax check on dvd manufacturer value drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse bsg-lib: add module.h include cfq-iosched: Reduce linked group count upon group destruction blk-throttle: correctly determine sync bio loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices loop: add management interface for on-demand device allocation loop: replace linked list of allocated devices with an idr index ...
2 parents 0c3bef6 + b53d1ed commit 5ccc387

File tree

26 files changed

+800
-135
lines changed

26 files changed

+800
-135
lines changed

Documentation/block/cfq-iosched.txt

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,3 +43,74 @@ If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
4343
to IOPS mode and starts providing fairness in terms of number of requests
4444
dispatched. Note that this mode switching takes effect only for group
4545
scheduling. For non-cgroup users nothing should change.
46+
47+
CFQ IO scheduler Idling Theory
48+
===============================
49+
Idling on a queue is primarily about waiting for the next request to come
50+
on same queue after completion of a request. In this process CFQ will not
51+
dispatch requests from other cfq queues even if requests are pending there.
52+
53+
The rationale behind idling is that it can cut down on number of seeks
54+
on rotational media. For example, if a process is doing dependent
55+
sequential reads (next read will come on only after completion of previous
56+
one), then not dispatching request from other queue should help as we
57+
did not move the disk head and kept on dispatching sequential IO from
58+
one queue.
59+
60+
CFQ has following service trees and various queues are put on these trees.
61+
62+
sync-idle sync-noidle async
63+
64+
All cfq queues doing synchronous sequential IO go on to sync-idle tree.
65+
On this tree we idle on each queue individually.
66+
67+
All synchronous non-sequential queues go on sync-noidle tree. Also any
68+
request which are marked with REQ_NOIDLE go on this service tree. On this
69+
tree we do not idle on individual queues instead idle on the whole group
70+
of queues or the tree. So if there are 4 queues waiting for IO to dispatch
71+
we will idle only once last queue has dispatched the IO and there is
72+
no more IO on this service tree.
73+
74+
All async writes go on async service tree. There is no idling on async
75+
queues.
76+
77+
CFQ has some optimizations for SSDs and if it detects a non-rotational
78+
media which can support higher queue depth (multiple requests at in
79+
flight at a time), then it cuts down on idling of individual queues and
80+
all the queues move to sync-noidle tree and only tree idle remains. This
81+
tree idling provides isolation with buffered write queues on async tree.
82+
83+
FAQ
84+
===
85+
Q1. Why to idle at all on queues marked with REQ_NOIDLE.
86+
87+
A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
88+
with REQ_NOIDLE. This helps in providing isolation with all the sync-idle
89+
queues. Otherwise in presence of many sequential readers, other
90+
synchronous IO might not get fair share of disk.
91+
92+
For example, if there are 10 sequential readers doing IO and they get
93+
100ms each. If a REQ_NOIDLE request comes in, it will be scheduled
94+
roughly after 1 second. If after completion of REQ_NOIDLE request we
95+
do not idle, and after a couple of milli seconds a another REQ_NOIDLE
96+
request comes in, again it will be scheduled after 1second. Repeat it
97+
and notice how a workload can lose its disk share and suffer due to
98+
multiple sequential readers.
99+
100+
fsync can generate dependent IO where bunch of data is written in the
101+
context of fsync, and later some journaling data is written. Journaling
102+
data comes in only after fsync has finished its IO (atleast for ext4
103+
that seemed to be the case). Now if one decides not to idle on fsync
104+
thread due to REQ_NOIDLE, then next journaling write will not get
105+
scheduled for another second. A process doing small fsync, will suffer
106+
badly in presence of multiple sequential readers.
107+
108+
Hence doing tree idling on threads using REQ_NOIDLE flag on requests
109+
provides isolation from multiple sequential readers and at the same
110+
time we do not idle on individual threads.
111+
112+
Q2. When to specify REQ_NOIDLE
113+
A2. I would think whenever one is doing synchronous write and not expecting
114+
more writes to be dispatched from same context soon, should be able
115+
to specify REQ_NOIDLE on writes and that probably should work well for
116+
most of the cases.

Documentation/kernel-parameters.txt

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1350,9 +1350,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
13501350
it is equivalent to "nosmp", which also disables
13511351
the IO APIC.
13521352

1353-
max_loop= [LOOP] Maximum number of loopback devices that can
1354-
be mounted
1355-
Format: <1-256>
1353+
max_loop= [LOOP] The number of loop block devices that get
1354+
(loop.max_loop) unconditionally pre-created at init time. The default
1355+
number is configured by BLK_DEV_LOOP_MIN_COUNT. Instead
1356+
of statically allocating a predefined number, loop
1357+
devices can be requested on-demand with the
1358+
/dev/loop-control interface.
13561359

13571360
mcatest= [IA-64]
13581361

block/Kconfig

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,16 @@ config BLK_DEV_BSG
6565

6666
If unsure, say Y.
6767

68+
config BLK_DEV_BSGLIB
69+
bool "Block layer SG support v4 helper lib"
70+
default n
71+
select BLK_DEV_BSG
72+
help
73+
Subsystems will normally enable this if needed. Users will not
74+
normally need to manually enable this.
75+
76+
If unsure, say N.
77+
6878
config BLK_DEV_INTEGRITY
6979
bool "Block layer data integrity support"
7080
---help---

block/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
88
blk-iopoll.o blk-lib.o ioctl.o genhd.o scsi_ioctl.o
99

1010
obj-$(CONFIG_BLK_DEV_BSG) += bsg.o
11+
obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o
1112
obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
1213
obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
1314
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o

block/blk-core.c

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1702,6 +1702,7 @@ EXPORT_SYMBOL_GPL(blk_rq_check_limits);
17021702
int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
17031703
{
17041704
unsigned long flags;
1705+
int where = ELEVATOR_INSERT_BACK;
17051706

17061707
if (blk_rq_check_limits(q, rq))
17071708
return -EIO;
@@ -1718,7 +1719,10 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
17181719
*/
17191720
BUG_ON(blk_queued_rq(rq));
17201721

1721-
add_acct_request(q, rq, ELEVATOR_INSERT_BACK);
1722+
if (rq->cmd_flags & (REQ_FLUSH|REQ_FUA))
1723+
where = ELEVATOR_INSERT_FLUSH;
1724+
1725+
add_acct_request(q, rq, where);
17221726
spin_unlock_irqrestore(q->queue_lock, flags);
17231727

17241728
return 0;
@@ -2275,7 +2279,7 @@ static bool blk_end_bidi_request(struct request *rq, int error,
22752279
* %false - we are done with this request
22762280
* %true - still buffers pending for this request
22772281
**/
2278-
static bool __blk_end_bidi_request(struct request *rq, int error,
2282+
bool __blk_end_bidi_request(struct request *rq, int error,
22792283
unsigned int nr_bytes, unsigned int bidi_bytes)
22802284
{
22812285
if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes))

block/blk-flush.c

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -95,11 +95,12 @@ static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
9595
{
9696
unsigned int policy = 0;
9797

98+
if (blk_rq_sectors(rq))
99+
policy |= REQ_FSEQ_DATA;
100+
98101
if (fflags & REQ_FLUSH) {
99102
if (rq->cmd_flags & REQ_FLUSH)
100103
policy |= REQ_FSEQ_PREFLUSH;
101-
if (blk_rq_sectors(rq))
102-
policy |= REQ_FSEQ_DATA;
103104
if (!(fflags & REQ_FUA) && (rq->cmd_flags & REQ_FUA))
104105
policy |= REQ_FSEQ_POSTFLUSH;
105106
}
@@ -122,7 +123,7 @@ static void blk_flush_restore_request(struct request *rq)
122123

123124
/* make @rq a normal request */
124125
rq->cmd_flags &= ~REQ_FLUSH_SEQ;
125-
rq->end_io = NULL;
126+
rq->end_io = rq->flush.saved_end_io;
126127
}
127128

128129
/**
@@ -300,9 +301,6 @@ void blk_insert_flush(struct request *rq)
300301
unsigned int fflags = q->flush_flags; /* may change, cache */
301302
unsigned int policy = blk_flush_policy(fflags, rq);
302303

303-
BUG_ON(rq->end_io);
304-
BUG_ON(!rq->bio || rq->bio != rq->biotail);
305-
306304
/*
307305
* @policy now records what operations need to be done. Adjust
308306
* REQ_FLUSH and FUA for the driver.
@@ -311,6 +309,19 @@ void blk_insert_flush(struct request *rq)
311309
if (!(fflags & REQ_FUA))
312310
rq->cmd_flags &= ~REQ_FUA;
313311

312+
/*
313+
* An empty flush handed down from a stacking driver may
314+
* translate into nothing if the underlying device does not
315+
* advertise a write-back cache. In this case, simply
316+
* complete the request.
317+
*/
318+
if (!policy) {
319+
__blk_end_bidi_request(rq, 0, 0, 0);
320+
return;
321+
}
322+
323+
BUG_ON(!rq->bio || rq->bio != rq->biotail);
324+
314325
/*
315326
* If there's data but flush is not necessary, the request can be
316327
* processed directly without going through flush machinery. Queue
@@ -319,6 +330,7 @@ void blk_insert_flush(struct request *rq)
319330
if ((policy & REQ_FSEQ_DATA) &&
320331
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
321332
list_add_tail(&rq->queuelist, &q->queue_head);
333+
blk_run_queue_async(q);
322334
return;
323335
}
324336

@@ -329,6 +341,7 @@ void blk_insert_flush(struct request *rq)
329341
memset(&rq->flush, 0, sizeof(rq->flush));
330342
INIT_LIST_HEAD(&rq->flush.list);
331343
rq->cmd_flags |= REQ_FLUSH_SEQ;
344+
rq->flush.saved_end_io = rq->end_io; /* Usually NULL */
332345
rq->end_io = flush_data_end_io;
333346

334347
blk_flush_complete_seq(rq, REQ_FSEQ_ACTIONS & ~policy, 0);

block/blk-softirq.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,14 @@ void __blk_complete_request(struct request *req)
124124
} else
125125
ccpu = cpu;
126126

127+
/*
128+
* If current CPU and requested CPU are in the same group, running
129+
* softirq in current CPU. One might concern this is just like
130+
* QUEUE_FLAG_SAME_FORCE, but actually not. blk_complete_request() is
131+
* running in interrupt handler, and currently I/O controller doesn't
132+
* support multiple interrupts, so current CPU is unique actually. This
133+
* avoids IPI sending from current CPU to the first CPU of a group.
134+
*/
127135
if (ccpu == cpu || ccpu == group_cpu) {
128136
struct list_head *list;
129137
do_local:

block/blk-throttle.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -746,7 +746,7 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
746746
static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
747747
{
748748
bool rw = bio_data_dir(bio);
749-
bool sync = bio->bi_rw & REQ_SYNC;
749+
bool sync = rw_is_sync(bio->bi_rw);
750750

751751
/* Charge the bio to the group */
752752
tg->bytes_disp[rw] += bio->bi_size;
@@ -1150,7 +1150,7 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
11501150

11511151
if (tg_no_rule_group(tg, rw)) {
11521152
blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size,
1153-
rw, bio->bi_rw & REQ_SYNC);
1153+
rw, rw_is_sync(bio->bi_rw));
11541154
rcu_read_unlock();
11551155
return 0;
11561156
}

block/blk.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ int blk_rq_append_bio(struct request_queue *q, struct request *rq,
1717
struct bio *bio);
1818
void blk_dequeue_request(struct request *rq);
1919
void __blk_queue_free_tags(struct request_queue *q);
20+
bool __blk_end_bidi_request(struct request *rq, int error,
21+
unsigned int nr_bytes, unsigned int bidi_bytes);
2022

2123
void blk_rq_timed_out_timer(unsigned long data);
2224
void blk_delete_timer(struct request *);

0 commit comments

Comments
 (0)