Skip to content

Commit 126e76f

Browse files
committed
Merge branch 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-block
Pull followup block layer updates from Jens Axboe: "I ended up splitting the main pull request for this series into two, mainly because of clashes between NVMe fixes that went into 4.13 after the for-4.14 branches were split off. This pull request is mostly NVMe, but not exclusively. In detail, it contains: - Two pull request for NVMe changes from Christoph. Nothing new on the feature front, basically just fixes all over the map for the core bits, transport, rdma, etc. - Series from Bart, cleaning up various bits in the BFQ scheduler. - Series of bcache fixes, which has been lingering for a release or two. Coly sent this in, but patches from various people in this area. - Set of patches for BFQ from Paolo himself, updating both documentation and fixing some corner cases in performance. - Series from Omar, attempting to now get the 4k loop support correct. Our confidence level is higher this time. - Series from Shaohua for loop as well, improving O_DIRECT performance and fixing a use-after-free" * 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-block: (74 commits) bcache: initialize dirty stripes in flash_dev_run() loop: set physical block size to logical block size bcache: fix bch_hprint crash and improve output bcache: Update continue_at() documentation bcache: silence static checker warning bcache: fix for gc and write-back race bcache: increase the number of open buckets bcache: Correct return value for sysfs attach errors bcache: correct cache_dirty_target in __update_writeback_rate() bcache: gc does not work when triggering by manual command bcache: Don't reinvent the wheel but use existing llist API bcache: do not subtract sectors_to_gc for bypassed IO bcache: fix sequential large write IO bypass bcache: Fix leak of bdev reference block/loop: remove unused field block/loop: fix use after free bfq: Use icq_to_bic() consistently bfq: Suppress compiler warnings about comparisons bfq: Check kstrtoul() return value bfq: Declare local functions static ...
2 parents fbd0141 + 175206c commit 126e76f

35 files changed

+1143
-817
lines changed

Documentation/block/bfq-iosched.txt

Lines changed: 64 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,16 @@ throughput. So, when needed for achieving a lower latency, BFQ builds
1616
schedules that may lead to a lower throughput. If your main or only
1717
goal, for a given device, is to achieve the maximum-possible
1818
throughput at all times, then do switch off all low-latency heuristics
19-
for that device, by setting low_latency to 0. Full details in Section 3.
19+
for that device, by setting low_latency to 0. See Section 3 for
20+
details on how to configure BFQ for the desired tradeoff between
21+
latency and throughput, or on how to maximize throughput.
2022

2123
On average CPUs, the current version of BFQ can handle devices
2224
performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
2325
reference, 30-50 KIOPS correspond to very high bandwidths with
2426
sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
25-
to 120-200 MB/s with 4KB random I/O. BFQ has not yet been tested on
26-
multi-queue devices.
27+
to 120-200 MB/s with 4KB random I/O. BFQ is currently being tested on
28+
multi-queue devices too.
2729

2830
The table of contents follow. Impatients can just jump to Section 3.
2931

@@ -33,7 +35,7 @@ CONTENTS
3335
1-1 Personal systems
3436
1-2 Server systems
3537
2. How does BFQ work?
36-
3. What are BFQ's tunable?
38+
3. What are BFQ's tunables and how to properly configure BFQ?
3739
4. BFQ group scheduling
3840
4-1 Service guarantees provided
3941
4-2 Interface
@@ -145,19 +147,28 @@ plus a lot of code, are borrowed from CFQ.
145147
contrast, BFQ may idle the device for a short time interval,
146148
giving the process the chance to go on being served if it issues
147149
a new request in time. Device idling typically boosts the
148-
throughput on rotational devices, if processes do synchronous
149-
and sequential I/O. In addition, under BFQ, device idling is
150-
also instrumental in guaranteeing the desired throughput
151-
fraction to processes issuing sync requests (see the description
152-
of the slice_idle tunable in this document, or [1, 2], for more
153-
details).
150+
throughput on rotational devices and on non-queueing flash-based
151+
devices, if processes do synchronous and sequential I/O. In
152+
addition, under BFQ, device idling is also instrumental in
153+
guaranteeing the desired throughput fraction to processes
154+
issuing sync requests (see the description of the slice_idle
155+
tunable in this document, or [1, 2], for more details).
154156

155157
- With respect to idling for service guarantees, if several
156158
processes are competing for the device at the same time, but
157-
all processes (and groups, after the following commit) have
158-
the same weight, then BFQ guarantees the expected throughput
159-
distribution without ever idling the device. Throughput is
160-
thus as high as possible in this common scenario.
159+
all processes and groups have the same weight, then BFQ
160+
guarantees the expected throughput distribution without ever
161+
idling the device. Throughput is thus as high as possible in
162+
this common scenario.
163+
164+
- On flash-based storage with internal queueing of commands
165+
(typically NCQ), device idling happens to be always detrimental
166+
for throughput. So, with these devices, BFQ performs idling
167+
only when strictly needed for service guarantees, i.e., for
168+
guaranteeing low latency or fairness. In these cases, overall
169+
throughput may be sub-optimal. No solution currently exists to
170+
provide both strong service guarantees and optimal throughput
171+
on devices with internal queueing.
161172

162173
- If low-latency mode is enabled (default configuration), BFQ
163174
executes some special heuristics to detect interactive and soft
@@ -191,10 +202,7 @@ plus a lot of code, are borrowed from CFQ.
191202
- Queues are scheduled according to a variant of WF2Q+, named
192203
B-WF2Q+, and implemented using an augmented rb-tree to preserve an
193204
O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
194-
also ready for hierarchical scheduling. However, for a cleaner
195-
logical breakdown, the code that enables and completes
196-
hierarchical support is provided in the next commit, which focuses
197-
exactly on this feature.
205+
also ready for hierarchical scheduling, details in Section 4.
198206

199207
- B-WF2Q+ guarantees a tight deviation with respect to an ideal,
200208
perfectly fair, and smooth service. In particular, B-WF2Q+
@@ -249,13 +257,24 @@ plus a lot of code, are borrowed from CFQ.
249257
the Idle class, to prevent it from starving.
250258

251259

252-
3. What are BFQ's tunable?
253-
==========================
260+
3. What are BFQ's tunables and how to properly configure BFQ?
261+
=============================================================
262+
263+
Most BFQ tunables affect service guarantees (basically latency and
264+
fairness) and throughput. For full details on how to choose the
265+
desired tradeoff between service guarantees and throughput, see the
266+
parameters slice_idle, strict_guarantees and low_latency. For details
267+
on how to maximise throughput, see slice_idle, timeout_sync and
268+
max_budget. The other performance-related parameters have been
269+
inherited from, and have been preserved mostly for compatibility with
270+
CFQ. So far, no performance improvement has been reported after
271+
changing the latter parameters in BFQ.
254272

255-
The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
256-
fifo_expire_sync below are the same as in CFQ. Their description is
257-
just copied from that for CFQ. Some considerations in the description
258-
of slice_idle are copied from CFQ too.
273+
In particular, the tunables back_seek-max, back_seek_penalty,
274+
fifo_expire_async and fifo_expire_sync below are the same as in
275+
CFQ. Their description is just copied from that for CFQ. Some
276+
considerations in the description of slice_idle are copied from CFQ
277+
too.
259278

260279
per-process ioprio and weight
261280
-----------------------------
@@ -285,15 +304,17 @@ number of seeks and see improved throughput.
285304

286305
Setting slice_idle to 0 will remove all the idling on queues and one
287306
should see an overall improved throughput on faster storage devices
288-
like multiple SATA/SAS disks in hardware RAID configuration.
307+
like multiple SATA/SAS disks in hardware RAID configuration, as well
308+
as flash-based storage with internal command queueing (and
309+
parallelism).
289310

290311
So depending on storage and workload, it might be useful to set
291312
slice_idle=0. In general for SATA/SAS disks and software RAID of
292313
SATA/SAS disks keeping slice_idle enabled should be useful. For any
293314
configurations where there are multiple spindles behind single LUN
294-
(Host based hardware RAID controller or for storage arrays), setting
295-
slice_idle=0 might end up in better throughput and acceptable
296-
latencies.
315+
(Host based hardware RAID controller or for storage arrays), or with
316+
flash-based fast storage, setting slice_idle=0 might end up in better
317+
throughput and acceptable latencies.
297318

298319
Idling is however necessary to have service guarantees enforced in
299320
case of differentiated weights or differentiated I/O-request lengths.
@@ -312,13 +333,14 @@ There is an important flipside for idling: apart from the above cases
312333
where it is beneficial also for throughput, idling can severely impact
313334
throughput. One important case is random workload. Because of this
314335
issue, BFQ tends to avoid idling as much as possible, when it is not
315-
beneficial also for throughput. As a consequence of this behavior, and
316-
of further issues described for the strict_guarantees tunable,
317-
short-term service guarantees may be occasionally violated. And, in
318-
some cases, these guarantees may be more important than guaranteeing
319-
maximum throughput. For example, in video playing/streaming, a very
320-
low drop rate may be more important than maximum throughput. In these
321-
cases, consider setting the strict_guarantees parameter.
336+
beneficial also for throughput (as detailed in Section 2). As a
337+
consequence of this behavior, and of further issues described for the
338+
strict_guarantees tunable, short-term service guarantees may be
339+
occasionally violated. And, in some cases, these guarantees may be
340+
more important than guaranteeing maximum throughput. For example, in
341+
video playing/streaming, a very low drop rate may be more important
342+
than maximum throughput. In these cases, consider setting the
343+
strict_guarantees parameter.
322344

323345
strict_guarantees
324346
-----------------
@@ -420,58 +442,20 @@ The default value is 0, which enables auto-tuning: BFQ sets max_budget
420442
to the maximum number of sectors that can be served during
421443
timeout_sync, according to the estimated peak rate.
422444

445+
For specific devices, some users have occasionally reported to have
446+
reached a higher throughput by setting max_budget explicitly, i.e., by
447+
setting max_budget to a higher value than 0. In particular, they have
448+
set max_budget to higher values than those to which BFQ would have set
449+
it with auto-tuning. An alternative way to achieve this goal is to
450+
just increase the value of timeout_sync, leaving max_budget equal to 0.
451+
423452
weights
424453
-------
425454

426455
Read-only parameter, used to show the weights of the currently active
427456
BFQ queues.
428457

429458

430-
wr_ tunables
431-
------------
432-
433-
BFQ exports a few parameters to control/tune the behavior of
434-
low-latency heuristics.
435-
436-
wr_coeff
437-
438-
Factor by which the weight of a weight-raised queue is multiplied. If
439-
the queue is deemed soft real-time, then the weight is further
440-
multiplied by an additional, constant factor.
441-
442-
wr_max_time
443-
444-
Maximum duration of a weight-raising period for an interactive task
445-
(ms). If set to zero (default value), then this value is computed
446-
automatically, as a function of the peak rate of the device. In any
447-
case, when the value of this parameter is read, it always reports the
448-
current duration, regardless of whether it has been set manually or
449-
computed automatically.
450-
451-
wr_max_softrt_rate
452-
453-
Maximum service rate below which a queue is deemed to be associated
454-
with a soft real-time application, and is then weight-raised
455-
accordingly (sectors/sec).
456-
457-
wr_min_idle_time
458-
459-
Minimum idle period after which interactive weight-raising may be
460-
reactivated for a queue (in ms).
461-
462-
wr_rt_max_time
463-
464-
Maximum weight-raising duration for soft real-time queues (in ms). The
465-
start time from which this duration is considered is automatically
466-
moved forward if the queue is detected to be still soft real-time
467-
before the current soft real-time weight-raising period finishes.
468-
469-
wr_min_inter_arr_async
470-
471-
Minimum period between I/O request arrivals after which weight-raising
472-
may be reactivated for an already busy async queue (in ms).
473-
474-
475459
4. Group scheduling with BFQ
476460
============================
477461

block/bfq-cgroup.c

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -206,7 +206,7 @@ static void bfqg_get(struct bfq_group *bfqg)
206206
bfqg->ref++;
207207
}
208208

209-
void bfqg_put(struct bfq_group *bfqg)
209+
static void bfqg_put(struct bfq_group *bfqg)
210210
{
211211
bfqg->ref--;
212212

@@ -385,7 +385,7 @@ static struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg)
385385
return cpd_to_bfqgd(blkcg_to_cpd(blkcg, &blkcg_policy_bfq));
386386
}
387387

388-
struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp)
388+
static struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp)
389389
{
390390
struct bfq_group_data *bgd;
391391

@@ -395,20 +395,20 @@ struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp)
395395
return &bgd->pd;
396396
}
397397

398-
void bfq_cpd_init(struct blkcg_policy_data *cpd)
398+
static void bfq_cpd_init(struct blkcg_policy_data *cpd)
399399
{
400400
struct bfq_group_data *d = cpd_to_bfqgd(cpd);
401401

402402
d->weight = cgroup_subsys_on_dfl(io_cgrp_subsys) ?
403403
CGROUP_WEIGHT_DFL : BFQ_WEIGHT_LEGACY_DFL;
404404
}
405405

406-
void bfq_cpd_free(struct blkcg_policy_data *cpd)
406+
static void bfq_cpd_free(struct blkcg_policy_data *cpd)
407407
{
408408
kfree(cpd_to_bfqgd(cpd));
409409
}
410410

411-
struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
411+
static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
412412
{
413413
struct bfq_group *bfqg;
414414

@@ -426,7 +426,7 @@ struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
426426
return &bfqg->pd;
427427
}
428428

429-
void bfq_pd_init(struct blkg_policy_data *pd)
429+
static void bfq_pd_init(struct blkg_policy_data *pd)
430430
{
431431
struct blkcg_gq *blkg = pd_to_blkg(pd);
432432
struct bfq_group *bfqg = blkg_to_bfqg(blkg);
@@ -445,15 +445,15 @@ void bfq_pd_init(struct blkg_policy_data *pd)
445445
bfqg->rq_pos_tree = RB_ROOT;
446446
}
447447

448-
void bfq_pd_free(struct blkg_policy_data *pd)
448+
static void bfq_pd_free(struct blkg_policy_data *pd)
449449
{
450450
struct bfq_group *bfqg = pd_to_bfqg(pd);
451451

452452
bfqg_stats_exit(&bfqg->stats);
453453
bfqg_put(bfqg);
454454
}
455455

456-
void bfq_pd_reset_stats(struct blkg_policy_data *pd)
456+
static void bfq_pd_reset_stats(struct blkg_policy_data *pd)
457457
{
458458
struct bfq_group *bfqg = pd_to_bfqg(pd);
459459

@@ -740,7 +740,7 @@ static void bfq_reparent_active_entities(struct bfq_data *bfqd,
740740
* blkio already grabs the queue_lock for us, so no need to use
741741
* RCU-based magic
742742
*/
743-
void bfq_pd_offline(struct blkg_policy_data *pd)
743+
static void bfq_pd_offline(struct blkg_policy_data *pd)
744744
{
745745
struct bfq_service_tree *st;
746746
struct bfq_group *bfqg = pd_to_bfqg(pd);

0 commit comments

Comments
 (0)