Skip to content

Commit 0be600a

Browse files
committed
Merge tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer: - DM core fixes to ensure that bio submission follows a depth-first tree walk; this is critical to allow forward progress without the need to use the bioset's BIOSET_NEED_RESCUER. - Remove DM core's BIOSET_NEED_RESCUER based dm_offload infrastructure. - DM core cleanups and improvements to make bio-based DM more efficient (e.g. reduced memory footprint as well leveraging per-bio-data more). - Introduce new bio-based mode (DM_TYPE_NVME_BIO_BASED) that leverages the more direct IO submission path in the block layer; this mode is used by DM multipath and also optimizes targets like DM thin-pool that stack directly on NVMe data device. - DM multipath improvements to factor out legacy SCSI-only (e.g. scsi_dh) code paths to allow for more optimized support for NVMe multipath. - A fix for DM multipath path selectors (service-time and queue-length) to select paths in a more balanced way; largely academic but doesn't hurt. - Numerous DM raid target fixes and improvements. - Add a new DM "unstriped" target that enables Intel to workaround firmware limitations in some NVMe drives that are striped internally (this target also works when stacked above the DM "striped" target). - Various Documentation fixes and improvements. - Misc cleanups and fixes across various DM infrastructure and targets (e.g. bufio, flakey, log-writes, snapshot). * tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (69 commits) dm cache: Documentation: update default migration_throttling value dm mpath selector: more evenly distribute ties dm unstripe: fix target length versus number of stripes size check dm thin: fix trailing semicolon in __remap_and_issue_shared_cell dm table: fix NVMe bio-based dm_table_determine_type() validation dm: various cleanups to md->queue initialization code dm mpath: delay the retry of a request if the target responded as busy dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIRED dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure dm log writes: fix max length used for kstrndup dm: backfill missing calls to mutex_destroy() dm snapshot: use mutex instead of rw_semaphore dm flakey: check for null arg_name in parse_features() dm thin: extend thinpool status format string with omitted fields dm thin: fixes in thin-provisioning.txt dm thin: document representation of <highest mapped sector> when there is none dm thin: fix documentation relative to low water mark threshold dm cache: be consistent in specifying sectors and SI units in cache.txt dm cache: delete obsoleted paragraph in cache.txt dm cache: fix grammar in cache-policies.txt ...
2 parents 040639b + 9614e2b commit 0be600a

31 files changed

+1409
-671
lines changed

Documentation/device-mapper/cache-policies.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ Memory usage:
6060
The mq policy used a lot of memory; 88 bytes per cache block on a 64
6161
bit machine.
6262

63-
smq uses 28bit indexes to implement it's data structures rather than
63+
smq uses 28bit indexes to implement its data structures rather than
6464
pointers. It avoids storing an explicit hit count for each block. It
6565
has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of
6666
the entries (each hotspot block covers a larger area than a single
@@ -84,7 +84,7 @@ resulting in better promotion/demotion decisions.
8484

8585
Adaptability:
8686
The mq policy maintained a hit count for each cache block. For a
87-
different block to get promoted to the cache it's hit count has to
87+
different block to get promoted to the cache its hit count has to
8888
exceed the lowest currently in the cache. This meant it could take a
8989
long time for the cache to adapt between varying IO patterns.
9090

Documentation/device-mapper/cache.txt

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ Fixed block size
5959
The origin is divided up into blocks of a fixed size. This block size
6060
is configurable when you first create the cache. Typically we've been
6161
using block sizes of 256KB - 1024KB. The block size must be between 64
62-
(32KB) and 2097152 (1GB) and a multiple of 64 (32KB).
62+
sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB).
6363

6464
Having a fixed block size simplifies the target a lot. But it is
6565
something of a compromise. For instance, a small part of a block may be
@@ -119,7 +119,7 @@ doing here to avoid migrating during those peak io moments.
119119

120120
For the time being, a message "migration_threshold <#sectors>"
121121
can be used to set the maximum number of sectors being migrated,
122-
the default being 204800 sectors (or 100MB).
122+
the default being 2048 sectors (1MB).
123123

124124
Updating on-disk metadata
125125
-------------------------
@@ -143,11 +143,6 @@ the policy how big this chunk is, but it should be kept small. Like the
143143
dirty flags this data is lost if there's a crash so a safe fallback
144144
value should always be possible.
145145

146-
For instance, the 'mq' policy, which is currently the default policy,
147-
uses this facility to store the hit count of the cache blocks. If
148-
there's a crash this information will be lost, which means the cache
149-
may be less efficient until those hit counts are regenerated.
150-
151146
Policy hints affect performance, not correctness.
152147

153148
Policy messaging

Documentation/device-mapper/dm-raid.txt

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -343,5 +343,8 @@ Version History
343343
1.11.0 Fix table line argument order
344344
(wrong raid10_copies/raid10_format sequence)
345345
1.11.1 Add raid4/5/6 journal write-back support via journal_mode option
346-
1.12.1 fix for MD deadlock between mddev_suspend() and md_write_start() available
346+
1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
347347
1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
348+
1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an
349+
state races.
350+
1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen

Documentation/device-mapper/snapshot.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,10 @@ The difference between persistent and transient is with transient
4949
snapshots less metadata must be saved on disk - they can be kept in
5050
memory by the kernel.
5151

52+
When loading or unloading the snapshot target, the corresponding
53+
snapshot-origin or snapshot-merge target must be suspended. A failure to
54+
suspend the origin target could result in data corruption.
55+
5256

5357
* snapshot-merge <origin> <COW device> <persistent> <chunksize>
5458

Documentation/device-mapper/thin-provisioning.txt

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -112,9 +112,11 @@ $low_water_mark is expressed in blocks of size $data_block_size. If
112112
free space on the data device drops below this level then a dm event
113113
will be triggered which a userspace daemon should catch allowing it to
114114
extend the pool device. Only one such event will be sent.
115-
Resuming a device with a new table itself triggers an event so the
116-
userspace daemon can use this to detect a situation where a new table
117-
already exceeds the threshold.
115+
116+
No special event is triggered if a just resumed device's free space is below
117+
the low water mark. However, resuming a device always triggers an
118+
event; a userspace daemon should verify that free space exceeds the low
119+
water mark when handling this event.
118120

119121
A low water mark for the metadata device is maintained in the kernel and
120122
will trigger a dm event if free space on the metadata device drops below
@@ -274,7 +276,8 @@ ii) Status
274276

275277
<transaction id> <used metadata blocks>/<total metadata blocks>
276278
<used data blocks>/<total data blocks> <held metadata root>
277-
[no_]discard_passdown ro|rw
279+
ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
280+
needs_check|-
278281

279282
transaction id:
280283
A 64-bit number used by userspace to help synchronise with metadata
@@ -394,3 +397,6 @@ ii) Status
394397
If the pool has encountered device errors and failed, the status
395398
will just contain the string 'Fail'. The userspace recovery
396399
tools should then be used.
400+
401+
In the case where <nr mapped sectors> is 0, there is no highest
402+
mapped sector and the value of <highest mapped sector> is unspecified.
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
Introduction
2+
============
3+
4+
The device-mapper "unstriped" target provides a transparent mechanism to
5+
unstripe a device-mapper "striped" target to access the underlying disks
6+
without having to touch the true backing block-device. It can also be
7+
used to unstripe a hardware RAID-0 to access backing disks.
8+
9+
Parameters:
10+
<number of stripes> <chunk size> <stripe #> <dev_path> <offset>
11+
12+
<number of stripes>
13+
The number of stripes in the RAID 0.
14+
15+
<chunk size>
16+
The amount of 512B sectors in the chunk striping.
17+
18+
<dev_path>
19+
The block device you wish to unstripe.
20+
21+
<stripe #>
22+
The stripe number within the device that corresponds to physical
23+
drive you wish to unstripe. This must be 0 indexed.
24+
25+
26+
Why use this module?
27+
====================
28+
29+
An example of undoing an existing dm-stripe
30+
-------------------------------------------
31+
32+
This small bash script will setup 4 loop devices and use the existing
33+
striped target to combine the 4 devices into one. It then will use
34+
the unstriped target ontop of the striped device to access the
35+
individual backing loop devices. We write data to the newly exposed
36+
unstriped devices and verify the data written matches the correct
37+
underlying device on the striped array.
38+
39+
#!/bin/bash
40+
41+
MEMBER_SIZE=$((128 * 1024 * 1024))
42+
NUM=4
43+
SEQ_END=$((${NUM}-1))
44+
CHUNK=256
45+
BS=4096
46+
47+
RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
48+
DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
49+
COUNT=$((${MEMBER_SIZE} / ${BS}))
50+
51+
for i in $(seq 0 ${SEQ_END}); do
52+
dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
53+
losetup /dev/loop${i} member-${i}
54+
DM_PARMS+=" /dev/loop${i} 0"
55+
done
56+
57+
echo $DM_PARMS | dmsetup create raid0
58+
for i in $(seq 0 ${SEQ_END}); do
59+
echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i}
60+
done;
61+
62+
for i in $(seq 0 ${SEQ_END}); do
63+
dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct
64+
diff /dev/mapper/set-${i} member-${i}
65+
done;
66+
67+
for i in $(seq 0 ${SEQ_END}); do
68+
dmsetup remove set-${i}
69+
done
70+
71+
dmsetup remove raid0
72+
73+
for i in $(seq 0 ${SEQ_END}); do
74+
losetup -d /dev/loop${i}
75+
rm -f member-${i}
76+
done
77+
78+
Another example
79+
---------------
80+
81+
Intel NVMe drives contain two cores on the physical device.
82+
Each core of the drive has segregated access to its LBA range.
83+
The current LBA model has a RAID 0 128k chunk on each core, resulting
84+
in a 256k stripe across the two cores:
85+
86+
Core 0: Core 1:
87+
__________ __________
88+
| LBA 512| | LBA 768|
89+
| LBA 0 | | LBA 256|
90+
---------- ----------
91+
92+
The purpose of this unstriping is to provide better QoS in noisy
93+
neighbor environments. When two partitions are created on the
94+
aggregate drive without this unstriping, reads on one partition
95+
can affect writes on another partition. This is because the partitions
96+
are striped across the two cores. When we unstripe this hardware RAID 0
97+
and make partitions on each new exposed device the two partitions are now
98+
physically separated.
99+
100+
With the dm-unstriped target we're able to segregate an fio script that
101+
has read and write jobs that are independent of each other. Compared to
102+
when we run the test on a combined drive with partitions, we were able
103+
to get a 92% reduction in read latency using this device mapper target.
104+
105+
106+
Example dmsetup usage
107+
=====================
108+
109+
unstriped ontop of Intel NVMe device that has 2 cores
110+
-----------------------------------------------------
111+
dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
112+
dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0'
113+
114+
There will now be two devices that expose Intel NVMe core 0 and 1
115+
respectively:
116+
/dev/mapper/nvmset0
117+
/dev/mapper/nvmset1
118+
119+
unstriped ontop of striped with 4 drives using 128K chunk size
120+
--------------------------------------------------------------
121+
dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
122+
dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0'
123+
dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
124+
dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'

drivers/md/Kconfig

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -269,6 +269,13 @@ config DM_BIO_PRISON
269269

270270
source "drivers/md/persistent-data/Kconfig"
271271

272+
config DM_UNSTRIPED
273+
tristate "Unstriped target"
274+
depends on BLK_DEV_DM
275+
---help---
276+
Unstripes I/O so it is issued solely on a single drive in a HW
277+
RAID0 or dm-striped target.
278+
272279
config DM_CRYPT
273280
tristate "Crypt target support"
274281
depends on BLK_DEV_DM

drivers/md/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ obj-$(CONFIG_BCACHE) += bcache/
4343
obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
4444
obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
4545
obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o
46+
obj-$(CONFIG_DM_UNSTRIPED) += dm-unstripe.o
4647
obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
4748
obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o
4849
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o

drivers/md/dm-bufio.c

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -662,7 +662,7 @@ static void submit_io(struct dm_buffer *b, int rw, bio_end_io_t *end_io)
662662

663663
sector = (b->block << b->c->sectors_per_block_bits) + b->c->start;
664664

665-
if (rw != WRITE) {
665+
if (rw != REQ_OP_WRITE) {
666666
n_sectors = 1 << b->c->sectors_per_block_bits;
667667
offset = 0;
668668
} else {
@@ -740,7 +740,7 @@ static void __write_dirty_buffer(struct dm_buffer *b,
740740
b->write_end = b->dirty_end;
741741

742742
if (!write_list)
743-
submit_io(b, WRITE, write_endio);
743+
submit_io(b, REQ_OP_WRITE, write_endio);
744744
else
745745
list_add_tail(&b->write_list, write_list);
746746
}
@@ -753,7 +753,7 @@ static void __flush_write_list(struct list_head *write_list)
753753
struct dm_buffer *b =
754754
list_entry(write_list->next, struct dm_buffer, write_list);
755755
list_del(&b->write_list);
756-
submit_io(b, WRITE, write_endio);
756+
submit_io(b, REQ_OP_WRITE, write_endio);
757757
cond_resched();
758758
}
759759
blk_finish_plug(&plug);
@@ -1123,7 +1123,7 @@ static void *new_read(struct dm_bufio_client *c, sector_t block,
11231123
return NULL;
11241124

11251125
if (need_submit)
1126-
submit_io(b, READ, read_endio);
1126+
submit_io(b, REQ_OP_READ, read_endio);
11271127

11281128
wait_on_bit_io(&b->state, B_READING, TASK_UNINTERRUPTIBLE);
11291129

@@ -1193,7 +1193,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
11931193
dm_bufio_unlock(c);
11941194

11951195
if (need_submit)
1196-
submit_io(b, READ, read_endio);
1196+
submit_io(b, REQ_OP_READ, read_endio);
11971197
dm_bufio_release(b);
11981198

11991199
cond_resched();
@@ -1454,7 +1454,7 @@ void dm_bufio_release_move(struct dm_buffer *b, sector_t new_block)
14541454
old_block = b->block;
14551455
__unlink_buffer(b);
14561456
__link_buffer(b, new_block, b->list_mode);
1457-
submit_io(b, WRITE, write_endio);
1457+
submit_io(b, REQ_OP_WRITE, write_endio);
14581458
wait_on_bit_io(&b->state, B_WRITING,
14591459
TASK_UNINTERRUPTIBLE);
14601460
__unlink_buffer(b);
@@ -1716,7 +1716,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
17161716
if (!DM_BUFIO_CACHE_NAME(c)) {
17171717
r = -ENOMEM;
17181718
mutex_unlock(&dm_bufio_clients_lock);
1719-
goto bad_cache;
1719+
goto bad;
17201720
}
17211721
}
17221722

@@ -1727,7 +1727,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
17271727
if (!DM_BUFIO_CACHE(c)) {
17281728
r = -ENOMEM;
17291729
mutex_unlock(&dm_bufio_clients_lock);
1730-
goto bad_cache;
1730+
goto bad;
17311731
}
17321732
}
17331733
}
@@ -1738,27 +1738,28 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
17381738

17391739
if (!b) {
17401740
r = -ENOMEM;
1741-
goto bad_buffer;
1741+
goto bad;
17421742
}
17431743
__free_buffer_wake(b);
17441744
}
17451745

1746+
c->shrinker.count_objects = dm_bufio_shrink_count;
1747+
c->shrinker.scan_objects = dm_bufio_shrink_scan;
1748+
c->shrinker.seeks = 1;
1749+
c->shrinker.batch = 0;
1750+
r = register_shrinker(&c->shrinker);
1751+
if (r)
1752+
goto bad;
1753+
17461754
mutex_lock(&dm_bufio_clients_lock);
17471755
dm_bufio_client_count++;
17481756
list_add(&c->client_list, &dm_bufio_all_clients);
17491757
__cache_size_refresh();
17501758
mutex_unlock(&dm_bufio_clients_lock);
17511759

1752-
c->shrinker.count_objects = dm_bufio_shrink_count;
1753-
c->shrinker.scan_objects = dm_bufio_shrink_scan;
1754-
c->shrinker.seeks = 1;
1755-
c->shrinker.batch = 0;
1756-
register_shrinker(&c->shrinker);
1757-
17581760
return c;
17591761

1760-
bad_buffer:
1761-
bad_cache:
1762+
bad:
17621763
while (!list_empty(&c->reserved_buffers)) {
17631764
struct dm_buffer *b = list_entry(c->reserved_buffers.next,
17641765
struct dm_buffer, lru_list);
@@ -1767,6 +1768,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
17671768
}
17681769
dm_io_client_destroy(c->dm_io);
17691770
bad_dm_io:
1771+
mutex_destroy(&c->lock);
17701772
kfree(c);
17711773
bad_client:
17721774
return ERR_PTR(r);
@@ -1811,6 +1813,7 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
18111813
BUG_ON(c->n_buffers[i]);
18121814

18131815
dm_io_client_destroy(c->dm_io);
1816+
mutex_destroy(&c->lock);
18141817
kfree(c);
18151818
}
18161819
EXPORT_SYMBOL_GPL(dm_bufio_client_destroy);

drivers/md/dm-core.h

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -91,8 +91,7 @@ struct mapped_device {
9191
/*
9292
* io objects are allocated from here.
9393
*/
94-
mempool_t *io_pool;
95-
94+
struct bio_set *io_bs;
9695
struct bio_set *bs;
9796

9897
/*
@@ -130,8 +129,6 @@ struct mapped_device {
130129
struct srcu_struct io_barrier;
131130
};
132131

133-
void dm_init_md_queue(struct mapped_device *md);
134-
void dm_init_normal_md_queue(struct mapped_device *md);
135132
int md_in_flight(struct mapped_device *md);
136133
void disable_write_same(struct mapped_device *md);
137134
void disable_write_zeroes(struct mapped_device *md);

0 commit comments

Comments
 (0)