Skip to content

Commit d895ec7

Browse files
committed
Merge tag 'block-6.0-2022-09-02' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - error handling fix for the new auth code (Hannes Reinecke) - fix unhandled tcp states in nvmet_tcp_state_change (Maurizio Lombardi) - add NVME_QUIRK_BOGUS_NID for Lexar NM610 (Shyamin Ayesh) - Add documentation for the ublk driver merged in this merge window (Ming) * tag 'block-6.0-2022-09-02' of git://git.kernel.dk/linux-block: Documentation: document ublk nvmet-tcp: fix unhandled tcp states in nvmet_tcp_state_change() nvmet-auth: add missing goto in nvmet_setup_auth() nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM610
2 parents cec53f4 + 7a3d222 commit d895ec7

File tree

6 files changed

+261
-0
lines changed

6 files changed

+261
-0
lines changed

Documentation/block/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,4 @@ Block
2323
stat
2424
switching-sched
2525
writeback_cache_control
26+
ublk

Documentation/block/ublk.rst

Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
===========================================
4+
Userspace block device driver (ublk driver)
5+
===========================================
6+
7+
Overview
8+
========
9+
10+
ublk is a generic framework for implementing block device logic from userspace.
11+
The motivation behind it is that moving virtual block drivers into userspace,
12+
such as loop, nbd and similar can be very helpful. It can help to implement
13+
new virtual block device such as ublk-qcow2 (there are several attempts of
14+
implementing qcow2 driver in kernel).
15+
16+
Userspace block devices are attractive because:
17+
18+
- They can be written many programming languages.
19+
- They can use libraries that are not available in the kernel.
20+
- They can be debugged with tools familiar to application developers.
21+
- Crashes do not kernel panic the machine.
22+
- Bugs are likely to have a lower security impact than bugs in kernel
23+
code.
24+
- They can be installed and updated independently of the kernel.
25+
- They can be used to simulate block device easily with user specified
26+
parameters/setting for test/debug purpose
27+
28+
ublk block device (``/dev/ublkb*``) is added by ublk driver. Any IO request
29+
on the device will be forwarded to ublk userspace program. For convenience,
30+
in this document, ``ublk server`` refers to generic ublk userspace
31+
program. ``ublksrv`` [#userspace]_ is one of such implementation. It
32+
provides ``libublksrv`` [#userspace_lib]_ library for developing specific
33+
user block device conveniently, while also generic type block device is
34+
included, such as loop and null. Richard W.M. Jones wrote userspace nbd device
35+
``nbdublk`` [#userspace_nbdublk]_ based on ``libublksrv`` [#userspace_lib]_.
36+
37+
After the IO is handled by userspace, the result is committed back to the
38+
driver, thus completing the request cycle. This way, any specific IO handling
39+
logic is totally done by userspace, such as loop's IO handling, NBD's IO
40+
communication, or qcow2's IO mapping.
41+
42+
``/dev/ublkb*`` is driven by blk-mq request-based driver. Each request is
43+
assigned by one queue wide unique tag. ublk server assigns unique tag to each
44+
IO too, which is 1:1 mapped with IO of ``/dev/ublkb*``.
45+
46+
Both the IO request forward and IO handling result committing are done via
47+
``io_uring`` passthrough command; that is why ublk is also one io_uring based
48+
block driver. It has been observed that using io_uring passthrough command can
49+
give better IOPS than block IO; which is why ublk is one of high performance
50+
implementation of userspace block device: not only IO request communication is
51+
done by io_uring, but also the preferred IO handling in ublk server is io_uring
52+
based approach too.
53+
54+
ublk provides control interface to set/get ublk block device parameters.
55+
The interface is extendable and kabi compatible: basically any ublk request
56+
queue's parameter or ublk generic feature parameters can be set/get via the
57+
interface. Thus, ublk is generic userspace block device framework.
58+
For example, it is easy to setup a ublk device with specified block
59+
parameters from userspace.
60+
61+
Using ublk
62+
==========
63+
64+
ublk requires userspace ublk server to handle real block device logic.
65+
66+
Below is example of using ``ublksrv`` to provide ublk-based loop device.
67+
68+
- add a device::
69+
70+
ublk add -t loop -f ublk-loop.img
71+
72+
- format with xfs, then use it::
73+
74+
mkfs.xfs /dev/ublkb0
75+
mount /dev/ublkb0 /mnt
76+
# do anything. all IOs are handled by io_uring
77+
...
78+
umount /mnt
79+
80+
- list the devices with their info::
81+
82+
ublk list
83+
84+
- delete the device::
85+
86+
ublk del -a
87+
ublk del -n $ublk_dev_id
88+
89+
See usage details in README of ``ublksrv`` [#userspace_readme]_.
90+
91+
Design
92+
======
93+
94+
Control plane
95+
-------------
96+
97+
ublk driver provides global misc device node (``/dev/ublk-control``) for
98+
managing and controlling ublk devices with help of several control commands:
99+
100+
- ``UBLK_CMD_ADD_DEV``
101+
102+
Add a ublk char device (``/dev/ublkc*``) which is talked with ublk server
103+
WRT IO command communication. Basic device info is sent together with this
104+
command. It sets UAPI structure of ``ublksrv_ctrl_dev_info``,
105+
such as ``nr_hw_queues``, ``queue_depth``, and max IO request buffer size,
106+
for which the info is negotiated with the driver and sent back to the server.
107+
When this command is completed, the basic device info is immutable.
108+
109+
- ``UBLK_CMD_SET_PARAMS`` / ``UBLK_CMD_GET_PARAMS``
110+
111+
Set or get parameters of the device, which can be either generic feature
112+
related, or request queue limit related, but can't be IO logic specific,
113+
because the driver does not handle any IO logic. This command has to be
114+
sent before sending ``UBLK_CMD_START_DEV``.
115+
116+
- ``UBLK_CMD_START_DEV``
117+
118+
After the server prepares userspace resources (such as creating per-queue
119+
pthread & io_uring for handling ublk IO), this command is sent to the
120+
driver for allocating & exposing ``/dev/ublkb*``. Parameters set via
121+
``UBLK_CMD_SET_PARAMS`` are applied for creating the device.
122+
123+
- ``UBLK_CMD_STOP_DEV``
124+
125+
Halt IO on ``/dev/ublkb*`` and remove the device. When this command returns,
126+
ublk server will release resources (such as destroying per-queue pthread &
127+
io_uring).
128+
129+
- ``UBLK_CMD_DEL_DEV``
130+
131+
Remove ``/dev/ublkc*``. When this command returns, the allocated ublk device
132+
number can be reused.
133+
134+
- ``UBLK_CMD_GET_QUEUE_AFFINITY``
135+
136+
When ``/dev/ublkc`` is added, the driver creates block layer tagset, so
137+
that each queue's affinity info is available. The server sends
138+
``UBLK_CMD_GET_QUEUE_AFFINITY`` to retrieve queue affinity info. It can
139+
set up the per-queue context efficiently, such as bind affine CPUs with IO
140+
pthread and try to allocate buffers in IO thread context.
141+
142+
- ``UBLK_CMD_GET_DEV_INFO``
143+
144+
For retrieving device info via ``ublksrv_ctrl_dev_info``. It is the server's
145+
responsibility to save IO target specific info in userspace.
146+
147+
Data plane
148+
----------
149+
150+
ublk server needs to create per-queue IO pthread & io_uring for handling IO
151+
commands via io_uring passthrough. The per-queue IO pthread
152+
focuses on IO handling and shouldn't handle any control & management
153+
tasks.
154+
155+
The's IO is assigned by a unique tag, which is 1:1 mapping with IO
156+
request of ``/dev/ublkb*``.
157+
158+
UAPI structure of ``ublksrv_io_desc`` is defined for describing each IO from
159+
the driver. A fixed mmaped area (array) on ``/dev/ublkc*`` is provided for
160+
exporting IO info to the server; such as IO offset, length, OP/flags and
161+
buffer address. Each ``ublksrv_io_desc`` instance can be indexed via queue id
162+
and IO tag directly.
163+
164+
The following IO commands are communicated via io_uring passthrough command,
165+
and each command is only for forwarding the IO and committing the result
166+
with specified IO tag in the command data:
167+
168+
- ``UBLK_IO_FETCH_REQ``
169+
170+
Sent from the server IO pthread for fetching future incoming IO requests
171+
destined to ``/dev/ublkb*``. This command is sent only once from the server
172+
IO pthread for ublk driver to setup IO forward environment.
173+
174+
- ``UBLK_IO_COMMIT_AND_FETCH_REQ``
175+
176+
When an IO request is destined to ``/dev/ublkb*``, the driver stores
177+
the IO's ``ublksrv_io_desc`` to the specified mapped area; then the
178+
previous received IO command of this IO tag (either ``UBLK_IO_FETCH_REQ``
179+
or ``UBLK_IO_COMMIT_AND_FETCH_REQ)`` is completed, so the server gets
180+
the IO notification via io_uring.
181+
182+
After the server handles the IO, its result is committed back to the
183+
driver by sending ``UBLK_IO_COMMIT_AND_FETCH_REQ`` back. Once ublkdrv
184+
received this command, it parses the result and complete the request to
185+
``/dev/ublkb*``. In the meantime setup environment for fetching future
186+
requests with the same IO tag. That is, ``UBLK_IO_COMMIT_AND_FETCH_REQ``
187+
is reused for both fetching request and committing back IO result.
188+
189+
- ``UBLK_IO_NEED_GET_DATA``
190+
191+
With ``UBLK_F_NEED_GET_DATA`` enabled, the WRITE request will be firstly
192+
issued to ublk server without data copy. Then, IO backend of ublk server
193+
receives the request and it can allocate data buffer and embed its addr
194+
inside this new io command. After the kernel driver gets the command,
195+
data copy is done from request pages to this backend's buffer. Finally,
196+
backend receives the request again with data to be written and it can
197+
truly handle the request.
198+
199+
``UBLK_IO_NEED_GET_DATA`` adds one additional round-trip and one
200+
io_uring_enter() syscall. Any user thinks that it may lower performance
201+
should not enable UBLK_F_NEED_GET_DATA. ublk server pre-allocates IO
202+
buffer for each IO by default. Any new project should try to use this
203+
buffer to communicate with ublk driver. However, existing project may
204+
break or not able to consume the new buffer interface; that's why this
205+
command is added for backwards compatibility so that existing projects
206+
can still consume existing buffers.
207+
208+
- data copy between ublk server IO buffer and ublk block IO request
209+
210+
The driver needs to copy the block IO request pages into the server buffer
211+
(pages) first for WRITE before notifying the server of the coming IO, so
212+
that the server can handle WRITE request.
213+
214+
When the server handles READ request and sends
215+
``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
216+
the server buffer (pages) read to the IO request pages.
217+
218+
Future development
219+
==================
220+
221+
Container-aware ublk deivice
222+
----------------------------
223+
224+
ublk driver doesn't handle any IO logic. Its function is well defined
225+
for now and very limited userspace interfaces are needed, which is also
226+
well defined too. It is possible to make ublk devices container-aware block
227+
devices in future as Stefan Hajnoczi suggested [#stefan]_, by removing
228+
ADMIN privilege.
229+
230+
Zero copy
231+
---------
232+
233+
Zero copy is a generic requirement for nbd, fuse or similar drivers. A
234+
problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace
235+
can't be remapped any more in kernel with existing mm interfaces. This can
236+
occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that
237+
big requests (IO size >= 256 KB) may benefit a lot from zero copy.
238+
239+
240+
References
241+
==========
242+
243+
.. [#userspace] https://github.com/ming1/ubdsrv
244+
245+
.. [#userspace_lib] https://github.com/ming1/ubdsrv/tree/master/lib
246+
247+
.. [#userspace_nbdublk] https://gitlab.com/rwmjones/libnbd/-/tree/nbdublk
248+
249+
.. [#userspace_readme] https://github.com/ming1/ubdsrv/blob/master/README
250+
251+
.. [#stefan] https://lore.kernel.org/linux-block/[email protected]/
252+
253+
.. [#xiaoguang] https://lore.kernel.org/linux-block/[email protected]/

MAINTAINERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20764,6 +20764,7 @@ UBLK USERSPACE BLOCK DRIVER
2076420764
M: Ming Lei <[email protected]>
2076520765
2076620766
S: Maintained
20767+
F: Documentation/block/ublk.rst
2076720768
F: drivers/block/ublk_drv.c
2076820769
F: include/uapi/linux/ublk_cmd.h
2076920770

drivers/nvme/host/pci.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3517,6 +3517,8 @@ static const struct pci_device_id nvme_id_table[] = {
35173517
.driver_data = NVME_QUIRK_NO_DEEPEST_PS, },
35183518
{ PCI_DEVICE(0xc0a9, 0x540a), /* Crucial P2 */
35193519
.driver_data = NVME_QUIRK_BOGUS_NID, },
3520+
{ PCI_DEVICE(0x1d97, 0x2263), /* Lexar NM610 */
3521+
.driver_data = NVME_QUIRK_BOGUS_NID, },
35203522
{ PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0x0061),
35213523
.driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, },
35223524
{ PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0x0065),

drivers/nvme/target/auth.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,7 @@ int nvmet_setup_auth(struct nvmet_ctrl *ctrl)
196196
if (IS_ERR(ctrl->ctrl_key)) {
197197
ret = PTR_ERR(ctrl->ctrl_key);
198198
ctrl->ctrl_key = NULL;
199+
goto out_free_hash;
199200
}
200201
pr_debug("%s: using ctrl hash %s key %*ph\n", __func__,
201202
ctrl->ctrl_key->hash > 0 ?

drivers/nvme/target/tcp.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1506,6 +1506,9 @@ static void nvmet_tcp_state_change(struct sock *sk)
15061506
goto done;
15071507

15081508
switch (sk->sk_state) {
1509+
case TCP_FIN_WAIT2:
1510+
case TCP_LAST_ACK:
1511+
break;
15091512
case TCP_FIN_WAIT1:
15101513
case TCP_CLOSE_WAIT:
15111514
case TCP_CLOSE:

0 commit comments

Comments
 (0)