|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +=========================================== |
| 4 | +Userspace block device driver (ublk driver) |
| 5 | +=========================================== |
| 6 | + |
| 7 | +Overview |
| 8 | +======== |
| 9 | + |
| 10 | +ublk is a generic framework for implementing block device logic from userspace. |
| 11 | +The motivation behind it is that moving virtual block drivers into userspace, |
| 12 | +such as loop, nbd and similar can be very helpful. It can help to implement |
| 13 | +new virtual block device such as ublk-qcow2 (there are several attempts of |
| 14 | +implementing qcow2 driver in kernel). |
| 15 | + |
| 16 | +Userspace block devices are attractive because: |
| 17 | + |
| 18 | +- They can be written many programming languages. |
| 19 | +- They can use libraries that are not available in the kernel. |
| 20 | +- They can be debugged with tools familiar to application developers. |
| 21 | +- Crashes do not kernel panic the machine. |
| 22 | +- Bugs are likely to have a lower security impact than bugs in kernel |
| 23 | + code. |
| 24 | +- They can be installed and updated independently of the kernel. |
| 25 | +- They can be used to simulate block device easily with user specified |
| 26 | + parameters/setting for test/debug purpose |
| 27 | + |
| 28 | +ublk block device (``/dev/ublkb*``) is added by ublk driver. Any IO request |
| 29 | +on the device will be forwarded to ublk userspace program. For convenience, |
| 30 | +in this document, ``ublk server`` refers to generic ublk userspace |
| 31 | +program. ``ublksrv`` [#userspace]_ is one of such implementation. It |
| 32 | +provides ``libublksrv`` [#userspace_lib]_ library for developing specific |
| 33 | +user block device conveniently, while also generic type block device is |
| 34 | +included, such as loop and null. Richard W.M. Jones wrote userspace nbd device |
| 35 | +``nbdublk`` [#userspace_nbdublk]_ based on ``libublksrv`` [#userspace_lib]_. |
| 36 | + |
| 37 | +After the IO is handled by userspace, the result is committed back to the |
| 38 | +driver, thus completing the request cycle. This way, any specific IO handling |
| 39 | +logic is totally done by userspace, such as loop's IO handling, NBD's IO |
| 40 | +communication, or qcow2's IO mapping. |
| 41 | + |
| 42 | +``/dev/ublkb*`` is driven by blk-mq request-based driver. Each request is |
| 43 | +assigned by one queue wide unique tag. ublk server assigns unique tag to each |
| 44 | +IO too, which is 1:1 mapped with IO of ``/dev/ublkb*``. |
| 45 | + |
| 46 | +Both the IO request forward and IO handling result committing are done via |
| 47 | +``io_uring`` passthrough command; that is why ublk is also one io_uring based |
| 48 | +block driver. It has been observed that using io_uring passthrough command can |
| 49 | +give better IOPS than block IO; which is why ublk is one of high performance |
| 50 | +implementation of userspace block device: not only IO request communication is |
| 51 | +done by io_uring, but also the preferred IO handling in ublk server is io_uring |
| 52 | +based approach too. |
| 53 | + |
| 54 | +ublk provides control interface to set/get ublk block device parameters. |
| 55 | +The interface is extendable and kabi compatible: basically any ublk request |
| 56 | +queue's parameter or ublk generic feature parameters can be set/get via the |
| 57 | +interface. Thus, ublk is generic userspace block device framework. |
| 58 | +For example, it is easy to setup a ublk device with specified block |
| 59 | +parameters from userspace. |
| 60 | + |
| 61 | +Using ublk |
| 62 | +========== |
| 63 | + |
| 64 | +ublk requires userspace ublk server to handle real block device logic. |
| 65 | + |
| 66 | +Below is example of using ``ublksrv`` to provide ublk-based loop device. |
| 67 | + |
| 68 | +- add a device:: |
| 69 | + |
| 70 | + ublk add -t loop -f ublk-loop.img |
| 71 | + |
| 72 | +- format with xfs, then use it:: |
| 73 | + |
| 74 | + mkfs.xfs /dev/ublkb0 |
| 75 | + mount /dev/ublkb0 /mnt |
| 76 | + # do anything. all IOs are handled by io_uring |
| 77 | + ... |
| 78 | + umount /mnt |
| 79 | + |
| 80 | +- list the devices with their info:: |
| 81 | + |
| 82 | + ublk list |
| 83 | + |
| 84 | +- delete the device:: |
| 85 | + |
| 86 | + ublk del -a |
| 87 | + ublk del -n $ublk_dev_id |
| 88 | + |
| 89 | +See usage details in README of ``ublksrv`` [#userspace_readme]_. |
| 90 | + |
| 91 | +Design |
| 92 | +====== |
| 93 | + |
| 94 | +Control plane |
| 95 | +------------- |
| 96 | + |
| 97 | +ublk driver provides global misc device node (``/dev/ublk-control``) for |
| 98 | +managing and controlling ublk devices with help of several control commands: |
| 99 | + |
| 100 | +- ``UBLK_CMD_ADD_DEV`` |
| 101 | + |
| 102 | + Add a ublk char device (``/dev/ublkc*``) which is talked with ublk server |
| 103 | + WRT IO command communication. Basic device info is sent together with this |
| 104 | + command. It sets UAPI structure of ``ublksrv_ctrl_dev_info``, |
| 105 | + such as ``nr_hw_queues``, ``queue_depth``, and max IO request buffer size, |
| 106 | + for which the info is negotiated with the driver and sent back to the server. |
| 107 | + When this command is completed, the basic device info is immutable. |
| 108 | + |
| 109 | +- ``UBLK_CMD_SET_PARAMS`` / ``UBLK_CMD_GET_PARAMS`` |
| 110 | + |
| 111 | + Set or get parameters of the device, which can be either generic feature |
| 112 | + related, or request queue limit related, but can't be IO logic specific, |
| 113 | + because the driver does not handle any IO logic. This command has to be |
| 114 | + sent before sending ``UBLK_CMD_START_DEV``. |
| 115 | + |
| 116 | +- ``UBLK_CMD_START_DEV`` |
| 117 | + |
| 118 | + After the server prepares userspace resources (such as creating per-queue |
| 119 | + pthread & io_uring for handling ublk IO), this command is sent to the |
| 120 | + driver for allocating & exposing ``/dev/ublkb*``. Parameters set via |
| 121 | + ``UBLK_CMD_SET_PARAMS`` are applied for creating the device. |
| 122 | + |
| 123 | +- ``UBLK_CMD_STOP_DEV`` |
| 124 | + |
| 125 | + Halt IO on ``/dev/ublkb*`` and remove the device. When this command returns, |
| 126 | + ublk server will release resources (such as destroying per-queue pthread & |
| 127 | + io_uring). |
| 128 | + |
| 129 | +- ``UBLK_CMD_DEL_DEV`` |
| 130 | + |
| 131 | + Remove ``/dev/ublkc*``. When this command returns, the allocated ublk device |
| 132 | + number can be reused. |
| 133 | + |
| 134 | +- ``UBLK_CMD_GET_QUEUE_AFFINITY`` |
| 135 | + |
| 136 | + When ``/dev/ublkc`` is added, the driver creates block layer tagset, so |
| 137 | + that each queue's affinity info is available. The server sends |
| 138 | + ``UBLK_CMD_GET_QUEUE_AFFINITY`` to retrieve queue affinity info. It can |
| 139 | + set up the per-queue context efficiently, such as bind affine CPUs with IO |
| 140 | + pthread and try to allocate buffers in IO thread context. |
| 141 | + |
| 142 | +- ``UBLK_CMD_GET_DEV_INFO`` |
| 143 | + |
| 144 | + For retrieving device info via ``ublksrv_ctrl_dev_info``. It is the server's |
| 145 | + responsibility to save IO target specific info in userspace. |
| 146 | + |
| 147 | +Data plane |
| 148 | +---------- |
| 149 | + |
| 150 | +ublk server needs to create per-queue IO pthread & io_uring for handling IO |
| 151 | +commands via io_uring passthrough. The per-queue IO pthread |
| 152 | +focuses on IO handling and shouldn't handle any control & management |
| 153 | +tasks. |
| 154 | + |
| 155 | +The's IO is assigned by a unique tag, which is 1:1 mapping with IO |
| 156 | +request of ``/dev/ublkb*``. |
| 157 | + |
| 158 | +UAPI structure of ``ublksrv_io_desc`` is defined for describing each IO from |
| 159 | +the driver. A fixed mmaped area (array) on ``/dev/ublkc*`` is provided for |
| 160 | +exporting IO info to the server; such as IO offset, length, OP/flags and |
| 161 | +buffer address. Each ``ublksrv_io_desc`` instance can be indexed via queue id |
| 162 | +and IO tag directly. |
| 163 | + |
| 164 | +The following IO commands are communicated via io_uring passthrough command, |
| 165 | +and each command is only for forwarding the IO and committing the result |
| 166 | +with specified IO tag in the command data: |
| 167 | + |
| 168 | +- ``UBLK_IO_FETCH_REQ`` |
| 169 | + |
| 170 | + Sent from the server IO pthread for fetching future incoming IO requests |
| 171 | + destined to ``/dev/ublkb*``. This command is sent only once from the server |
| 172 | + IO pthread for ublk driver to setup IO forward environment. |
| 173 | + |
| 174 | +- ``UBLK_IO_COMMIT_AND_FETCH_REQ`` |
| 175 | + |
| 176 | + When an IO request is destined to ``/dev/ublkb*``, the driver stores |
| 177 | + the IO's ``ublksrv_io_desc`` to the specified mapped area; then the |
| 178 | + previous received IO command of this IO tag (either ``UBLK_IO_FETCH_REQ`` |
| 179 | + or ``UBLK_IO_COMMIT_AND_FETCH_REQ)`` is completed, so the server gets |
| 180 | + the IO notification via io_uring. |
| 181 | + |
| 182 | + After the server handles the IO, its result is committed back to the |
| 183 | + driver by sending ``UBLK_IO_COMMIT_AND_FETCH_REQ`` back. Once ublkdrv |
| 184 | + received this command, it parses the result and complete the request to |
| 185 | + ``/dev/ublkb*``. In the meantime setup environment for fetching future |
| 186 | + requests with the same IO tag. That is, ``UBLK_IO_COMMIT_AND_FETCH_REQ`` |
| 187 | + is reused for both fetching request and committing back IO result. |
| 188 | + |
| 189 | +- ``UBLK_IO_NEED_GET_DATA`` |
| 190 | + |
| 191 | + With ``UBLK_F_NEED_GET_DATA`` enabled, the WRITE request will be firstly |
| 192 | + issued to ublk server without data copy. Then, IO backend of ublk server |
| 193 | + receives the request and it can allocate data buffer and embed its addr |
| 194 | + inside this new io command. After the kernel driver gets the command, |
| 195 | + data copy is done from request pages to this backend's buffer. Finally, |
| 196 | + backend receives the request again with data to be written and it can |
| 197 | + truly handle the request. |
| 198 | + |
| 199 | + ``UBLK_IO_NEED_GET_DATA`` adds one additional round-trip and one |
| 200 | + io_uring_enter() syscall. Any user thinks that it may lower performance |
| 201 | + should not enable UBLK_F_NEED_GET_DATA. ublk server pre-allocates IO |
| 202 | + buffer for each IO by default. Any new project should try to use this |
| 203 | + buffer to communicate with ublk driver. However, existing project may |
| 204 | + break or not able to consume the new buffer interface; that's why this |
| 205 | + command is added for backwards compatibility so that existing projects |
| 206 | + can still consume existing buffers. |
| 207 | + |
| 208 | +- data copy between ublk server IO buffer and ublk block IO request |
| 209 | + |
| 210 | + The driver needs to copy the block IO request pages into the server buffer |
| 211 | + (pages) first for WRITE before notifying the server of the coming IO, so |
| 212 | + that the server can handle WRITE request. |
| 213 | + |
| 214 | + When the server handles READ request and sends |
| 215 | + ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy |
| 216 | + the server buffer (pages) read to the IO request pages. |
| 217 | + |
| 218 | +Future development |
| 219 | +================== |
| 220 | + |
| 221 | +Container-aware ublk deivice |
| 222 | +---------------------------- |
| 223 | + |
| 224 | +ublk driver doesn't handle any IO logic. Its function is well defined |
| 225 | +for now and very limited userspace interfaces are needed, which is also |
| 226 | +well defined too. It is possible to make ublk devices container-aware block |
| 227 | +devices in future as Stefan Hajnoczi suggested [#stefan]_, by removing |
| 228 | +ADMIN privilege. |
| 229 | + |
| 230 | +Zero copy |
| 231 | +--------- |
| 232 | + |
| 233 | +Zero copy is a generic requirement for nbd, fuse or similar drivers. A |
| 234 | +problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace |
| 235 | +can't be remapped any more in kernel with existing mm interfaces. This can |
| 236 | +occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that |
| 237 | +big requests (IO size >= 256 KB) may benefit a lot from zero copy. |
| 238 | + |
| 239 | + |
| 240 | +References |
| 241 | +========== |
| 242 | + |
| 243 | +.. [#userspace] https://github.com/ming1/ubdsrv |
| 244 | +
|
| 245 | +.. [#userspace_lib] https://github.com/ming1/ubdsrv/tree/master/lib |
| 246 | +
|
| 247 | +.. [#userspace_nbdublk] https://gitlab.com/rwmjones/libnbd/-/tree/nbdublk |
| 248 | +
|
| 249 | +.. [#userspace_readme] https://github.com/ming1/ubdsrv/blob/master/README |
| 250 | +
|
| 251 | +.. [ #stefan] https://lore.kernel.org/linux-block/[email protected]/ |
| 252 | +
|
| 253 | +.. [ #xiaoguang] https://lore.kernel.org/linux-block/[email protected]/ |
0 commit comments