Skip to content

Commit f713ffa

Browse files
committed
Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe: - Fix for a deadlock on queue freeze with zoned writes - Fix for zoned append emulation - Two bio folio fixes, for sparsemem and for very large folios - Fix for a performance regression introduced in 6.13 when plug insertion was changed - Fix for NVMe passthrough handling for polled IO - Document the ublk auto registration feature - loop lockdep warning fix * tag 'block-6.16-20250614' of git://git.kernel.dk/linux: nvme: always punt polled uring_cmd end_io work to task_work Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists block: Fix bvec_set_folio() for very large folios bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP block: use plug request list tail for one-shot backmerge attempt block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG) loop: move lo_set_size() out of queue freeze
2 parents 6d13760 + 9ce6c98 commit f713ffa

File tree

7 files changed

+114
-38
lines changed

7 files changed

+114
-38
lines changed

Documentation/block/ublk.rst

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -352,6 +352,83 @@ For reaching best IO performance, ublk server should align its segment
352352
parameter of `struct ublk_param_segment` with backend for avoiding
353353
unnecessary IO split, which usually hurts io_uring performance.
354354

355+
Auto Buffer Registration
356+
------------------------
357+
358+
The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration
359+
and unregistration for I/O requests, which simplifies the buffer management
360+
process and reduces overhead in the ublk server implementation.
361+
362+
This is another feature flag for using zero copy, and it is compatible with
363+
``UBLK_F_SUPPORT_ZERO_COPY``.
364+
365+
Feature Overview
366+
~~~~~~~~~~~~~~~~
367+
368+
This feature automatically registers request buffers to the io_uring context
369+
before delivering I/O commands to the ublk server and unregisters them when
370+
completing I/O commands. This eliminates the need for manual buffer
371+
registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and
372+
``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server
373+
can avoid dependency on the two uring_cmd operations.
374+
375+
IOs can't be issued concurrently to io_uring if there is any dependency
376+
among these IOs. So this way not only simplifies ublk server implementation,
377+
but also makes concurrent IO handling becomes possible by removing the
378+
dependency on buffer registration & unregistration commands.
379+
380+
Usage Requirements
381+
~~~~~~~~~~~~~~~~~~
382+
383+
1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx``
384+
used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``. If
385+
uring_cmd is issued on a different ``io_ring_ctx``, manual buffer
386+
unregistration is required.
387+
388+
2. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the
389+
following structure::
390+
391+
struct ublk_auto_buf_reg {
392+
__u16 index; /* Buffer index for registration */
393+
__u8 flags; /* Registration flags */
394+
__u8 reserved0; /* Reserved for future use */
395+
__u32 reserved1; /* Reserved for future use */
396+
};
397+
398+
ublk_auto_buf_reg_to_sqe_addr() is for converting the above structure into
399+
``sqe->addr``.
400+
401+
3. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed.
402+
403+
4. Optional flags can be passed via ``ublk_auto_buf_reg.flags``.
404+
405+
Fallback Behavior
406+
~~~~~~~~~~~~~~~~~
407+
408+
If auto buffer registration fails:
409+
410+
1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
411+
412+
- The uring_cmd is completed
413+
- ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags``
414+
- The ublk server must manually deal with the failure, such as, register
415+
the buffer manually, or using user copy feature for retrieving the data
416+
for handling ublk IO
417+
418+
2. If fallback is not enabled:
419+
420+
- The ublk I/O request fails silently
421+
- The uring_cmd won't be completed
422+
423+
Limitations
424+
~~~~~~~~~~~
425+
426+
- Requires same ``io_ring_ctx`` for all operations
427+
- May require manual buffer management in fallback cases
428+
- io_ring_ctx buffer table has a max size of 16K, which may not be enough
429+
in case that too many ublk devices are handled by this single io_ring_ctx
430+
and each one has very large queue depth
431+
355432
References
356433
==========
357434

block/blk-merge.c

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -998,20 +998,20 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
998998
if (!plug || rq_list_empty(&plug->mq_list))
999999
return false;
10001000

1001-
rq_list_for_each(&plug->mq_list, rq) {
1002-
if (rq->q == q) {
1003-
if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
1004-
BIO_MERGE_OK)
1005-
return true;
1006-
break;
1007-
}
1001+
rq = plug->mq_list.tail;
1002+
if (rq->q == q)
1003+
return blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
1004+
BIO_MERGE_OK;
1005+
else if (!plug->multiple_queues)
1006+
return false;
10081007

1009-
/*
1010-
* Only keep iterating plug list for merges if we have multiple
1011-
* queues
1012-
*/
1013-
if (!plug->multiple_queues)
1014-
break;
1008+
rq_list_for_each(&plug->mq_list, rq) {
1009+
if (rq->q != q)
1010+
continue;
1011+
if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
1012+
BIO_MERGE_OK)
1013+
return true;
1014+
break;
10151015
}
10161016
return false;
10171017
}

block/blk-zoned.c

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1225,6 +1225,7 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
12251225
if (bio_flagged(bio, BIO_EMULATES_ZONE_APPEND)) {
12261226
bio->bi_opf &= ~REQ_OP_MASK;
12271227
bio->bi_opf |= REQ_OP_ZONE_APPEND;
1228+
bio_clear_flag(bio, BIO_EMULATES_ZONE_APPEND);
12281229
}
12291230

12301231
/*
@@ -1306,16 +1307,19 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
13061307
spin_unlock_irqrestore(&zwplug->lock, flags);
13071308

13081309
bdev = bio->bi_bdev;
1309-
submit_bio_noacct_nocheck(bio);
13101310

13111311
/*
13121312
* blk-mq devices will reuse the extra reference on the request queue
13131313
* usage counter we took when the BIO was plugged, but the submission
13141314
* path for BIO-based devices will not do that. So drop this extra
13151315
* reference here.
13161316
*/
1317-
if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO))
1317+
if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) {
1318+
bdev->bd_disk->fops->submit_bio(bio);
13181319
blk_queue_exit(bdev->bd_disk->queue);
1320+
} else {
1321+
blk_mq_submit_bio(bio);
1322+
}
13191323

13201324
put_zwplug:
13211325
/* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */

drivers/block/loop.c

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1248,19 +1248,18 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
12481248
lo->lo_flags &= ~LOOP_SET_STATUS_CLEARABLE_FLAGS;
12491249
lo->lo_flags |= (info->lo_flags & LOOP_SET_STATUS_SETTABLE_FLAGS);
12501250

1251-
if (size_changed) {
1252-
loff_t new_size = get_size(lo->lo_offset, lo->lo_sizelimit,
1253-
lo->lo_backing_file);
1254-
loop_set_size(lo, new_size);
1255-
}
1256-
12571251
/* update the direct I/O flag if lo_offset changed */
12581252
loop_update_dio(lo);
12591253

12601254
out_unfreeze:
12611255
blk_mq_unfreeze_queue(lo->lo_queue, memflags);
12621256
if (partscan)
12631257
clear_bit(GD_SUPPRESS_PART_SCAN, &lo->lo_disk->state);
1258+
if (!err && size_changed) {
1259+
loff_t new_size = get_size(lo->lo_offset, lo->lo_sizelimit,
1260+
lo->lo_backing_file);
1261+
loop_set_size(lo, new_size);
1262+
}
12641263
out_unlock:
12651264
mutex_unlock(&lo->lo_mutex);
12661265
if (partscan)

drivers/nvme/host/ioctl.c

Lines changed: 7 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -429,21 +429,14 @@ static enum rq_end_io_ret nvme_uring_cmd_end_io(struct request *req,
429429
pdu->result = le64_to_cpu(nvme_req(req)->result.u64);
430430

431431
/*
432-
* For iopoll, complete it directly. Note that using the uring_cmd
433-
* helper for this is safe only because we check blk_rq_is_poll().
434-
* As that returns false if we're NOT on a polled queue, then it's
435-
* safe to use the polled completion helper.
436-
*
437-
* Otherwise, move the completion to task work.
432+
* IOPOLL could potentially complete this request directly, but
433+
* if multiple rings are polling on the same queue, then it's possible
434+
* for one ring to find completions for another ring. Punting the
435+
* completion via task_work will always direct it to the right
436+
* location, rather than potentially complete requests for ringA
437+
* under iopoll invocations from ringB.
438438
*/
439-
if (blk_rq_is_poll(req)) {
440-
if (pdu->bio)
441-
blk_rq_unmap_user(pdu->bio);
442-
io_uring_cmd_iopoll_done(ioucmd, pdu->result, pdu->status);
443-
} else {
444-
io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb);
445-
}
446-
439+
io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb);
447440
return RQ_END_IO_FREE;
448441
}
449442

include/linux/bio.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -291,7 +291,7 @@ static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio,
291291

292292
fi->folio = page_folio(bvec->bv_page);
293293
fi->offset = bvec->bv_offset +
294-
PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
294+
PAGE_SIZE * folio_page_idx(fi->folio, bvec->bv_page);
295295
fi->_seg_count = bvec->bv_len;
296296
fi->length = min(folio_size(fi->folio) - fi->offset, fi->_seg_count);
297297
fi->_next = folio_next(fi->folio);

include/linux/bvec.h

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,9 +57,12 @@ static inline void bvec_set_page(struct bio_vec *bv, struct page *page,
5757
* @offset: offset into the folio
5858
*/
5959
static inline void bvec_set_folio(struct bio_vec *bv, struct folio *folio,
60-
unsigned int len, unsigned int offset)
60+
size_t len, size_t offset)
6161
{
62-
bvec_set_page(bv, &folio->page, len, offset);
62+
unsigned long nr = offset / PAGE_SIZE;
63+
64+
WARN_ON_ONCE(len > UINT_MAX);
65+
bvec_set_page(bv, folio_page(folio, nr), len, offset % PAGE_SIZE);
6366
}
6467

6568
/**

0 commit comments

Comments
 (0)