Skip to content

Commit ae72950

Browse files
chuckleveramschuma-ntap
authored andcommitted
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement: Recently Sagi Grimberg <[email protected]> observed that kernel RDMA- enabled storage initiators don't handle delayed Send completion correctly. If Send completion is delayed beyond the end of a ULP transaction, the ULP may release resources that are still being used by the HCA to complete a long-running Send operation. This is a common design trait amongst our initiators. Most Send operations are faster than the ULP transaction they are part of. Waiting for a completion for these is typically unnecessary. Infrequently, a network partition or some other problem crops up where an ordering problem can occur. In NFS parlance, the RPC Reply arrives and completes the RPC, but the HCA is still retrying the Send WR that conveyed the RPC Call. In this case, the HCA can try to use memory that has been invalidated or DMA unmapped, and the connection is lost. If that memory has been re-used for something else (possibly not related to NFS), and the Send retransmission exposes that data on the wire. Thus we cannot assume that it is safe to release Send-related resources just because a ULP reply has arrived. After some analysis, we have determined that the completion housekeeping will not be difficult for xprtrdma: - Inline Send buffers are registered via the local DMA key, and are already left DMA mapped for the lifetime of a transport connection, thus no additional handling is necessary for those - Gathered Sends involving page cache pages _will_ need to DMA unmap those pages after the Send completes. But like inline send buffers, they are registered via the local DMA key, and thus will not need to be invalidated In addition, RPC completion will need to wait for Send completion in the latter case. However, nearly always, the Send that conveys the RPC Call will have completed long before the RPC Reply arrives, and thus no additional latency will be accrued. Design notes: In this patch, the rpcrdma_sendctx object is introduced, and a lock-free circular queue is added to manage a set of them per transport. The RPC client's send path already prevents sending more than one RPC Call at the same time. This allows us to treat the consumer side of the queue (rpcrdma_sendctx_get_locked) as if there is a single consumer thread. The producer side of the queue (rpcrdma_sendctx_put_locked) is invoked only from the Send completion handler, which is a single thread of execution (soft IRQ). The only care that needs to be taken is with the tail index, which is shared between the producer and consumer. Only the producer updates the tail index. The consumer compares the head with the tail to ensure that the a sendctx that is in use is never handed out again (or, expressed more conventionally, the queue is empty). When the sendctx queue empties completely, there are enough Sends outstanding that posting more Send operations can result in a Send Queue overflow. In this case, the ULP is told to wait and try again. This introduces strong Send Queue accounting to xprtrdma. As a final touch, Jason Gunthorpe <[email protected]> suggested a mechanism that does not require signaling every Send. We signal once every N Sends, and perform SGE unmapping of N Send operations during that one completion. Reported-by: Sagi Grimberg <[email protected]> Suggested-by: Jason Gunthorpe <[email protected]> Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
1 parent a062a2a commit ae72950

File tree

4 files changed

+247
-32
lines changed

4 files changed

+247
-32
lines changed

net/sunrpc/xprtrdma/rpc_rdma.c

Lines changed: 24 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -512,23 +512,26 @@ rpcrdma_encode_reply_chunk(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
512512
}
513513

514514
/**
515-
* rpcrdma_unmap_sges - DMA-unmap Send buffers
516-
* @ia: interface adapter (device)
517-
* @req: req with possibly some SGEs to be DMA unmapped
515+
* rpcrdma_unmap_sendctx - DMA-unmap Send buffers
516+
* @sc: sendctx containing SGEs to unmap
518517
*
519518
*/
520519
void
521-
rpcrdma_unmap_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req)
520+
rpcrdma_unmap_sendctx(struct rpcrdma_sendctx *sc)
522521
{
522+
struct rpcrdma_ia *ia = &sc->sc_xprt->rx_ia;
523523
struct ib_sge *sge;
524524
unsigned int count;
525525

526+
dprintk("RPC: %s: unmapping %u sges for sc=%p\n",
527+
__func__, sc->sc_unmap_count, sc);
528+
526529
/* The first two SGEs contain the transport header and
527530
* the inline buffer. These are always left mapped so
528531
* they can be cheaply re-used.
529532
*/
530-
sge = &req->rl_send_sge[2];
531-
for (count = req->rl_mapped_sges; count--; sge++)
533+
sge = &sc->sc_sges[2];
534+
for (count = sc->sc_unmap_count; count; ++sge, --count)
532535
ib_dma_unmap_page(ia->ri_device,
533536
sge->addr, sge->length, DMA_TO_DEVICE);
534537
}
@@ -539,8 +542,9 @@ static bool
539542
rpcrdma_prepare_hdr_sge(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
540543
u32 len)
541544
{
545+
struct rpcrdma_sendctx *sc = req->rl_sendctx;
542546
struct rpcrdma_regbuf *rb = req->rl_rdmabuf;
543-
struct ib_sge *sge = &req->rl_send_sge[0];
547+
struct ib_sge *sge = sc->sc_sges;
544548

545549
if (!rpcrdma_dma_map_regbuf(ia, rb))
546550
goto out_regbuf;
@@ -550,7 +554,7 @@ rpcrdma_prepare_hdr_sge(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
550554

551555
ib_dma_sync_single_for_device(rdmab_device(rb), sge->addr,
552556
sge->length, DMA_TO_DEVICE);
553-
req->rl_send_wr.num_sge++;
557+
sc->sc_wr.num_sge++;
554558
return true;
555559

556560
out_regbuf:
@@ -565,10 +569,11 @@ static bool
565569
rpcrdma_prepare_msg_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
566570
struct xdr_buf *xdr, enum rpcrdma_chunktype rtype)
567571
{
572+
struct rpcrdma_sendctx *sc = req->rl_sendctx;
568573
unsigned int sge_no, page_base, len, remaining;
569574
struct rpcrdma_regbuf *rb = req->rl_sendbuf;
570575
struct ib_device *device = ia->ri_device;
571-
struct ib_sge *sge = req->rl_send_sge;
576+
struct ib_sge *sge = sc->sc_sges;
572577
u32 lkey = ia->ri_pd->local_dma_lkey;
573578
struct page *page, **ppages;
574579

@@ -631,7 +636,7 @@ rpcrdma_prepare_msg_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
631636
sge[sge_no].length = len;
632637
sge[sge_no].lkey = lkey;
633638

634-
req->rl_mapped_sges++;
639+
sc->sc_unmap_count++;
635640
ppages++;
636641
remaining -= len;
637642
page_base = 0;
@@ -657,24 +662,24 @@ rpcrdma_prepare_msg_sges(struct rpcrdma_ia *ia, struct rpcrdma_req *req,
657662
goto out_mapping_err;
658663
sge[sge_no].length = len;
659664
sge[sge_no].lkey = lkey;
660-
req->rl_mapped_sges++;
665+
sc->sc_unmap_count++;
661666
}
662667

663668
out:
664-
req->rl_send_wr.num_sge += sge_no;
669+
sc->sc_wr.num_sge += sge_no;
665670
return true;
666671

667672
out_regbuf:
668673
pr_err("rpcrdma: failed to DMA map a Send buffer\n");
669674
return false;
670675

671676
out_mapping_overflow:
672-
rpcrdma_unmap_sges(ia, req);
677+
rpcrdma_unmap_sendctx(sc);
673678
pr_err("rpcrdma: too many Send SGEs (%u)\n", sge_no);
674679
return false;
675680

676681
out_mapping_err:
677-
rpcrdma_unmap_sges(ia, req);
682+
rpcrdma_unmap_sendctx(sc);
678683
pr_err("rpcrdma: Send mapping error\n");
679684
return false;
680685
}
@@ -694,8 +699,11 @@ rpcrdma_prepare_send_sges(struct rpcrdma_xprt *r_xprt,
694699
struct rpcrdma_req *req, u32 hdrlen,
695700
struct xdr_buf *xdr, enum rpcrdma_chunktype rtype)
696701
{
697-
req->rl_send_wr.num_sge = 0;
698-
req->rl_mapped_sges = 0;
702+
req->rl_sendctx = rpcrdma_sendctx_get_locked(&r_xprt->rx_buf);
703+
if (!req->rl_sendctx)
704+
return -ENOBUFS;
705+
req->rl_sendctx->sc_wr.num_sge = 0;
706+
req->rl_sendctx->sc_unmap_count = 0;
699707

700708
if (!rpcrdma_prepare_hdr_sge(&r_xprt->rx_ia, req, hdrlen))
701709
return -EIO;

net/sunrpc/xprtrdma/transport.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -687,7 +687,6 @@ xprt_rdma_free(struct rpc_task *task)
687687

688688
if (!list_empty(&req->rl_registered))
689689
ia->ri_ops->ro_unmap_sync(r_xprt, &req->rl_registered);
690-
rpcrdma_unmap_sges(ia, req);
691690
rpcrdma_buffer_put(req);
692691
}
693692

@@ -790,11 +789,12 @@ void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
790789
r_xprt->rx_stats.failed_marshal_count,
791790
r_xprt->rx_stats.bad_reply_count,
792791
r_xprt->rx_stats.nomsg_call_count);
793-
seq_printf(seq, "%lu %lu %lu %lu\n",
792+
seq_printf(seq, "%lu %lu %lu %lu %lu\n",
794793
r_xprt->rx_stats.mrs_recovered,
795794
r_xprt->rx_stats.mrs_orphaned,
796795
r_xprt->rx_stats.mrs_allocated,
797-
r_xprt->rx_stats.local_inv_needed);
796+
r_xprt->rx_stats.local_inv_needed,
797+
r_xprt->rx_stats.empty_sendctx_q);
798798
}
799799

800800
static int

0 commit comments

Comments
 (0)