@@ -462,8 +462,92 @@ XDP_OPTIONS getsockopt
462
462
Gets options from an XDP socket. The only one supported so far is
463
463
XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
464
464
465
+ Multi-Buffer Support
466
+ ====================
467
+
468
+ With multi-buffer support, programs using AF_XDP sockets can receive
469
+ and transmit packets consisting of multiple buffers both in copy and
470
+ zero-copy mode. For example, a packet can consist of two
471
+ frames/buffers, one with the header and the other one with the data,
472
+ or a 9K Ethernet jumbo frame can be constructed by chaining together
473
+ three 4K frames.
474
+
475
+ Some definitions:
476
+
477
+ * A packet consists of one or more frames
478
+
479
+ * A descriptor in one of the AF_XDP rings always refers to a single
480
+ frame. In the case the packet consists of a single frame, the
481
+ descriptor refers to the whole packet.
482
+
483
+ To enable multi-buffer support for an AF_XDP socket, use the new bind
484
+ flag XDP_USE_SG. If this is not provided, all multi-buffer packets
485
+ will be dropped just as before. Note that the XDP program loaded also
486
+ needs to be in multi-buffer mode. This can be accomplished by using
487
+ "xdp.frags" as the section name of the XDP program used.
488
+
489
+ To represent a packet consisting of multiple frames, a new flag called
490
+ XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
491
+ descriptors. If it is true (1) the packet continues with the next
492
+ descriptor and if it is false (0) it means this is the last descriptor
493
+ of the packet. Why the reverse logic of end-of-packet (eop) flag found
494
+ in many NICs? Just to preserve compatibility with non-multi-buffer
495
+ applications that have this bit set to false for all packets on Rx,
496
+ and the apps set the options field to zero for Tx, as anything else
497
+ will be treated as an invalid descriptor.
498
+
499
+ These are the semantics for producing packets onto AF_XDP Tx ring
500
+ consisting of multiple frames:
501
+
502
+ * When an invalid descriptor is found, all the other
503
+ descriptors/frames of this packet are marked as invalid and not
504
+ completed. The next descriptor is treated as the start of a new
505
+ packet, even if this was not the intent (because we cannot guess
506
+ the intent). As before, if your program is producing invalid
507
+ descriptors you have a bug that must be fixed.
508
+
509
+ * Zero length descriptors are treated as invalid descriptors.
510
+
511
+ * For copy mode, the maximum supported number of frames in a packet is
512
+ equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
513
+ descriptors accumulated so far are dropped and treated as
514
+ invalid. To produce an application that will work on any system
515
+ regardless of this config setting, limit the number of frags to 18,
516
+ as the minimum value of the config is 17.
517
+
518
+ * For zero-copy mode, the limit is up to what the NIC HW
519
+ supports. Usually at least five on the NICs we have checked. We
520
+ consciously chose to not enforce a rigid limit (such as
521
+ CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
522
+ resulted in copy actions under the hood to fit into what limit the
523
+ NIC supports. Kind of defeats the purpose of zero-copy mode. How to
524
+ probe for this limit is explained in the "probe for multi-buffer
525
+ support" section.
526
+
527
+ On the Rx path in copy-mode, the xsk core copies the XDP data into
528
+ multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
529
+ detailed before. Zero-copy mode works the same, though the data is not
530
+ copied. When the application gets a descriptor with the XDP_PKT_CONTD
531
+ flag set to one, it means that the packet consists of multiple buffers
532
+ and it continues with the next buffer in the following
533
+ descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
534
+ means that this is the last buffer of the packet. AF_XDP guarantees
535
+ that only a complete packet (all frames in the packet) is sent to the
536
+ application. If there is not enough space in the AF_XDP Rx ring, all
537
+ frames of the packet will be dropped.
538
+
539
+ If application reads a batch of descriptors, using for example the libxdp
540
+ interfaces, it is not guaranteed that the batch will end with a full
541
+ packet. It might end in the middle of a packet and the rest of the
542
+ buffers of that packet will arrive at the beginning of the next batch,
543
+ since the libxdp interface does not read the whole ring (unless you
544
+ have an enormous batch size or a very small ring size).
545
+
546
+ An example program each for Rx and Tx multi-buffer support can be found
547
+ later in this document.
548
+
465
549
Usage
466
- =====
550
+ -----
467
551
468
552
In order to use AF_XDP sockets two parts are needed. The
469
553
user-space application and the XDP program. For a complete setup and
@@ -541,6 +625,131 @@ like this:
541
625
But please use the libbpf functions as they are optimized and ready to
542
626
use. Will make your life easier.
543
627
628
+ Usage Multi-Buffer Rx
629
+ ---------------------
630
+
631
+ Here is a simple Rx path pseudo-code example (using libxdp interfaces
632
+ for simplicity). Error paths have been excluded to keep it short:
633
+
634
+ .. code-block :: c
635
+
636
+ void rx_packets(struct xsk_socket_info *xsk)
637
+ {
638
+ static bool new_packet = true;
639
+ u32 idx_rx = 0, idx_fq = 0;
640
+ static char *pkt;
641
+
642
+ int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
643
+
644
+ xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
645
+
646
+ for (int i = 0; i < rcvd; i++) {
647
+ struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
648
+ char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
649
+ bool eop = !(desc->options & XDP_PKT_CONTD);
650
+
651
+ if (new_packet)
652
+ pkt = frag;
653
+ else
654
+ add_frag_to_pkt(pkt, frag);
655
+
656
+ if (eop)
657
+ process_pkt(pkt);
658
+
659
+ new_packet = eop;
660
+
661
+ *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
662
+ }
663
+
664
+ xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
665
+ xsk_ring_cons__release(&xsk->rx, rcvd);
666
+ }
667
+
668
+ Usage Multi-Buffer Tx
669
+ ---------------------
670
+
671
+ Here is an example Tx path pseudo-code (using libxdp interfaces for
672
+ simplicity) ignoring that the umem is finite in size, and that we
673
+ eventually will run out of packets to send. Also assumes pkts.addr
674
+ points to a valid location in the umem.
675
+
676
+ .. code-block :: c
677
+
678
+ void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts,
679
+ int batch_size)
680
+ {
681
+ u32 idx, i, pkt_nb = 0;
682
+
683
+ xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx);
684
+
685
+ for (i = 0; i < batch_size;) {
686
+ u64 addr = pkts[pkt_nb].addr;
687
+ u32 len = pkts[pkt_nb].size;
688
+
689
+ do {
690
+ struct xdp_desc *tx_desc;
691
+
692
+ tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++);
693
+ tx_desc->addr = addr;
694
+
695
+ if (len > xsk_frame_size) {
696
+ tx_desc->len = xsk_frame_size;
697
+ tx_desc->options = XDP_PKT_CONTD;
698
+ } else {
699
+ tx_desc->len = len;
700
+ tx_desc->options = 0;
701
+ pkt_nb++;
702
+ }
703
+ len -= tx_desc->len;
704
+ addr += xsk_frame_size;
705
+
706
+ if (i == batch_size) {
707
+ /* Remember len, addr, pkt_nb for next iteration.
708
+ * Skipped for simplicity.
709
+ */
710
+ break;
711
+ }
712
+ } while (len);
713
+ }
714
+
715
+ xsk_ring_prod__submit(&xsk->tx, i);
716
+ }
717
+
718
+ Probing for Multi-Buffer Support
719
+ --------------------------------
720
+
721
+ To discover if a driver supports multi-buffer AF_XDP in SKB or DRV
722
+ mode, use the XDP_FEATURES feature of netlink in linux/netdev.h to
723
+ query for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for
724
+ querying for XDP multi-buffer support. If XDP supports multi-buffer in
725
+ a driver, then AF_XDP will also support that in SKB and DRV mode.
726
+
727
+ To discover if a driver supports multi-buffer AF_XDP in zero-copy
728
+ mode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY
729
+ flag. If it is set, it means that at least zero-copy is supported and
730
+ you should go and check the netlink attribute
731
+ NETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer
732
+ value will be returned stating the max number of frags that are
733
+ supported by this device in zero-copy mode. These are the possible
734
+ return values:
735
+
736
+ 1: Multi-buffer for zero-copy is not supported by this device, as max
737
+ one fragment supported means that multi-buffer is not possible.
738
+
739
+ >=2: Multi-buffer is supported in zero-copy mode for this device. The
740
+ returned number signifies the max number of frags supported.
741
+
742
+ For an example on how these are used through libbpf, please take a
743
+ look at tools/testing/selftests/bpf/xskxceiver.c.
744
+
745
+ Multi-Buffer Support for Zero-Copy Drivers
746
+ ------------------------------------------
747
+
748
+ Zero-copy drivers usually use the batched APIs for Rx and Tx
749
+ processing. Note that the Tx batch API guarantees that it will provide
750
+ a batch of Tx descriptors that ends with full packet at the end. This
751
+ to facilitate extending a zero-copy driver with multi-buffer support.
752
+
544
753
Sample application
545
754
==================
546
755
0 commit comments