Skip to content

Commit e93165d

Browse files
committed
Alexei Starovoitov says: ==================== pull-request: bpf-next 2023-07-19 We've added 45 non-merge commits during the last 3 day(s) which contain a total of 71 files changed, 7808 insertions(+), 592 deletions(-). The main changes are: 1) multi-buffer support in AF_XDP, from Maciej Fijalkowski, Magnus Karlsson, Tirthendu Sarkar. 2) BPF link support for tc BPF programs, from Daniel Borkmann. 3) Enable bpf_map_sum_elem_count kfunc for all program types, from Anton Protopopov. 4) Add 'owner' field to bpf_rb_node to fix races in shared ownership, Dave Marchevsky. 5) Prevent potential skb_header_pointer() misuse, from Alexei Starovoitov. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits) bpf, net: Introduce skb_pointer_if_linear(). bpf: sync tools/ uapi header with selftests/bpf: Add mprog API tests for BPF tcx links selftests/bpf: Add mprog API tests for BPF tcx opts bpftool: Extend net dump with tcx progs libbpf: Add helper macro to clear opts structs libbpf: Add link-based API for tcx libbpf: Add opts-based attach/detach/query API for tcx bpf: Add fd-based tcx multi-prog infra with link support bpf: Add generic attach/detach/query API for multi-progs selftests/xsk: reset NIC settings to default after running test suite selftests/xsk: add test for too many frags selftests/xsk: add metadata copy test for multi-buff selftests/xsk: add invalid descriptor test for multi-buffer selftests/xsk: add unaligned mode test for multi-buffer selftests/xsk: add basic multi-buffer test selftests/xsk: transmit and receive multi-buffer packets xsk: add multi-buffer documentation i40e: xsk: add TX multi-buffer support ice: xsk: Tx multi-buffer support ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2 parents 97083c2 + 6f5a630 commit e93165d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+7808
-592
lines changed

Documentation/netlink/specs/netdev.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,12 @@ attribute-sets:
6262
type: u64
6363
enum: xdp-act
6464
enum-as-flags: true
65+
-
66+
name: xdp_zc_max_segs
67+
doc: max fragment count supported by ZC driver
68+
type: u32
69+
checks:
70+
min: 1
6571

6672
operations:
6773
list:

Documentation/networking/af_xdp.rst

Lines changed: 210 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -462,8 +462,92 @@ XDP_OPTIONS getsockopt
462462
Gets options from an XDP socket. The only one supported so far is
463463
XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
464464

465+
Multi-Buffer Support
466+
====================
467+
468+
With multi-buffer support, programs using AF_XDP sockets can receive
469+
and transmit packets consisting of multiple buffers both in copy and
470+
zero-copy mode. For example, a packet can consist of two
471+
frames/buffers, one with the header and the other one with the data,
472+
or a 9K Ethernet jumbo frame can be constructed by chaining together
473+
three 4K frames.
474+
475+
Some definitions:
476+
477+
* A packet consists of one or more frames
478+
479+
* A descriptor in one of the AF_XDP rings always refers to a single
480+
frame. In the case the packet consists of a single frame, the
481+
descriptor refers to the whole packet.
482+
483+
To enable multi-buffer support for an AF_XDP socket, use the new bind
484+
flag XDP_USE_SG. If this is not provided, all multi-buffer packets
485+
will be dropped just as before. Note that the XDP program loaded also
486+
needs to be in multi-buffer mode. This can be accomplished by using
487+
"xdp.frags" as the section name of the XDP program used.
488+
489+
To represent a packet consisting of multiple frames, a new flag called
490+
XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
491+
descriptors. If it is true (1) the packet continues with the next
492+
descriptor and if it is false (0) it means this is the last descriptor
493+
of the packet. Why the reverse logic of end-of-packet (eop) flag found
494+
in many NICs? Just to preserve compatibility with non-multi-buffer
495+
applications that have this bit set to false for all packets on Rx,
496+
and the apps set the options field to zero for Tx, as anything else
497+
will be treated as an invalid descriptor.
498+
499+
These are the semantics for producing packets onto AF_XDP Tx ring
500+
consisting of multiple frames:
501+
502+
* When an invalid descriptor is found, all the other
503+
descriptors/frames of this packet are marked as invalid and not
504+
completed. The next descriptor is treated as the start of a new
505+
packet, even if this was not the intent (because we cannot guess
506+
the intent). As before, if your program is producing invalid
507+
descriptors you have a bug that must be fixed.
508+
509+
* Zero length descriptors are treated as invalid descriptors.
510+
511+
* For copy mode, the maximum supported number of frames in a packet is
512+
equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
513+
descriptors accumulated so far are dropped and treated as
514+
invalid. To produce an application that will work on any system
515+
regardless of this config setting, limit the number of frags to 18,
516+
as the minimum value of the config is 17.
517+
518+
* For zero-copy mode, the limit is up to what the NIC HW
519+
supports. Usually at least five on the NICs we have checked. We
520+
consciously chose to not enforce a rigid limit (such as
521+
CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
522+
resulted in copy actions under the hood to fit into what limit the
523+
NIC supports. Kind of defeats the purpose of zero-copy mode. How to
524+
probe for this limit is explained in the "probe for multi-buffer
525+
support" section.
526+
527+
On the Rx path in copy-mode, the xsk core copies the XDP data into
528+
multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
529+
detailed before. Zero-copy mode works the same, though the data is not
530+
copied. When the application gets a descriptor with the XDP_PKT_CONTD
531+
flag set to one, it means that the packet consists of multiple buffers
532+
and it continues with the next buffer in the following
533+
descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
534+
means that this is the last buffer of the packet. AF_XDP guarantees
535+
that only a complete packet (all frames in the packet) is sent to the
536+
application. If there is not enough space in the AF_XDP Rx ring, all
537+
frames of the packet will be dropped.
538+
539+
If application reads a batch of descriptors, using for example the libxdp
540+
interfaces, it is not guaranteed that the batch will end with a full
541+
packet. It might end in the middle of a packet and the rest of the
542+
buffers of that packet will arrive at the beginning of the next batch,
543+
since the libxdp interface does not read the whole ring (unless you
544+
have an enormous batch size or a very small ring size).
545+
546+
An example program each for Rx and Tx multi-buffer support can be found
547+
later in this document.
548+
465549
Usage
466-
=====
550+
-----
467551

468552
In order to use AF_XDP sockets two parts are needed. The
469553
user-space application and the XDP program. For a complete setup and
@@ -541,6 +625,131 @@ like this:
541625
But please use the libbpf functions as they are optimized and ready to
542626
use. Will make your life easier.
543627

628+
Usage Multi-Buffer Rx
629+
---------------------
630+
631+
Here is a simple Rx path pseudo-code example (using libxdp interfaces
632+
for simplicity). Error paths have been excluded to keep it short:
633+
634+
.. code-block:: c
635+
636+
void rx_packets(struct xsk_socket_info *xsk)
637+
{
638+
static bool new_packet = true;
639+
u32 idx_rx = 0, idx_fq = 0;
640+
static char *pkt;
641+
642+
int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
643+
644+
xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
645+
646+
for (int i = 0; i < rcvd; i++) {
647+
struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
648+
char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
649+
bool eop = !(desc->options & XDP_PKT_CONTD);
650+
651+
if (new_packet)
652+
pkt = frag;
653+
else
654+
add_frag_to_pkt(pkt, frag);
655+
656+
if (eop)
657+
process_pkt(pkt);
658+
659+
new_packet = eop;
660+
661+
*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
662+
}
663+
664+
xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
665+
xsk_ring_cons__release(&xsk->rx, rcvd);
666+
}
667+
668+
Usage Multi-Buffer Tx
669+
---------------------
670+
671+
Here is an example Tx path pseudo-code (using libxdp interfaces for
672+
simplicity) ignoring that the umem is finite in size, and that we
673+
eventually will run out of packets to send. Also assumes pkts.addr
674+
points to a valid location in the umem.
675+
676+
.. code-block:: c
677+
678+
void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts,
679+
int batch_size)
680+
{
681+
u32 idx, i, pkt_nb = 0;
682+
683+
xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx);
684+
685+
for (i = 0; i < batch_size;) {
686+
u64 addr = pkts[pkt_nb].addr;
687+
u32 len = pkts[pkt_nb].size;
688+
689+
do {
690+
struct xdp_desc *tx_desc;
691+
692+
tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++);
693+
tx_desc->addr = addr;
694+
695+
if (len > xsk_frame_size) {
696+
tx_desc->len = xsk_frame_size;
697+
tx_desc->options = XDP_PKT_CONTD;
698+
} else {
699+
tx_desc->len = len;
700+
tx_desc->options = 0;
701+
pkt_nb++;
702+
}
703+
len -= tx_desc->len;
704+
addr += xsk_frame_size;
705+
706+
if (i == batch_size) {
707+
/* Remember len, addr, pkt_nb for next iteration.
708+
* Skipped for simplicity.
709+
*/
710+
break;
711+
}
712+
} while (len);
713+
}
714+
715+
xsk_ring_prod__submit(&xsk->tx, i);
716+
}
717+
718+
Probing for Multi-Buffer Support
719+
--------------------------------
720+
721+
To discover if a driver supports multi-buffer AF_XDP in SKB or DRV
722+
mode, use the XDP_FEATURES feature of netlink in linux/netdev.h to
723+
query for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for
724+
querying for XDP multi-buffer support. If XDP supports multi-buffer in
725+
a driver, then AF_XDP will also support that in SKB and DRV mode.
726+
727+
To discover if a driver supports multi-buffer AF_XDP in zero-copy
728+
mode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY
729+
flag. If it is set, it means that at least zero-copy is supported and
730+
you should go and check the netlink attribute
731+
NETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer
732+
value will be returned stating the max number of frags that are
733+
supported by this device in zero-copy mode. These are the possible
734+
return values:
735+
736+
1: Multi-buffer for zero-copy is not supported by this device, as max
737+
one fragment supported means that multi-buffer is not possible.
738+
739+
>=2: Multi-buffer is supported in zero-copy mode for this device. The
740+
returned number signifies the max number of frags supported.
741+
742+
For an example on how these are used through libbpf, please take a
743+
look at tools/testing/selftests/bpf/xskxceiver.c.
744+
745+
Multi-Buffer Support for Zero-Copy Drivers
746+
------------------------------------------
747+
748+
Zero-copy drivers usually use the batched APIs for Rx and Tx
749+
processing. Note that the Tx batch API guarantees that it will provide
750+
a batch of Tx descriptors that ends with full packet at the end. This
751+
to facilitate extending a zero-copy driver with multi-buffer support.
752+
544753
Sample application
545754
==================
546755

MAINTAINERS

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3684,6 +3684,7 @@ F: include/linux/filter.h
36843684
F: include/linux/tnum.h
36853685
F: kernel/bpf/core.c
36863686
F: kernel/bpf/dispatcher.c
3687+
F: kernel/bpf/mprog.c
36873688
F: kernel/bpf/syscall.c
36883689
F: kernel/bpf/tnum.c
36893690
F: kernel/bpf/trampoline.c
@@ -3777,13 +3778,15 @@ L: [email protected]
37773778
S: Maintained
37783779
F: kernel/bpf/bpf_struct*
37793780

3780-
BPF [NETWORKING] (tc BPF, sock_addr)
3781+
BPF [NETWORKING] (tcx & tc BPF, sock_addr)
37813782
M: Martin KaFai Lau <[email protected]>
37823783
M: Daniel Borkmann <[email protected]>
37833784
R: John Fastabend <[email protected]>
37843785
37853786
37863787
S: Maintained
3788+
F: include/net/tcx.h
3789+
F: kernel/bpf/tcx.c
37873790
F: net/core/filter.c
37883791
F: net/sched/act_bpf.c
37893792
F: net/sched/cls_bpf.c

arch/x86/net/bpf_jit_comp.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1925,7 +1925,7 @@ static int get_nr_used_regs(const struct btf_func_model *m)
19251925
static void save_args(const struct btf_func_model *m, u8 **prog,
19261926
int stack_size, bool for_call_origin)
19271927
{
1928-
int arg_regs, first_off, nr_regs = 0, nr_stack_slots = 0;
1928+
int arg_regs, first_off = 0, nr_regs = 0, nr_stack_slots = 0;
19291929
int i, j;
19301930

19311931
/* Store function arguments to stack.

drivers/net/ethernet/intel/i40e/i40e_main.c

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3585,11 +3585,6 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
35853585
if (ring->xsk_pool) {
35863586
ring->rx_buf_len =
35873587
xsk_pool_get_rx_frame_size(ring->xsk_pool);
3588-
/* For AF_XDP ZC, we disallow packets to span on
3589-
* multiple buffers, thus letting us skip that
3590-
* handling in the fast-path.
3591-
*/
3592-
chain_len = 1;
35933588
ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
35943589
MEM_TYPE_XSK_BUFF_POOL,
35953590
NULL);
@@ -13822,6 +13817,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
1382213817
NETDEV_XDP_ACT_REDIRECT |
1382313818
NETDEV_XDP_ACT_XSK_ZEROCOPY |
1382413819
NETDEV_XDP_ACT_RX_SG;
13820+
netdev->xdp_zc_max_segs = I40E_MAX_BUFFER_TXD;
1382513821
} else {
1382613822
/* Relate the VSI_VMDQ name to the VSI_MAIN name. Note that we
1382713823
* are still limited by IFNAMSIZ, but we're adding 'v%d\0' to

drivers/net/ethernet/intel/i40e/i40e_txrx.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2284,8 +2284,8 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
22842284
* If the buffer is an EOP buffer, this function exits returning false,
22852285
* otherwise return true indicating that this is in fact a non-EOP buffer.
22862286
*/
2287-
static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
2288-
union i40e_rx_desc *rx_desc)
2287+
bool i40e_is_non_eop(struct i40e_ring *rx_ring,
2288+
union i40e_rx_desc *rx_desc)
22892289
{
22902290
/* if we are the last buffer then there is nothing else to do */
22912291
#define I40E_RXD_EOF BIT(I40E_RX_DESC_STATUS_EOF_SHIFT)

drivers/net/ethernet/intel/i40e/i40e_txrx.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -473,6 +473,8 @@ int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
473473
bool __i40e_chk_linearize(struct sk_buff *skb);
474474
int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
475475
u32 flags);
476+
bool i40e_is_non_eop(struct i40e_ring *rx_ring,
477+
union i40e_rx_desc *rx_desc);
476478

477479
/**
478480
* i40e_get_head - Retrieve head from head writeback

0 commit comments

Comments
 (0)