|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +====== |
| 4 | +AF_XDP |
| 5 | +====== |
| 6 | + |
| 7 | +Overview |
| 8 | +======== |
| 9 | + |
| 10 | +AF_XDP is an address family that is optimized for high performance |
| 11 | +packet processing. |
| 12 | + |
| 13 | +This document assumes that the reader is familiar with BPF and XDP. If |
| 14 | +not, the Cilium project has an excellent reference guide at |
| 15 | +http://cilium.readthedocs.io/en/doc-1.0/bpf/. |
| 16 | + |
| 17 | +Using the XDP_REDIRECT action from an XDP program, the program can |
| 18 | +redirect ingress frames to other XDP enabled netdevs, using the |
| 19 | +bpf_redirect_map() function. AF_XDP sockets enable the possibility for |
| 20 | +XDP programs to redirect frames to a memory buffer in a user-space |
| 21 | +application. |
| 22 | + |
| 23 | +An AF_XDP socket (XSK) is created with the normal socket() |
| 24 | +syscall. Associated with each XSK are two rings: the RX ring and the |
| 25 | +TX ring. A socket can receive packets on the RX ring and it can send |
| 26 | +packets on the TX ring. These rings are registered and sized with the |
| 27 | +setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory |
| 28 | +to have at least one of these rings for each socket. An RX or TX |
| 29 | +descriptor ring points to a data buffer in a memory area called a |
| 30 | +UMEM. RX and TX can share the same UMEM so that a packet does not have |
| 31 | +to be copied between RX and TX. Moreover, if a packet needs to be kept |
| 32 | +for a while due to a possible retransmit, the descriptor that points |
| 33 | +to that packet can be changed to point to another and reused right |
| 34 | +away. This again avoids copying data. |
| 35 | + |
| 36 | +The UMEM consists of a number of equally size frames and each frame |
| 37 | +has a unique frame id. A descriptor in one of the rings references a |
| 38 | +frame by referencing its frame id. The user space allocates memory for |
| 39 | +this UMEM using whatever means it feels is most appropriate (malloc, |
| 40 | +mmap, huge pages, etc). This memory area is then registered with the |
| 41 | +kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two |
| 42 | +rings: the FILL ring and the COMPLETION ring. The fill ring is used by |
| 43 | +the application to send down frame ids for the kernel to fill in with |
| 44 | +RX packet data. References to these frames will then appear in the RX |
| 45 | +ring once each packet has been received. The completion ring, on the |
| 46 | +other hand, contains frame ids that the kernel has transmitted |
| 47 | +completely and can now be used again by user space, for either TX or |
| 48 | +RX. Thus, the frame ids appearing in the completion ring are ids that |
| 49 | +were previously transmitted using the TX ring. In summary, the RX and |
| 50 | +FILL rings are used for the RX path and the TX and COMPLETION rings |
| 51 | +are used for the TX path. |
| 52 | + |
| 53 | +The socket is then finally bound with a bind() call to a device and a |
| 54 | +specific queue id on that device, and it is not until bind is |
| 55 | +completed that traffic starts to flow. |
| 56 | + |
| 57 | +The UMEM can be shared between processes, if desired. If a process |
| 58 | +wants to do this, it simply skips the registration of the UMEM and its |
| 59 | +corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind |
| 60 | +call and submits the XSK of the process it would like to share UMEM |
| 61 | +with as well as its own newly created XSK socket. The new process will |
| 62 | +then receive frame id references in its own RX ring that point to this |
| 63 | +shared UMEM. Note that since the ring structures are single-consumer / |
| 64 | +single-producer (for performance reasons), the new process has to |
| 65 | +create its own socket with associated RX and TX rings, since it cannot |
| 66 | +share this with the other process. This is also the reason that there |
| 67 | +is only one set of FILL and COMPLETION rings per UMEM. It is the |
| 68 | +responsibility of a single process to handle the UMEM. |
| 69 | + |
| 70 | +How is then packets distributed from an XDP program to the XSKs? There |
| 71 | +is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The |
| 72 | +user-space application can place an XSK at an arbitrary place in this |
| 73 | +map. The XDP program can then redirect a packet to a specific index in |
| 74 | +this map and at this point XDP validates that the XSK in that map was |
| 75 | +indeed bound to that device and ring number. If not, the packet is |
| 76 | +dropped. If the map is empty at that index, the packet is also |
| 77 | +dropped. This also means that it is currently mandatory to have an XDP |
| 78 | +program loaded (and one XSK in the XSKMAP) to be able to get any |
| 79 | +traffic to user space through the XSK. |
| 80 | + |
| 81 | +AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the |
| 82 | +driver does not have support for XDP, or XDP_SKB is explicitly chosen |
| 83 | +when loading the XDP program, XDP_SKB mode is employed that uses SKBs |
| 84 | +together with the generic XDP support and copies out the data to user |
| 85 | +space. A fallback mode that works for any network device. On the other |
| 86 | +hand, if the driver has support for XDP, it will be used by the AF_XDP |
| 87 | +code to provide better performance, but there is still a copy of the |
| 88 | +data into user space. |
| 89 | + |
| 90 | +Concepts |
| 91 | +======== |
| 92 | + |
| 93 | +In order to use an AF_XDP socket, a number of associated objects need |
| 94 | +to be setup. |
| 95 | + |
| 96 | +Jonathan Corbet has also written an excellent article on LWN, |
| 97 | +"Accelerating networking with AF_XDP". It can be found at |
| 98 | +https://lwn.net/Articles/750845/. |
| 99 | + |
| 100 | +UMEM |
| 101 | +---- |
| 102 | + |
| 103 | +UMEM is a region of virtual contiguous memory, divided into |
| 104 | +equal-sized frames. An UMEM is associated to a netdev and a specific |
| 105 | +queue id of that netdev. It is created and configured (frame size, |
| 106 | +frame headroom, start address and size) by using the XDP_UMEM_REG |
| 107 | +setsockopt system call. A UMEM is bound to a netdev and queue id, via |
| 108 | +the bind() system call. |
| 109 | + |
| 110 | +An AF_XDP is socket linked to a single UMEM, but one UMEM can have |
| 111 | +multiple AF_XDP sockets. To share an UMEM created via one socket A, |
| 112 | +the next socket B can do this by setting the XDP_SHARED_UMEM flag in |
| 113 | +struct sockaddr_xdp member sxdp_flags, and passing the file descriptor |
| 114 | +of A to struct sockaddr_xdp member sxdp_shared_umem_fd. |
| 115 | + |
| 116 | +The UMEM has two single-producer/single-consumer rings, that are used |
| 117 | +to transfer ownership of UMEM frames between the kernel and the |
| 118 | +user-space application. |
| 119 | + |
| 120 | +Rings |
| 121 | +----- |
| 122 | + |
| 123 | +There are a four different kind of rings: Fill, Completion, RX and |
| 124 | +TX. All rings are single-producer/single-consumer, so the user-space |
| 125 | +application need explicit synchronization of multiple |
| 126 | +processes/threads are reading/writing to them. |
| 127 | + |
| 128 | +The UMEM uses two rings: Fill and Completion. Each socket associated |
| 129 | +with the UMEM must have an RX queue, TX queue or both. Say, that there |
| 130 | +is a setup with four sockets (all doing TX and RX). Then there will be |
| 131 | +one Fill ring, one Completion ring, four TX rings and four RX rings. |
| 132 | + |
| 133 | +The rings are head(producer)/tail(consumer) based rings. A producer |
| 134 | +writes the data ring at the index pointed out by struct xdp_ring |
| 135 | +producer member, and increasing the producer index. A consumer reads |
| 136 | +the data ring at the index pointed out by struct xdp_ring consumer |
| 137 | +member, and increasing the consumer index. |
| 138 | + |
| 139 | +The rings are configured and created via the _RING setsockopt system |
| 140 | +calls and mmapped to user-space using the appropriate offset to mmap() |
| 141 | +(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and |
| 142 | +XDP_UMEM_PGOFF_COMPLETION_RING). |
| 143 | + |
| 144 | +The size of the rings need to be of size power of two. |
| 145 | + |
| 146 | +UMEM Fill Ring |
| 147 | +~~~~~~~~~~~~~~ |
| 148 | + |
| 149 | +The Fill ring is used to transfer ownership of UMEM frames from |
| 150 | +user-space to kernel-space. The UMEM indicies are passed in the |
| 151 | +ring. As an example, if the UMEM is 64k and each frame is 4k, then the |
| 152 | +UMEM has 16 frames and can pass indicies between 0 and 15. |
| 153 | + |
| 154 | +Frames passed to the kernel are used for the ingress path (RX rings). |
| 155 | + |
| 156 | +The user application produces UMEM indicies to this ring. |
| 157 | + |
| 158 | +UMEM Completetion Ring |
| 159 | +~~~~~~~~~~~~~~~~~~~~~~ |
| 160 | + |
| 161 | +The Completion Ring is used transfer ownership of UMEM frames from |
| 162 | +kernel-space to user-space. Just like the Fill ring, UMEM indicies are |
| 163 | +used. |
| 164 | + |
| 165 | +Frames passed from the kernel to user-space are frames that has been |
| 166 | +sent (TX ring) and can be used by user-space again. |
| 167 | + |
| 168 | +The user application consumes UMEM indicies from this ring. |
| 169 | + |
| 170 | + |
| 171 | +RX Ring |
| 172 | +~~~~~~~ |
| 173 | + |
| 174 | +The RX ring is the receiving side of a socket. Each entry in the ring |
| 175 | +is a struct xdp_desc descriptor. The descriptor contains UMEM index |
| 176 | +(idx), the length of the data (len), the offset into the frame |
| 177 | +(offset). |
| 178 | + |
| 179 | +If no frames have been passed to kernel via the Fill ring, no |
| 180 | +descriptors will (or can) appear on the RX ring. |
| 181 | + |
| 182 | +The user application consumes struct xdp_desc descriptors from this |
| 183 | +ring. |
| 184 | + |
| 185 | +TX Ring |
| 186 | +~~~~~~~ |
| 187 | + |
| 188 | +The TX ring is used to send frames. The struct xdp_desc descriptor is |
| 189 | +filled (index, length and offset) and passed into the ring. |
| 190 | + |
| 191 | +To start the transfer a sendmsg() system call is required. This might |
| 192 | +be relaxed in the future. |
| 193 | + |
| 194 | +The user application produces struct xdp_desc descriptors to this |
| 195 | +ring. |
| 196 | + |
| 197 | +XSKMAP / BPF_MAP_TYPE_XSKMAP |
| 198 | +---------------------------- |
| 199 | + |
| 200 | +On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that |
| 201 | +is used in conjunction with bpf_redirect_map() to pass the ingress |
| 202 | +frame to a socket. |
| 203 | + |
| 204 | +The user application inserts the socket into the map, via the bpf() |
| 205 | +system call. |
| 206 | + |
| 207 | +Note that if an XDP program tries to redirect to a socket that does |
| 208 | +not match the queue configuration and netdev, the frame will be |
| 209 | +dropped. E.g. an AF_XDP socket is bound to netdev eth0 and |
| 210 | +queue 17. Only the XDP program executing for eth0 and queue 17 will |
| 211 | +successfully pass data to the socket. Please refer to the sample |
| 212 | +application (samples/bpf/) in for an example. |
| 213 | + |
| 214 | +Usage |
| 215 | +===== |
| 216 | + |
| 217 | +In order to use AF_XDP sockets there are two parts needed. The |
| 218 | +user-space application and the XDP program. For a complete setup and |
| 219 | +usage example, please refer to the sample application. The user-space |
| 220 | +side is xdpsock_user.c and the XDP side xdpsock_kern.c. |
| 221 | + |
| 222 | +Naive ring dequeue and enqueue could look like this:: |
| 223 | + |
| 224 | + // typedef struct xdp_rxtx_ring RING; |
| 225 | + // typedef struct xdp_umem_ring RING; |
| 226 | + |
| 227 | + // typedef struct xdp_desc RING_TYPE; |
| 228 | + // typedef __u32 RING_TYPE; |
| 229 | + |
| 230 | + int dequeue_one(RING *ring, RING_TYPE *item) |
| 231 | + { |
| 232 | + __u32 entries = ring->ptrs.producer - ring->ptrs.consumer; |
| 233 | + |
| 234 | + if (entries == 0) |
| 235 | + return -1; |
| 236 | + |
| 237 | + // read-barrier! |
| 238 | + |
| 239 | + *item = ring->desc[ring->ptrs.consumer & (RING_SIZE - 1)]; |
| 240 | + ring->ptrs.consumer++; |
| 241 | + return 0; |
| 242 | + } |
| 243 | + |
| 244 | + int enqueue_one(RING *ring, const RING_TYPE *item) |
| 245 | + { |
| 246 | + u32 free_entries = RING_SIZE - (ring->ptrs.producer - ring->ptrs.consumer); |
| 247 | + |
| 248 | + if (free_entries == 0) |
| 249 | + return -1; |
| 250 | + |
| 251 | + ring->desc[ring->ptrs.producer & (RING_SIZE - 1)] = *item; |
| 252 | + |
| 253 | + // write-barrier! |
| 254 | + |
| 255 | + ring->ptrs.producer++; |
| 256 | + return 0; |
| 257 | + } |
| 258 | + |
| 259 | + |
| 260 | +For a more optimized version, please refer to the sample application. |
| 261 | + |
| 262 | +Sample application |
| 263 | +================== |
| 264 | + |
| 265 | +There is a xdpsock benchmarking/test application included that |
| 266 | +demonstrates how to use AF_XDP sockets with both private and shared |
| 267 | +UMEMs. Say that you would like your UDP traffic from port 4242 to end |
| 268 | +up in queue 16, that we will enable AF_XDP on. Here, we use ethtool |
| 269 | +for this:: |
| 270 | + |
| 271 | + ethtool -N p3p2 rx-flow-hash udp4 fn |
| 272 | + ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ |
| 273 | + action 16 |
| 274 | + |
| 275 | +Running the rxdrop benchmark in XDP_DRV mode can then be done |
| 276 | +using:: |
| 277 | + |
| 278 | + samples/bpf/xdpsock -i p3p2 -q 16 -r -N |
| 279 | + |
| 280 | +For XDP_SKB mode, use the switch "-S" instead of "-N" and all options |
| 281 | +can be displayed with "-h", as usual. |
| 282 | + |
| 283 | +Credits |
| 284 | +======= |
| 285 | + |
| 286 | +- Björn Töpel (AF_XDP core) |
| 287 | +- Magnus Karlsson (AF_XDP core) |
| 288 | +- Alexander Duyck |
| 289 | +- Alexei Starovoitov |
| 290 | +- Daniel Borkmann |
| 291 | +- Jesper Dangaard Brouer |
| 292 | +- John Fastabend |
| 293 | +- Jonathan Corbet (LWN coverage) |
| 294 | +- Michael S. Tsirkin |
| 295 | +- Qi Z Zhang |
| 296 | +- Willem de Bruijn |
| 297 | + |
0 commit comments