Skip to content

Commit d1e9117

Browse files
maryamtahhanborkmann
authored andcommitted
bpf, docs: DEVMAPs and XDP_REDIRECT
Add documentation for BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_DEVMAP_HASH including kernel version introduced, usage and examples. Add documentation that describes XDP_REDIRECT. Signed-off-by: Maryam Tahhan <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
1 parent f80e16b commit d1e9117

File tree

4 files changed

+310
-2
lines changed

4 files changed

+310
-2
lines changed

Documentation/bpf/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ that goes into great technical depth about the BPF Architecture.
2929
clang-notes
3030
linux-notes
3131
other
32+
redirect
3233

3334
.. only:: subproject and html
3435

Documentation/bpf/map_devmap.rst

Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
.. SPDX-License-Identifier: GPL-2.0-only
2+
.. Copyright (C) 2022 Red Hat, Inc.
3+
4+
=================================================
5+
BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_DEVMAP_HASH
6+
=================================================
7+
8+
.. note::
9+
- ``BPF_MAP_TYPE_DEVMAP`` was introduced in kernel version 4.14
10+
- ``BPF_MAP_TYPE_DEVMAP_HASH`` was introduced in kernel version 5.4
11+
12+
``BPF_MAP_TYPE_DEVMAP`` and ``BPF_MAP_TYPE_DEVMAP_HASH`` are BPF maps primarily
13+
used as backend maps for the XDP BPF helper call ``bpf_redirect_map()``.
14+
``BPF_MAP_TYPE_DEVMAP`` is backed by an array that uses the key as
15+
the index to lookup a reference to a net device. While ``BPF_MAP_TYPE_DEVMAP_HASH``
16+
is backed by a hash table that uses a key to lookup a reference to a net device.
17+
The user provides either <``key``/ ``ifindex``> or <``key``/ ``struct bpf_devmap_val``>
18+
pairs to update the maps with new net devices.
19+
20+
.. note::
21+
- The key to a hash map doesn't have to be an ``ifindex``.
22+
- While ``BPF_MAP_TYPE_DEVMAP_HASH`` allows for densely packing the net devices
23+
it comes at the cost of a hash of the key when performing a look up.
24+
25+
The setup and packet enqueue/send code is shared between the two types of
26+
devmap; only the lookup and insertion is different.
27+
28+
Usage
29+
=====
30+
Kernel BPF
31+
----------
32+
.. c:function::
33+
long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
34+
35+
Redirect the packet to the endpoint referenced by ``map`` at index ``key``.
36+
For ``BPF_MAP_TYPE_DEVMAP`` and ``BPF_MAP_TYPE_DEVMAP_HASH`` this map contains
37+
references to net devices (for forwarding packets through other ports).
38+
39+
The lower two bits of *flags* are used as the return code if the map lookup
40+
fails. This is so that the return value can be one of the XDP program return
41+
codes up to ``XDP_TX``, as chosen by the caller. The higher bits of ``flags``
42+
can be set to ``BPF_F_BROADCAST`` or ``BPF_F_EXCLUDE_INGRESS`` as defined
43+
below.
44+
45+
With ``BPF_F_BROADCAST`` the packet will be broadcast to all the interfaces
46+
in the map, with ``BPF_F_EXCLUDE_INGRESS`` the ingress interface will be excluded
47+
from the broadcast.
48+
49+
.. note::
50+
- The key is ignored if BPF_F_BROADCAST is set.
51+
- The broadcast feature can also be used to implement multicast forwarding:
52+
simply create multiple DEVMAPs, each one corresponding to a single multicast group.
53+
54+
This helper will return ``XDP_REDIRECT`` on success, or the value of the two
55+
lower bits of the ``flags`` argument if the map lookup fails.
56+
57+
More information about redirection can be found :doc:`redirect`
58+
59+
.. c:function::
60+
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
61+
62+
Net device entries can be retrieved using the ``bpf_map_lookup_elem()``
63+
helper.
64+
65+
Userspace
66+
---------
67+
.. note::
68+
DEVMAP entries can only be updated/deleted from user space and not
69+
from an eBPF program. Trying to call these functions from a kernel eBPF
70+
program will result in the program failing to load and a verifier warning.
71+
72+
.. c:function::
73+
int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags);
74+
75+
Net device entries can be added or updated using the ``bpf_map_update_elem()``
76+
helper. This helper replaces existing elements atomically. The ``value`` parameter
77+
can be ``struct bpf_devmap_val`` or a simple ``int ifindex`` for backwards
78+
compatibility.
79+
80+
.. code-block:: c
81+
82+
struct bpf_devmap_val {
83+
__u32 ifindex; /* device index */
84+
union {
85+
int fd; /* prog fd on map write */
86+
__u32 id; /* prog id on map read */
87+
} bpf_prog;
88+
};
89+
90+
The ``flags`` argument can be one of the following:
91+
92+
- ``BPF_ANY``: Create a new element or update an existing element.
93+
- ``BPF_NOEXIST``: Create a new element only if it did not exist.
94+
- ``BPF_EXIST``: Update an existing element.
95+
96+
DEVMAPs can associate a program with a device entry by adding a ``bpf_prog.fd``
97+
to ``struct bpf_devmap_val``. Programs are run after ``XDP_REDIRECT`` and have
98+
access to both Rx device and Tx device. The program associated with the ``fd``
99+
must have type XDP with expected attach type ``xdp_devmap``.
100+
When a program is associated with a device index, the program is run on an
101+
``XDP_REDIRECT`` and before the buffer is added to the per-cpu queue. Examples
102+
of how to attach/use xdp_devmap progs can be found in the kernel selftests:
103+
104+
- ``tools/testing/selftests/bpf/prog_tests/xdp_devmap_attach.c``
105+
- ``tools/testing/selftests/bpf/progs/test_xdp_with_devmap_helpers.c``
106+
107+
.. c:function::
108+
int bpf_map_lookup_elem(int fd, const void *key, void *value);
109+
110+
Net device entries can be retrieved using the ``bpf_map_lookup_elem()``
111+
helper.
112+
113+
.. c:function::
114+
int bpf_map_delete_elem(int fd, const void *key);
115+
116+
Net device entries can be deleted using the ``bpf_map_delete_elem()``
117+
helper. This helper will return 0 on success, or negative error in case of
118+
failure.
119+
120+
Examples
121+
========
122+
123+
Kernel BPF
124+
----------
125+
126+
The following code snippet shows how to declare a ``BPF_MAP_TYPE_DEVMAP``
127+
called tx_port.
128+
129+
.. code-block:: c
130+
131+
struct {
132+
__uint(type, BPF_MAP_TYPE_DEVMAP);
133+
__type(key, __u32);
134+
__type(value, __u32);
135+
__uint(max_entries, 256);
136+
} tx_port SEC(".maps");
137+
138+
The following code snippet shows how to declare a ``BPF_MAP_TYPE_DEVMAP_HASH``
139+
called forward_map.
140+
141+
.. code-block:: c
142+
143+
struct {
144+
__uint(type, BPF_MAP_TYPE_DEVMAP_HASH);
145+
__type(key, __u32);
146+
__type(value, struct bpf_devmap_val);
147+
__uint(max_entries, 32);
148+
} forward_map SEC(".maps");
149+
150+
.. note::
151+
152+
The value type in the DEVMAP above is a ``struct bpf_devmap_val``
153+
154+
The following code snippet shows a simple xdp_redirect_map program. This program
155+
would work with a user space program that populates the devmap ``forward_map`` based
156+
on ingress ifindexes. The BPF program (below) is redirecting packets using the
157+
ingress ``ifindex`` as the ``key``.
158+
159+
.. code-block:: c
160+
161+
SEC("xdp")
162+
int xdp_redirect_map_func(struct xdp_md *ctx)
163+
{
164+
int index = ctx->ingress_ifindex;
165+
166+
return bpf_redirect_map(&forward_map, index, 0);
167+
}
168+
169+
The following code snippet shows a BPF program that is broadcasting packets to
170+
all the interfaces in the ``tx_port`` devmap.
171+
172+
.. code-block:: c
173+
174+
SEC("xdp")
175+
int xdp_redirect_map_func(struct xdp_md *ctx)
176+
{
177+
return bpf_redirect_map(&tx_port, 0, BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS);
178+
}
179+
180+
User space
181+
----------
182+
183+
The following code snippet shows how to update a devmap called ``tx_port``.
184+
185+
.. code-block:: c
186+
187+
int update_devmap(int ifindex, int redirect_ifindex)
188+
{
189+
int ret;
190+
191+
ret = bpf_map_update_elem(bpf_map__fd(tx_port), &ifindex, &redirect_ifindex, 0);
192+
if (ret < 0) {
193+
fprintf(stderr, "Failed to update devmap_ value: %s\n",
194+
strerror(errno));
195+
}
196+
197+
return ret;
198+
}
199+
200+
The following code snippet shows how to update a hash_devmap called ``forward_map``.
201+
202+
.. code-block:: c
203+
204+
int update_devmap(int ifindex, int redirect_ifindex)
205+
{
206+
struct bpf_devmap_val devmap_val = { .ifindex = redirect_ifindex };
207+
int ret;
208+
209+
ret = bpf_map_update_elem(bpf_map__fd(forward_map), &ifindex, &devmap_val, 0);
210+
if (ret < 0) {
211+
fprintf(stderr, "Failed to update devmap_ value: %s\n",
212+
strerror(errno));
213+
}
214+
return ret;
215+
}
216+
217+
References
218+
===========
219+
220+
- https://lwn.net/Articles/728146/
221+
- https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=6f9d451ab1a33728adb72d7ff66a7b374d665176
222+
- https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4106

Documentation/bpf/redirect.rst

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
.. SPDX-License-Identifier: GPL-2.0-only
2+
.. Copyright (C) 2022 Red Hat, Inc.
3+
4+
========
5+
Redirect
6+
========
7+
XDP_REDIRECT
8+
############
9+
Supported maps
10+
--------------
11+
12+
XDP_REDIRECT works with the following map types:
13+
14+
- ``BPF_MAP_TYPE_DEVMAP``
15+
- ``BPF_MAP_TYPE_DEVMAP_HASH``
16+
- ``BPF_MAP_TYPE_CPUMAP``
17+
- ``BPF_MAP_TYPE_XSKMAP``
18+
19+
For more information on these maps, please see the specific map documentation.
20+
21+
Process
22+
-------
23+
24+
.. kernel-doc:: net/core/filter.c
25+
:doc: xdp redirect
26+
27+
.. note::
28+
Not all drivers support transmitting frames after a redirect, and for
29+
those that do, not all of them support non-linear frames. Non-linear xdp
30+
bufs/frames are bufs/frames that contain more than one fragment.
31+
32+
Debugging packet drops
33+
----------------------
34+
Silent packet drops for XDP_REDIRECT can be debugged using:
35+
36+
- bpf_trace
37+
- perf_record
38+
39+
bpf_trace
40+
^^^^^^^^^
41+
The following bpftrace command can be used to capture and count all XDP tracepoints:
42+
43+
.. code-block:: none
44+
45+
sudo bpftrace -e 'tracepoint:xdp:* { @cnt[probe] = count(); }'
46+
Attaching 12 probes...
47+
^C
48+
49+
@cnt[tracepoint:xdp:mem_connect]: 18
50+
@cnt[tracepoint:xdp:mem_disconnect]: 18
51+
@cnt[tracepoint:xdp:xdp_exception]: 19605
52+
@cnt[tracepoint:xdp:xdp_devmap_xmit]: 1393604
53+
@cnt[tracepoint:xdp:xdp_redirect]: 22292200
54+
55+
.. note::
56+
The various xdp tracepoints can be found in ``source/include/trace/events/xdp.h``
57+
58+
The following bpftrace command can be used to extract the ``ERRNO`` being returned as
59+
part of the err parameter:
60+
61+
.. code-block:: none
62+
63+
sudo bpftrace -e \
64+
'tracepoint:xdp:xdp_redirect*_err {@redir_errno[-args->err] = count();}
65+
tracepoint:xdp:xdp_devmap_xmit {@devmap_errno[-args->err] = count();}'
66+
67+
perf record
68+
^^^^^^^^^^^
69+
The perf tool also supports recording tracepoints:
70+
71+
.. code-block:: none
72+
73+
perf record -a -e xdp:xdp_redirect_err \
74+
-e xdp:xdp_redirect_map_err \
75+
-e xdp:xdp_exception \
76+
-e xdp:xdp_devmap_xmit
77+
78+
References
79+
===========
80+
81+
- https://github.com/xdp-project/xdp-tutorial/tree/master/tracing02-xdp-monitor

net/core/filter.c

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4108,7 +4108,10 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
41084108
.arg2_type = ARG_ANYTHING,
41094109
};
41104110

4111-
/* XDP_REDIRECT works by a three-step process, implemented in the functions
4111+
/**
4112+
* DOC: xdp redirect
4113+
*
4114+
* XDP_REDIRECT works by a three-step process, implemented in the functions
41124115
* below:
41134116
*
41144117
* 1. The bpf_redirect() and bpf_redirect_map() helpers will lookup the target
@@ -4123,7 +4126,8 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
41234126
* 3. Before exiting its NAPI poll loop, the driver will call xdp_do_flush(),
41244127
* which will flush all the different bulk queues, thus completing the
41254128
* redirect.
4126-
*
4129+
*/
4130+
/*
41274131
* Pointers to the map entries will be kept around for this whole sequence of
41284132
* steps, protected by RCU. However, there is no top-level rcu_read_lock() in
41294133
* the core code; instead, the RCU protection relies on everything happening

0 commit comments

Comments
 (0)