Skip to content

Commit 7828f20

Browse files
committed
Merge branch 'bpf-cgroup-bind-connect'
Andrey Ignatov says: ==================== v2->v3: - rebase due to conflicts - fix ipv6=m build v1->v2: - support expected_attach_type at prog load time so that prog (incl. context accesses and calls to helpers) can be validated with regard to specific attach point it is supposed to be attached to. Later, at attach time, attach type is checked so that it must be same as at load time if it was provided - reworked hooks to rely on expected_attach_type, and reduced number of new prog types from 6 to just 1: BPF_PROG_TYPE_CGROUP_SOCK_ADDR - reused BPF_PROG_TYPE_CGROUP_SOCK for sys_bind post-hooks - add selftests for post-sys_bind hook For our container management we've been using complicated and fragile setup consisting of LD_PRELOAD wrapper intercepting bind and connect calls from all containerized applications. Unfortunately it doesn't work for apps that don't use glibc and changing all applications that run in the datacenter is not possible due to 3rd party code and libraries (despite being open source code) and sheer amount of legacy code that has to be rewritten (we're rewriting what we can in parallel) These applications are written without containers in mind and have builtin assumptions about network services. Like an application X expects to connect localhost:special_port and find service Y in there. To move application X and service Y into two different containers LD_PRELOAD approach is used to help one service connect to another without rewriting them. Moving these two applications into different L2 (netns) or L3 (vrf) network isolation scopes doesn't help to solve the problem, since applications need to see each other like they were running on the host without containers. So if app X and app Y would run in different netns something would need to punch a connectivity hole in those namespaces. That would be real layering violation (with corresponding network debugging pains), since clean l2, l3 abstraction would suddenly support something that breaks through the layers. Instead we used LD_PRELOAD (and now bpf programs) at bind/connect time to help applications discover and connect to each other. All applications are running in init_nens and there are no vrfs. After bind/connect the normal fib/neighbor core networking logic works as it should always do and the whole system is clean from network point of view and can be debugged with standard tools. We also considered resurrecting Hannes's afnetns work, but all hierarchical namespace abstraction don't work due to these builtin networking assumptions inside the apps. To run an application inside cgroup container that was not written with containers in mind we have to make an illusion of running in non-containerized environment. In some cases we remember the port and container id in the post-bind hook in a bpf map and when some other task in a different container is trying to connect to a service we need to know where this service is running. It can be remote and can be local. Both client and service may or may not be written with containers in mind and this sockaddr rewrite is providing connectivity and load balancing feature. BPF+cgroup looks to be the best solution for this problem. Hence we introduce 3 hooks: - at entry into sys_bind and sys_connect to let bpf prog look and modify 'struct sockaddr' provided by user space and fail bind/connect when appropriate - post sys_bind after port is allocated The approach works great and has zero overhead for anyone who doesn't use it and very low overhead when deployed. Different use case for this feature is to do low overhead firewall that doesn't need to inspect all packets and works at bind/connect time. ==================== Signed-off-by: Daniel Borkmann <[email protected]>
2 parents 807ae7d + 1d43688 commit 7828f20

33 files changed

+2314
-132
lines changed

include/linux/bpf-cgroup.h

Lines changed: 65 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
#include <uapi/linux/bpf.h>
77

88
struct sock;
9+
struct sockaddr;
910
struct cgroup;
1011
struct sk_buff;
1112
struct bpf_sock_ops_kern;
@@ -63,6 +64,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
6364
int __cgroup_bpf_run_filter_sk(struct sock *sk,
6465
enum bpf_attach_type type);
6566

67+
int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
68+
struct sockaddr *uaddr,
69+
enum bpf_attach_type type);
70+
6671
int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
6772
struct bpf_sock_ops_kern *sock_ops,
6873
enum bpf_attach_type type);
@@ -93,16 +98,64 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
9398
__ret; \
9499
})
95100

96-
#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) \
101+
#define BPF_CGROUP_RUN_SK_PROG(sk, type) \
97102
({ \
98103
int __ret = 0; \
99104
if (cgroup_bpf_enabled) { \
100-
__ret = __cgroup_bpf_run_filter_sk(sk, \
101-
BPF_CGROUP_INET_SOCK_CREATE); \
105+
__ret = __cgroup_bpf_run_filter_sk(sk, type); \
106+
} \
107+
__ret; \
108+
})
109+
110+
#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) \
111+
BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET_SOCK_CREATE)
112+
113+
#define BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk) \
114+
BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET4_POST_BIND)
115+
116+
#define BPF_CGROUP_RUN_PROG_INET6_POST_BIND(sk) \
117+
BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET6_POST_BIND)
118+
119+
#define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, type) \
120+
({ \
121+
int __ret = 0; \
122+
if (cgroup_bpf_enabled) \
123+
__ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type); \
124+
__ret; \
125+
})
126+
127+
#define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, type) \
128+
({ \
129+
int __ret = 0; \
130+
if (cgroup_bpf_enabled) { \
131+
lock_sock(sk); \
132+
__ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type); \
133+
release_sock(sk); \
102134
} \
103135
__ret; \
104136
})
105137

138+
#define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) \
139+
BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_BIND)
140+
141+
#define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) \
142+
BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_BIND)
143+
144+
#define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (cgroup_bpf_enabled && \
145+
sk->sk_prot->pre_connect)
146+
147+
#define BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr) \
148+
BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_CONNECT)
149+
150+
#define BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr) \
151+
BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_CONNECT)
152+
153+
#define BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr) \
154+
BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET4_CONNECT)
155+
156+
#define BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr) \
157+
BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET6_CONNECT)
158+
106159
#define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) \
107160
({ \
108161
int __ret = 0; \
@@ -132,9 +185,18 @@ struct cgroup_bpf {};
132185
static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
133186
static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
134187

188+
#define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (0)
135189
#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
136190
#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
137191
#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
192+
#define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) ({ 0; })
193+
#define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) ({ 0; })
194+
#define BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk) ({ 0; })
195+
#define BPF_CGROUP_RUN_PROG_INET6_POST_BIND(sk) ({ 0; })
196+
#define BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr) ({ 0; })
197+
#define BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr) ({ 0; })
198+
#define BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr) ({ 0; })
199+
#define BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr) ({ 0; })
138200
#define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
139201
#define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
140202

include/linux/bpf.h

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -208,12 +208,15 @@ struct bpf_prog_ops {
208208

209209
struct bpf_verifier_ops {
210210
/* return eBPF function prototype for verification */
211-
const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
211+
const struct bpf_func_proto *
212+
(*get_func_proto)(enum bpf_func_id func_id,
213+
const struct bpf_prog *prog);
212214

213215
/* return true if 'size' wide access at offset 'off' within bpf_context
214216
* with 'type' (read or write) is allowed
215217
*/
216218
bool (*is_valid_access)(int off, int size, enum bpf_access_type type,
219+
const struct bpf_prog *prog,
217220
struct bpf_insn_access_aux *info);
218221
int (*gen_prologue)(struct bpf_insn *insn, bool direct_write,
219222
const struct bpf_prog *prog);

include/linux/bpf_types.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_ACT, tc_cls_act)
88
BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp)
99
BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
1010
BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
11+
BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK_ADDR, cg_sock_addr)
1112
BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
1213
BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
1314
BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)

include/linux/filter.h

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -469,6 +469,7 @@ struct bpf_prog {
469469
is_func:1, /* program is a bpf function */
470470
kprobe_override:1; /* Do we override a kprobe? */
471471
enum bpf_prog_type type; /* Type of BPF program */
472+
enum bpf_attach_type expected_attach_type; /* For some prog types */
472473
u32 len; /* Number of filter blocks */
473474
u32 jited_len; /* Size of jited insns in bytes */
474475
u8 tag[BPF_TAG_SIZE];
@@ -1020,6 +1021,16 @@ static inline int bpf_tell_extensions(void)
10201021
return SKF_AD_MAX;
10211022
}
10221023

1024+
struct bpf_sock_addr_kern {
1025+
struct sock *sk;
1026+
struct sockaddr *uaddr;
1027+
/* Temporary "register" to make indirect stores to nested structures
1028+
* defined above. We need three registers to make such a store, but
1029+
* only two (src and dst) are available at convert_ctx_access time
1030+
*/
1031+
u64 tmp_reg;
1032+
};
1033+
10231034
struct bpf_sock_ops_kern {
10241035
struct sock *sk;
10251036
u32 op;

include/net/addrconf.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,13 @@ struct ipv6_stub {
231231
};
232232
extern const struct ipv6_stub *ipv6_stub __read_mostly;
233233

234+
/* A stub used by bpf helpers. Similarly ugly as ipv6_stub */
235+
struct ipv6_bpf_stub {
236+
int (*inet6_bind)(struct sock *sk, struct sockaddr *uaddr, int addr_len,
237+
bool force_bind_address_no_port, bool with_lock);
238+
};
239+
extern const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly;
240+
234241
/*
235242
* identify MLD packets for MLD filter exceptions
236243
*/

include/net/inet_common.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ int inet_shutdown(struct socket *sock, int how);
3232
int inet_listen(struct socket *sock, int backlog);
3333
void inet_sock_destruct(struct sock *sk);
3434
int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
35+
int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
36+
bool force_bind_address_no_port, bool with_lock);
3537
int inet_getname(struct socket *sock, struct sockaddr *uaddr,
3638
int peer);
3739
int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg);

include/net/ipv6.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1066,6 +1066,8 @@ void ipv6_local_error(struct sock *sk, int err, struct flowi6 *fl6, u32 info);
10661066
void ipv6_local_rxpmtu(struct sock *sk, struct flowi6 *fl6, u32 mtu);
10671067

10681068
int inet6_release(struct socket *sock);
1069+
int __inet6_bind(struct sock *sock, struct sockaddr *uaddr, int addr_len,
1070+
bool force_bind_address_no_port, bool with_lock);
10691071
int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
10701072
int inet6_getname(struct socket *sock, struct sockaddr *uaddr,
10711073
int peer);

include/net/sock.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1026,6 +1026,9 @@ static inline void sk_prot_clear_nulls(struct sock *sk, int size)
10261026
struct proto {
10271027
void (*close)(struct sock *sk,
10281028
long timeout);
1029+
int (*pre_connect)(struct sock *sk,
1030+
struct sockaddr *uaddr,
1031+
int addr_len);
10291032
int (*connect)(struct sock *sk,
10301033
struct sockaddr *uaddr,
10311034
int addr_len);

include/net/udp.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -273,6 +273,7 @@ void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
273273
int udp_rcv(struct sk_buff *skb);
274274
int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
275275
int udp_init_sock(struct sock *sk);
276+
int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
276277
int __udp_disconnect(struct sock *sk, int flags);
277278
int udp_disconnect(struct sock *sk, int flags);
278279
__poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait);

include/uapi/linux/bpf.h

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,7 @@ enum bpf_prog_type {
136136
BPF_PROG_TYPE_CGROUP_DEVICE,
137137
BPF_PROG_TYPE_SK_MSG,
138138
BPF_PROG_TYPE_RAW_TRACEPOINT,
139+
BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
139140
};
140141

141142
enum bpf_attach_type {
@@ -147,6 +148,12 @@ enum bpf_attach_type {
147148
BPF_SK_SKB_STREAM_VERDICT,
148149
BPF_CGROUP_DEVICE,
149150
BPF_SK_MSG_VERDICT,
151+
BPF_CGROUP_INET4_BIND,
152+
BPF_CGROUP_INET6_BIND,
153+
BPF_CGROUP_INET4_CONNECT,
154+
BPF_CGROUP_INET6_CONNECT,
155+
BPF_CGROUP_INET4_POST_BIND,
156+
BPF_CGROUP_INET6_POST_BIND,
150157
__MAX_BPF_ATTACH_TYPE
151158
};
152159

@@ -296,6 +303,11 @@ union bpf_attr {
296303
__u32 prog_flags;
297304
char prog_name[BPF_OBJ_NAME_LEN];
298305
__u32 prog_ifindex; /* ifindex of netdev to prep for */
306+
/* For some prog types expected attach type must be known at
307+
* load time to verify attach type specific parts of prog
308+
* (context accesses, allowed helpers, etc).
309+
*/
310+
__u32 expected_attach_type;
299311
};
300312

301313
struct { /* anonymous struct used by BPF_OBJ_* commands */
@@ -736,6 +748,13 @@ union bpf_attr {
736748
* @flags: reserved for future use
737749
* Return: SK_PASS
738750
*
751+
* int bpf_bind(ctx, addr, addr_len)
752+
* Bind socket to address. Only binding to IP is supported, no port can be
753+
* set in addr.
754+
* @ctx: pointer to context of type bpf_sock_addr
755+
* @addr: pointer to struct sockaddr to bind socket to
756+
* @addr_len: length of sockaddr structure
757+
* Return: 0 on success or negative error code
739758
*/
740759
#define __BPF_FUNC_MAPPER(FN) \
741760
FN(unspec), \
@@ -801,7 +820,8 @@ union bpf_attr {
801820
FN(msg_redirect_map), \
802821
FN(msg_apply_bytes), \
803822
FN(msg_cork_bytes), \
804-
FN(msg_pull_data),
823+
FN(msg_pull_data), \
824+
FN(bind),
805825

806826
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
807827
* function eBPF program intends to call
@@ -930,6 +950,15 @@ struct bpf_sock {
930950
__u32 protocol;
931951
__u32 mark;
932952
__u32 priority;
953+
__u32 src_ip4; /* Allows 1,2,4-byte read.
954+
* Stored in network byte order.
955+
*/
956+
__u32 src_ip6[4]; /* Allows 1,2,4-byte read.
957+
* Stored in network byte order.
958+
*/
959+
__u32 src_port; /* Allows 4-byte read.
960+
* Stored in host byte order
961+
*/
933962
};
934963

935964
#define XDP_PACKET_HEADROOM 256
@@ -1005,6 +1034,26 @@ struct bpf_map_info {
10051034
__u64 netns_ino;
10061035
} __attribute__((aligned(8)));
10071036

1037+
/* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
1038+
* by user and intended to be used by socket (e.g. to bind to, depends on
1039+
* attach attach type).
1040+
*/
1041+
struct bpf_sock_addr {
1042+
__u32 user_family; /* Allows 4-byte read, but no write. */
1043+
__u32 user_ip4; /* Allows 1,2,4-byte read and 4-byte write.
1044+
* Stored in network byte order.
1045+
*/
1046+
__u32 user_ip6[4]; /* Allows 1,2,4-byte read an 4-byte write.
1047+
* Stored in network byte order.
1048+
*/
1049+
__u32 user_port; /* Allows 4-byte read and write.
1050+
* Stored in network byte order
1051+
*/
1052+
__u32 family; /* Allows 4-byte read, but no write */
1053+
__u32 type; /* Allows 4-byte read, but no write */
1054+
__u32 protocol; /* Allows 4-byte read, but no write */
1055+
};
1056+
10081057
/* User bpf_sock_ops struct to access socket values and specify request ops
10091058
* and their replies.
10101059
* Some of this fields are in network (bigendian) byte order and may need

kernel/bpf/cgroup.c

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -494,6 +494,42 @@ int __cgroup_bpf_run_filter_sk(struct sock *sk,
494494
}
495495
EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
496496

497+
/**
498+
* __cgroup_bpf_run_filter_sock_addr() - Run a program on a sock and
499+
* provided by user sockaddr
500+
* @sk: sock struct that will use sockaddr
501+
* @uaddr: sockaddr struct provided by user
502+
* @type: The type of program to be exectuted
503+
*
504+
* socket is expected to be of type INET or INET6.
505+
*
506+
* This function will return %-EPERM if an attached program is found and
507+
* returned value != 1 during execution. In all other cases, 0 is returned.
508+
*/
509+
int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
510+
struct sockaddr *uaddr,
511+
enum bpf_attach_type type)
512+
{
513+
struct bpf_sock_addr_kern ctx = {
514+
.sk = sk,
515+
.uaddr = uaddr,
516+
};
517+
struct cgroup *cgrp;
518+
int ret;
519+
520+
/* Check socket family since not all sockets represent network
521+
* endpoint (e.g. AF_UNIX).
522+
*/
523+
if (sk->sk_family != AF_INET && sk->sk_family != AF_INET6)
524+
return 0;
525+
526+
cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
527+
ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN);
528+
529+
return ret == 1 ? 0 : -EPERM;
530+
}
531+
EXPORT_SYMBOL(__cgroup_bpf_run_filter_sock_addr);
532+
497533
/**
498534
* __cgroup_bpf_run_filter_sock_ops() - Run a program on a sock
499535
* @sk: socket to get cgroup from
@@ -545,7 +581,7 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
545581
EXPORT_SYMBOL(__cgroup_bpf_check_dev_permission);
546582

547583
static const struct bpf_func_proto *
548-
cgroup_dev_func_proto(enum bpf_func_id func_id)
584+
cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
549585
{
550586
switch (func_id) {
551587
case BPF_FUNC_map_lookup_elem:
@@ -566,6 +602,7 @@ cgroup_dev_func_proto(enum bpf_func_id func_id)
566602

567603
static bool cgroup_dev_is_valid_access(int off, int size,
568604
enum bpf_access_type type,
605+
const struct bpf_prog *prog,
569606
struct bpf_insn_access_aux *info)
570607
{
571608
const int size_default = sizeof(__u32);

0 commit comments

Comments
 (0)