Skip to content

Commit f318903

Browse files
borkmannAlexei Starovoitov
authored andcommitted
bpf: Add netns cookie and enable it for bpf cgroup hooks
In Cilium we're mainly using BPF cgroup hooks today in order to implement kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*), ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic between Cilium managed nodes. While this works in its current shape and avoids packet-level NAT for inter Cilium managed node traffic, there is one major limitation we're facing today, that is, lack of netns awareness. In Kubernetes, the concept of Pods (which hold one or multiple containers) has been built around network namespaces, so while we can use the global scope of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing NodePort ports on loopback addresses), we also have the need to differentiate between initial network namespaces and non-initial one. For example, ExternalIP services mandate that non-local service IPs are not to be translated from the host (initial) network namespace as one example. Right now, we have an ugly work-around in place where non-local service IPs for ExternalIP services are not xlated from connect() and friends BPF hooks but instead via less efficient packet-level NAT on the veth tc ingress hook for Pod traffic. On top of determining whether we're in initial or non-initial network namespace we also have a need for a socket-cookie like mechanism for network namespaces scope. Socket cookies have the nice property that they can be combined as part of the key structure e.g. for BPF LRU maps without having to worry that the cookie could be recycled. We are planning to use this for our sessionAffinity implementation for services. Therefore, add a new bpf_get_netns_cookie() helper which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would provide the cookie for the initial network namespace while passing the context instead of NULL would provide the cookie from the application's network namespace. We're using a hole, so no size increase; the assignment happens only once. Therefore this allows for a comparison on initial namespace as well as regular cookie usage as we have today with socket cookies. We could later on enable this helper for other program types as well as we would see need. (*) Both externalTrafficPolicy={Local|Cluster} types [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
1 parent fcf752e commit f318903

File tree

7 files changed

+103
-8
lines changed

7 files changed

+103
-8
lines changed

include/linux/bpf.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,7 @@ enum bpf_arg_type {
233233
ARG_CONST_SIZE_OR_ZERO, /* number of bytes accessed from memory or 0 */
234234

235235
ARG_PTR_TO_CTX, /* pointer to context */
236+
ARG_PTR_TO_CTX_OR_NULL, /* pointer to context or NULL */
236237
ARG_ANYTHING, /* any (initialized) argument is ok */
237238
ARG_PTR_TO_SPIN_LOCK, /* pointer to bpf_spin_lock */
238239
ARG_PTR_TO_SOCK_COMMON, /* pointer to sock_common */

include/net/net_namespace.h

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,9 @@ struct net {
168168
#ifdef CONFIG_XFRM
169169
struct netns_xfrm xfrm;
170170
#endif
171+
172+
atomic64_t net_cookie; /* written once */
173+
171174
#if IS_ENABLED(CONFIG_IP_VS)
172175
struct netns_ipvs *ipvs;
173176
#endif
@@ -273,6 +276,8 @@ static inline int check_net(const struct net *net)
273276

274277
void net_drop_ns(void *);
275278

279+
u64 net_gen_cookie(struct net *net);
280+
276281
#else
277282

278283
static inline struct net *get_net(struct net *net)
@@ -300,6 +305,11 @@ static inline int check_net(const struct net *net)
300305
return 1;
301306
}
302307

308+
static inline u64 net_gen_cookie(struct net *net)
309+
{
310+
return 0;
311+
}
312+
303313
#define net_drop_ns NULL
304314
#endif
305315

include/uapi/linux/bpf.h

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2950,6 +2950,19 @@ union bpf_attr {
29502950
* restricted to raw_tracepoint bpf programs.
29512951
* Return
29522952
* 0 on success, or a negative error in case of failure.
2953+
*
2954+
* u64 bpf_get_netns_cookie(void *ctx)
2955+
* Description
2956+
* Retrieve the cookie (generated by the kernel) of the network
2957+
* namespace the input *ctx* is associated with. The network
2958+
* namespace cookie remains stable for its lifetime and provides
2959+
* a global identifier that can be assumed unique. If *ctx* is
2960+
* NULL, then the helper returns the cookie for the initial
2961+
* network namespace. The cookie itself is very similar to that
2962+
* of bpf_get_socket_cookie() helper, but for network namespaces
2963+
* instead of sockets.
2964+
* Return
2965+
* A 8-byte long opaque number.
29532966
*/
29542967
#define __BPF_FUNC_MAPPER(FN) \
29552968
FN(unspec), \
@@ -3073,7 +3086,8 @@ union bpf_attr {
30733086
FN(jiffies64), \
30743087
FN(read_branch_records), \
30753088
FN(get_ns_current_pid_tgid), \
3076-
FN(xdp_output),
3089+
FN(xdp_output), \
3090+
FN(get_netns_cookie),
30773091

30783092
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
30793093
* function eBPF program intends to call

kernel/bpf/verifier.c

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3461,13 +3461,17 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
34613461
expected_type = CONST_PTR_TO_MAP;
34623462
if (type != expected_type)
34633463
goto err_type;
3464-
} else if (arg_type == ARG_PTR_TO_CTX) {
3464+
} else if (arg_type == ARG_PTR_TO_CTX ||
3465+
arg_type == ARG_PTR_TO_CTX_OR_NULL) {
34653466
expected_type = PTR_TO_CTX;
3466-
if (type != expected_type)
3467-
goto err_type;
3468-
err = check_ctx_reg(env, reg, regno);
3469-
if (err < 0)
3470-
return err;
3467+
if (!(register_is_null(reg) &&
3468+
arg_type == ARG_PTR_TO_CTX_OR_NULL)) {
3469+
if (type != expected_type)
3470+
goto err_type;
3471+
err = check_ctx_reg(env, reg, regno);
3472+
if (err < 0)
3473+
return err;
3474+
}
34713475
} else if (arg_type == ARG_PTR_TO_SOCK_COMMON) {
34723476
expected_type = PTR_TO_SOCK_COMMON;
34733477
/* Any sk pointer can be ARG_PTR_TO_SOCK_COMMON */

net/core/filter.c

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4141,6 +4141,39 @@ static const struct bpf_func_proto bpf_get_socket_cookie_sock_ops_proto = {
41414141
.arg1_type = ARG_PTR_TO_CTX,
41424142
};
41434143

4144+
static u64 __bpf_get_netns_cookie(struct sock *sk)
4145+
{
4146+
#ifdef CONFIG_NET_NS
4147+
return net_gen_cookie(sk ? sk->sk_net.net : &init_net);
4148+
#else
4149+
return 0;
4150+
#endif
4151+
}
4152+
4153+
BPF_CALL_1(bpf_get_netns_cookie_sock, struct sock *, ctx)
4154+
{
4155+
return __bpf_get_netns_cookie(ctx);
4156+
}
4157+
4158+
static const struct bpf_func_proto bpf_get_netns_cookie_sock_proto = {
4159+
.func = bpf_get_netns_cookie_sock,
4160+
.gpl_only = false,
4161+
.ret_type = RET_INTEGER,
4162+
.arg1_type = ARG_PTR_TO_CTX_OR_NULL,
4163+
};
4164+
4165+
BPF_CALL_1(bpf_get_netns_cookie_sock_addr, struct bpf_sock_addr_kern *, ctx)
4166+
{
4167+
return __bpf_get_netns_cookie(ctx ? ctx->sk : NULL);
4168+
}
4169+
4170+
static const struct bpf_func_proto bpf_get_netns_cookie_sock_addr_proto = {
4171+
.func = bpf_get_netns_cookie_sock_addr,
4172+
.gpl_only = false,
4173+
.ret_type = RET_INTEGER,
4174+
.arg1_type = ARG_PTR_TO_CTX_OR_NULL,
4175+
};
4176+
41444177
BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb)
41454178
{
41464179
struct sock *sk = sk_to_full_sk(skb->sk);
@@ -5968,6 +6001,8 @@ sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
59686001
return &bpf_get_local_storage_proto;
59696002
case BPF_FUNC_get_socket_cookie:
59706003
return &bpf_get_socket_cookie_sock_proto;
6004+
case BPF_FUNC_get_netns_cookie:
6005+
return &bpf_get_netns_cookie_sock_proto;
59716006
case BPF_FUNC_perf_event_output:
59726007
return &bpf_event_output_data_proto;
59736008
default:
@@ -5994,6 +6029,8 @@ sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
59946029
}
59956030
case BPF_FUNC_get_socket_cookie:
59966031
return &bpf_get_socket_cookie_sock_addr_proto;
6032+
case BPF_FUNC_get_netns_cookie:
6033+
return &bpf_get_netns_cookie_sock_addr_proto;
59976034
case BPF_FUNC_get_local_storage:
59986035
return &bpf_get_local_storage_proto;
59996036
case BPF_FUNC_perf_event_output:

net/core/net_namespace.c

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,20 @@ EXPORT_SYMBOL_GPL(pernet_ops_rwsem);
6969

7070
static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS;
7171

72+
static atomic64_t cookie_gen;
73+
74+
u64 net_gen_cookie(struct net *net)
75+
{
76+
while (1) {
77+
u64 res = atomic64_read(&net->net_cookie);
78+
79+
if (res)
80+
return res;
81+
res = atomic64_inc_return(&cookie_gen);
82+
atomic64_cmpxchg(&net->net_cookie, 0, res);
83+
}
84+
}
85+
7286
static struct net_generic *net_alloc_generic(void)
7387
{
7488
struct net_generic *ng;
@@ -1087,6 +1101,7 @@ static int __init net_ns_init(void)
10871101
panic("Could not allocate generic netns");
10881102

10891103
rcu_assign_pointer(init_net.gen, ng);
1104+
net_gen_cookie(&init_net);
10901105

10911106
down_write(&pernet_ops_rwsem);
10921107
if (setup_net(&init_net, &init_user_ns))

tools/include/uapi/linux/bpf.h

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2950,6 +2950,19 @@ union bpf_attr {
29502950
* restricted to raw_tracepoint bpf programs.
29512951
* Return
29522952
* 0 on success, or a negative error in case of failure.
2953+
*
2954+
* u64 bpf_get_netns_cookie(void *ctx)
2955+
* Description
2956+
* Retrieve the cookie (generated by the kernel) of the network
2957+
* namespace the input *ctx* is associated with. The network
2958+
* namespace cookie remains stable for its lifetime and provides
2959+
* a global identifier that can be assumed unique. If *ctx* is
2960+
* NULL, then the helper returns the cookie for the initial
2961+
* network namespace. The cookie itself is very similar to that
2962+
* of bpf_get_socket_cookie() helper, but for network namespaces
2963+
* instead of sockets.
2964+
* Return
2965+
* A 8-byte long opaque number.
29532966
*/
29542967
#define __BPF_FUNC_MAPPER(FN) \
29552968
FN(unspec), \
@@ -3073,7 +3086,8 @@ union bpf_attr {
30733086
FN(jiffies64), \
30743087
FN(read_branch_records), \
30753088
FN(get_ns_current_pid_tgid), \
3076-
FN(xdp_output),
3089+
FN(xdp_output), \
3090+
FN(get_netns_cookie),
30773091

30783092
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
30793093
* function eBPF program intends to call

0 commit comments

Comments
 (0)