Skip to content

Commit 7e22561

Browse files
committed
Merge branch 'vrf-allow-simultaneous-service-instances-in-default-and-other-VRFs'
Mike Manning says: ==================== vrf: allow simultaneous service instances in default and other VRFs Services currently have to be VRF-aware if they are using an unbound socket. One cannot have multiple service instances running in the default and other VRFs for services that are not VRF-aware and listen on an unbound socket. This is because there is no easy way of isolating packets received in the default VRF from those arriving in other VRFs. This series provides this isolation for stream sockets subject to the existing kernel parameter net.ipv4.tcp_l3mdev_accept not being set, given that this is documented as allowing a single service instance to work across all VRF domains. Similarly, net.ipv4.udp_l3mdev_accept is checked for datagram sockets, and net.ipv4.raw_l3mdev_accept is introduced for raw sockets. The functionality applies to UDP & TCP services as well as those using raw sockets, and is for IPv4 and IPv6. Example of running ssh instances in default and blue VRF: $ /usr/sbin/sshd -D $ ip vrf exec vrf-blue /usr/sbin/sshd $ ss -ta | egrep 'State|ssh' State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 0.0.0.0%vrf-blue:ssh 0.0.0.0:* LISTEN 0 128 0.0.0.0:ssh 0.0.0.0:* ESTAB 0 0 192.168.122.220:ssh 192.168.122.1:50282 LISTEN 0 128 [::]%vrf-blue:ssh [::]:* LISTEN 0 128 [::]:ssh [::]:* ESTAB 0 0 [3000::2]%vrf-blue:ssh [3000::9]:45896 ESTAB 0 0 [2000::2]:ssh [2000::9]:46398 v1: - Address Paolo Abeni's comments (patch 4/5) - Fix build when CONFIG_NET_L3_MASTER_DEV not defined (patch 1/5) v2: - Address David Aherns' comments (patches 4/5 and 5/5) - Remove patches 3/5 and 5/5 from series for individual submissions - Include a sysctl for raw sockets as recommended by David Ahern - Expand series into 10 patches and provide improved descriptions v3: - Update description for patch 1/10 and remove patch 6/10 v4: - Set default to enabled for raw socket sysctl as recommended by David Ahern v5: - Address review comments from David Ahern in patches 2-5 ==================== Signed-off-by: David S. Miller <[email protected]>
2 parents f601a85 + 7bd2db4 commit 7e22561

File tree

22 files changed

+243
-84
lines changed

22 files changed

+243
-84
lines changed

Documentation/networking/ip-sysctl.txt

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -370,6 +370,7 @@ tcp_l3mdev_accept - BOOLEAN
370370
derived from the listen socket to be bound to the L3 domain in
371371
which the packets originated. Only valid when the kernel was
372372
compiled with CONFIG_NET_L3_MASTER_DEV.
373+
Default: 0 (disabled)
373374

374375
tcp_low_latency - BOOLEAN
375376
This is a legacy option, it has no effect anymore.
@@ -773,6 +774,7 @@ udp_l3mdev_accept - BOOLEAN
773774
being received regardless of the L3 domain in which they
774775
originated. Only valid when the kernel was compiled with
775776
CONFIG_NET_L3_MASTER_DEV.
777+
Default: 0 (disabled)
776778

777779
udp_mem - vector of 3 INTEGERs: min, pressure, max
778780
Number of pages allowed for queueing by all UDP sockets.
@@ -799,6 +801,16 @@ udp_wmem_min - INTEGER
799801
total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
800802
Default: 4K
801803

804+
RAW variables:
805+
806+
raw_l3mdev_accept - BOOLEAN
807+
Enabling this option allows a "global" bound socket to work
808+
across L3 master domains (e.g., VRFs) with packets capable of
809+
being received regardless of the L3 domain in which they
810+
originated. Only valid when the kernel was compiled with
811+
CONFIG_NET_L3_MASTER_DEV.
812+
Default: 1 (enabled)
813+
802814
CIPSOv4 Variables:
803815

804816
cipso_cache_enable - BOOLEAN

Documentation/networking/vrf.txt

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -103,19 +103,33 @@ VRF device:
103103

104104
or to specify the output device using cmsg and IP_PKTINFO.
105105

106+
By default the scope of the port bindings for unbound sockets is
107+
limited to the default VRF. That is, it will not be matched by packets
108+
arriving on interfaces enslaved to an l3mdev and processes may bind to
109+
the same port if they bind to an l3mdev.
110+
106111
TCP & UDP services running in the default VRF context (ie., not bound
107112
to any VRF device) can work across all VRF domains by enabling the
108113
tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
114+
109115
sysctl -w net.ipv4.tcp_l3mdev_accept=1
110116
sysctl -w net.ipv4.udp_l3mdev_accept=1
111117

118+
These options are disabled by default so that a socket in a VRF is only
119+
selected for packets in that VRF. There is a similar option for RAW
120+
sockets, which is enabled by default for reasons of backwards compatibility.
121+
This is so as to specify the output device with cmsg and IP_PKTINFO, but
122+
using a socket not bound to the corresponding VRF. This allows e.g. older ping
123+
implementations to be run with specifying the device but without executing it
124+
in the VRF. This option can be disabled so that packets received in a VRF
125+
context are only handled by a raw socket bound to the VRF, and packets in the
126+
default VRF are only handled by a socket not bound to any VRF:
127+
128+
sysctl -w net.ipv4.raw_l3mdev_accept=0
129+
112130
netfilter rules on the VRF device can be used to limit access to services
113131
running in the default VRF context as well.
114132

115-
The default VRF does not have limited scope with respect to port bindings.
116-
That is, if a process does a wildcard bind to a port in the default VRF it
117-
owns the port across all VRF domains within the network namespace.
118-
119133
################################################################################
120134

121135
Using iproute2 for VRFs

drivers/net/vrf.c

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -981,24 +981,23 @@ static struct sk_buff *vrf_ip6_rcv(struct net_device *vrf_dev,
981981
struct sk_buff *skb)
982982
{
983983
int orig_iif = skb->skb_iif;
984-
bool need_strict;
984+
bool need_strict = rt6_need_strict(&ipv6_hdr(skb)->daddr);
985+
bool is_ndisc = ipv6_ndisc_frame(skb);
985986

986-
/* loopback traffic; do not push through packet taps again.
987-
* Reset pkt_type for upper layers to process skb
987+
/* loopback, multicast & non-ND link-local traffic; do not push through
988+
* packet taps again. Reset pkt_type for upper layers to process skb
988989
*/
989-
if (skb->pkt_type == PACKET_LOOPBACK) {
990+
if (skb->pkt_type == PACKET_LOOPBACK || (need_strict && !is_ndisc)) {
990991
skb->dev = vrf_dev;
991992
skb->skb_iif = vrf_dev->ifindex;
992993
IP6CB(skb)->flags |= IP6SKB_L3SLAVE;
993-
skb->pkt_type = PACKET_HOST;
994+
if (skb->pkt_type == PACKET_LOOPBACK)
995+
skb->pkt_type = PACKET_HOST;
994996
goto out;
995997
}
996998

997-
/* if packet is NDISC or addressed to multicast or link-local
998-
* then keep the ingress interface
999-
*/
1000-
need_strict = rt6_need_strict(&ipv6_hdr(skb)->daddr);
1001-
if (!ipv6_ndisc_frame(skb) && !need_strict) {
999+
/* if packet is NDISC then keep the ingress interface */
1000+
if (!is_ndisc) {
10021001
vrf_rx_stats(vrf_dev, skb->len);
10031002
skb->dev = vrf_dev;
10041003
skb->skb_iif = vrf_dev->ifindex;

include/net/inet6_hashtables.h

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -115,9 +115,8 @@ int inet6_hash(struct sock *sk);
115115
((__sk)->sk_family == AF_INET6) && \
116116
ipv6_addr_equal(&(__sk)->sk_v6_daddr, (__saddr)) && \
117117
ipv6_addr_equal(&(__sk)->sk_v6_rcv_saddr, (__daddr)) && \
118-
(!(__sk)->sk_bound_dev_if || \
119-
((__sk)->sk_bound_dev_if == (__dif)) || \
120-
((__sk)->sk_bound_dev_if == (__sdif))) && \
118+
(((__sk)->sk_bound_dev_if == (__dif)) || \
119+
((__sk)->sk_bound_dev_if == (__sdif))) && \
121120
net_eq(sock_net(__sk), (__net)))
122121

123122
#endif /* _INET6_HASHTABLES_H */

include/net/inet_hashtables.h

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ struct inet_ehash_bucket {
7979

8080
struct inet_bind_bucket {
8181
possible_net_t ib_net;
82+
int l3mdev;
8283
unsigned short port;
8384
signed char fastreuse;
8485
signed char fastreuseport;
@@ -188,10 +189,21 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
188189
hashinfo->ehash_locks = NULL;
189190
}
190191

192+
static inline bool inet_sk_bound_dev_eq(struct net *net, int bound_dev_if,
193+
int dif, int sdif)
194+
{
195+
#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
196+
return inet_bound_dev_eq(!!net->ipv4.sysctl_tcp_l3mdev_accept,
197+
bound_dev_if, dif, sdif);
198+
#else
199+
return inet_bound_dev_eq(true, bound_dev_if, dif, sdif);
200+
#endif
201+
}
202+
191203
struct inet_bind_bucket *
192204
inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
193205
struct inet_bind_hashbucket *head,
194-
const unsigned short snum);
206+
const unsigned short snum, int l3mdev);
195207
void inet_bind_bucket_destroy(struct kmem_cache *cachep,
196208
struct inet_bind_bucket *tb);
197209

@@ -282,9 +294,8 @@ static inline struct sock *inet_lookup_listener(struct net *net,
282294
#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
283295
(((__sk)->sk_portpair == (__ports)) && \
284296
((__sk)->sk_addrpair == (__cookie)) && \
285-
(!(__sk)->sk_bound_dev_if || \
286-
((__sk)->sk_bound_dev_if == (__dif)) || \
287-
((__sk)->sk_bound_dev_if == (__sdif))) && \
297+
(((__sk)->sk_bound_dev_if == (__dif)) || \
298+
((__sk)->sk_bound_dev_if == (__sdif))) && \
288299
net_eq(sock_net(__sk), (__net)))
289300
#else /* 32-bit arch */
290301
#define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
@@ -294,9 +305,8 @@ static inline struct sock *inet_lookup_listener(struct net *net,
294305
(((__sk)->sk_portpair == (__ports)) && \
295306
((__sk)->sk_daddr == (__saddr)) && \
296307
((__sk)->sk_rcv_saddr == (__daddr)) && \
297-
(!(__sk)->sk_bound_dev_if || \
298-
((__sk)->sk_bound_dev_if == (__dif)) || \
299-
((__sk)->sk_bound_dev_if == (__sdif))) && \
308+
(((__sk)->sk_bound_dev_if == (__dif)) || \
309+
((__sk)->sk_bound_dev_if == (__sdif))) && \
300310
net_eq(sock_net(__sk), (__net)))
301311
#endif /* 64-bit arch */
302312

include/net/inet_sock.h

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,27 @@ static inline int inet_request_bound_dev_if(const struct sock *sk,
130130
return sk->sk_bound_dev_if;
131131
}
132132

133+
static inline int inet_sk_bound_l3mdev(const struct sock *sk)
134+
{
135+
#ifdef CONFIG_NET_L3_MASTER_DEV
136+
struct net *net = sock_net(sk);
137+
138+
if (!net->ipv4.sysctl_tcp_l3mdev_accept)
139+
return l3mdev_master_ifindex_by_index(net,
140+
sk->sk_bound_dev_if);
141+
#endif
142+
143+
return 0;
144+
}
145+
146+
static inline bool inet_bound_dev_eq(bool l3mdev_accept, int bound_dev_if,
147+
int dif, int sdif)
148+
{
149+
if (!bound_dev_if)
150+
return !sdif || l3mdev_accept;
151+
return bound_dev_if == dif || bound_dev_if == sdif;
152+
}
153+
133154
struct inet_cork {
134155
unsigned int flags;
135156
__be32 addr;

include/net/netns/ipv4.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,9 @@ struct netns_ipv4 {
103103
/* Shall we try to damage output packets if routing dev changes? */
104104
int sysctl_ip_dynaddr;
105105
int sysctl_ip_early_demux;
106+
#ifdef CONFIG_NET_L3_MASTER_DEV
107+
int sysctl_raw_l3mdev_accept;
108+
#endif
106109
int sysctl_tcp_early_demux;
107110
int sysctl_udp_early_demux;
108111

include/net/raw.h

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
#ifndef _RAW_H
1818
#define _RAW_H
1919

20-
20+
#include <net/inet_sock.h>
2121
#include <net/protocol.h>
2222
#include <linux/icmp.h>
2323

@@ -61,6 +61,7 @@ void raw_seq_stop(struct seq_file *seq, void *v);
6161

6262
int raw_hash_sk(struct sock *sk);
6363
void raw_unhash_sk(struct sock *sk);
64+
void raw_init(void);
6465

6566
struct raw_sock {
6667
/* inet_sock has to be the first member */
@@ -74,4 +75,15 @@ static inline struct raw_sock *raw_sk(const struct sock *sk)
7475
return (struct raw_sock *)sk;
7576
}
7677

78+
static inline bool raw_sk_bound_dev_eq(struct net *net, int bound_dev_if,
79+
int dif, int sdif)
80+
{
81+
#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
82+
return inet_bound_dev_eq(!!net->ipv4.sysctl_raw_l3mdev_accept,
83+
bound_dev_if, dif, sdif);
84+
#else
85+
return inet_bound_dev_eq(true, bound_dev_if, dif, sdif);
86+
#endif
87+
}
88+
7789
#endif /* _RAW_H */

include/net/udp.h

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -252,6 +252,17 @@ static inline int udp_rqueue_get(struct sock *sk)
252252
return sk_rmem_alloc_get(sk) - READ_ONCE(udp_sk(sk)->forward_deficit);
253253
}
254254

255+
static inline bool udp_sk_bound_dev_eq(struct net *net, int bound_dev_if,
256+
int dif, int sdif)
257+
{
258+
#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
259+
return inet_bound_dev_eq(!!net->ipv4.sysctl_udp_l3mdev_accept,
260+
bound_dev_if, dif, sdif);
261+
#else
262+
return inet_bound_dev_eq(true, bound_dev_if, dif, sdif);
263+
#endif
264+
}
265+
255266
/* net/ipv4/udp.c */
256267
void udp_destruct_sock(struct sock *sk);
257268
void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);

net/core/sock.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -567,6 +567,8 @@ static int sock_setbindtodevice(struct sock *sk, char __user *optval,
567567

568568
lock_sock(sk);
569569
sk->sk_bound_dev_if = index;
570+
if (sk->sk_prot->rehash)
571+
sk->sk_prot->rehash(sk);
570572
sk_dst_reset(sk);
571573
release_sock(sk);
572574

net/ipv4/af_inet.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1964,6 +1964,8 @@ static int __init inet_init(void)
19641964
/* Add UDP-Lite (RFC 3828) */
19651965
udplite4_register();
19661966

1967+
raw_init();
1968+
19671969
ping_init();
19681970

19691971
/*

net/ipv4/inet_connection_sock.c

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,9 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
183183
int i, low, high, attempt_half;
184184
struct inet_bind_bucket *tb;
185185
u32 remaining, offset;
186+
int l3mdev;
186187

188+
l3mdev = inet_sk_bound_l3mdev(sk);
187189
attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
188190
other_half_scan:
189191
inet_get_local_port_range(net, &low, &high);
@@ -219,7 +221,8 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
219221
hinfo->bhash_size)];
220222
spin_lock_bh(&head->lock);
221223
inet_bind_bucket_for_each(tb, &head->chain)
222-
if (net_eq(ib_net(tb), net) && tb->port == port) {
224+
if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
225+
tb->port == port) {
223226
if (!inet_csk_bind_conflict(sk, tb, false, false))
224227
goto success;
225228
goto next_port;
@@ -293,6 +296,9 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
293296
struct net *net = sock_net(sk);
294297
struct inet_bind_bucket *tb = NULL;
295298
kuid_t uid = sock_i_uid(sk);
299+
int l3mdev;
300+
301+
l3mdev = inet_sk_bound_l3mdev(sk);
296302

297303
if (!port) {
298304
head = inet_csk_find_open_port(sk, &tb, &port);
@@ -306,11 +312,12 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
306312
hinfo->bhash_size)];
307313
spin_lock_bh(&head->lock);
308314
inet_bind_bucket_for_each(tb, &head->chain)
309-
if (net_eq(ib_net(tb), net) && tb->port == port)
315+
if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
316+
tb->port == port)
310317
goto tb_found;
311318
tb_not_found:
312319
tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
313-
net, head, port);
320+
net, head, port, l3mdev);
314321
if (!tb)
315322
goto fail_unlock;
316323
tb_found:

0 commit comments

Comments
 (0)