Skip to content

Commit bd1060a

Browse files
htejundavem330
authored andcommitted
sock, cgroup: add sock->sk_cgroup
In cgroup v1, dealing with cgroup membership was difficult because the number of membership associations was unbound. As a result, cgroup v1 grew several controllers whose primary purpose is either tagging membership or pull in configuration knobs from other subsystems so that cgroup membership test can be avoided. net_cls and net_prio controllers are examples of the latter. They allow configuring network-specific attributes from cgroup side so that network subsystem can avoid testing cgroup membership; unfortunately, these are not only cumbersome but also problematic. Both net_cls and net_prio aren't properly hierarchical. Both inherit configuration from the parent on creation but there's no interaction afterwards. An ancestor doesn't restrict the behavior in its subtree in anyway and configuration changes aren't propagated downwards. Especially when combined with cgroup delegation, this is problematic because delegatees can mess up whatever network configuration implemented at the system level. net_prio would allow the delegatees to set whatever priority value regardless of CAP_NET_ADMIN and net_cls the same for classid. While it is possible to solve these issues from controller side by implementing hierarchical allowable ranges in both controllers, it would involve quite a bit of complexity in the controllers and further obfuscate network configuration as it becomes even more difficult to tell what's actually being configured looking from the network side. While not much can be done for v1 at this point, as membership handling is sane on cgroup v2, it'd be better to make cgroup matching behave like other network matches and classifiers than introducing further complications. In preparation, this patch updates sock->sk_cgrp_data handling so that it points to the v2 cgroup that sock was created in until either net_prio or net_cls is used. Once either of the two is used, sock->sk_cgrp_data reverts to its previous role of carrying prioidx and classid. This is to avoid adding yet another cgroup related field to struct sock. As the mode switching can happen at most once per boot, the switching mechanism is aimed at lowering hot path overhead. It may leak a finite, likely small, number of cgroup refs and report spurious prioidx or classid on switching; however, dynamic updates of prioidx and classid have always been racy and lossy - socks between creation and fd installation are never updated, config changes don't update existing sockets at all, and prioidx may index with dead and recycled cgroup IDs. Non-critical inaccuracies from small race windows won't make any noticeable difference. This patch doesn't make use of the pointer yet. The following patch will implement netfilter match for cgroup2 membership. v2: Use sock_cgroup_data to avoid inflating struct sock w/ another cgroup specific field. v3: Add comments explaining why sock_data_prioidx() and sock_data_classid() use different fallback values. Signed-off-by: Tejun Heo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Daniel Wagner <[email protected]> CC: Neil Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
1 parent 2a56a1f commit bd1060a

File tree

6 files changed

+191
-9
lines changed

6 files changed

+191
-9
lines changed

include/linux/cgroup-defs.h

Lines changed: 82 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -544,31 +544,107 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) {}
544544

545545
#ifdef CONFIG_SOCK_CGROUP_DATA
546546

547+
/*
548+
* sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
549+
* per-socket cgroup information except for memcg association.
550+
*
551+
* On legacy hierarchies, net_prio and net_cls controllers directly set
552+
* attributes on each sock which can then be tested by the network layer.
553+
* On the default hierarchy, each sock is associated with the cgroup it was
554+
* created in and the networking layer can match the cgroup directly.
555+
*
556+
* To avoid carrying all three cgroup related fields separately in sock,
557+
* sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
558+
* On boot, sock_cgroup_data records the cgroup that the sock was created
559+
* in so that cgroup2 matches can be made; however, once either net_prio or
560+
* net_cls starts being used, the area is overriden to carry prioidx and/or
561+
* classid. The two modes are distinguished by whether the lowest bit is
562+
* set. Clear bit indicates cgroup pointer while set bit prioidx and
563+
* classid.
564+
*
565+
* While userland may start using net_prio or net_cls at any time, once
566+
* either is used, cgroup2 matching no longer works. There is no reason to
567+
* mix the two and this is in line with how legacy and v2 compatibility is
568+
* handled. On mode switch, cgroup references which are already being
569+
* pointed to by socks may be leaked. While this can be remedied by adding
570+
* synchronization around sock_cgroup_data, given that the number of leaked
571+
* cgroups is bound and highly unlikely to be high, this seems to be the
572+
* better trade-off.
573+
*/
547574
struct sock_cgroup_data {
548-
u16 prioidx;
549-
u32 classid;
575+
union {
576+
#ifdef __LITTLE_ENDIAN
577+
struct {
578+
u8 is_data;
579+
u8 padding;
580+
u16 prioidx;
581+
u32 classid;
582+
} __packed;
583+
#else
584+
struct {
585+
u32 classid;
586+
u16 prioidx;
587+
u8 padding;
588+
u8 is_data;
589+
} __packed;
590+
#endif
591+
u64 val;
592+
};
550593
};
551594

595+
/*
596+
* There's a theoretical window where the following accessors race with
597+
* updaters and return part of the previous pointer as the prioidx or
598+
* classid. Such races are short-lived and the result isn't critical.
599+
*/
552600
static inline u16 sock_cgroup_prioidx(struct sock_cgroup_data *skcd)
553601
{
554-
return skcd->prioidx;
602+
/* fallback to 1 which is always the ID of the root cgroup */
603+
return (skcd->is_data & 1) ? skcd->prioidx : 1;
555604
}
556605

557606
static inline u32 sock_cgroup_classid(struct sock_cgroup_data *skcd)
558607
{
559-
return skcd->classid;
608+
/* fallback to 0 which is the unconfigured default classid */
609+
return (skcd->is_data & 1) ? skcd->classid : 0;
560610
}
561611

612+
/*
613+
* If invoked concurrently, the updaters may clobber each other. The
614+
* caller is responsible for synchronization.
615+
*/
562616
static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
563617
u16 prioidx)
564618
{
565-
skcd->prioidx = prioidx;
619+
struct sock_cgroup_data skcd_buf = { .val = READ_ONCE(skcd->val) };
620+
621+
if (sock_cgroup_prioidx(&skcd_buf) == prioidx)
622+
return;
623+
624+
if (!(skcd_buf.is_data & 1)) {
625+
skcd_buf.val = 0;
626+
skcd_buf.is_data = 1;
627+
}
628+
629+
skcd_buf.prioidx = prioidx;
630+
WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */
566631
}
567632

568633
static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd,
569634
u32 classid)
570635
{
571-
skcd->classid = classid;
636+
struct sock_cgroup_data skcd_buf = { .val = READ_ONCE(skcd->val) };
637+
638+
if (sock_cgroup_classid(&skcd_buf) == classid)
639+
return;
640+
641+
if (!(skcd_buf.is_data & 1)) {
642+
skcd_buf.val = 0;
643+
skcd_buf.is_data = 1;
644+
}
645+
646+
skcd_buf.classid = classid;
647+
WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */
572648
}
573649

574650
#else /* CONFIG_SOCK_CGROUP_DATA */

include/linux/cgroup.h

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -578,4 +578,45 @@ static inline int cgroup_init(void) { return 0; }
578578

579579
#endif /* !CONFIG_CGROUPS */
580580

581+
/*
582+
* sock->sk_cgrp_data handling. For more info, see sock_cgroup_data
583+
* definition in cgroup-defs.h.
584+
*/
585+
#ifdef CONFIG_SOCK_CGROUP_DATA
586+
587+
#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
588+
extern spinlock_t cgroup_sk_update_lock;
589+
#endif
590+
591+
void cgroup_sk_alloc_disable(void);
592+
void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
593+
void cgroup_sk_free(struct sock_cgroup_data *skcd);
594+
595+
static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
596+
{
597+
#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
598+
unsigned long v;
599+
600+
/*
601+
* @skcd->val is 64bit but the following is safe on 32bit too as we
602+
* just need the lower ulong to be written and read atomically.
603+
*/
604+
v = READ_ONCE(skcd->val);
605+
606+
if (v & 1)
607+
return &cgrp_dfl_root.cgrp;
608+
609+
return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
610+
#else
611+
return (struct cgroup *)(unsigned long)skcd->val;
612+
#endif
613+
}
614+
615+
#else /* CONFIG_CGROUP_DATA */
616+
617+
static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {}
618+
static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
619+
620+
#endif /* CONFIG_CGROUP_DATA */
621+
581622
#endif /* _LINUX_CGROUP_H */

kernel/cgroup.c

Lines changed: 54 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,8 +57,8 @@
5757
#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
5858
#include <linux/kthread.h>
5959
#include <linux/delay.h>
60-
6160
#include <linux/atomic.h>
61+
#include <net/sock.h>
6262

6363
/*
6464
* pidlists linger the following amount before being destroyed. The goal
@@ -5782,6 +5782,59 @@ struct cgroup *cgroup_get_from_path(const char *path)
57825782
}
57835783
EXPORT_SYMBOL_GPL(cgroup_get_from_path);
57845784

5785+
/*
5786+
* sock->sk_cgrp_data handling. For more info, see sock_cgroup_data
5787+
* definition in cgroup-defs.h.
5788+
*/
5789+
#ifdef CONFIG_SOCK_CGROUP_DATA
5790+
5791+
#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
5792+
5793+
spinlock_t cgroup_sk_update_lock;
5794+
static bool cgroup_sk_alloc_disabled __read_mostly;
5795+
5796+
void cgroup_sk_alloc_disable(void)
5797+
{
5798+
if (cgroup_sk_alloc_disabled)
5799+
return;
5800+
pr_info("cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation\n");
5801+
cgroup_sk_alloc_disabled = true;
5802+
}
5803+
5804+
#else
5805+
5806+
#define cgroup_sk_alloc_disabled false
5807+
5808+
#endif
5809+
5810+
void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
5811+
{
5812+
if (cgroup_sk_alloc_disabled)
5813+
return;
5814+
5815+
rcu_read_lock();
5816+
5817+
while (true) {
5818+
struct css_set *cset;
5819+
5820+
cset = task_css_set(current);
5821+
if (likely(cgroup_tryget(cset->dfl_cgrp))) {
5822+
skcd->val = (unsigned long)cset->dfl_cgrp;
5823+
break;
5824+
}
5825+
cpu_relax();
5826+
}
5827+
5828+
rcu_read_unlock();
5829+
}
5830+
5831+
void cgroup_sk_free(struct sock_cgroup_data *skcd)
5832+
{
5833+
cgroup_put(sock_cgroup_ptr(skcd));
5834+
}
5835+
5836+
#endif /* CONFIG_SOCK_CGROUP_DATA */
5837+
57855838
#ifdef CONFIG_CGROUP_DEBUG
57865839
static struct cgroup_subsys_state *
57875840
debug_css_alloc(struct cgroup_subsys_state *parent_css)

net/core/netclassid_cgroup.c

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,9 +61,12 @@ static int update_classid_sock(const void *v, struct file *file, unsigned n)
6161
int err;
6262
struct socket *sock = sock_from_file(file, &err);
6363

64-
if (sock)
64+
if (sock) {
65+
spin_lock(&cgroup_sk_update_lock);
6566
sock_cgroup_set_classid(&sock->sk->sk_cgrp_data,
6667
(unsigned long)v);
68+
spin_unlock(&cgroup_sk_update_lock);
69+
}
6770
return 0;
6871
}
6972

@@ -98,6 +101,8 @@ static int write_classid(struct cgroup_subsys_state *css, struct cftype *cft,
98101
{
99102
struct cgroup_cls_state *cs = css_cls_state(css);
100103

104+
cgroup_sk_alloc_disable();
105+
101106
cs->classid = (u32)value;
102107

103108
update_classid(css, (void *)(unsigned long)cs->classid);

net/core/netprio_cgroup.c

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,8 @@ static ssize_t write_priomap(struct kernfs_open_file *of,
209209
if (!dev)
210210
return -ENODEV;
211211

212+
cgroup_sk_alloc_disable();
213+
212214
rtnl_lock();
213215

214216
ret = netprio_set_prio(of_css(of), dev, prio);
@@ -222,9 +224,12 @@ static int update_netprio(const void *v, struct file *file, unsigned n)
222224
{
223225
int err;
224226
struct socket *sock = sock_from_file(file, &err);
225-
if (sock)
227+
if (sock) {
228+
spin_lock(&cgroup_sk_update_lock);
226229
sock_cgroup_set_prioidx(&sock->sk->sk_cgrp_data,
227230
(unsigned long)v);
231+
spin_unlock(&cgroup_sk_update_lock);
232+
}
228233
return 0;
229234
}
230235

net/core/sock.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1363,6 +1363,7 @@ static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
13631363
if (!try_module_get(prot->owner))
13641364
goto out_free_sec;
13651365
sk_tx_queue_clear(sk);
1366+
cgroup_sk_alloc(&sk->sk_cgrp_data);
13661367
}
13671368

13681369
return sk;
@@ -1385,6 +1386,7 @@ static void sk_prot_free(struct proto *prot, struct sock *sk)
13851386
owner = prot->owner;
13861387
slab = prot->slab;
13871388

1389+
cgroup_sk_free(&sk->sk_cgrp_data);
13881390
security_sk_free(sk);
13891391
if (slab != NULL)
13901392
kmem_cache_free(slab, sk);

0 commit comments

Comments
 (0)