Skip to content

Commit 28d82dc

Browse files
jibarontorvalds
authored andcommitted
epoll: limit paths
The current epoll code can be tickled to run basically indefinitely in both loop detection path check (on ep_insert()), and in the wakeup paths. The programs that tickle this behavior set up deeply linked networks of epoll file descriptors that cause the epoll algorithms to traverse them indefinitely. A couple of these sample programs have been previously posted in this thread: https://lkml.org/lkml/2011/2/25/297. To fix the loop detection path check algorithms, I simply keep track of the epoll nodes that have been already visited. Thus, the loop detection becomes proportional to the number of epoll file descriptor and links. This dramatically decreases the run-time of the loop check algorithm. In one diabolical case I tried it reduced the run-time from 15 mintues (all in kernel time) to .3 seconds. Fixing the wakeup paths could be done at wakeup time in a similar manner by keeping track of nodes that have already been visited, but the complexity is harder, since there can be multiple wakeups on different cpus...Thus, I've opted to limit the number of possible wakeup paths when the paths are created. This is accomplished, by noting that the end file descriptor points that are found during the loop detection pass (from the newly added link), are actually the sources for wakeup events. I keep a list of these file descriptors and limit the number and length of these paths that emanate from these 'source file descriptors'. In the current implemetation I allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of length 4 and 10 of length 5. Note that it is sufficient to check the 'source file descriptors' reachable from the newly added link, since no other 'source file descriptors' will have newly added links. This allows us to check only the wakeup paths that may have gotten too long, and not re-check all possible wakeup paths on the system. In terms of the path limit selection, I think its first worth noting that the most common case for epoll, is probably the model where you have 1 epoll file descriptor that is monitoring n number of 'source file descriptors'. In this case, each 'source file descriptor' has a 1 path of length 1. Thus, I believe that the limits I'm proposing are quite reasonable and in fact may be too generous. Thus, I'm hoping that the proposed limits will not prevent any workloads that currently work to fail. In terms of locking, I have extended the use of the 'epmutex' to all epoll_ctl add and remove operations. Currently its only used in a subset of the add paths. I need to hold the epmutex, so that we can correctly traverse a coherent graph, to check the number of paths. I believe that this additional locking is probably ok, since its in the setup/teardown paths, and doesn't affect the running paths, but it certainly is going to add some extra overhead. Also, worth noting is that the epmuex was recently added to the ep_ctl add operations in the initial path loop detection code using the argument that it was not on a critical path. Another thing to note here, is the length of epoll chains that is allowed. Currently, eventpoll.c defines: /* Maximum number of nesting allowed inside epoll sets */ #define EP_MAX_NESTS 4 This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS + 1). However, this limit is currently only enforced during the loop check detection code, and only when the epoll file descriptors are added in a certain order. Thus, this limit is currently easily bypassed. The newly added check for wakeup paths, stricly limits the wakeup paths to a length of 5, regardless of the order in which ep's are linked together. Thus, a side-effect of the new code is a more consistent enforcement of the graph depth. Thus far, I've tested this, using the sample programs previously mentioned, which now either return quickly or return -EINVAL. I've also testing using the piptest.c epoll tester, which showed no difference in performance. I've also created a number of different epoll networks and tested that they behave as expectded. I believe this solves the original diabolical test cases, while still preserving the sane epoll nesting. Signed-off-by: Jason Baron <[email protected]> Cc: Nelson Elhage <[email protected]> Cc: Davide Libenzi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 2ccd4f4 commit 28d82dc

File tree

3 files changed

+211
-25
lines changed

3 files changed

+211
-25
lines changed

fs/eventpoll.c

Lines changed: 209 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,12 @@ struct eventpoll {
197197

198198
/* The user that created the eventpoll descriptor */
199199
struct user_struct *user;
200+
201+
struct file *file;
202+
203+
/* used to optimize loop detection check */
204+
int visited;
205+
struct list_head visited_list_link;
200206
};
201207

202208
/* Wait structure used by the poll hooks */
@@ -255,6 +261,15 @@ static struct kmem_cache *epi_cache __read_mostly;
255261
/* Slab cache used to allocate "struct eppoll_entry" */
256262
static struct kmem_cache *pwq_cache __read_mostly;
257263

264+
/* Visited nodes during ep_loop_check(), so we can unset them when we finish */
265+
static LIST_HEAD(visited_list);
266+
267+
/*
268+
* List of files with newly added links, where we may need to limit the number
269+
* of emanating paths. Protected by the epmutex.
270+
*/
271+
static LIST_HEAD(tfile_check_list);
272+
258273
#ifdef CONFIG_SYSCTL
259274

260275
#include <linux/sysctl.h>
@@ -276,6 +291,12 @@ ctl_table epoll_table[] = {
276291
};
277292
#endif /* CONFIG_SYSCTL */
278293

294+
static const struct file_operations eventpoll_fops;
295+
296+
static inline int is_file_epoll(struct file *f)
297+
{
298+
return f->f_op == &eventpoll_fops;
299+
}
279300

280301
/* Setup the structure that is used as key for the RB tree */
281302
static inline void ep_set_ffd(struct epoll_filefd *ffd,
@@ -711,12 +732,6 @@ static const struct file_operations eventpoll_fops = {
711732
.llseek = noop_llseek,
712733
};
713734

714-
/* Fast test to see if the file is an eventpoll file */
715-
static inline int is_file_epoll(struct file *f)
716-
{
717-
return f->f_op == &eventpoll_fops;
718-
}
719-
720735
/*
721736
* This is called from eventpoll_release() to unlink files from the eventpoll
722737
* interface. We need to have this facility to cleanup correctly files that are
@@ -926,6 +941,99 @@ static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)
926941
rb_insert_color(&epi->rbn, &ep->rbr);
927942
}
928943

944+
945+
946+
#define PATH_ARR_SIZE 5
947+
/*
948+
* These are the number paths of length 1 to 5, that we are allowing to emanate
949+
* from a single file of interest. For example, we allow 1000 paths of length
950+
* 1, to emanate from each file of interest. This essentially represents the
951+
* potential wakeup paths, which need to be limited in order to avoid massive
952+
* uncontrolled wakeup storms. The common use case should be a single ep which
953+
* is connected to n file sources. In this case each file source has 1 path
954+
* of length 1. Thus, the numbers below should be more than sufficient. These
955+
* path limits are enforced during an EPOLL_CTL_ADD operation, since a modify
956+
* and delete can't add additional paths. Protected by the epmutex.
957+
*/
958+
static const int path_limits[PATH_ARR_SIZE] = { 1000, 500, 100, 50, 10 };
959+
static int path_count[PATH_ARR_SIZE];
960+
961+
static int path_count_inc(int nests)
962+
{
963+
if (++path_count[nests] > path_limits[nests])
964+
return -1;
965+
return 0;
966+
}
967+
968+
static void path_count_init(void)
969+
{
970+
int i;
971+
972+
for (i = 0; i < PATH_ARR_SIZE; i++)
973+
path_count[i] = 0;
974+
}
975+
976+
static int reverse_path_check_proc(void *priv, void *cookie, int call_nests)
977+
{
978+
int error = 0;
979+
struct file *file = priv;
980+
struct file *child_file;
981+
struct epitem *epi;
982+
983+
list_for_each_entry(epi, &file->f_ep_links, fllink) {
984+
child_file = epi->ep->file;
985+
if (is_file_epoll(child_file)) {
986+
if (list_empty(&child_file->f_ep_links)) {
987+
if (path_count_inc(call_nests)) {
988+
error = -1;
989+
break;
990+
}
991+
} else {
992+
error = ep_call_nested(&poll_loop_ncalls,
993+
EP_MAX_NESTS,
994+
reverse_path_check_proc,
995+
child_file, child_file,
996+
current);
997+
}
998+
if (error != 0)
999+
break;
1000+
} else {
1001+
printk(KERN_ERR "reverse_path_check_proc: "
1002+
"file is not an ep!\n");
1003+
}
1004+
}
1005+
return error;
1006+
}
1007+
1008+
/**
1009+
* reverse_path_check - The tfile_check_list is list of file *, which have
1010+
* links that are proposed to be newly added. We need to
1011+
* make sure that those added links don't add too many
1012+
* paths such that we will spend all our time waking up
1013+
* eventpoll objects.
1014+
*
1015+
* Returns: Returns zero if the proposed links don't create too many paths,
1016+
* -1 otherwise.
1017+
*/
1018+
static int reverse_path_check(void)
1019+
{
1020+
int length = 0;
1021+
int error = 0;
1022+
struct file *current_file;
1023+
1024+
/* let's call this for all tfiles */
1025+
list_for_each_entry(current_file, &tfile_check_list, f_tfile_llink) {
1026+
length++;
1027+
path_count_init();
1028+
error = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
1029+
reverse_path_check_proc, current_file,
1030+
current_file, current);
1031+
if (error)
1032+
break;
1033+
}
1034+
return error;
1035+
}
1036+
9291037
/*
9301038
* Must be called with "mtx" held.
9311039
*/
@@ -987,6 +1095,11 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
9871095
*/
9881096
ep_rbtree_insert(ep, epi);
9891097

1098+
/* now check if we've created too many backpaths */
1099+
error = -EINVAL;
1100+
if (reverse_path_check())
1101+
goto error_remove_epi;
1102+
9901103
/* We have to drop the new item inside our item list to keep track of it */
9911104
spin_lock_irqsave(&ep->lock, flags);
9921105

@@ -1011,6 +1124,14 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
10111124

10121125
return 0;
10131126

1127+
error_remove_epi:
1128+
spin_lock(&tfile->f_lock);
1129+
if (ep_is_linked(&epi->fllink))
1130+
list_del_init(&epi->fllink);
1131+
spin_unlock(&tfile->f_lock);
1132+
1133+
rb_erase(&epi->rbn, &ep->rbr);
1134+
10141135
error_unregister:
10151136
ep_unregister_pollwait(ep, epi);
10161137

@@ -1275,18 +1396,36 @@ static int ep_loop_check_proc(void *priv, void *cookie, int call_nests)
12751396
int error = 0;
12761397
struct file *file = priv;
12771398
struct eventpoll *ep = file->private_data;
1399+
struct eventpoll *ep_tovisit;
12781400
struct rb_node *rbp;
12791401
struct epitem *epi;
12801402

12811403
mutex_lock_nested(&ep->mtx, call_nests + 1);
1404+
ep->visited = 1;
1405+
list_add(&ep->visited_list_link, &visited_list);
12821406
for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
12831407
epi = rb_entry(rbp, struct epitem, rbn);
12841408
if (unlikely(is_file_epoll(epi->ffd.file))) {
1409+
ep_tovisit = epi->ffd.file->private_data;
1410+
if (ep_tovisit->visited)
1411+
continue;
12851412
error = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
1286-
ep_loop_check_proc, epi->ffd.file,
1287-
epi->ffd.file->private_data, current);
1413+
ep_loop_check_proc, epi->ffd.file,
1414+
ep_tovisit, current);
12881415
if (error != 0)
12891416
break;
1417+
} else {
1418+
/*
1419+
* If we've reached a file that is not associated with
1420+
* an ep, then we need to check if the newly added
1421+
* links are going to add too many wakeup paths. We do
1422+
* this by adding it to the tfile_check_list, if it's
1423+
* not already there, and calling reverse_path_check()
1424+
* during ep_insert().
1425+
*/
1426+
if (list_empty(&epi->ffd.file->f_tfile_llink))
1427+
list_add(&epi->ffd.file->f_tfile_llink,
1428+
&tfile_check_list);
12901429
}
12911430
}
12921431
mutex_unlock(&ep->mtx);
@@ -1307,17 +1446,41 @@ static int ep_loop_check_proc(void *priv, void *cookie, int call_nests)
13071446
*/
13081447
static int ep_loop_check(struct eventpoll *ep, struct file *file)
13091448
{
1310-
return ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
1449+
int ret;
1450+
struct eventpoll *ep_cur, *ep_next;
1451+
1452+
ret = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
13111453
ep_loop_check_proc, file, ep, current);
1454+
/* clear visited list */
1455+
list_for_each_entry_safe(ep_cur, ep_next, &visited_list,
1456+
visited_list_link) {
1457+
ep_cur->visited = 0;
1458+
list_del(&ep_cur->visited_list_link);
1459+
}
1460+
return ret;
1461+
}
1462+
1463+
static void clear_tfile_check_list(void)
1464+
{
1465+
struct file *file;
1466+
1467+
/* first clear the tfile_check_list */
1468+
while (!list_empty(&tfile_check_list)) {
1469+
file = list_first_entry(&tfile_check_list, struct file,
1470+
f_tfile_llink);
1471+
list_del_init(&file->f_tfile_llink);
1472+
}
1473+
INIT_LIST_HEAD(&tfile_check_list);
13121474
}
13131475

13141476
/*
13151477
* Open an eventpoll file descriptor.
13161478
*/
13171479
SYSCALL_DEFINE1(epoll_create1, int, flags)
13181480
{
1319-
int error;
1481+
int error, fd;
13201482
struct eventpoll *ep = NULL;
1483+
struct file *file;
13211484

13221485
/* Check the EPOLL_* constant for consistency. */
13231486
BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
@@ -1334,11 +1497,25 @@ SYSCALL_DEFINE1(epoll_create1, int, flags)
13341497
* Creates all the items needed to setup an eventpoll file. That is,
13351498
* a file structure and a free file descriptor.
13361499
*/
1337-
error = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep,
1500+
fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
1501+
if (fd < 0) {
1502+
error = fd;
1503+
goto out_free_ep;
1504+
}
1505+
file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
13381506
O_RDWR | (flags & O_CLOEXEC));
1339-
if (error < 0)
1340-
ep_free(ep);
1341-
1507+
if (IS_ERR(file)) {
1508+
error = PTR_ERR(file);
1509+
goto out_free_fd;
1510+
}
1511+
fd_install(fd, file);
1512+
ep->file = file;
1513+
return fd;
1514+
1515+
out_free_fd:
1516+
put_unused_fd(fd);
1517+
out_free_ep:
1518+
ep_free(ep);
13421519
return error;
13431520
}
13441521

@@ -1404,21 +1581,27 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
14041581
/*
14051582
* When we insert an epoll file descriptor, inside another epoll file
14061583
* descriptor, there is the change of creating closed loops, which are
1407-
* better be handled here, than in more critical paths.
1584+
* better be handled here, than in more critical paths. While we are
1585+
* checking for loops we also determine the list of files reachable
1586+
* and hang them on the tfile_check_list, so we can check that we
1587+
* haven't created too many possible wakeup paths.
14081588
*
1409-
* We hold epmutex across the loop check and the insert in this case, in
1410-
* order to prevent two separate inserts from racing and each doing the
1411-
* insert "at the same time" such that ep_loop_check passes on both
1412-
* before either one does the insert, thereby creating a cycle.
1589+
* We need to hold the epmutex across both ep_insert and ep_remove
1590+
* b/c we want to make sure we are looking at a coherent view of
1591+
* epoll network.
14131592
*/
1414-
if (unlikely(is_file_epoll(tfile) && op == EPOLL_CTL_ADD)) {
1593+
if (op == EPOLL_CTL_ADD || op == EPOLL_CTL_DEL) {
14151594
mutex_lock(&epmutex);
14161595
did_lock_epmutex = 1;
1417-
error = -ELOOP;
1418-
if (ep_loop_check(ep, tfile) != 0)
1419-
goto error_tgt_fput;
14201596
}
1421-
1597+
if (op == EPOLL_CTL_ADD) {
1598+
if (is_file_epoll(tfile)) {
1599+
error = -ELOOP;
1600+
if (ep_loop_check(ep, tfile) != 0)
1601+
goto error_tgt_fput;
1602+
} else
1603+
list_add(&tfile->f_tfile_llink, &tfile_check_list);
1604+
}
14221605

14231606
mutex_lock_nested(&ep->mtx, 0);
14241607

@@ -1437,6 +1620,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
14371620
error = ep_insert(ep, &epds, tfile, fd);
14381621
} else
14391622
error = -EEXIST;
1623+
clear_tfile_check_list();
14401624
break;
14411625
case EPOLL_CTL_DEL:
14421626
if (epi)
@@ -1455,7 +1639,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
14551639
mutex_unlock(&ep->mtx);
14561640

14571641
error_tgt_fput:
1458-
if (unlikely(did_lock_epmutex))
1642+
if (did_lock_epmutex)
14591643
mutex_unlock(&epmutex);
14601644

14611645
fput(tfile);

include/linux/eventpoll.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ struct file;
6161
static inline void eventpoll_init_file(struct file *file)
6262
{
6363
INIT_LIST_HEAD(&file->f_ep_links);
64+
INIT_LIST_HEAD(&file->f_tfile_llink);
6465
}
6566

6667

include/linux/fs.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1001,6 +1001,7 @@ struct file {
10011001
#ifdef CONFIG_EPOLL
10021002
/* Used by fs/eventpoll.c to link all the hooks to this file */
10031003
struct list_head f_ep_links;
1004+
struct list_head f_tfile_llink;
10041005
#endif /* #ifdef CONFIG_EPOLL */
10051006
struct address_space *f_mapping;
10061007
#ifdef CONFIG_DEBUG_WRITECOUNT

0 commit comments

Comments
 (0)