Skip to content

Commit a4ebf1b

Browse files
vaverintorvalds
authored andcommitted
memcg: prohibit unconditional exceeding the limit of dying tasks
Memory cgroup charging allows killed or exiting tasks to exceed the hard limit. It is assumed that the amount of the memory charged by those tasks is bound and most of the memory will get released while the task is exiting. This is resembling a heuristic for the global OOM situation when tasks get access to memory reserves. There is no global memory shortage at the memcg level so the memcg heuristic is more relieved. The above assumption is overly optimistic though. E.g. vmalloc can scale to really large requests and the heuristic would allow that. We used to have an early break in the vmalloc allocator for killed tasks but this has been reverted by commit b8c8a33 ("Revert "vmalloc: back off when the current task is killed""). There are likely other similar code paths which do not check for fatal signals in an allocation&charge loop. Also there are some kernel objects charged to a memcg which are not bound to a process life time. It has been observed that it is not really hard to trigger these bypasses and cause global OOM situation. One potential way to address these runaways would be to limit the amount of excess (similar to the global OOM with limited oom reserves). This is certainly possible but it is not really clear how much of an excess is desirable and still protects from global OOMs as that would have to consider the overall memcg configuration. This patch is addressing the problem by removing the heuristic altogether. Bypass is only allowed for requests which either cannot fail or where the failure is not desirable while excess should be still limited (e.g. atomic requests). Implementation wise a killed or dying task fails to charge if it has passed the OOM killer stage. That should give all forms of reclaim chance to restore the limit before the failure (ENOMEM) and tell the caller to back off. In addition, this patch renames should_force_charge() helper to task_is_dying() because now its use is not associated witch forced charging. This patch depends on pagefault_out_of_memory() to not trigger out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM and cause a global OOM killer. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vasily Averin <[email protected]> Suggested-by: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Uladzislau Rezki <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 60e2793 commit a4ebf1b

File tree

1 file changed

+8
-19
lines changed

1 file changed

+8
-19
lines changed

mm/memcontrol.c

Lines changed: 8 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -234,7 +234,7 @@ enum res_type {
234234
iter != NULL; \
235235
iter = mem_cgroup_iter(NULL, iter, NULL))
236236

237-
static inline bool should_force_charge(void)
237+
static inline bool task_is_dying(void)
238238
{
239239
return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
240240
(current->flags & PF_EXITING);
@@ -1624,7 +1624,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
16241624
* A few threads which were not waiting at mutex_lock_killable() can
16251625
* fail to bail out. Therefore, check again after holding oom_lock.
16261626
*/
1627-
ret = should_force_charge() || out_of_memory(&oc);
1627+
ret = task_is_dying() || out_of_memory(&oc);
16281628

16291629
unlock:
16301630
mutex_unlock(&oom_lock);
@@ -2579,6 +2579,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
25792579
struct page_counter *counter;
25802580
enum oom_status oom_status;
25812581
unsigned long nr_reclaimed;
2582+
bool passed_oom = false;
25822583
bool may_swap = true;
25832584
bool drained = false;
25842585
unsigned long pflags;
@@ -2613,15 +2614,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
26132614
if (gfp_mask & __GFP_ATOMIC)
26142615
goto force;
26152616

2616-
/*
2617-
* Unlike in global OOM situations, memcg is not in a physical
2618-
* memory shortage. Allow dying and OOM-killed tasks to
2619-
* bypass the last charges so that they can exit quickly and
2620-
* free their memory.
2621-
*/
2622-
if (unlikely(should_force_charge()))
2623-
goto force;
2624-
26252617
/*
26262618
* Prevent unbounded recursion when reclaim operations need to
26272619
* allocate memory. This might exceed the limits temporarily,
@@ -2679,8 +2671,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
26792671
if (gfp_mask & __GFP_RETRY_MAYFAIL)
26802672
goto nomem;
26812673

2682-
if (fatal_signal_pending(current))
2683-
goto force;
2674+
/* Avoid endless loop for tasks bypassed by the oom killer */
2675+
if (passed_oom && task_is_dying())
2676+
goto nomem;
26842677

26852678
/*
26862679
* keep retrying as long as the memcg oom killer is able to make
@@ -2689,14 +2682,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
26892682
*/
26902683
oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
26912684
get_order(nr_pages * PAGE_SIZE));
2692-
switch (oom_status) {
2693-
case OOM_SUCCESS:
2685+
if (oom_status == OOM_SUCCESS) {
2686+
passed_oom = true;
26942687
nr_retries = MAX_RECLAIM_RETRIES;
26952688
goto retry;
2696-
case OOM_FAILED:
2697-
goto force;
2698-
default:
2699-
goto nomem;
27002689
}
27012690
nomem:
27022691
if (!(gfp_mask & __GFP_NOFAIL))

0 commit comments

Comments
 (0)