Skip to content

Commit 63fd327

Browse files
hnazakpm00
authored andcommitted
mm: memcontrol: don't throttle dying tasks on memory.high
While investigating hosts with high cgroup memory pressures, Tejun found culprit zombie tasks that had were holding on to a lot of memory, had SIGKILL pending, but were stuck in memory.high reclaim. In the past, we used to always force-charge allocations from tasks that were exiting in order to accelerate them dying and freeing up their rss. This changed for memory.max in a4ebf1b ("memcg: prohibit unconditional exceeding the limit of dying tasks"); it noted that this can cause (userspace inducable) containment failures, so it added a mandatory reclaim and OOM kill cycle before forcing charges. At the time, memory.high enforcement was handled in the userspace return path, which isn't reached by dying tasks, and so memory.high was still never enforced by dying tasks. When c9afe31 ("memcg: synchronously enforce memory.high for large overcharges") added synchronous reclaim for memory.high, it added unconditional memory.high enforcement for dying tasks as well. The callstack shows that this path is where the zombie is stuck in. We need to accelerate dying tasks getting past memory.high, but we cannot do it quite the same way as we do for memory.max: memory.max is enforced strictly, and tasks aren't allowed to move past it without FIRST reclaiming and OOM killing if necessary. This ensures very small levels of excess. With memory.high, though, enforcement happens lazily after the charge, and OOM killing is never triggered. A lot of concurrent threads could have pushed, or could actively be pushing, the cgroup into excess. The dying task will enter reclaim on every allocation attempt, with little hope of restoring balance. To fix this, skip synchronous memory.high enforcement on dying tasks altogether again. Update memory.high path documentation while at it. [[email protected]: also handle tasks are being killed during the reclaim] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: c9afe31 ("memcg: synchronously enforce memory.high for large overcharges") Signed-off-by: Johannes Weiner <[email protected]> Reported-by: Tejun Heo <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Dan Schatzberg <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent c4608d1 commit 63fd327

File tree

1 file changed

+25
-4
lines changed

1 file changed

+25
-4
lines changed

mm/memcontrol.c

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2623,8 +2623,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
26232623
}
26242624

26252625
/*
2626-
* Scheduled by try_charge() to be executed from the userland return path
2627-
* and reclaims memory over the high limit.
2626+
* Reclaims memory over the high limit. Called directly from
2627+
* try_charge() (context permitting), as well as from the userland
2628+
* return path where reclaim is always able to block.
26282629
*/
26292630
void mem_cgroup_handle_over_high(gfp_t gfp_mask)
26302631
{
@@ -2643,6 +2644,17 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
26432644
current->memcg_nr_pages_over_high = 0;
26442645

26452646
retry_reclaim:
2647+
/*
2648+
* Bail if the task is already exiting. Unlike memory.max,
2649+
* memory.high enforcement isn't as strict, and there is no
2650+
* OOM killer involved, which means the excess could already
2651+
* be much bigger (and still growing) than it could for
2652+
* memory.max; the dying task could get stuck in fruitless
2653+
* reclaim for a long time, which isn't desirable.
2654+
*/
2655+
if (task_is_dying())
2656+
goto out;
2657+
26462658
/*
26472659
* The allocating task should reclaim at least the batch size, but for
26482660
* subsequent retries we only want to do what's necessary to prevent oom
@@ -2693,6 +2705,9 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
26932705
}
26942706

26952707
/*
2708+
* Reclaim didn't manage to push usage below the limit, slow
2709+
* this allocating task down.
2710+
*
26962711
* If we exit early, we're guaranteed to die (since
26972712
* schedule_timeout_killable sets TASK_KILLABLE). This means we don't
26982713
* need to account for any ill-begotten jiffies to pay them off later.
@@ -2887,11 +2902,17 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
28872902
}
28882903
} while ((memcg = parent_mem_cgroup(memcg)));
28892904

2905+
/*
2906+
* Reclaim is set up above to be called from the userland
2907+
* return path. But also attempt synchronous reclaim to avoid
2908+
* excessive overrun while the task is still inside the
2909+
* kernel. If this is successful, the return path will see it
2910+
* when it rechecks the overage and simply bail out.
2911+
*/
28902912
if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
28912913
!(current->flags & PF_MEMALLOC) &&
2892-
gfpflags_allow_blocking(gfp_mask)) {
2914+
gfpflags_allow_blocking(gfp_mask))
28932915
mem_cgroup_handle_over_high(gfp_mask);
2894-
}
28952916
return 0;
28962917
}
28972918

0 commit comments

Comments
 (0)