Skip to content

Commit cd04ae1

Browse files
Michal Hockotorvalds
authored andcommitted
mm, oom: do not rely on TIF_MEMDIE for memory reserves access
For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65 ("oom, oom_reaper: allow to reap mm shared by the kthreads") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent d30561c commit cd04ae1

File tree

3 files changed

+73
-23
lines changed

3 files changed

+73
-23
lines changed

mm/internal.h

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
480480
/* Mask to get the watermark bits */
481481
#define ALLOC_WMARK_MASK (ALLOC_NO_WATERMARKS-1)
482482

483+
/*
484+
* Only MMU archs have async oom victim reclaim - aka oom_reaper so we
485+
* cannot assume a reduced access to memory reserves is sufficient for
486+
* !MMU
487+
*/
488+
#ifdef CONFIG_MMU
489+
#define ALLOC_OOM 0x08
490+
#else
491+
#define ALLOC_OOM ALLOC_NO_WATERMARKS
492+
#endif
493+
483494
#define ALLOC_HARDER 0x10 /* try to alloc harder */
484495
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
485496
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */

mm/oom_kill.c

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -824,7 +824,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
824824

825825
/*
826826
* If the task is already exiting, don't alarm the sysadmin or kill
827-
* its children or threads, just set TIF_MEMDIE so it can die quickly
827+
* its children or threads, just give it access to memory reserves
828+
* so it can die quickly
828829
*/
829830
task_lock(p);
830831
if (task_will_free_mem(p)) {
@@ -889,9 +890,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
889890
count_memcg_event_mm(mm, OOM_KILL);
890891

891892
/*
892-
* We should send SIGKILL before setting TIF_MEMDIE in order to prevent
893-
* the OOM victim from depleting the memory reserves from the user
894-
* space under its control.
893+
* We should send SIGKILL before granting access to memory reserves
894+
* in order to prevent the OOM victim from depleting the memory
895+
* reserves from the user space under its control.
895896
*/
896897
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
897898
mark_oom_victim(victim);

mm/page_alloc.c

Lines changed: 57 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -2951,7 +2951,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
29512951
{
29522952
long min = mark;
29532953
int o;
2954-
const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
2954+
const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
29552955

29562956
/* free_pages may go negative - that's OK */
29572957
free_pages -= (1 << order) - 1;
@@ -2964,10 +2964,21 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
29642964
* the high-atomic reserves. This will over-estimate the size of the
29652965
* atomic reserve but it avoids a search.
29662966
*/
2967-
if (likely(!alloc_harder))
2967+
if (likely(!alloc_harder)) {
29682968
free_pages -= z->nr_reserved_highatomic;
2969-
else
2970-
min -= min / 4;
2969+
} else {
2970+
/*
2971+
* OOM victims can try even harder than normal ALLOC_HARDER
2972+
* users on the grounds that it's definitely going to be in
2973+
* the exit path shortly and free memory. Any allocation it
2974+
* makes during the free path will be small and short-lived.
2975+
*/
2976+
if (alloc_flags & ALLOC_OOM)
2977+
min -= min / 2;
2978+
else
2979+
min -= min / 4;
2980+
}
2981+
29712982

29722983
#ifdef CONFIG_CMA
29732984
/* If allocation can't use CMA areas don't use free CMA pages */
@@ -3205,7 +3216,7 @@ static void warn_alloc_show_mem(gfp_t gfp_mask, nodemask_t *nodemask)
32053216
* of allowed nodes.
32063217
*/
32073218
if (!(gfp_mask & __GFP_NOMEMALLOC))
3208-
if (test_thread_flag(TIF_MEMDIE) ||
3219+
if (tsk_is_oom_victim(current) ||
32093220
(current->flags & (PF_MEMALLOC | PF_EXITING)))
32103221
filter &= ~SHOW_MEM_FILTER_NODES;
32113222
if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -3668,21 +3679,46 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
36683679
return alloc_flags;
36693680
}
36703681

3671-
bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
3682+
static bool oom_reserves_allowed(struct task_struct *tsk)
36723683
{
3673-
if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
3684+
if (!tsk_is_oom_victim(tsk))
3685+
return false;
3686+
3687+
/*
3688+
* !MMU doesn't have oom reaper so give access to memory reserves
3689+
* only to the thread with TIF_MEMDIE set
3690+
*/
3691+
if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
36743692
return false;
36753693

3694+
return true;
3695+
}
3696+
3697+
/*
3698+
* Distinguish requests which really need access to full memory
3699+
* reserves from oom victims which can live with a portion of it
3700+
*/
3701+
static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
3702+
{
3703+
if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
3704+
return 0;
36763705
if (gfp_mask & __GFP_MEMALLOC)
3677-
return true;
3706+
return ALLOC_NO_WATERMARKS;
36783707
if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
3679-
return true;
3680-
if (!in_interrupt() &&
3681-
((current->flags & PF_MEMALLOC) ||
3682-
unlikely(test_thread_flag(TIF_MEMDIE))))
3683-
return true;
3708+
return ALLOC_NO_WATERMARKS;
3709+
if (!in_interrupt()) {
3710+
if (current->flags & PF_MEMALLOC)
3711+
return ALLOC_NO_WATERMARKS;
3712+
else if (oom_reserves_allowed(current))
3713+
return ALLOC_OOM;
3714+
}
36843715

3685-
return false;
3716+
return 0;
3717+
}
3718+
3719+
bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
3720+
{
3721+
return !!__gfp_pfmemalloc_flags(gfp_mask);
36863722
}
36873723

36883724
/*
@@ -3835,6 +3871,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
38353871
unsigned long alloc_start = jiffies;
38363872
unsigned int stall_timeout = 10 * HZ;
38373873
unsigned int cpuset_mems_cookie;
3874+
int reserve_flags;
38383875

38393876
/*
38403877
* In the slowpath, we sanity check order to avoid ever trying to
@@ -3940,15 +3977,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
39403977
if (gfp_mask & __GFP_KSWAPD_RECLAIM)
39413978
wake_all_kswapds(order, ac);
39423979

3943-
if (gfp_pfmemalloc_allowed(gfp_mask))
3944-
alloc_flags = ALLOC_NO_WATERMARKS;
3980+
reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
3981+
if (reserve_flags)
3982+
alloc_flags = reserve_flags;
39453983

39463984
/*
39473985
* Reset the zonelist iterators if memory policies can be ignored.
39483986
* These allocations are high priority and system rather than user
39493987
* orientated.
39503988
*/
3951-
if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
3989+
if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
39523990
ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
39533991
ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
39543992
ac->high_zoneidx, ac->nodemask);
@@ -4025,8 +4063,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
40254063
goto got_pg;
40264064

40274065
/* Avoid allocations with no watermarks from looping endlessly */
4028-
if (test_thread_flag(TIF_MEMDIE) &&
4029-
(alloc_flags == ALLOC_NO_WATERMARKS ||
4066+
if (tsk_is_oom_victim(current) &&
4067+
(alloc_flags == ALLOC_OOM ||
40304068
(gfp_mask & __GFP_NOMEMALLOC)))
40314069
goto nopage;
40324070

0 commit comments

Comments
 (0)