Skip to content

Commit e47483b

Browse files
tehcastertorvalds
authored andcommitted
mm, page_alloc: fix premature OOM when racing with cpuset mems update
Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers OOM killer in few seconds, despite lots of free memory. The test attempts to repeatedly fault in memory in one process in a cpuset, while changing allowed nodes of the cpuset between 0 and 1 in another process. The problem comes from insufficient protection against cpuset changes, which can cause get_page_from_freelist() to consider all zones as non-eligible due to nodemask and/or current->mems_allowed. This was masked in the past by sufficient retries, but since commit 682a338 ("mm, page_alloc: inline the fast path of the zonelist iterator") we fix the preferred_zoneref once, and don't iterate over the whole zonelist in further attempts, thus the only eligible zones might be placed in the zonelist before our starting point and we always miss them. A previous patch fixed this problem for current->mems_allowed. However, cpuset changes also update the task's mempolicy nodemask. The fix has two parts. We have to repeat the preferred_zoneref search when we detect cpuset update by way of seqcount, and we have to check the seqcount before considering OOM. [[email protected]: fix typo in comment] Link: http://lkml.kernel.org/r/[email protected] Fixes: c33d6c0 ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") Signed-off-by: Vlastimil Babka <[email protected]> Reported-by: Ganapatrao Kulkarni <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 5ce9bfe commit e47483b

File tree

1 file changed

+24
-11
lines changed

1 file changed

+24
-11
lines changed

mm/page_alloc.c

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3555,6 +3555,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
35553555
no_progress_loops = 0;
35563556
compact_priority = DEF_COMPACT_PRIORITY;
35573557
cpuset_mems_cookie = read_mems_allowed_begin();
3558+
/*
3559+
* We need to recalculate the starting point for the zonelist iterator
3560+
* because we might have used different nodemask in the fast path, or
3561+
* there was a cpuset modification and we are retrying - otherwise we
3562+
* could end up iterating over non-eligible zones endlessly.
3563+
*/
3564+
ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
3565+
ac->high_zoneidx, ac->nodemask);
3566+
if (!ac->preferred_zoneref->zone)
3567+
goto nopage;
3568+
35583569

35593570
/*
35603571
* The fast path uses conservative alloc_flags to succeed only until
@@ -3715,6 +3726,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
37153726
&compaction_retries))
37163727
goto retry;
37173728

3729+
/*
3730+
* It's possible we raced with cpuset update so the OOM would be
3731+
* premature (see below the nopage: label for full explanation).
3732+
*/
3733+
if (read_mems_allowed_retry(cpuset_mems_cookie))
3734+
goto retry_cpuset;
3735+
37183736
/* Reclaim has failed us, start killing things */
37193737
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
37203738
if (page)
@@ -3728,10 +3746,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
37283746

37293747
nopage:
37303748
/*
3731-
* When updating a task's mems_allowed, it is possible to race with
3732-
* parallel threads in such a way that an allocation can fail while
3733-
* the mask is being updated. If a page allocation is about to fail,
3734-
* check if the cpuset changed during allocation and if so, retry.
3749+
* When updating a task's mems_allowed or mempolicy nodemask, it is
3750+
* possible to race with parallel threads in such a way that our
3751+
* allocation can fail while the mask is being updated. If we are about
3752+
* to fail, check if the cpuset changed during allocation and if so,
3753+
* retry.
37353754
*/
37363755
if (read_mems_allowed_retry(cpuset_mems_cookie))
37373756
goto retry_cpuset;
@@ -3822,15 +3841,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
38223841
/*
38233842
* Restore the original nodemask if it was potentially replaced with
38243843
* &cpuset_current_mems_allowed to optimize the fast-path attempt.
3825-
* Also recalculate the starting point for the zonelist iterator or
3826-
* we could end up iterating over non-eligible zones endlessly.
38273844
*/
3828-
if (unlikely(ac.nodemask != nodemask)) {
3845+
if (unlikely(ac.nodemask != nodemask))
38293846
ac.nodemask = nodemask;
3830-
ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
3831-
ac.high_zoneidx, ac.nodemask);
3832-
/* If we have NULL preferred zone, slowpath wll handle that */
3833-
}
38343847

38353848
page = __alloc_pages_slowpath(alloc_mask, order, &ac);
38363849

0 commit comments

Comments
 (0)