Skip to content

Commit f77cf4e

Browse files
gormanmtorvalds
authored andcommitted
mm, page_alloc: delete the zonelist_cache
The zonelist cache (zlc) was introduced to skip over zones that were recently known to be full. This avoided expensive operations such as the cpuset checks, watermark calculations and zone_reclaim. The situation today is different and the complexity of zlc is harder to justify. 1) The cpuset checks are no-ops unless a cpuset is active and in general are a lot cheaper. 2) zone_reclaim is now disabled by default and I suspect that was a large source of the cost that zlc wanted to avoid. When it is enabled, it's known to be a major source of stalling when nodes fill up and it's unwise to hit every other user with the overhead. 3) Watermark checks are expensive to calculate for high-order allocation requests. Later patches in this series will reduce the cost of the watermark checking. 4) The most important issue is that in the current implementation it is possible for a failed THP allocation to mark a zone full for order-0 allocations and cause a fallback to remote nodes. The last issue could be addressed with additional complexity but as the benefit of zlc is questionable, it is better to remove it. If stalls due to zone_reclaim are ever reported then an alternative would be to introduce deferring logic based on a timeout inside zone_reclaim itself and leave the page allocator fast paths alone. The impact on page-allocator microbenchmarks is negligible as they don't hit the paths where the zlc comes into play. Most page-reclaim related workloads showed no noticeable difference as a result of the removal. The impact was noticeable in a workload called "stutter". One part uses a lot of anonymous memory, a second measures mmap latency and a third copies a large file. In an ideal world the latency application would not notice the mmap latency. On a 2-node machine the results of this patch are stutter 4.3.0-rc1 4.3.0-rc1 baseline nozlc-v4 Min mmap 20.9243 ( 0.00%) 20.7716 ( 0.73%) 1st-qrtle mmap 22.0612 ( 0.00%) 22.0680 ( -0.03%) 2nd-qrtle mmap 22.3291 ( 0.00%) 22.3809 ( -0.23%) 3rd-qrtle mmap 25.2244 ( 0.00%) 25.2396 ( -0.06%) Max-90% mmap 48.0995 ( 0.00%) 28.3713 ( 41.02%) Max-93% mmap 52.5557 ( 0.00%) 36.0170 ( 31.47%) Max-95% mmap 55.8173 ( 0.00%) 47.3163 ( 15.23%) Max-99% mmap 67.3781 ( 0.00%) 70.1140 ( -4.06%) Max mmap 24447.6375 ( 0.00%) 12915.1356 ( 47.17%) Mean mmap 33.7883 ( 0.00%) 27.7944 ( 17.74%) Best99%Mean mmap 27.7825 ( 0.00%) 25.2767 ( 9.02%) Best95%Mean mmap 26.3912 ( 0.00%) 23.7994 ( 9.82%) Best90%Mean mmap 24.9886 ( 0.00%) 23.2251 ( 7.06%) Best50%Mean mmap 22.0157 ( 0.00%) 22.0261 ( -0.05%) Best10%Mean mmap 21.6705 ( 0.00%) 21.6083 ( 0.29%) Best5%Mean mmap 21.5581 ( 0.00%) 21.4611 ( 0.45%) Best1%Mean mmap 21.3079 ( 0.00%) 21.1631 ( 0.68%) Note that the maximum stall latency went from 24 seconds to 12 which is still bad but an improvement. The milage varies considerably 2-node machine on an earlier test went from 494 seconds to 47 seconds and a 4-node machine that tested an earlier version of this patch went from a worst case stall time of 6 seconds to 67ms. The nature of the benchmark is inherently unpredictable as it is hammering the system and the milage will vary between machines. There is a secondary impact with potentially more direct reclaim because zones are now being considered instead of being skipped by zlc. In this particular test run it did not occur so will not be described. However, in at least one test the following was observed 1. Direct reclaim rates were higher. This was likely due to direct reclaim being entered instead of the zlc disabling a zone and busy looping. Busy looping may have the effect of allowing kswapd to make more progress and in some cases may be better overall. If this is found then the correct action is to put direct reclaimers to sleep on a waitqueue and allow kswapd make forward progress. Busy looping on the zlc is even worse than when the allocator used to blindly call congestion_wait(). 2. There was higher swap activity as direct reclaim was active. 3. Direct reclaim efficiency was lower. This is related to 1 as more scanning activity also encountered more pages that could not be immediately reclaimed In that case, the direct page scan and reclaim rates are noticeable but it is not considered a problem for a few reasons 1. The test is primarily concerned with latency. The mmap attempts are also faulted which means there are THP allocation requests. The ZLC could cause zones to be disabled causing the process to busy loop instead of reclaiming. This looks like elevated direct reclaim activity but it's the correct action to take based on what processes requested. 2. The test hammers reclaim and compaction heavily. The number of successful THP faults is highly variable but affects the reclaim stats. It's not a realistic or reasonable measure of page reclaim activity. 3. No other page-reclaim intensive workload that was tested showed a problem. 4. If a workload is identified that benefitted from the busy looping then it should be fixed by having direct reclaimers sleep on a wait queue until woken by kswapd instead of busy looping. We had this class of problem before when congestion_waits() with a fixed timeout was a brain damaged decision but happened to benefit some workloads. If a workload is identified that relied on the zlc to busy loop then it should be fixed correctly and have a direct reclaimer sleep on a waitqueue until woken by kswapd. Signed-off-by: Mel Gorman <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 71baba4 commit f77cf4e

File tree

2 files changed

+0
-286
lines changed

2 files changed

+0
-286
lines changed

include/linux/mmzone.h

Lines changed: 0 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -589,75 +589,8 @@ static inline bool zone_is_empty(struct zone *zone)
589589
* [1] : No fallback (__GFP_THISNODE)
590590
*/
591591
#define MAX_ZONELISTS 2
592-
593-
594-
/*
595-
* We cache key information from each zonelist for smaller cache
596-
* footprint when scanning for free pages in get_page_from_freelist().
597-
*
598-
* 1) The BITMAP fullzones tracks which zones in a zonelist have come
599-
* up short of free memory since the last time (last_fullzone_zap)
600-
* we zero'd fullzones.
601-
* 2) The array z_to_n[] maps each zone in the zonelist to its node
602-
* id, so that we can efficiently evaluate whether that node is
603-
* set in the current tasks mems_allowed.
604-
*
605-
* Both fullzones and z_to_n[] are one-to-one with the zonelist,
606-
* indexed by a zones offset in the zonelist zones[] array.
607-
*
608-
* The get_page_from_freelist() routine does two scans. During the
609-
* first scan, we skip zones whose corresponding bit in 'fullzones'
610-
* is set or whose corresponding node in current->mems_allowed (which
611-
* comes from cpusets) is not set. During the second scan, we bypass
612-
* this zonelist_cache, to ensure we look methodically at each zone.
613-
*
614-
* Once per second, we zero out (zap) fullzones, forcing us to
615-
* reconsider nodes that might have regained more free memory.
616-
* The field last_full_zap is the time we last zapped fullzones.
617-
*
618-
* This mechanism reduces the amount of time we waste repeatedly
619-
* reexaming zones for free memory when they just came up low on
620-
* memory momentarilly ago.
621-
*
622-
* The zonelist_cache struct members logically belong in struct
623-
* zonelist. However, the mempolicy zonelists constructed for
624-
* MPOL_BIND are intentionally variable length (and usually much
625-
* shorter). A general purpose mechanism for handling structs with
626-
* multiple variable length members is more mechanism than we want
627-
* here. We resort to some special case hackery instead.
628-
*
629-
* The MPOL_BIND zonelists don't need this zonelist_cache (in good
630-
* part because they are shorter), so we put the fixed length stuff
631-
* at the front of the zonelist struct, ending in a variable length
632-
* zones[], as is needed by MPOL_BIND.
633-
*
634-
* Then we put the optional zonelist cache on the end of the zonelist
635-
* struct. This optional stuff is found by a 'zlcache_ptr' pointer in
636-
* the fixed length portion at the front of the struct. This pointer
637-
* both enables us to find the zonelist cache, and in the case of
638-
* MPOL_BIND zonelists, (which will just set the zlcache_ptr to NULL)
639-
* to know that the zonelist cache is not there.
640-
*
641-
* The end result is that struct zonelists come in two flavors:
642-
* 1) The full, fixed length version, shown below, and
643-
* 2) The custom zonelists for MPOL_BIND.
644-
* The custom MPOL_BIND zonelists have a NULL zlcache_ptr and no zlcache.
645-
*
646-
* Even though there may be multiple CPU cores on a node modifying
647-
* fullzones or last_full_zap in the same zonelist_cache at the same
648-
* time, we don't lock it. This is just hint data - if it is wrong now
649-
* and then, the allocator will still function, perhaps a bit slower.
650-
*/
651-
652-
653-
struct zonelist_cache {
654-
unsigned short z_to_n[MAX_ZONES_PER_ZONELIST]; /* zone->nid */
655-
DECLARE_BITMAP(fullzones, MAX_ZONES_PER_ZONELIST); /* zone full? */
656-
unsigned long last_full_zap; /* when last zap'd (jiffies) */
657-
};
658592
#else
659593
#define MAX_ZONELISTS 1
660-
struct zonelist_cache;
661594
#endif
662595

663596
/*
@@ -675,9 +608,6 @@ struct zoneref {
675608
* allocation, the other zones are fallback zones, in decreasing
676609
* priority.
677610
*
678-
* If zlcache_ptr is not NULL, then it is just the address of zlcache,
679-
* as explained above. If zlcache_ptr is NULL, there is no zlcache.
680-
* *
681611
* To speed the reading of the zonelist, the zonerefs contain the zone index
682612
* of the entry being read. Helper functions to access information given
683613
* a struct zoneref are
@@ -687,11 +617,7 @@ struct zoneref {
687617
* zonelist_node_idx() - Return the index of the node for an entry
688618
*/
689619
struct zonelist {
690-
struct zonelist_cache *zlcache_ptr; // NULL or &zlcache
691620
struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
692-
#ifdef CONFIG_NUMA
693-
struct zonelist_cache zlcache; // optional ...
694-
#endif
695621
};
696622

697623
#ifndef CONFIG_DISCONTIGMEM

0 commit comments

Comments
 (0)