You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replace the memory controller's custom hierarchical stats code with the
generic rstat infrastructure provided by the cgroup core.
The current implementation does batched upward propagation from the
write side (i.e. as stats change). The per-cpu batches introduce an
error, which is multiplied by the number of subgroups in a tree. In
systems with many CPUs and sizable cgroup trees, the error can be large
enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 subgroups
results in an error of up to 128M per stat item). This can entirely
swallow allocation bursts inside a workload that the user is expecting
to see reflected in the statistics.
In the past, we've done read-side aggregation, where a memory.stat read
would have to walk the entire subtree and add up per-cpu counts. This
became problematic with lazily-freed cgroups: we could have large
subtrees where most cgroups were entirely idle. Hence the switch to
change-driven upward propagation. Unfortunately, it needed to trade
accuracy for speed due to the write side being so hot.
Rstat combines the best of both worlds: from the write side, it cheaply
maintains a queue of cgroups that have pending changes, so that the read
side can do selective tree aggregation. This way the reported stats
will always be precise and recent as can be, while the aggregation can
skip over potentially large numbers of idle cgroups.
The way rstat works is that it implements a tree for tracking cgroups
with pending local changes, as well as a flush function that walks the
tree upwards. The controller then drives this by 1) telling rstat when
a local cgroup stat changes (e.g. mod_memcg_state) and 2) when a flush
is required to get uptodate hierarchy stats for a given subtree (e.g.
when memory.stat is read). The controller also provides a flush
callback that is called during the rstat flush walk for each cgroup and
aggregates its local per-cpu counters and propagates them upwards.
This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT +
NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward
aggregation. It removes 3 words from the per-cpu data. It eliminates
memcg_exact_page_state(), since memcg_page_state() is now exact.
[[email protected]: merge fix]
[[email protected]: fix a sleep in atomic section problem]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Michal Koutný <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
0 commit comments