Skip to content

Commit ef6a22b

Browse files
gormanmakpm00
authored andcommitted
sched/numa: apply the scan delay to every new vma
Pach series "sched/numa: Enhance vma scanning", v3. The patchset proposes one of the enhancements to numa vma scanning suggested by Mel. This is continuation of [3]. Reposting the rebased patchset to akpm mm-unstable tree (March 1) Existing mechanism of scan period involves, scan period derived from per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA fault stats at per-process level to capture aplication behaviour better. During that course of discussion, Mel proposed several ideas to enhance current numa balancing. One of the suggestion was below Track what threads access a VMA. The suggestion was to use an unsigned long pid_mask and use the lower bits to tag approximately what threads access a VMA. Skip VMAs that did not trap a fault. This would be approximate because of PID collisions but would reduce scanning of areas the thread is not interested in. The above suggestion intends not to penalize threads that has no interest in the vma, thus reduce scanning overhead. V3 changes are mostly based on PeterZ comments (details below in changes) Summary of patchset: Current patchset implements: 1. Delay the vma scanning logic for newly created VMA's so that additional overhead of scanning is not incurred for short lived tasks (implementation by Mel) 2. Store the information of tasks accessing VMA in 2 windows. It is regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval. The above time is derived from experimenting (Suggested by PeterZ) to balance between frequent clearing vs obsolete access data 3. hash_32 used to encode task index accessing VMA information 4. VMA's acess information is used to skip scanning for the tasks which had not accessed VMA Changes since V2: patch1: - Renaming of structure, macro to function, - Add explanation to heuristics - Adding more details from result (PeterZ) Patch2: - Usage of test and set bit (PeterZ) - Move storing access PID info to numa_migrate_prep() - Add a note on fainess among tasks allowed to scan (PeterZ) Patch3: - Maintain two windows of access PID information (PeterZ supported implementation and Gave idea to extend to N if needed) Patch4: - Apply hash_32 function to track VMA accessing PIDs (PeterZ) Changes since RFC V1: - Include Mel's vma scan delay patch - Change the accessing pid store logic (Thanks Mel) - Fencing structure / code to NUMA_BALANCING (David, Mel) - Adding clearing access PID logic (Mel) - Descriptive change log ( Mike Rapoport) Things to ponder over: ========================================== - Improvement to clearing accessing PIDs logic (discussed in-detail in patch3 itself (Done in this patchset by implementing 2 window history) - Current scan period is not changed in the patchset, so we do see frequent tries to scan. Relaxing scan period dynamically could improve results further. [1] sched/numa: Process Adaptive autoNUMA Link: https://lore.kernel.org/lkml/[email protected]/T/ [2] RFC V1 Link: https://lore.kernel.org/all/[email protected]/ [3] V2 Link: https://lore.kernel.org/lkml/[email protected]/ Results: Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement is more than 5% and huge system time (80%+) improvement from mmtest autonuma. (dbench had huge std deviation to post) kernbench =========== 6.2.0-mmunstable-base 6.2.0-mmunstable-patched Amean user-256 22002.51 ( 0.00%) 22649.95 * -2.94%* Amean syst-256 10162.78 ( 0.00%) 8214.13 * 19.17%* Amean elsp-256 160.74 ( 0.00%) 156.92 * 2.38%* Duration User 66017.43 67959.84 Duration System 30503.15 24657.03 Duration Elapsed 504.61 493.12 6.2.0-mmunstable-base 6.2.0-mmunstable-patched Ops NUMA alloc hit 1738835089.00 1738780310.00 Ops NUMA alloc local 1738834448.00 1738779711.00 Ops NUMA base-page range updates 477310.00 392566.00 Ops NUMA PTE updates 477310.00 392566.00 Ops NUMA hint faults 96817.00 87555.00 Ops NUMA hint local faults % 10150.00 2192.00 Ops NUMA hint local percent 10.48 2.50 Ops NUMA pages migrated 86660.00 85363.00 Ops AutoNUMA cost 489.07 442.14 autonumabench =============== 6.2.0-mmunstable-base 6.2.0-mmunstable-patched Amean syst-NUMA01 399.50 ( 0.00%) 52.05 * 86.97%* Amean syst-NUMA01_THREADLOCAL 0.21 ( 0.00%) 0.22 * -5.41%* Amean syst-NUMA02 0.80 ( 0.00%) 0.78 * 2.68%* Amean syst-NUMA02_SMT 0.65 ( 0.00%) 0.68 * -3.95%* Amean elsp-NUMA01 313.26 ( 0.00%) 313.11 * 0.05%* Amean elsp-NUMA01_THREADLOCAL 1.06 ( 0.00%) 1.08 * -1.76%* Amean elsp-NUMA02 3.19 ( 0.00%) 3.24 * -1.52%* Amean elsp-NUMA02_SMT 3.72 ( 0.00%) 3.61 * 2.92%* Duration User 396433.47 324835.96 Duration System 2808.70 376.66 Duration Elapsed 2258.61 2258.12 6.2.0-mmunstable-base 6.2.0-mmunstable-patched Ops NUMA alloc hit 59921806.00 49623489.00 Ops NUMA alloc miss 0.00 0.00 Ops NUMA interleave hit 0.00 0.00 Ops NUMA alloc local 59920880.00 49622594.00 Ops NUMA base-page range updates 152259275.00 50075.00 Ops NUMA PTE updates 152259275.00 50075.00 Ops NUMA PMD updates 0.00 0.00 Ops NUMA hint faults 154660352.00 39014.00 Ops NUMA hint local faults % 138550501.00 23139.00 Ops NUMA hint local percent 89.58 59.31 Ops NUMA pages migrated 8179067.00 14147.00 Ops AutoNUMA cost 774522.98 195.69 This patch (of 4): Currently whenever a new task is created we wait for sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead. Extend the same logic to new or very short-lived VMAs. [[email protected]: add initialization in vm_area_dup())] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/7a6fbba87c8b51e67efd3e74285bb4cb311a16ca.1677672277.git.raghavendra.kt@amd.com Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Raghavendra K T <[email protected]> Cc: Bharata B Rao <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Disha Talreja <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent e06f47a commit ef6a22b

File tree

4 files changed

+44
-0
lines changed

4 files changed

+44
-0
lines changed

include/linux/mm.h

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
#include <linux/pgtable.h>
3030
#include <linux/kasan.h>
3131
#include <linux/memremap.h>
32+
#include <linux/slab.h>
3233

3334
struct mempolicy;
3435
struct anon_vma;
@@ -627,6 +628,20 @@ struct vm_operations_struct {
627628
unsigned long addr);
628629
};
629630

631+
#ifdef CONFIG_NUMA_BALANCING
632+
static inline void vma_numab_state_init(struct vm_area_struct *vma)
633+
{
634+
vma->numab_state = NULL;
635+
}
636+
static inline void vma_numab_state_free(struct vm_area_struct *vma)
637+
{
638+
kfree(vma->numab_state);
639+
}
640+
#else
641+
static inline void vma_numab_state_init(struct vm_area_struct *vma) {}
642+
static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
643+
#endif /* CONFIG_NUMA_BALANCING */
644+
630645
#ifdef CONFIG_PER_VMA_LOCK
631646
/*
632647
* Try to read-lock a vma. The function is allowed to occasionally yield false
@@ -747,6 +762,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
747762
vma->vm_ops = &dummy_vm_ops;
748763
INIT_LIST_HEAD(&vma->anon_vma_chain);
749764
vma_mark_detached(vma, false);
765+
vma_numab_state_init(vma);
750766
}
751767

752768
/* Use when VMA is not part of the VMA tree and needs no locking */

include/linux/mm_types.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -475,6 +475,10 @@ struct vma_lock {
475475
struct rw_semaphore lock;
476476
};
477477

478+
struct vma_numab_state {
479+
unsigned long next_scan;
480+
};
481+
478482
/*
479483
* This struct describes a virtual memory area. There is one of these
480484
* per VM-area/task. A VM area is any part of the process virtual memory
@@ -560,6 +564,9 @@ struct vm_area_struct {
560564
#endif
561565
#ifdef CONFIG_NUMA
562566
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
567+
#endif
568+
#ifdef CONFIG_NUMA_BALANCING
569+
struct vma_numab_state *numab_state; /* NUMA Balancing state */
563570
#endif
564571
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
565572
} __randomize_layout;

kernel/fork.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -516,13 +516,15 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
516516
return NULL;
517517
}
518518
INIT_LIST_HEAD(&new->anon_vma_chain);
519+
vma_numab_state_init(new);
519520
dup_anon_vma_name(orig, new);
520521

521522
return new;
522523
}
523524

524525
void __vm_area_free(struct vm_area_struct *vma)
525526
{
527+
vma_numab_state_free(vma);
526528
free_anon_vma_name(vma);
527529
vma_lock_free(vma);
528530
kmem_cache_free(vm_area_cachep, vma);

kernel/sched/fair.c

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3027,6 +3027,25 @@ static void task_numa_work(struct callback_head *work)
30273027
if (!vma_is_accessible(vma))
30283028
continue;
30293029

3030+
/* Initialise new per-VMA NUMAB state. */
3031+
if (!vma->numab_state) {
3032+
vma->numab_state = kzalloc(sizeof(struct vma_numab_state),
3033+
GFP_KERNEL);
3034+
if (!vma->numab_state)
3035+
continue;
3036+
3037+
vma->numab_state->next_scan = now +
3038+
msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
3039+
}
3040+
3041+
/*
3042+
* Scanning the VMA's of short lived tasks add more overhead. So
3043+
* delay the scan for new VMAs.
3044+
*/
3045+
if (mm->numa_scan_seq && time_before(jiffies,
3046+
vma->numab_state->next_scan))
3047+
continue;
3048+
30303049
do {
30313050
start = max(start, vma->vm_start);
30323051
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);

0 commit comments

Comments
 (0)