Skip to content

Commit 39cf275

Browse files
committed
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar: "The main changes in this cycle are: - (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik van Riel, Peter Zijlstra et al. Yay! - optimize preemption counter handling: merge the NEED_RESCHED flag into the preempt_count variable, by Peter Zijlstra. - wait.h fixes and code reorganization from Peter Zijlstra - cfs_bandwidth fixes from Ben Segall - SMP load-balancer cleanups from Peter Zijstra - idle balancer improvements from Jason Low - other fixes and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits) ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED stop_machine: Fix race between stop_two_cpus() and stop_cpus() sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus sched: Fix asymmetric scheduling for POWER7 sched: Move completion code from core.c to completion.c sched: Move wait code from core.c to wait.c sched: Move wait.c into kernel/sched/ sched/wait: Fix __wait_event_interruptible_lock_irq_timeout() sched: Avoid throttle_cfs_rq() racing with period_timer stopping sched: Guarantee new group-entities always have weight sched: Fix hrtimer_cancel()/rq->lock deadlock sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining sched: Fix race on toggling cfs_bandwidth_used sched: Remove extra put_online_cpus() inside sched_setaffinity() sched/rt: Fix task_tick_rt() comment sched/wait: Fix build breakage sched/wait: Introduce prepare_to_wait_event() sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too sched: Remove get_online_cpus() usage sched: Fix race in migrate_swap_stop() ...
2 parents ad5d698 + e5137b5 commit 39cf275

117 files changed

Lines changed: 3569 additions & 1594 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Documentation/sysctl/kernel.txt

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -355,6 +355,82 @@ utilize.
355355

356356
==============================================================
357357

358+
numa_balancing
359+
360+
Enables/disables automatic page fault based NUMA memory
361+
balancing. Memory is moved automatically to nodes
362+
that access it often.
363+
364+
Enables/disables automatic NUMA memory balancing. On NUMA machines, there
365+
is a performance penalty if remote memory is accessed by a CPU. When this
366+
feature is enabled the kernel samples what task thread is accessing memory
367+
by periodically unmapping pages and later trapping a page fault. At the
368+
time of the page fault, it is determined if the data being accessed should
369+
be migrated to a local memory node.
370+
371+
The unmapping of pages and trapping faults incur additional overhead that
372+
ideally is offset by improved memory locality but there is no universal
373+
guarantee. If the target workload is already bound to NUMA nodes then this
374+
feature should be disabled. Otherwise, if the system overhead from the
375+
feature is too high then the rate the kernel samples for NUMA hinting
376+
faults may be controlled by the numa_balancing_scan_period_min_ms,
377+
numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
378+
numa_balancing_scan_size_mb, numa_balancing_settle_count sysctls and
379+
numa_balancing_migrate_deferred.
380+
381+
==============================================================
382+
383+
numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
384+
numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
385+
386+
Automatic NUMA balancing scans tasks address space and unmaps pages to
387+
detect if pages are properly placed or if the data should be migrated to a
388+
memory node local to where the task is running. Every "scan delay" the task
389+
scans the next "scan size" number of pages in its address space. When the
390+
end of the address space is reached the scanner restarts from the beginning.
391+
392+
In combination, the "scan delay" and "scan size" determine the scan rate.
393+
When "scan delay" decreases, the scan rate increases. The scan delay and
394+
hence the scan rate of every task is adaptive and depends on historical
395+
behaviour. If pages are properly placed then the scan delay increases,
396+
otherwise the scan delay decreases. The "scan size" is not adaptive but
397+
the higher the "scan size", the higher the scan rate.
398+
399+
Higher scan rates incur higher system overhead as page faults must be
400+
trapped and potentially data must be migrated. However, the higher the scan
401+
rate, the more quickly a tasks memory is migrated to a local node if the
402+
workload pattern changes and minimises performance impact due to remote
403+
memory accesses. These sysctls control the thresholds for scan delays and
404+
the number of pages scanned.
405+
406+
numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
407+
scan a tasks virtual memory. It effectively controls the maximum scanning
408+
rate for each task.
409+
410+
numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
411+
when it initially forks.
412+
413+
numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
414+
scan a tasks virtual memory. It effectively controls the minimum scanning
415+
rate for each task.
416+
417+
numa_balancing_scan_size_mb is how many megabytes worth of pages are
418+
scanned for a given scan.
419+
420+
numa_balancing_settle_count is how many scan periods must complete before
421+
the schedule balancer stops pushing the task towards a preferred node. This
422+
gives the scheduler a chance to place the task on an alternative node if the
423+
preferred node is overloaded.
424+
425+
numa_balancing_migrate_deferred is how many page migrations get skipped
426+
unconditionally, after a page migration is skipped because a page is shared
427+
with other tasks. This reduces page migration overhead, and determines
428+
how much stronger the "move task near its memory" policy scheduler becomes,
429+
versus the "move memory near its task" memory management policy, for workloads
430+
with shared memory.
431+
432+
==============================================================
433+
358434
osrelease, ostype & version:
359435

360436
# cat osrelease

Documentation/trace/ftrace.txt

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -655,7 +655,11 @@ explains which is which.
655655
read the irq flags variable, an 'X' will always
656656
be printed here.
657657

658-
need-resched: 'N' task need_resched is set, '.' otherwise.
658+
need-resched:
659+
'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
660+
'n' only TIF_NEED_RESCHED is set,
661+
'p' only PREEMPT_NEED_RESCHED is set,
662+
'.' otherwise.
659663

660664
hardirq/softirq:
661665
'H' - hard irq occurred inside a softirq.

MAINTAINERS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7326,6 +7326,8 @@ S: Maintained
73267326
F: kernel/sched/
73277327
F: include/linux/sched.h
73287328
F: include/uapi/linux/sched.h
7329+
F: kernel/wait.c
7330+
F: include/linux/wait.h
73297331

73307332
SCORE ARCHITECTURE
73317333
M: Chen Liqin <liqin.linux@gmail.com>

arch/alpha/include/asm/Kbuild

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ generic-y += clkdev.h
33

44
generic-y += exec.h
55
generic-y += trace_clock.h
6+
generic-y += preempt.h

arch/arc/include/asm/Kbuild

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,3 +46,4 @@ generic-y += ucontext.h
4646
generic-y += user.h
4747
generic-y += vga.h
4848
generic-y += xor.h
49+
generic-y += preempt.h

arch/arm/include/asm/Kbuild

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,4 @@ generic-y += termios.h
3232
generic-y += timex.h
3333
generic-y += trace_clock.h
3434
generic-y += unaligned.h
35+
generic-y += preempt.h

arch/arm64/include/asm/Kbuild

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,3 +50,4 @@ generic-y += unaligned.h
5050
generic-y += user.h
5151
generic-y += vga.h
5252
generic-y += xor.h
53+
generic-y += preempt.h

arch/avr32/include/asm/Kbuild

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ generic-y += div64.h
77
generic-y += emergency-restart.h
88
generic-y += exec.h
99
generic-y += futex.h
10+
generic-y += preempt.h
1011
generic-y += irq_regs.h
1112
generic-y += param.h
1213
generic-y += local.h

arch/blackfin/include/asm/Kbuild

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,3 +44,4 @@ generic-y += ucontext.h
4444
generic-y += unaligned.h
4545
generic-y += user.h
4646
generic-y += xor.h
47+
generic-y += preempt.h

arch/c6x/include/asm/Kbuild

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,3 +56,4 @@ generic-y += ucontext.h
5656
generic-y += user.h
5757
generic-y += vga.h
5858
generic-y += xor.h
59+
generic-y += preempt.h

0 commit comments

Comments
 (0)