Skip to content

Commit 657bd90

Browse files
committed
Merge tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: "Core scheduler updates: - Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the preempt=none/voluntary/full boot options (default: full), to allow distros to build a PREEMPT kernel but fall back to close to PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling behavior via a boot time selection. There's also the /debug/sched_debug switch to do this runtime. This feature is implemented via runtime patching (a new variant of static calls). The scope of the runtime patching can be best reviewed by looking at the sched_dynamic_update() function in kernel/sched/core.c. ( Note that the dynamic none/voluntary mode isn't 100% identical, for example preempt-RCU is available in all cases, plus the preempt count is maintained in all models, which has runtime overhead even with the code patching. ) The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast majority of distributions, are supposed to be unaffected. - Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that was found via rcutorture triggering a hang. The bug is that rcu_idle_enter() may wake up a NOCB kthread, but this happens after the last generic need_resched() check. Some cpuidle drivers fix it by chance but many others don't. In true 2020 fashion the original bug fix has grown into a 5-patch scheduler/RCU fix series plus another 16 RCU patches to address the underlying issue of missed preemption events. These are the initial fixes that should fix current incarnations of the bug. - Clean up rbtree usage in the scheduler, by providing & using the following consistent set of rbtree APIs: partial-order; less() based: - rb_add(): add a new entry to the rbtree - rb_add_cached(): like rb_add(), but for a rb_root_cached total-order; cmp() based: - rb_find(): find an entry in an rbtree - rb_find_add(): find an entry, and add if not found - rb_find_first(): find the first (leftmost) matching entry - rb_next_match(): continue from rb_find_first() - rb_for_each(): iterate a sub-tree using the previous two - Improve the SMP/NUMA load-balancer: scan for an idle sibling in a single pass. This is a 4-commit series where each commit improves one aspect of the idle sibling scan logic. - Improve the cpufreq cooling driver by getting the effective CPU utilization metrics from the scheduler - Improve the fair scheduler's active load-balancing logic by reducing the number of active LB attempts & lengthen the load-balancing interval. This improves stress-ng mmapfork performance. - Fix CFS's estimated utilization (util_est) calculation bug that can result in too high utilization values Misc updates & fixes: - Fix the HRTICK reprogramming & optimization feature - Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code - Reduce dl_add_task_root_domain() overhead - Fix uprobes refcount bug - Process pending softirqs in flush_smp_call_function_from_idle() - Clean up task priority related defines, remove *USER_*PRIO and USER_PRIO() - Simplify the sched_init_numa() deduplication sort - Documentation updates - Fix EAS bug in update_misfit_status(), which degraded the quality of energy-balancing - Smaller cleanups" * tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) sched,x86: Allow !PREEMPT_DYNAMIC entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point entry: Explicitly flush pending rcuog wakeup before last rescheduling point rcu/nocb: Trigger self-IPI on late deferred wake up before user resume rcu/nocb: Perform deferred wake up before last idle's need_resched() check rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers sched/features: Distinguish between NORMAL and DEADLINE hrtick sched/features: Fix hrtick reprogramming sched/deadline: Reduce rq lock contention in dl_add_task_root_domain() uprobes: (Re)add missing get_uprobe() in __find_uprobe() smp: Process pending softirqs in flush_smp_call_function_from_idle() sched: Harden PREEMPT_DYNAMIC static_call: Allow module use without exposing static_call_key sched: Add /debug/sched_preempt preempt/dynamic: Support dynamic preempt with preempt= boot option preempt/dynamic: Provide irqentry_exit_cond_resched() static call preempt/dynamic: Provide preempt_schedule[_notrace]() static calls preempt/dynamic: Provide cond_resched() and might_resched() static calls preempt: Introduce CONFIG_PREEMPT_DYNAMIC static_call: Provide DEFINE_STATIC_CALL_RET0() ...
2 parents 7b15c27 + c5e6fc0 commit 657bd90

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+1898
-785
lines changed

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3903,6 +3903,13 @@
39033903
Format: {"off"}
39043904
Disable Hardware Transactional Memory
39053905

3906+
preempt= [KNL]
3907+
Select preemption mode if you have CONFIG_PREEMPT_DYNAMIC
3908+
none - Limited to cond_resched() calls
3909+
voluntary - Limited to cond_resched() and might_sleep() calls
3910+
full - Any section that isn't explicitly preempt disabled
3911+
can be preempted anytime.
3912+
39063913
print-fatal-signals=
39073914
[KNL] debug: print fatal signals
39083915

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
2+
3+
NOTE; all this assumes a linear relation between frequency and work capacity,
4+
we know this is flawed, but it is the best workable approximation.
5+
6+
7+
PELT (Per Entity Load Tracking)
8+
-------------------------------
9+
10+
With PELT we track some metrics across the various scheduler entities, from
11+
individual tasks to task-group slices to CPU runqueues. As the basis for this
12+
we use an Exponentially Weighted Moving Average (EWMA), each period (1024us)
13+
is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute
14+
half, while the rest of history contribute the other half.
15+
16+
Specifically:
17+
18+
ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...
19+
20+
ewma(u) = ewma_sum(u) / ewma_sum(1)
21+
22+
Since this is essentially a progression of an infinite geometric series, the
23+
results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
24+
is key, since it gives the ability to recompose the averages when tasks move
25+
around.
26+
27+
Note that blocked tasks still contribute to the aggregates (task-group slices
28+
and CPU runqueues), which reflects their expected contribution when they
29+
resume running.
30+
31+
Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
32+
reflects the time an entity spends on the CPU, while 'runnable' reflects the
33+
time an entity spends on the runqueue. When there is only a single task these
34+
two metrics are the same, but once there is contention for the CPU 'running'
35+
will decrease to reflect the fraction of time each task spends on the CPU
36+
while 'runnable' will increase to reflect the amount of contention.
37+
38+
For more detail see: kernel/sched/pelt.c
39+
40+
41+
Frequency- / CPU Invariance
42+
---------------------------
43+
44+
Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
45+
for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
46+
a big CPU, we allow architectures to scale the time delta with two ratios, one
47+
Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.
48+
49+
For simple DVFS architectures (where software is in full control) we trivially
50+
compute the ratio as:
51+
52+
f_cur
53+
r_dvfs := -----
54+
f_max
55+
56+
For more dynamic systems where the hardware is in control of DVFS we use
57+
hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
58+
For Intel specifically, we use:
59+
60+
APERF
61+
f_cur := ----- * P0
62+
MPERF
63+
64+
4C-turbo; if available and turbo enabled
65+
f_max := { 1C-turbo; if turbo enabled
66+
P0; otherwise
67+
68+
f_cur
69+
r_dvfs := min( 1, ----- )
70+
f_max
71+
72+
We pick 4C turbo over 1C turbo to make it slightly more sustainable.
73+
74+
r_cpu is determined as the ratio of highest performance level of the current
75+
CPU vs the highest performance level of any other CPU in the system.
76+
77+
r_tot = r_dvfs * r_cpu
78+
79+
The result is that the above 'running' and 'runnable' metrics become invariant
80+
of DVFS and CPU type. IOW. we can transfer and compare them between CPUs.
81+
82+
For more detail see:
83+
84+
- kernel/sched/pelt.h:update_rq_clock_pelt()
85+
- arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
86+
- Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
87+
88+
89+
UTIL_EST / UTIL_EST_FASTUP
90+
--------------------------
91+
92+
Because periodic tasks have their averages decayed while they sleep, even
93+
though when running their expected utilization will be the same, they suffer a
94+
(DVFS) ramp-up after they are running again.
95+
96+
To alleviate this (a default enabled option) UTIL_EST drives an Infinite
97+
Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
98+
highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
99+
filter to instantly increase and only decay on decrease.
100+
101+
A further runqueue wide sum (of runnable tasks) is maintained of:
102+
103+
util_est := \Sum_t max( t_running, t_util_est_ewma )
104+
105+
For more detail see: kernel/sched/fair.c:util_est_dequeue()
106+
107+
108+
UCLAMP
109+
------
110+
111+
It is possible to set effective u_min and u_max clamps on each CFS or RT task;
112+
the runqueue keeps an max aggregate of these clamps for all running tasks.
113+
114+
For more detail see: include/uapi/linux/sched/types.h
115+
116+
117+
Schedutil / DVFS
118+
----------------
119+
120+
Every time the scheduler load tracking is updated (task wakeup, task
121+
migration, time progression) we call out to schedutil to update the hardware
122+
DVFS state.
123+
124+
The basis is the CPU runqueue's 'running' metric, which per the above it is
125+
the frequency invariant utilization estimate of the CPU. From this we compute
126+
a desired frequency like:
127+
128+
max( running, util_est ); if UTIL_EST
129+
u_cfs := { running; otherwise
130+
131+
clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASK
132+
u_clamp := { u_cfs + u_rt; otherwise
133+
134+
u := u_clamp + u_irq + u_dl; [approx. see source for more detail]
135+
136+
f_des := min( f_max, 1.25 u * f_max )
137+
138+
XXX IO-wait; when the update is due to a task wakeup from IO-completion we
139+
boost 'u' above.
140+
141+
This frequency is then used to select a P-state/OPP or directly munged into a
142+
CPPC style request to the hardware.
143+
144+
XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
145+
required to satisfy the workload.
146+
147+
Because these callbacks are directly from the scheduler, the DVFS hardware
148+
interaction should be 'fast' and non-blocking. Schedutil supports
149+
rate-limiting DVFS requests for when hardware interaction is slow and
150+
expensive, this reduces effectiveness.
151+
152+
For more information see: kernel/sched/cpufreq_schedutil.c
153+
154+
155+
NOTES
156+
-----
157+
158+
- On low-load scenarios, where DVFS is most relevant, the 'running' numbers
159+
will closely reflect utilization.
160+
161+
- In saturated scenarios task movement will cause some transient dips,
162+
suppose we have a CPU saturated with 4 tasks, then when we migrate a task
163+
to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
164+
new CPU will gain 0.25. This is inevitable and time progression will
165+
correct this. XXX do we still guarantee f_max due to no idle-time?
166+
167+
- Much of the above is about avoiding DVFS dips, and independent DVFS domains
168+
having to re-learn / ramp-up when load shifts.
169+

arch/Kconfig

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1058,6 +1058,15 @@ config HAVE_STATIC_CALL_INLINE
10581058
bool
10591059
depends on HAVE_STATIC_CALL
10601060

1061+
config HAVE_PREEMPT_DYNAMIC
1062+
bool
1063+
depends on HAVE_STATIC_CALL
1064+
depends on GENERIC_ENTRY
1065+
help
1066+
Select this if the architecture support boot time preempt setting
1067+
on top of static calls. It is strongly advised to support inline
1068+
static call to avoid any overhead.
1069+
10611070
config ARCH_WANT_LD_ORPHAN_WARN
10621071
bool
10631072
help

arch/powerpc/platforms/cell/spufs/sched.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ static struct timer_list spuloadavg_timer;
7272
#define DEF_SPU_TIMESLICE (100 * HZ / (1000 * SPUSCHED_TICK))
7373

7474
#define SCALE_PRIO(x, prio) \
75-
max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
75+
max(x * (MAX_PRIO - prio) / (NICE_WIDTH / 2), MIN_SPU_TIMESLICE)
7676

7777
/*
7878
* scale user-nice values [ -20 ... 0 ... 19 ] to time slice values:

arch/x86/Kconfig

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -224,6 +224,7 @@ config X86
224224
select HAVE_STACK_VALIDATION if X86_64
225225
select HAVE_STATIC_CALL
226226
select HAVE_STATIC_CALL_INLINE if HAVE_STACK_VALIDATION
227+
select HAVE_PREEMPT_DYNAMIC
227228
select HAVE_RSEQ
228229
select HAVE_SYSCALL_TRACEPOINTS
229230
select HAVE_UNSTABLE_SCHED_CLOCK

arch/x86/include/asm/preempt.h

Lines changed: 39 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
#include <asm/rmwcc.h>
66
#include <asm/percpu.h>
77
#include <linux/thread_info.h>
8+
#include <linux/static_call_types.h>
89

910
DECLARE_PER_CPU(int, __preempt_count);
1011

@@ -103,16 +104,45 @@ static __always_inline bool should_resched(int preempt_offset)
103104
}
104105

105106
#ifdef CONFIG_PREEMPTION
106-
extern asmlinkage void preempt_schedule_thunk(void);
107-
# define __preempt_schedule() \
108-
asm volatile ("call preempt_schedule_thunk" : ASM_CALL_CONSTRAINT)
109107

110-
extern asmlinkage void preempt_schedule(void);
111-
extern asmlinkage void preempt_schedule_notrace_thunk(void);
112-
# define __preempt_schedule_notrace() \
113-
asm volatile ("call preempt_schedule_notrace_thunk" : ASM_CALL_CONSTRAINT)
108+
extern asmlinkage void preempt_schedule(void);
109+
extern asmlinkage void preempt_schedule_thunk(void);
114110

115-
extern asmlinkage void preempt_schedule_notrace(void);
116-
#endif
111+
#define __preempt_schedule_func preempt_schedule_thunk
112+
113+
extern asmlinkage void preempt_schedule_notrace(void);
114+
extern asmlinkage void preempt_schedule_notrace_thunk(void);
115+
116+
#define __preempt_schedule_notrace_func preempt_schedule_notrace_thunk
117+
118+
#ifdef CONFIG_PREEMPT_DYNAMIC
119+
120+
DECLARE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
121+
122+
#define __preempt_schedule() \
123+
do { \
124+
__STATIC_CALL_MOD_ADDRESSABLE(preempt_schedule); \
125+
asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule) : ASM_CALL_CONSTRAINT); \
126+
} while (0)
127+
128+
DECLARE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
129+
130+
#define __preempt_schedule_notrace() \
131+
do { \
132+
__STATIC_CALL_MOD_ADDRESSABLE(preempt_schedule_notrace); \
133+
asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule_notrace) : ASM_CALL_CONSTRAINT); \
134+
} while (0)
135+
136+
#else /* PREEMPT_DYNAMIC */
137+
138+
#define __preempt_schedule() \
139+
asm volatile ("call preempt_schedule_thunk" : ASM_CALL_CONSTRAINT);
140+
141+
#define __preempt_schedule_notrace() \
142+
asm volatile ("call preempt_schedule_notrace_thunk" : ASM_CALL_CONSTRAINT);
143+
144+
#endif /* PREEMPT_DYNAMIC */
145+
146+
#endif /* PREEMPTION */
117147

118148
#endif /* __ASM_PREEMPT_H */

arch/x86/include/asm/static_call.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,4 +37,11 @@
3737
#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) \
3838
__ARCH_DEFINE_STATIC_CALL_TRAMP(name, "ret; nop; nop; nop; nop")
3939

40+
41+
#define ARCH_ADD_TRAMP_KEY(name) \
42+
asm(".pushsection .static_call_tramp_key, \"a\" \n" \
43+
".long " STATIC_CALL_TRAMP_STR(name) " - . \n" \
44+
".long " STATIC_CALL_KEY_STR(name) " - . \n" \
45+
".popsection \n")
46+
4047
#endif /* _ASM_STATIC_CALL_H */

arch/x86/kernel/static_call.c

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,26 @@ enum insn_type {
1111
RET = 3, /* tramp / site cond-tail-call */
1212
};
1313

14+
/*
15+
* data16 data16 xorq %rax, %rax - a single 5 byte instruction that clears %rax
16+
* The REX.W cancels the effect of any data16.
17+
*/
18+
static const u8 xor5rax[] = { 0x66, 0x66, 0x48, 0x31, 0xc0 };
19+
1420
static void __ref __static_call_transform(void *insn, enum insn_type type, void *func)
1521
{
22+
const void *emulate = NULL;
1623
int size = CALL_INSN_SIZE;
1724
const void *code;
1825

1926
switch (type) {
2027
case CALL:
2128
code = text_gen_insn(CALL_INSN_OPCODE, insn, func);
29+
if (func == &__static_call_return0) {
30+
emulate = code;
31+
code = &xor5rax;
32+
}
33+
2234
break;
2335

2436
case NOP:
@@ -41,7 +53,7 @@ static void __ref __static_call_transform(void *insn, enum insn_type type, void
4153
if (unlikely(system_state == SYSTEM_BOOTING))
4254
return text_poke_early(insn, code, size);
4355

44-
text_poke_bp(insn, code, size, NULL);
56+
text_poke_bp(insn, code, size, emulate);
4557
}
4658

4759
static void __static_call_validate(void *insn, bool tail)
@@ -54,7 +66,8 @@ static void __static_call_validate(void *insn, bool tail)
5466
return;
5567
} else {
5668
if (opcode == CALL_INSN_OPCODE ||
57-
!memcmp(insn, ideal_nops[NOP_ATOMIC5], 5))
69+
!memcmp(insn, ideal_nops[NOP_ATOMIC5], 5) ||
70+
!memcmp(insn, xor5rax, 5))
5871
return;
5972
}
6073

arch/x86/kvm/x86.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1782,6 +1782,7 @@ EXPORT_SYMBOL_GPL(kvm_emulate_wrmsr);
17821782

17831783
bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu)
17841784
{
1785+
xfer_to_guest_mode_prepare();
17851786
return vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu) ||
17861787
xfer_to_guest_mode_work_pending();
17871788
}

0 commit comments

Comments
 (0)