Skip to content

Commit

Permalink
Changes representative of linux-4.18.0-147.3.1.el8_1.tar.xz
Browse files Browse the repository at this point in the history
  • Loading branch information
da-x committed Nov 27, 2019
1 parent 68f1989 commit d5a3b74
Show file tree
Hide file tree
Showing 49 changed files with 1,163 additions and 340 deletions.
3 changes: 3 additions & 0 deletions Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -678,6 +678,9 @@
cpuidle.off=1 [CPU_IDLE]
disable the cpuidle sub-system

cpuidle.governor=
[CPU_IDLE] Name of the cpuidle governor to use.

cpufreq.off=1 [CPU_FREQ]
disable the cpufreq sub-system

Expand Down
74 changes: 60 additions & 14 deletions Documentation/scheduler/sched-bwc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
specification of the maximum CPU bandwidth available to a group or hierarchy.

The bandwidth allowed for a group is specified using a quota and period. Within
each given "period" (microseconds), a group is allowed to consume only up to
"quota" microseconds of CPU time. When the CPU bandwidth consumption of a
group exceeds this limit (for that period), the tasks belonging to its
hierarchy will be throttled and are not allowed to run again until the next
period.

A group's unused runtime is globally tracked, being refreshed with quota units
above at each period boundary. As threads consume this bandwidth it is
transferred to cpu-local "silos" on a demand basis. The amount transferred
each given "period" (microseconds), a task group is allocated up to "quota"
microseconds of CPU time. That quota is assigned to per-cpu run queues in
slices as threads in the cgroup become runnable. Once all quota has been
assigned any additional requests for quota will result in those threads being
throttled. Throttled threads will not be able to run again until the next
period when the quota is replenished.

A group's unassigned quota is globally tracked, being refreshed back to
cfs_quota units at each period boundary. As threads consume this bandwidth it
is transferred to cpu-local "silos" on a demand basis. The amount transferred
within each of these updates is tunable and described as the "slice".

Management
Expand All @@ -33,12 +34,12 @@ The default values are:

A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
bandwidth restriction in place, such a group is described as an unconstrained
bandwidth group. This represents the traditional work-conserving behavior for
bandwidth group. This represents the traditional work-conserving behavior for
CFS.

Writing any (valid) positive value(s) will enact the specified bandwidth limit.
The minimum quota allowed for the quota or period is 1ms. There is also an
upper bound on the period length of 1s. Additional restrictions exist when
The minimum quota allowed for the quota or period is 1ms. There is also an
upper bound on the period length of 1s. Additional restrictions exist when
bandwidth limits are used in a hierarchical fashion, these are explained in
more detail below.

Expand All @@ -51,8 +52,8 @@ unthrottled if it is in a constrained state.
System wide settings
--------------------
For efficiency run-time is transferred between the global pool and CPU local
"silos" in a batch fashion. This greatly reduces global accounting pressure
on large systems. The amount transferred each time such an update is required
"silos" in a batch fashion. This greatly reduces global accounting pressure
on large systems. The amount transferred each time such an update is required
is described as the "slice".

This is tunable via procfs:
Expand Down Expand Up @@ -90,6 +91,51 @@ There are two ways in which a group may become throttled:
In case b) above, even though the child may have runtime remaining it will not
be allowed to until the parent's runtime is refreshed.

CFS Bandwidth Quota Caveats
---------------------------
Once a slice is assigned to a cpu it does not expire. However all but 1ms of
the slice may be returned to the global pool if all threads on that cpu become
unrunnable. This is configured at compile time by the min_cfs_rq_runtime
variable. This is a performance tweak that helps prevent added contention on
the global lock.

The fact that cpu-local slices do not expire results in some interesting corner
cases that should be understood.

For cgroup cpu constrained applications that are cpu limited this is a
relatively moot point because they will naturally consume the entirety of their
quota as well as the entirety of each cpu-local slice in each period. As a
result it is expected that nr_periods roughly equal nr_throttled, and that
cpuacct.usage will increase roughly equal to cfs_quota_us in each period.

For highly-threaded, non-cpu bound applications this non-expiration nuance
allows applications to briefly burst past their quota limits by the amount of
unused slice on each cpu that the task group is running on (typically at most
1ms per cpu or as defined by min_cfs_rq_runtime). This slight burst only
applies if quota had been assigned to a cpu and then not fully used or returned
in previous periods. This burst amount will not be transferred between cores.
As a result, this mechanism still strictly limits the task group to quota
average usage, albeit over a longer time window than a single period. This
also limits the burst ability to no more than 1ms per cpu. This provides
better more predictable user experience for highly threaded applications with
small quota limits on high core count machines. It also eliminates the
propensity to throttle these applications while simultanously using less than
quota amounts of cpu. Another way to say this, is that by allowing the unused
portion of a slice to remain valid across periods we have decreased the
possibility of wastefully expiring quota on cpu-local silos that don't need a
full slice's amount of cpu time.

The interaction between cpu-bound and non-cpu-bound-interactive applications
should also be considered, especially when single core usage hits 100%. If you
gave each of these applications half of a cpu-core and they both got scheduled
on the same CPU it is theoretically possible that the non-cpu bound application
will use up to 1ms additional quota in some periods, thereby preventing the
cpu-bound application from fully using its quota by that same amount. In these
instances it will be up to the CFS algorithm (see sched-design-CFS.txt) to
decide which application is chosen to run, as they will both be runnable and
have remaining quota. This runtime discrepancy will be made up in the following
periods when the interactive application idles.

Examples
--------
1. Limit a group to 1 CPU worth of runtime.
Expand Down
78 changes: 78 additions & 0 deletions Documentation/virtual/guest-halt-polling.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
Guest halt polling
==================

The cpuidle_haltpoll driver, with the haltpoll governor, allows
the guest vcpus to poll for a specified amount of time before
halting.
This provides the following benefits to host side polling:

1) The POLL flag is set while polling is performed, which allows
a remote vCPU to avoid sending an IPI (and the associated
cost of handling the IPI) when performing a wakeup.

2) The VM-exit cost can be avoided.

The downside of guest side polling is that polling is performed
even with other runnable tasks in the host.

The basic logic as follows: A global value, guest_halt_poll_ns,
is configured by the user, indicating the maximum amount of
time polling is allowed. This value is fixed.

Each vcpu has an adjustable guest_halt_poll_ns
("per-cpu guest_halt_poll_ns"), which is adjusted by the algorithm
in response to events (explained below).

Module Parameters
=================

The haltpoll governor has 5 tunable module parameters:

1) guest_halt_poll_ns:
Maximum amount of time, in nanoseconds, that polling is
performed before halting.

Default: 200000

2) guest_halt_poll_shrink:
Division factor used to shrink per-cpu guest_halt_poll_ns when
wakeup event occurs after the global guest_halt_poll_ns.

Default: 2

3) guest_halt_poll_grow:
Multiplication factor used to grow per-cpu guest_halt_poll_ns
when event occurs after per-cpu guest_halt_poll_ns
but before global guest_halt_poll_ns.

Default: 2

4) guest_halt_poll_grow_start:
The per-cpu guest_halt_poll_ns eventually reaches zero
in case of an idle system. This value sets the initial
per-cpu guest_halt_poll_ns when growing. This can
be increased from 10000, to avoid misses during the initial
growth stage:

10k, 20k, 40k, ... (example assumes guest_halt_poll_grow=2).

Default: 50000

5) guest_halt_poll_allow_shrink:

Bool parameter which allows shrinking. Set to N
to avoid it (per-cpu guest_halt_poll_ns will remain
high once achieves global guest_halt_poll_ns value).

Default: Y

The module parameters can be set from the debugfs files in:

/sys/module/haltpoll/parameters/

Further Notes
=============

- Care should be taken when setting the guest_halt_poll_ns parameter as a
large value has the potential to drive the cpu usage to 100% on a machine which
would be almost entirely idle otherwise.
9 changes: 9 additions & 0 deletions Documentation/virtual/kvm/msr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -273,3 +273,12 @@ MSR_KVM_EOI_EN: 0x4b564d04
guest must both read the least significant bit in the memory area and
clear it using a single CPU instruction, such as test and clear, or
compare and exchange.

MSR_KVM_POLL_CONTROL: 0x4b564d05
Control host-side polling.

data: Bit 0 enables (1) or disables (0) host-side HLT polling logic.

KVM guests can request the host not to poll on HLT, for example if
they are performing polling themselves.

2 changes: 1 addition & 1 deletion Makefile.rhelver
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ RHEL_MINOR = 1
#
# Use this spot to avoid future merge conflicts.
# Do not trim this comment.
RHEL_RELEASE = 147.0.3
RHEL_RELEASE = 147.3.1

#
# Early y+1 numbering
Expand Down
32 changes: 15 additions & 17 deletions arch/arm64/kernel/process.c
Original file line number Diff line number Diff line change
Expand Up @@ -285,22 +285,27 @@ void arch_release_task_struct(struct task_struct *tsk)
fpsimd_release_task(tsk);
}

/*
* src and dst may temporarily have aliased sve_state after task_struct
* is copied. We cannot fix this properly here, because src may have
* live SVE state and dst's thread_info may not exist yet, so tweaking
* either src's or dst's TIF_SVE is not safe.
*
* The unaliasing is done in copy_thread() instead. This works because
* dst is not schedulable or traceable until both of these functions
* have been called.
*/
int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
{
if (current->mm)
fpsimd_preserve_current_state();
*dst = *src;

/* We rely on the above assignment to initialize dst's thread_flags: */
BUILD_BUG_ON(!IS_ENABLED(CONFIG_THREAD_INFO_IN_TASK));

/*
* Detach src's sve_state (if any) from dst so that it does not
* get erroneously used or freed prematurely. dst's sve_state
* will be allocated on demand later on if dst uses SVE.
* For consistency, also clear TIF_SVE here: this could be done
* later in copy_process(), but to avoid tripping up future
* maintainers it is best not to leave TIF_SVE and sve_state in
* an inconsistent state, even temporarily.
*/
dst->thread.sve_state = NULL;
clear_tsk_thread_flag(dst, TIF_SVE);

return 0;
}

Expand All @@ -313,13 +318,6 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,

memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));

/*
* Unalias p->thread.sve_state (if any) from the parent task
* and disable discard SVE state for p:
*/
clear_tsk_thread_flag(p, TIF_SVE);
p->thread.sve_state = NULL;

/*
* In case p was allocated the same task_struct pointer as some
* other recently-exited task, make sure p is disassociated from
Expand Down
26 changes: 26 additions & 0 deletions arch/powerpc/include/asm/drmem.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ struct drmem_lmb {
u32 drc_index;
u32 aa_index;
u32 flags;
#ifdef CONFIG_MEMORY_HOTPLUG
int nid;
#endif
};

struct drmem_lmb_info {
Expand Down Expand Up @@ -99,4 +102,27 @@ void __init walk_drmem_lmbs_early(unsigned long node,
void (*func)(struct drmem_lmb *, const __be32 **));
#endif

static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
{
lmb->aa_index = 0xffffffff;
}

#ifdef CONFIG_MEMORY_HOTPLUG
static inline void lmb_set_nid(struct drmem_lmb *lmb)
{
lmb->nid = memory_add_physaddr_to_nid(lmb->base_addr);
}
static inline void lmb_clear_nid(struct drmem_lmb *lmb)
{
lmb->nid = -1;
}
#else
static inline void lmb_set_nid(struct drmem_lmb *lmb)
{
}
static inline void lmb_clear_nid(struct drmem_lmb *lmb)
{
}
#endif

#endif /* _ASM_POWERPC_LMB_H */
4 changes: 2 additions & 2 deletions arch/powerpc/kernel/misc_64.S
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ _GLOBAL_TOC(flush_dcache_range)
subf r8,r6,r4 /* compute length */
add r8,r8,r5 /* ensure we get enough */
lwz r9,DCACHEL1LOGBLOCKSIZE(r10) /* Get log-2 of dcache block size */
srw. r8,r8,r9 /* compute line count */
srd. r8,r8,r9 /* compute line count */
beqlr /* nothing to do? */
mtctr r8
0: dcbst 0,r6
Expand All @@ -152,7 +152,7 @@ _GLOBAL(flush_inval_dcache_range)
subf r8,r6,r4 /* compute length */
add r8,r8,r5 /* ensure we get enough */
lwz r9,DCACHEL1LOGBLOCKSIZE(r10)/* Get log-2 of dcache block size */
srw. r8,r8,r9 /* compute line count */
srd. r8,r8,r9 /* compute line count */
beqlr /* nothing to do? */
sync
isync
Expand Down
2 changes: 2 additions & 0 deletions arch/powerpc/kernel/rtas.c
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
#include <linux/capability.h>
#include <linux/delay.h>
#include <linux/cpu.h>
#include <linux/sched.h>
#include <linux/smp.h>
#include <linux/completion.h>
#include <linux/cpumask.h>
Expand Down Expand Up @@ -902,6 +903,7 @@ static int rtas_cpu_state_change_mask(enum rtas_cpu_state state,
cpumask_clear_cpu(cpu, cpus);
}
}
cond_resched();
}

return ret;
Expand Down
6 changes: 5 additions & 1 deletion arch/powerpc/mm/drmem.c
Original file line number Diff line number Diff line change
Expand Up @@ -366,8 +366,10 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop)
if (!drmem_info->lmbs)
return;

for_each_drmem_lmb(lmb)
for_each_drmem_lmb(lmb) {
read_drconf_v1_cell(lmb, &prop);
lmb_set_nid(lmb);
}
}

static void __init init_drmem_v2_lmbs(const __be32 *prop)
Expand Down Expand Up @@ -412,6 +414,8 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop)

lmb->aa_index = dr_cell.aa_index;
lmb->flags = dr_cell.flags;

lmb_set_nid(lmb);
}
}
}
Expand Down
Loading

0 comments on commit d5a3b74

Please sign in to comment.