Skip to content

Commit

Permalink
sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies
Browse files Browse the repository at this point in the history
Peter Portante reported that for large cgroup hierarchies (and or on
large CPU counts) we get immense lock contention on rq->lock and stuff
stops working properly.

His workload was a ton of processes, each in their own cgroup,
everybody idling except for a sporadic wakeup once every so often.

It was found that:

  schedule()
    idle_balance()
      load_balance()
        local_irq_save()
        double_rq_lock()
        update_h_load()
          walk_tg_tree(tg_load_down)
            tg_load_down()

Results in an entire cgroup hierarchy walk under rq->lock for every
new-idle balance and since new-idle balance isn't throttled this
results in a lot of work while holding the rq->lock.

This patch does two things, it removes the work from under rq->lock
based on the good principle of race and pray which is widely employed
in the load-balancer as a whole. And secondly it throttles the
update_h_load() calculation to max once per jiffy.

I considered excluding update_h_load() for new-idle balance
all-together, but purely relying on regular balance passes to update
this data might not work out under some rare circumstances where the
new-idle busiest isn't the regular busiest for a while (unlikely, but
a nightmare to debug if someone hits it and suffers).

Cc: pjt@google.com
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Reported-by: Peter Portante <pportant@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
  • Loading branch information
Peter Zijlstra authored and KAGA-KOKO committed Aug 13, 2012
1 parent b940313 commit a35b646
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 3 deletions.
11 changes: 9 additions & 2 deletions kernel/sched/fair.c
Original file line number Diff line number Diff line change
Expand Up @@ -3387,6 +3387,14 @@ static int tg_load_down(struct task_group *tg, void *data)

static void update_h_load(long cpu)
{
struct rq *rq = cpu_rq(cpu);
unsigned long now = jiffies;

if (rq->h_load_throttle == now)
return;

rq->h_load_throttle = now;

rcu_read_lock();
walk_tg_tree(tg_load_down, tg_nop, (void *)cpu);
rcu_read_unlock();
Expand Down Expand Up @@ -4293,11 +4301,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
env.src_rq = busiest;
env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);

update_h_load(env.src_cpu);
more_balance:
local_irq_save(flags);
double_rq_lock(this_rq, busiest);
if (!env.loop)
update_h_load(env.src_cpu);

/*
* cur_ld_moved - load moved in current iteration
Expand Down
6 changes: 5 additions & 1 deletion kernel/sched/sched.h
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,11 @@ struct rq {
#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this cpu: */
struct list_head leaf_cfs_rq_list;
#endif
#ifdef CONFIG_SMP
unsigned long h_load_throttle;
#endif /* CONFIG_SMP */
#endif /* CONFIG_FAIR_GROUP_SCHED */

#ifdef CONFIG_RT_GROUP_SCHED
struct list_head leaf_rt_rq_list;
#endif
Expand Down

0 comments on commit a35b646

Please sign in to comment.