[core] Hybrid scheduling policy. #14790

wuisawesome · 2021-03-18T23:44:41Z

Why are these changes needed?

This PR introduces a new scheduling policy which is a hybrid of a pack and round robin policy. Description from the doc string:

/// This scheduling policy was designed with the following assumptions in mind:
///   1. Scheduling a task on a new node incurs a cold start penalty (warming the worker
///   pool).
///   2. Past a certain utilization threshold, a big noisy neighbor problem occurs (caused
///   by object spilling).
///   3. Locality is helpful, but generally outweighed by (1) and (2).
///
/// In order to solve these problems, we use the following scheduling policy.
///   1. Generate a traversal.
///   2. Run a priority scheduler.
///
/// A node's priorities are determined by the following factors:
///   * Always skip infeasible nodes
///   * Always prefer available nodes over feasible nodes.
///   * Break ties in available/feasible by critical resource utilization.
///   * Critical resource utilization below a threshold should be trucnated to 0.
///
/// The traversal order should:
///   * Prioritize the local node above all others.
///   * All other nodes should have a globally fixed priority across the cluster.
///
/// We call this a hybrid policy because below the threshold, the traversal and truncation
/// properties will lead to packing of nodes. Above the threshold, the policy will act
/// like a traditional weighted round robin.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

src/ray/raylet/scheduling/cluster_resource_data.cc

src/ray/raylet/scheduling/cluster_resource_scheduler.cc

src/ray/raylet/scheduling/cluster_resource_scheduler.h

ericl · 2021-03-19T00:05:15Z

We can also include memory, but I think we should exclude GPU. Bin packing for GPU is always preferable since they are so expensive.

…

On Thu, Mar 18, 2021, 5:03 PM Alex Wu ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/ray/raylet/scheduling/cluster_resource_data.cc <#14790 (comment)>: > @@ -164,6 +164,27 @@ NodeResources ResourceMapToNodeResources( return node_resources; } +float NodeResources::CalculateCriticalResourceUtilization() const { + float highest = 0; + + for (const auto &capacity : predefined_resources) { + float utilization = 1 - (capacity.available.Double() / capacity.total.Double()); + if (utilization > highest) { + highest = utilization; + } + } Hmmm good point... I can imagine similar situations for CPU and memory though. How about we do all predefined resources and no custom resources? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#14790 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSSJVV5IGPDIKEO3EDLTEKIEJANCNFSM4ZNWTNAQ> .

wuisawesome · 2021-03-19T00:08:13Z

Bin packing for GPU is always preferable since they are so expensive.

The GPU is in the cluster either way though right? Is this a heuristic to help with scaling down?

ericl · 2021-03-19T00:19:03Z

Yes. Also, GPUs don't really get oversubscribed in the way cpu and memory does, so it doesn't make sense to be trying to spread.

…

On Thu, Mar 18, 2021, 5:08 PM Alex Wu ***@***.***> wrote: Bin packing for GPU is always preferable since they are so expensive. The GPU is in the cluster either way though right? Is this a heuristic to help with scaling down? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#14790 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSTR6JQNRGIIVRHFUH3TEKIX3ANCNFSM4ZNWTNAQ> .

rkooo567 · 2021-03-22T06:18:01Z

Is it still WIP btw?

rkooo567

Can we add one python test? I think we can make it something like this;

3 nodes cluster (0 cpu head node)
Each node has 100MB object store memory & 2 cpus
Make a task that returns 55MB object.
Create 3 tasks
Make sure all nodes has 55MB object memory usage.

rkooo567 · 2021-03-23T05:09:30Z

src/ray/common/ray_config_def.h

+/// Whether to use the hybrid scheduling policy, or one of the legacy spillback
+/// strategies. In the hybrid scheduling strategy, leases are packed until a threshold,
+/// then spread via weighted (by critical resource usage).
+RAY_CONFIG(bool, scheduler_hybrid_scheduling,


What's happening if both this and scheduler_loadbalance_spillback are true?

Looks like scheduler_loadbalance_spillback is ignored. Can you add a TODO comment to clean this up?

This flag takes precedence over the other one.

rkooo567 · 2021-03-23T05:12:46Z

src/ray/raylet/scheduling/scheduling_policy.h

+///
+/// \return -1 if the task is infeasible, otherwise the node id (key in `nodes`) to
+/// schedule on.
+int64_t HybridPolicy(const TaskRequest &task_request, const int64_t local_node_id,


Can you add a TODO to move other policies to this file? (load balancing & simple bin packing)

Oh I was actually thinking we would delete those, but sure

rkooo567 · 2021-03-23T05:24:29Z

src/ray/raylet/scheduling/cluster_resource_data.cc

+  float highest = 0;
+  for (const auto &i : {CPU, MEM, OBJECT_STORE_MEM}) {
+    if (i >= this->predefined_resources.size()) {
+      continue;


Is it every happened? Why don't we just add a check here?

yeah this actually happens. check was inspired by real events :p (most of the unit tests actually trigger it)

That's interesting... Why is that? I thought the first N entries are reserved for predefined resources..

Yeah but the original data structure is actually a vector, and it's not dynamically resized at initialization time. This means that for some time, a task req of {"CPU": 1} will have predefined_resources.size() == 1 until some code comes along and resizes the vector.

That's a bit of a funky data model... Given that predefined resources are static by definition, wouldn't it be a lot cleaner if predefined_resources was a static array whose elements may be unset? That would still allow for enum-based indexing (which I like), but would get rid of all of the dynamic resizing and size checks, and should make it far less brittle to hard-to-catch writer bugs around improper resizing. If that might make sense, this would obviously be a refactor that can wait for a future PR.

rkooo567 · 2021-03-23T05:25:23Z

src/ray/raylet/scheduling/cluster_resource_data.cc

+      highest = utilization;
+    }
+  }
+  return highest;


Maybe just use std::min(highest, utilization)?

rkooo567 · 2021-03-23T05:26:04Z

src/ray/raylet/scheduling/cluster_resource_data.cc

@@ -164,6 +164,88 @@ NodeResources ResourceMapToNodeResources(
  return node_resources;
 }

+float NodeResources::CalculateCriticalResourceUtilization() const {


Can you write unit tests for these 3 new functions?

rkooo567 · 2021-03-23T05:27:25Z

src/ray/raylet/scheduling/cluster_resource_data.cc

+bool NodeResources::IsAvailable(const TaskRequest &task_req) const {
+  // First, check predefined resources.
+  for (size_t i = 0; i < PredefinedResources_MAX; i++) {
+    if (i >= this->predefined_resources.size()) {


Same here. Is this condition ever invoked? Why don't we just add a check?

rkooo567 · 2021-03-23T05:29:36Z

src/ray/raylet/scheduling/cluster_resource_data.cc

+}
+
+bool NodeResources::IsFeasible(const TaskRequest &task_req) const {
+  // First, check predefined resources.


This looks almost identical to IsAvailable except that it uses total instead of available. Any good way to reduce code duplication?

Could add a shared private IsWithinCapacity helper that also takes a lambda like [](const ResourceCapacity &capacity) { return capacity.available; }), but that would probably sacrifice a bit of readability. Definitely shouldn't block the PR IMO.

bool NodeResources::IsAvailable(const TaskRequest &task_req) const { return IsWithinCapacity(task_req, [](const ResourceCapacity &capacity) { return capacity.available; }); } bool NodeResources::IsFeasible(const TaskRequest &task_req) const { return IsWithinCapacity(task_req, [](const ResourceCapacity &capacity) { return capacity.total; }); } bool NodeResources::IsWithinCapacity(const TaskRequest &task_req, std::function<FixedPoint(const ResourceCapacity &)> get_capacity) { // First, check predefined resources. for (size_t i = 0; i < PredefinedResources_MAX; i++) { if (i >= this->predefined_resources.size()) { if (task_req.predefined_resources[i].demand != 0) { return false; } continue; } const auto &resource = get_capacity(this->predefined_resources[i]); const auto &demand = task_req.predefined_resources[i].demand; bool is_soft = task_req.predefined_resources[i].soft; if (resource < demand && !is_soft) { return false; } } // Now check custom resources. for (const auto &task_req_custom_resource : task_req.custom_resources) { bool is_soft = task_req_custom_resource.soft; auto it = this->custom_resources.find(task_req_custom_resource.id); if (it == this->custom_resources.end() && !is_soft) { return false; } else if (task_req_custom_resource.demand > get_capacity(it->second) && !is_soft) { return false; } } return true; }

rkooo567 · 2021-03-23T05:33:54Z

src/ray/raylet/scheduling/cluster_resource_data.h

+  float CalculateCriticalResourceUtilization() const;
+  /// Returns true if the node has the available resources to run the task.
+  /// Note: This doesn't account for the binpacking of unit resources.
+  bool IsAvailable(const TaskRequest &task_req) const;


Do we still need IsFeasible in the cluster_resource_scheduler after this? (except for the legacy logic)

I think it's only used by the legacy scheduler.

Maybe just adding TODO comments on those functions so that we can easily clean up later?

rkooo567

Btw I will still wait for Eric's approval, but it LGTM if you add a python test.

ericl · 2021-03-23T17:28:47Z

Failing C++ tests

clarkzinzow · 2021-03-25T23:16:11Z

@ericl That was my bad, my suggestion should have been

best_node = std::next(nodes_.begin(), idx)->first;

I constantly do that with iterators. 🤦 Maybe some day I'll get it through my thick head that iterators are just fancy constrained pointers.

src/ray/raylet/scheduling/cluster_resource_scheduler.cc

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

wuisawesome · 2021-03-29T17:47:44Z

@rkooo567 i'm going to remove the python test, it's way too flaky and exact scheduling behavior isn't guaranteed to users. we should just make sure the logic is correct (unit test) and the large scale behavior is correct (release test).

rkooo567 · 2021-03-29T18:00:14Z

@wuisawesome Why is it flaky if the logic is correct? Can you tell me the scenario when this could be flaky unavoidably?

wuisawesome · 2021-03-29T18:19:38Z

I think the source of flakiness here is caused by the worker pool startup time, but it could involve the resource report updates too.

wuisawesome · 2021-03-29T20:32:31Z

Yeah i found a way to reproduce the flakiness locally, and i believe what's happening is that a bunch of tasks are being scheduled on the driver/first node, but because the warm pool hasn't started yet, they don't allocate resources, therefore more tasks are scheduled locally.

rkooo567 · 2021-03-29T21:20:15Z

Hmm I see. Doesn't that mean this will also not work in real world due to the same issue? Btw, what does locally mean here? If you set 0 cpu head node, is it still reproducible?

wuisawesome · 2021-03-29T21:28:52Z

I didn't try with a 0 CPU head node, but there's nothing special about local scheduling anymore, so the same logic applies. The problem actually exists with the PACK scheduler too, but because of the packing semantics, you can't detect it from the outcome (just the number of spillbacks)

rkooo567 · 2021-03-30T06:30:16Z

So, what's the conclusion here? Are we going to do the solution that you suggested in the slack?

rkooo567 · 2021-03-31T18:51:41Z

Seems like Windows tests are timing out. test_lease_request_leak

rkooo567 · 2021-04-01T07:01:41Z

Can you merge the latest master + lint? I think we can merge after that.

wuisawesome · 2021-04-01T23:56:10Z

Here are some pretty pictures of scheduling throughput. They look as we would predict they would.

wuisawesome · 2021-04-01T23:57:06Z

The remaining test failures seem unrelated. Lint fails due to SSL error (and passes in buildkite). The serve failure looks unrelated (serializing a dependency that changed?). I'm going to merge this now.

…of nodes in the cluster (#31934) Why are these changes needed? This PR takes over #26373 Currently, the initial scheduling delay for a simple f.remote() loop is approximately worker startup time (~1s) * number of nodes. There are three reasons for this: 1 . Drivers do not share physical worker processes, so each raylet must start new worker processes when a new driver starts. Each raylet starts the workers when the driver first sends a lease (resource) request to that raylet. 2. The #14790 prefers to pack tasks on fewer nodes up to 50% CPU utilization before spreading tasks for load-balancing. 3. The maximum number of concurrent lease requests is 10, meaning that the driver must wait for workers to start on the first 10 nodes that it contacts before sending lease requests to the next set of nodes. Because of (2), the first 10 nodes contacted is usually not unique, especially when each node has many cores. This PR change (3), which allows us to dynamic adjust the max_pending_lease_requests based on the number of nodes in the cluster. Without this PR, the top k scheduling algorithm is bottlenecked by the speed of sending lease request across the cluster.

…of nodes in the cluster (ray-project#31934) Why are these changes needed? This PR takes over ray-project#26373 Currently, the initial scheduling delay for a simple f.remote() loop is approximately worker startup time (~1s) * number of nodes. There are three reasons for this: 1 . Drivers do not share physical worker processes, so each raylet must start new worker processes when a new driver starts. Each raylet starts the workers when the driver first sends a lease (resource) request to that raylet. 2. The ray-project#14790 prefers to pack tasks on fewer nodes up to 50% CPU utilization before spreading tasks for load-balancing. 3. The maximum number of concurrent lease requests is 10, meaning that the driver must wait for workers to start on the first 10 nodes that it contacts before sending lease requests to the next set of nodes. Because of (2), the first 10 nodes contacted is usually not unique, especially when each node has many cores. This PR change (3), which allows us to dynamic adjust the max_pending_lease_requests based on the number of nodes in the cluster. Without this PR, the top k scheduling algorithm is bottlenecked by the speed of sending lease request across the cluster. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

policy

92fe3ce

wuisawesome assigned ericl Mar 18, 2021

ericl requested changes Mar 18, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 18, 2021

ericl reviewed Mar 18, 2021

View reviewed changes

src/ray/raylet/scheduling/cluster_resource_scheduler.cc Outdated Show resolved Hide resolved

ericl reviewed Mar 18, 2021

View reviewed changes

src/ray/raylet/scheduling/cluster_resource_scheduler.h Show resolved Hide resolved

rkooo567 self-assigned this Mar 18, 2021

ericl added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Mar 19, 2021

Alex added 5 commits March 23, 2021 04:01

some stuff

b10c981

some tests/lint

5337453

doc string

5cd2a26

cleanup

1ac33a6

lint

8e16df1

wuisawesome changed the title ~~[WIP] Hybrid scheduling policy.~~ [core] Hybrid scheduling policy. Mar 23, 2021

wuisawesome changed the title ~~[core] Hybrid scheduling policy.~~ [WIP][core] Hybrid scheduling policy. Mar 23, 2021

wuisawesome removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 23, 2021

rkooo567 requested a review from ericl March 23, 2021 05:08

rkooo567 changed the title ~~[WIP][core] Hybrid scheduling policy.~~ [core] Hybrid scheduling policy. Mar 23, 2021

rkooo567 reviewed Mar 23, 2021

View reviewed changes

rkooo567 approved these changes Mar 23, 2021

View reviewed changes

ericl approved these changes Mar 23, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 23, 2021

Alex added 2 commits March 24, 2021 21:55

should pass tests?

f2d898d

lint

5f5bf56

clarkzinzow reviewed Mar 25, 2021

View reviewed changes

src/ray/raylet/scheduling/cluster_resource_scheduler.cc Outdated Show resolved Hide resolved

Alex Wu and others added 2 commits March 25, 2021 16:37

Update src/ray/raylet/scheduling/cluster_resource_scheduler.cc

7bf4ca3

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

Update test_scheduling.py

a0b04e3

de flek

6115eb7

Alex added 4 commits March 30, 2021 17:52

.

c3adad1

lint'

4d027c3

less flaky?

e4f2669

.

b53dee7

skip on windows

812ca28

wuisawesome merged commit 4fba05a into ray-project:master Apr 1, 2021

amogkam pushed a commit that referenced this pull request Apr 16, 2021

[core] Hybrid scheduling policy. (#14790)

b15277d

stephanie-wang mentioned this pull request Jul 7, 2022

[core] Spread tasks to improve scheduling delay on job start #26373

Closed

6 tasks

scv119 mentioned this pull request Jan 25, 2023

[Core] automatically pick max_pending_lease_requests based on number of nodes in the cluster #31934

Merged

7 tasks

[core] Hybrid scheduling policy. #14790

[core] Hybrid scheduling policy. #14790

Conversation

wuisawesome commented Mar 18, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

ericl commented Mar 19, 2021 via email

wuisawesome commented Mar 19, 2021

ericl commented Mar 19, 2021 via email

rkooo567 commented Mar 22, 2021

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 Mar 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 left a comment • edited Loading

Choose a reason for hiding this comment

ericl commented Mar 23, 2021

clarkzinzow commented Mar 25, 2021 • edited Loading

wuisawesome commented Mar 29, 2021

rkooo567 commented Mar 29, 2021

wuisawesome commented Mar 29, 2021

wuisawesome commented Mar 29, 2021

rkooo567 commented Mar 29, 2021

wuisawesome commented Mar 29, 2021

rkooo567 commented Mar 30, 2021

rkooo567 commented Mar 31, 2021

rkooo567 commented Apr 1, 2021

wuisawesome commented Apr 1, 2021

wuisawesome commented Apr 1, 2021

wuisawesome commented Mar 18, 2021 •

edited

Loading

rkooo567 Mar 23, 2021 •

edited

Loading

rkooo567 left a comment •

edited

Loading

clarkzinzow commented Mar 25, 2021 •

edited

Loading