-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Hybrid scheduling policy. #14790
[core] Hybrid scheduling policy. #14790
Conversation
We can also include memory, but I think we should exclude GPU. Bin packing
for GPU is always preferable since they are so expensive.
…On Thu, Mar 18, 2021, 5:03 PM Alex Wu ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/ray/raylet/scheduling/cluster_resource_data.cc
<#14790 (comment)>:
> @@ -164,6 +164,27 @@ NodeResources ResourceMapToNodeResources(
return node_resources;
}
+float NodeResources::CalculateCriticalResourceUtilization() const {
+ float highest = 0;
+
+ for (const auto &capacity : predefined_resources) {
+ float utilization = 1 - (capacity.available.Double() / capacity.total.Double());
+ if (utilization > highest) {
+ highest = utilization;
+ }
+ }
Hmmm good point... I can imagine similar situations for CPU and memory
though.
How about we do all predefined resources and no custom resources?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#14790 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSSJVV5IGPDIKEO3EDLTEKIEJANCNFSM4ZNWTNAQ>
.
|
The GPU is in the cluster either way though right? Is this a heuristic to help with scaling down? |
Yes. Also, GPUs don't really get oversubscribed in the way cpu and memory
does, so it doesn't make sense to be trying to spread.
…On Thu, Mar 18, 2021, 5:08 PM Alex Wu ***@***.***> wrote:
Bin packing for GPU is always preferable since they are so expensive.
The GPU is in the cluster either way though right? Is this a heuristic to
help with scaling down?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#14790 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSTR6JQNRGIIVRHFUH3TEKIX3ANCNFSM4ZNWTNAQ>
.
|
Is it still WIP btw? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add one python test? I think we can make it something like this;
- 3 nodes cluster (0 cpu head node)
- Each node has 100MB object store memory & 2 cpus
- Make a task that returns 55MB object.
- Create 3 tasks
- Make sure all nodes has 55MB object memory usage.
/// Whether to use the hybrid scheduling policy, or one of the legacy spillback | ||
/// strategies. In the hybrid scheduling strategy, leases are packed until a threshold, | ||
/// then spread via weighted (by critical resource usage). | ||
RAY_CONFIG(bool, scheduler_hybrid_scheduling, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's happening if both this and scheduler_loadbalance_spillback
are true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like scheduler_loadbalance_spillback
is ignored. Can you add a TODO comment to clean this up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag takes precedence over the other one.
/// | ||
/// \return -1 if the task is infeasible, otherwise the node id (key in `nodes`) to | ||
/// schedule on. | ||
int64_t HybridPolicy(const TaskRequest &task_request, const int64_t local_node_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a TODO to move other policies to this file? (load balancing & simple bin packing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was actually thinking we would delete those, but sure
float highest = 0; | ||
for (const auto &i : {CPU, MEM, OBJECT_STORE_MEM}) { | ||
if (i >= this->predefined_resources.size()) { | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it every happened? Why don't we just add a check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this actually happens. check was inspired by real events :p (most of the unit tests actually trigger it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting... Why is that? I thought the first N entries are reserved for predefined resources..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but the original data structure is actually a vector, and it's not dynamically resized at initialization time. This means that for some time, a task req of {"CPU": 1} will have predefined_resources.size() == 1
until some code comes along and resizes the vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bit of a funky data model... Given that predefined resources are static by definition, wouldn't it be a lot cleaner if predefined_resources
was a static array whose elements may be unset? That would still allow for enum-based indexing (which I like), but would get rid of all of the dynamic resizing and size checks, and should make it far less brittle to hard-to-catch writer bugs around improper resizing. If that might make sense, this would obviously be a refactor that can wait for a future PR.
highest = utilization; | ||
} | ||
} | ||
return highest; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just use std::min(highest, utilization)?
@@ -164,6 +164,88 @@ NodeResources ResourceMapToNodeResources( | |||
return node_resources; | |||
} | |||
|
|||
float NodeResources::CalculateCriticalResourceUtilization() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write unit tests for these 3 new functions?
bool NodeResources::IsAvailable(const TaskRequest &task_req) const { | ||
// First, check predefined resources. | ||
for (size_t i = 0; i < PredefinedResources_MAX; i++) { | ||
if (i >= this->predefined_resources.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Is this condition ever invoked? Why don't we just add a check?
} | ||
|
||
bool NodeResources::IsFeasible(const TaskRequest &task_req) const { | ||
// First, check predefined resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks almost identical to IsAvailable
except that it uses total instead of available. Any good way to reduce code duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add a shared private IsWithinCapacity
helper that also takes a lambda like [](const ResourceCapacity &capacity) { return capacity.available; })
, but that would probably sacrifice a bit of readability. Definitely shouldn't block the PR IMO.
bool NodeResources::IsAvailable(const TaskRequest &task_req) const {
return IsWithinCapacity(task_req, [](const ResourceCapacity &capacity) { return capacity.available; });
}
bool NodeResources::IsFeasible(const TaskRequest &task_req) const {
return IsWithinCapacity(task_req, [](const ResourceCapacity &capacity) { return capacity.total; });
}
bool NodeResources::IsWithinCapacity(const TaskRequest &task_req, std::function<FixedPoint(const ResourceCapacity &)> get_capacity) {
// First, check predefined resources.
for (size_t i = 0; i < PredefinedResources_MAX; i++) {
if (i >= this->predefined_resources.size()) {
if (task_req.predefined_resources[i].demand != 0) {
return false;
}
continue;
}
const auto &resource = get_capacity(this->predefined_resources[i]);
const auto &demand = task_req.predefined_resources[i].demand;
bool is_soft = task_req.predefined_resources[i].soft;
if (resource < demand && !is_soft) {
return false;
}
}
// Now check custom resources.
for (const auto &task_req_custom_resource : task_req.custom_resources) {
bool is_soft = task_req_custom_resource.soft;
auto it = this->custom_resources.find(task_req_custom_resource.id);
if (it == this->custom_resources.end() && !is_soft) {
return false;
} else if (task_req_custom_resource.demand > get_capacity(it->second) && !is_soft) {
return false;
}
}
return true;
}
float CalculateCriticalResourceUtilization() const; | ||
/// Returns true if the node has the available resources to run the task. | ||
/// Note: This doesn't account for the binpacking of unit resources. | ||
bool IsAvailable(const TaskRequest &task_req) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need IsFeasible
in the cluster_resource_scheduler
after this? (except for the legacy logic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's only used by the legacy scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just adding TODO comments on those functions so that we can easily clean up later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw I will still wait for Eric's approval, but it LGTM if you add a python test.
Failing C++ tests |
@ericl That was my bad, my suggestion should have been best_node = std::next(nodes_.begin(), idx)->first; I constantly do that with iterators. 🤦 Maybe some day I'll get it through my thick head that iterators are just fancy constrained pointers. |
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
@rkooo567 i'm going to remove the python test, it's way too flaky and exact scheduling behavior isn't guaranteed to users. we should just make sure the logic is correct (unit test) and the large scale behavior is correct (release test). |
@wuisawesome Why is it flaky if the logic is correct? Can you tell me the scenario when this could be flaky unavoidably? |
I think the source of flakiness here is caused by the worker pool startup time, but it could involve the resource report updates too. |
Yeah i found a way to reproduce the flakiness locally, and i believe what's happening is that a bunch of tasks are being scheduled on the driver/first node, but because the warm pool hasn't started yet, they don't allocate resources, therefore more tasks are scheduled locally. |
Hmm I see. Doesn't that mean this will also not work in real world due to the same issue? Btw, what does |
I didn't try with a 0 CPU head node, but there's nothing special about local scheduling anymore, so the same logic applies. The problem actually exists with the PACK scheduler too, but because of the packing semantics, you can't detect it from the outcome (just the number of spillbacks) |
So, what's the conclusion here? Are we going to do the solution that you suggested in the slack? |
Seems like Windows tests are timing out. |
Can you merge the latest master + lint? I think we can merge after that. |
The remaining test failures seem unrelated. Lint fails due to SSL error (and passes in buildkite). The serve failure looks unrelated (serializing a dependency that changed?). I'm going to merge this now. |
…of nodes in the cluster (#31934) Why are these changes needed? This PR takes over #26373 Currently, the initial scheduling delay for a simple f.remote() loop is approximately worker startup time (~1s) * number of nodes. There are three reasons for this: 1 . Drivers do not share physical worker processes, so each raylet must start new worker processes when a new driver starts. Each raylet starts the workers when the driver first sends a lease (resource) request to that raylet. 2. The #14790 prefers to pack tasks on fewer nodes up to 50% CPU utilization before spreading tasks for load-balancing. 3. The maximum number of concurrent lease requests is 10, meaning that the driver must wait for workers to start on the first 10 nodes that it contacts before sending lease requests to the next set of nodes. Because of (2), the first 10 nodes contacted is usually not unique, especially when each node has many cores. This PR change (3), which allows us to dynamic adjust the max_pending_lease_requests based on the number of nodes in the cluster. Without this PR, the top k scheduling algorithm is bottlenecked by the speed of sending lease request across the cluster.
…of nodes in the cluster (ray-project#31934) Why are these changes needed? This PR takes over ray-project#26373 Currently, the initial scheduling delay for a simple f.remote() loop is approximately worker startup time (~1s) * number of nodes. There are three reasons for this: 1 . Drivers do not share physical worker processes, so each raylet must start new worker processes when a new driver starts. Each raylet starts the workers when the driver first sends a lease (resource) request to that raylet. 2. The ray-project#14790 prefers to pack tasks on fewer nodes up to 50% CPU utilization before spreading tasks for load-balancing. 3. The maximum number of concurrent lease requests is 10, meaning that the driver must wait for workers to start on the first 10 nodes that it contacts before sending lease requests to the next set of nodes. Because of (2), the first 10 nodes contacted is usually not unique, especially when each node has many cores. This PR change (3), which allows us to dynamic adjust the max_pending_lease_requests based on the number of nodes in the cluster. Without this PR, the top k scheduling algorithm is bottlenecked by the speed of sending lease request across the cluster. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Why are these changes needed?
This PR introduces a new scheduling policy which is a hybrid of a pack and round robin policy. Description from the doc string:
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.