-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Cancel lease requests before returning a PG bundle #45919
Conversation
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
src/ray/raylet/node_manager.cc
Outdated
// Cancel lease requests related to unused bundles | ||
cluster_task_manager_->CancelTasks( | ||
[&](const RayTask &task) { | ||
const auto bundle_id = task.GetTaskSpecification().PlacementGroupBundleId(); | ||
return !bundle_id.first.IsNil() && 0 == in_use_bundles.count(bundle_id); | ||
}, | ||
rpc::RequestWorkerLeaseReply::SCHEDULING_CANCELLED_INTENDED, | ||
"The task is cancelled because it uses placement group bundles that are not " | ||
"registered to GCS. It can happen upon GCS restart."); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actual fix.
src/ray/raylet/node_manager.cc
Outdated
// Cancel lease requests related to the placement group to be removed. | ||
cluster_task_manager_->CancelTasks( | ||
[&](const RayTask &task) { | ||
const auto bundle_id = task.GetTaskSpecification().PlacementGroupBundleId(); | ||
return bundle_id.first == bundle_spec.PlacementGroupId(); | ||
}, | ||
rpc::RequestWorkerLeaseReply::SCHEDULING_CANCELLED_PLACEMENT_GROUP_REMOVED, | ||
""); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actual fix.
Side comment: we found multiple similar cases, in that when we want to kill all workers under a predicate (e.g. job died, root detached actor died, pg died, ...), we have to do it multiple places:
I wonder if we can have a unified mechanism to do the killing for all... |
@@ -233,6 +233,7 @@ void CoreWorkerProcessImpl::InitializeSystemConfig() { | |||
thread.join(); | |||
|
|||
RayConfig::instance().initialize(promise.get_future().get()); | |||
ray::asio::testing::init(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sure we can set testing_asio_delay_us
env var through _system_configs
Why are these changes needed?
To successfully return a PG bundle (
CancelResourceReserve
andReleaseUnusedBundles
), the bundle resource needs to be completely free (i.e. total == available). To make sure that, raylet will first destroy leased workers that are currently using the PG bundle resources so that these bundle resources can be freed. However this alone cannot guarantee that all bundle resources will be freed since a lease request that's popping worker also already acquires the bundle resources so we need to cancel these lease requests as well to free the bundle resources.After this PR, we don't need the retry in #42942.
Related issue number
Closes #45642
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.