-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] Make ReleaseUnusedBundles Fault Tolerant #57786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Make ReleaseUnusedBundles Fault Tolerant #57786
Conversation
Signed-off-by: joshlee <joshlee@anyscale.com>
|
@Sparks0219 merge conflict, and could you please separate the worker interface refactoring into another PR? |
992e1bb to
240ef63
Compare
f896cd6 to
240ef63
Compare
Signed-off-by: joshlee <joshlee@anyscale.com>
|
|
||
| TEST_F(NodeManagerTest, TestHandleRequestWorkerLeaseInfeasibleIdempotent) { | ||
| auto lease_spec = BuildLeaseSpec({{"CPU", 1}}); | ||
| auto lease_spec = BuildLeaseSpec({{"CPU", 11}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I needed to expand the cpu resources for this test from 0 to some positive number (10) for the bundle test, but then the infeasible lease test failed since 1 cpu is no longer infeasible. I'll change this to a constexpr and not some magic number so it's more clear
| bundle_spec_map_; | ||
|
|
||
| friend bool IsBundleRegistered(const PlacementGroupResourceManager &manager, | ||
| const BundleID &bundle_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀
no way around? + is it necessary to assert on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm I think we need some way of looking into pgmanager state since ReleaseUnusedBundles is calling pgmanager methods and I don't think there's anything I can use for this in the class currently :(
| *local_lease_manager_); | ||
|
|
||
| placement_group_resource_manager_ = | ||
| std::make_unique<NewPlacementGroupResourceManager>(*cluster_resource_scheduler_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can save for the refactor pr, but why is called new 😵💫
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀 no clue
| monkeypatch.setenv( | ||
| "RAY_testing_rpc_failure", | ||
| "NodeManagerService.grpc_client.ReleaseUnusedBundles=1:100:0" | ||
| + ",NodeManagerService.grpc_client.CancelResourceReserve=100:100:0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just do -1 if you want to kill it
but I don't love this
So like it's possible to happen without injecting failures on CancelResourceReserve because the gcs could be die before all the Cancel's happen. You should leave a comment describing that otherwise it's p hard to figure out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea exactly, but its inherently flaky so we need to perma block CancelResourceReserve to make it deterministic. I'll leave a comment + put -1 instead to be mroe clear about my intentions
Signed-off-by: joshlee <joshlee@anyscale.com>
dayshah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple questions but lgtm
| def test_release_unused_bundles_idempotent( | ||
| inject_rpc_failures, ray_start_cluster_head_with_external_redis | ||
| ): | ||
| # NOTE: Not testing response failure because the leaked bundle is cleaned up anyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any harm in testing resp failure too just for completeness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh not really I can add it in, though only request will fail without this change and response passes.
| port_(port), | ||
| proc_(Process::CreateNewDummy()), | ||
| connection_([&io_context]() { | ||
| local_stream_socket socket(io_context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we creating a real socket in a fake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup I am, seems a bit of a pain to rework it not to take in a real socket so just passed real socket + dummy lambdas as a compromise
| // bundle_in_use == true: a bundle is marked as in use in the placement group resource | ||
| // manager. ReleaseUnusedBundles is expected to not release the bundle. | ||
| // bundle_in_use == false: a bundle is not marked as in use in the placement group | ||
| // resource manager. ReleaseUnusedBundles is expected to release the bundle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bundle_in_use is state right, so there's the case where on the 1st cancel, it was in use, and on the retry it wasn't in use. But ya the behavior is correct anyways to take state into account but just not perfectly "idempotent"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True good point, any state change in the middle between the retries will cause issues with idempotency. I'll rename the test to retries instead
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com> Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Description
Making ReleaseUnusedBundles fault tolerant and enabling retries on network failures. Added cpp test to verify idempotency and created a python integration test. Also created a fake worker class since I needed to noop the underlying connection which is used in DestroyWorker and didn't want to modify the mock class.
Related issues
Types of change
Checklist
Does this PR introduce breaking changes?
Testing:
Code Quality:
git commit -s)Documentation:
doc/source/(if applicable)Additional context