Skip to content

Conversation

@Sparks0219
Copy link
Contributor

@Sparks0219 Sparks0219 commented Oct 16, 2025

Description

Making ReleaseUnusedBundles fault tolerant and enabling retries on network failures. Added cpp test to verify idempotency and created a python integration test. Also created a fake worker class since I needed to noop the underlying connection which is used in DestroyWorker and didn't want to modify the mock class.

Related issues

Types of change

  • Bug fix 🐛
  • New feature ✨
  • Enhancement 🚀
  • Code refactoring 🔧
  • Documentation update 📖
  • Chore 🧹
  • Style 🎨

Checklist

Does this PR introduce breaking changes?

  • Yes ⚠️
  • No

Testing:

  • Added/updated tests for my changes
  • Tested the changes manually
  • This PR is not tested ❌ (please explain why)

Code Quality:

  • Signed off every commit (git commit -s)
  • Ran pre-commit hooks (setup guide)

Documentation:

  • Updated documentation (if applicable) (contribution guide)
  • Added new APIs to doc/source/ (if applicable)

Additional context

Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
@Sparks0219 Sparks0219 requested review from dayshah and edoakes October 16, 2025 06:14
@Sparks0219 Sparks0219 added the go add ONLY when ready to merge, run all tests label Oct 16, 2025
@Sparks0219 Sparks0219 marked this pull request as ready for review October 16, 2025 06:15
@Sparks0219 Sparks0219 requested a review from a team as a code owner October 16, 2025 06:15
Signed-off-by: joshlee <joshlee@anyscale.com>
@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 16, 2025
@edoakes
Copy link
Collaborator

edoakes commented Oct 16, 2025

@Sparks0219 merge conflict, and could you please separate the worker interface refactoring into another PR?

@Sparks0219 Sparks0219 force-pushed the joshlee/make-release-unused-bundles-fault-tolerant branch from 992e1bb to 240ef63 Compare October 16, 2025 18:54
cursor[bot]

This comment was marked as outdated.

@Sparks0219 Sparks0219 force-pushed the joshlee/make-release-unused-bundles-fault-tolerant branch 2 times, most recently from f896cd6 to 240ef63 Compare October 16, 2025 19:55
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>

TEST_F(NodeManagerTest, TestHandleRequestWorkerLeaseInfeasibleIdempotent) {
auto lease_spec = BuildLeaseSpec({{"CPU", 1}});
auto lease_spec = BuildLeaseSpec({{"CPU", 11}});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I needed to expand the cpu resources for this test from 0 to some positive number (10) for the bundle test, but then the infeasible lease test failed since 1 cpu is no longer infeasible. I'll change this to a constexpr and not some magic number so it's more clear

bundle_spec_map_;

friend bool IsBundleRegistered(const PlacementGroupResourceManager &manager,
const BundleID &bundle_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀
no way around? + is it necessary to assert on this

Copy link
Contributor Author

@Sparks0219 Sparks0219 Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm I think we need some way of looking into pgmanager state since ReleaseUnusedBundles is calling pgmanager methods and I don't think there's anything I can use for this in the class currently :(

*local_lease_manager_);

placement_group_resource_manager_ =
std::make_unique<NewPlacementGroupResourceManager>(*cluster_resource_scheduler_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can save for the refactor pr, but why is called new 😵‍💫

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀 no clue

monkeypatch.setenv(
"RAY_testing_rpc_failure",
"NodeManagerService.grpc_client.ReleaseUnusedBundles=1:100:0"
+ ",NodeManagerService.grpc_client.CancelResourceReserve=100:100:0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do -1 if you want to kill it

but I don't love this

So like it's possible to happen without injecting failures on CancelResourceReserve because the gcs could be die before all the Cancel's happen. You should leave a comment describing that otherwise it's p hard to figure out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea exactly, but its inherently flaky so we need to perma block CancelResourceReserve to make it deterministic. I'll leave a comment + put -1 instead to be mroe clear about my intentions

Signed-off-by: joshlee <joshlee@anyscale.com>
@Sparks0219 Sparks0219 requested a review from dayshah October 17, 2025 23:23
Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple questions but lgtm

def test_release_unused_bundles_idempotent(
inject_rpc_failures, ray_start_cluster_head_with_external_redis
):
# NOTE: Not testing response failure because the leaked bundle is cleaned up anyway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any harm in testing resp failure too just for completeness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh not really I can add it in, though only request will fail without this change and response passes.

port_(port),
proc_(Process::CreateNewDummy()),
connection_([&io_context]() {
local_stream_socket socket(io_context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we creating a real socket in a fake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I am, seems a bit of a pain to rework it not to take in a real socket so just passed real socket + dummy lambdas as a compromise

// bundle_in_use == true: a bundle is marked as in use in the placement group resource
// manager. ReleaseUnusedBundles is expected to not release the bundle.
// bundle_in_use == false: a bundle is not marked as in use in the placement group
// resource manager. ReleaseUnusedBundles is expected to release the bundle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bundle_in_use is state right, so there's the case where on the 1st cancel, it was in use, and on the retry it wasn't in use. But ya the behavior is correct anyways to take state into account but just not perfectly "idempotent"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True good point, any state change in the middle between the retries will cause issues with idempotency. I'll rename the test to retries instead

Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
@dayshah dayshah enabled auto-merge (squash) October 20, 2025 20:11
@dayshah dayshah merged commit a9065a3 into ray-project:master Oct 20, 2025
7 checks passed
kamil-kaczmarek pushed a commit that referenced this pull request Oct 20, 2025
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Signed-off-by: joshlee <joshlee@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants