Skip to content

Conversation

@dayshah
Copy link
Contributor

@dayshah dayshah commented Nov 10, 2025

Description

Our current RPC failure injection framework accounts for 2 types of failures.

  1. The callback is called Status::Unavailable and the request is never sent.
  2. The callback is called with Status::Unavailable when the reply comes back.

It doesn't account for the case that
3. The request makes it to the server, and the callback is called with Status::Unavailable while the server is still processing the request. In these situations the client could retry and the server could get the retry while still processing the old request.

This adds an option to inject failures for that third case. In the more realistic iptable chaos tests, we found that this does happen.
This will help deterministically find issues like the one fixed here #58265

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Nov 10, 2025
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah marked this pull request as ready for review November 10, 2025 23:32
@dayshah dayshah requested a review from a team as a code owner November 10, 2025 23:32
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Legacy Config Breaks Modern Framework.

The RPC failure configuration uses the old 3-parameter format 1:100:0 but the updated framework requires 4 parameters minimum. This will cause a RAY_CHECK_GE assertion failure when the configuration is parsed, crashing the test.

python/ray/tests/test_raylet_fault_tolerance.py#L63-L64

"RAY_testing_rpc_failure",
"NodeManagerService.grpc_client.DrainRaylet=1:100:0",

Fix in Cursor Fix in Web


@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Nov 11, 2025
Signed-off-by: dayshah <dhyey2019@gmail.com>


@pytest.mark.parametrize("deterministic_failure", ["request", "response"])
@pytest.mark.parametrize("deterministic_failure", ["request", "response", "in_flight"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use RPC_FAILURE_TYPES

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this comment is not addressed

Copy link
Contributor

@Sparks0219 Sparks0219 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also update the comment for TODO(#58246) in data_release_tests.yaml

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

"cmd": "ray start --head",
"env": {
"RAY_testing_rpc_failure": "ray::rpc::InternalKVGcsService.grpc_client.InternalKVGet=2:50:50,CoreWorkerService.grpc_client.PushTask=3:50:50"
"RAY_testing_rpc_failure": "ray::rpc::InternalKVGcsService.grpc_client.InternalKVGet=3:33:33:33,CoreWorkerService.grpc_client.PushTask=3:33:33:33"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna change it to a proper json

RPC_FAILURE_MAP = {
"request": "100:0:0",
"response": "0:100:0",
"in_flight": "0:0:100",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems our codebase uses it as one word?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@edoakes edoakes Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my boy cited merriam webster 💪

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely two words: "the requests are in flight" and "the in-flight requests"

// it will apply to all methods.
RAY_CONFIG(std::string, testing_asio_delay_us, "")

/// To use this, simply do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"simply do" lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took it out since it's not simple anymore 💀


// You can also provide 5th, 6th, and / or 7th optional parameters to specify that there
// should be at least a certain amount of request, response, and in flight failures.
// flight failures. By default these are set to 0, but by setting them to positive values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "flight failures" is twice written

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

// Key is the RPC call name and value is a four part colon separated structure. It
// contains the max number of failures to inject + probability of req failure +
// probability of reply failure.
// probability of reply failure + probability of in flight failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: should be in-flight since it's an adjective

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed, grammar 💪🏽

failable.num_remaining_failures--;
return RpcFailure::Response;
}
if (random_number <= failable.req_failure_prob + failable.resp_failure_prob +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could just be <= 100.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or this could just be structured an if/else if/else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restructured to if else if else, but why <=100? all 3 numbers added up could still be <100, like 10:10:10?

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah requested review from Sparks0219 and jjyao November 16, 2025 03:32


@pytest.mark.parametrize("deterministic_failure", ["request", "response"])
@pytest.mark.parametrize("deterministic_failure", ["request", "response", "in_flight"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this comment is not addressed

Signed-off-by: dayshah <dhyey2019@gmail.com>
@github-actions github-actions bot disabled auto-merge November 17, 2025 06:08
@dayshah dayshah enabled auto-merge (squash) November 17, 2025 06:11
@dayshah dayshah merged commit fbf3c32 into ray-project:master Nov 17, 2025
7 checks passed
@dayshah dayshah deleted the in-flight-fail branch November 17, 2025 08:46
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…#58512)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…#58512)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants