-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] Adding option for in flight rpc failure injection #58512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Legacy Config Breaks Modern Framework.
The RPC failure configuration uses the old 3-parameter format 1:100:0 but the updated framework requires 4 parameters minimum. This will cause a RAY_CHECK_GE assertion failure when the configuration is parsed, crashing the test.
python/ray/tests/test_raylet_fault_tolerance.py#L63-L64
ray/python/ray/tests/test_raylet_fault_tolerance.py
Lines 63 to 64 in 80ff33b
| "RAY_testing_rpc_failure", | |
| "NodeManagerService.grpc_client.DrainRaylet=1:100:0", |
|
|
||
|
|
||
| @pytest.mark.parametrize("deterministic_failure", ["request", "response"]) | ||
| @pytest.mark.parametrize("deterministic_failure", ["request", "response", "in_flight"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use RPC_FAILURE_TYPES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this comment is not addressed
Sparks0219
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also update the comment for TODO(#58246) in data_release_tests.yaml
jjyao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
| "cmd": "ray start --head", | ||
| "env": { | ||
| "RAY_testing_rpc_failure": "ray::rpc::InternalKVGcsService.grpc_client.InternalKVGet=2:50:50,CoreWorkerService.grpc_client.PushTask=3:50:50" | ||
| "RAY_testing_rpc_failure": "ray::rpc::InternalKVGcsService.grpc_client.InternalKVGet=3:33:33:33,CoreWorkerService.grpc_client.PushTask=3:33:33:33" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gonna change it to a proper json
| RPC_FAILURE_MAP = { | ||
| "request": "100:0:0", | ||
| "response": "0:100:0", | ||
| "in_flight": "0:0:100", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems our codebase uses it as one word?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's supposed to be 2 lol https://www.merriam-webster.com/dictionary/in-flight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my boy cited merriam webster 💪
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely two words: "the requests are in flight" and "the in-flight requests"
src/ray/common/ray_config_def.h
Outdated
| // it will apply to all methods. | ||
| RAY_CONFIG(std::string, testing_asio_delay_us, "") | ||
|
|
||
| /// To use this, simply do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"simply do" lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took it out since it's not simple anymore 💀
src/ray/rpc/rpc_chaos.cc
Outdated
|
|
||
| // You can also provide 5th, 6th, and / or 7th optional parameters to specify that there | ||
| // should be at least a certain amount of request, response, and in flight failures. | ||
| // flight failures. By default these are set to 0, but by setting them to positive values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "flight failures" is twice written
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/ray/rpc/rpc_chaos.cc
Outdated
| // Key is the RPC call name and value is a four part colon separated structure. It | ||
| // contains the max number of failures to inject + probability of req failure + | ||
| // probability of reply failure. | ||
| // probability of reply failure + probability of in flight failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: should be in-flight since it's an adjective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed, grammar 💪🏽
src/ray/rpc/rpc_chaos.cc
Outdated
| failable.num_remaining_failures--; | ||
| return RpcFailure::Response; | ||
| } | ||
| if (random_number <= failable.req_failure_prob + failable.resp_failure_prob + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could just be <= 100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or this could just be structured an if/else if/else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restructured to if else if else, but why <=100? all 3 numbers added up could still be <100, like 10:10:10?
Signed-off-by: dayshah <dhyey2019@gmail.com>
|
|
||
|
|
||
| @pytest.mark.parametrize("deterministic_failure", ["request", "response"]) | ||
| @pytest.mark.parametrize("deterministic_failure", ["request", "response", "in_flight"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this comment is not addressed
…#58512) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…#58512) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
…#58512) Signed-off-by: dayshah <dhyey2019@gmail.com>
Description
Our current RPC failure injection framework accounts for 2 types of failures.
It doesn't account for the case that
3. The request makes it to the server, and the callback is called with Status::Unavailable while the server is still processing the request. In these situations the client could retry and the server could get the retry while still processing the old request.
This adds an option to inject failures for that third case. In the more realistic iptable chaos tests, we found that this does happen.
This will help deterministically find issues like the one fixed here #58265