Skip to content

Conversation

@dancingactor
Copy link
Contributor

@dancingactor dancingactor commented Nov 21, 2025

Description

This PR refactors the testing_rpc_failure configuration to use JSON instead of a custom delimited string format.

As more options were being added to the RPC chaos testing framework, the old string format (e.g., method1=3:12:12:50) became difficult to read and maintain. By switching to JSON, the configuration is now self-describing, much more readable, and easier to extend in the future.

Related issues

Closes #58686

Additional information

Before:

"*=-1:25:50:10:2:3:1"

After:

{
  "*": {
    "num_failures": -1,
    "req_failure_prob": 25,
    "resp_failure_prob": 50,
    "in_flight_failure_prob": 10,
    "num_lower_bound_req_failures": 2,
    "num_lower_bound_resp_failures": 3,
    "num_lower_bound_in_flight_failures": 1
  }
}

@dancingactor dancingactor requested a review from a team as a code owner November 21, 2025 08:56
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great improvement, refactoring the testing_rpc_failure configuration from a custom string format to JSON. This significantly enhances readability, maintainability, and extensibility of the RPC chaos testing framework. The changes in both the Python test files and the C++ core logic are well-executed.

I have one suggestion regarding code duplication in the Python test files. A helper function, create_failure_json, has been introduced in multiple test files. Consolidating this into a shared test utility would further improve the codebase.

Comment on lines +19 to +30
def create_failure_json(method, num_failures, failure_str):
parts = failure_str.split(":")
return json.dumps(
{
method: {
"num_failures": num_failures,
"req_failure_prob": int(parts[0]),
"resp_failure_prob": int(parts[1]),
"in_flight_failure_prob": int(parts[2]),
}
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This helper function create_failure_json is very useful for creating the RPC failure configuration. I noticed it's also defined in test_object_manager_fault_tolerance.py and test_raylet_fault_tolerance.py. To avoid code duplication and improve maintainability, would you consider moving this function to a shared test utility module, such as ray/_private/test_utils.py? This would allow all tests to import and use a single implementation.

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Nov 21, 2025
@edoakes
Copy link
Collaborator

edoakes commented Nov 21, 2025

@dayshah @Sparks0219

@dancingactor
Copy link
Contributor Author

dancingactor commented Nov 21, 2025

@dayshah Sorry for the delay. I am currently trying building Ray from source to test this modification. Would you mind waiting for a moment?

@dancingactor dancingactor force-pushed the RPC branch 5 times, most recently from 9caf2d3 to 20fd244 Compare November 25, 2025 16:11
@dancingactor
Copy link
Contributor Author

Hi @dayshah,
I have run bazel test //src/ray/rpc/tests:rpc_chaos_test and pytest on all relevant Python test files, and all tests have passed. The affected python test files were identified by searching for where the RAY_testing_rpc_failure environment variable is explicitly set within the test code. Please let me know if any further testing is required.

bazel test

  • bazel test //src/ray/rpc/tests:rpc_chaos_test
    //src/ray/rpc/tests:rpc_chaos_test                                       PASSED in 0.3s
    

pytest

  • test_raylet_fault_tolerance.py
    Results (2840.98s):
        11 passed
    
  • test_gcs_utils.py
    Results (468.07s):
        7 passed
        3 skipped
    
  • test_gcs_fault_tolerance.py::test_mark_job_finished_rpc_retry_and_idempotency
    Results (62.05s):
        1 passed
    
  • test_core_worker_fault_tolerance.py
    Results (1348.11s):
        19 passed
    
  • test_object_manager_fault_tolerance.py
    Results (205.23s):
        3 passed
    
  • test_actor_lineage_reconstruction.py
    Results (194.18s):
        3 passed
    
  • test_streaming_generator_4.py
    Results (131.06s):
        15 passed
    
  • test_job_manager.py
    Results (351.38s):
        67 passed
    

@jjyao
Copy link
Collaborator

jjyao commented Dec 4, 2025


[2025-11-26T06:17:15Z] @@ -92,9 +92,9 @@ ray_cc_library(
--
[2025-11-26T06:17:15Z]      visibility = ["//visibility:public"],
[2025-11-26T06:17:15Z]      deps = [
[2025-11-26T06:17:15Z]          "//src/ray/common:ray_config",
[2025-11-26T06:17:15Z] -        "@nlohmann_json",
[2025-11-26T06:17:15Z]          "@com_google_absl//absl/container:flat_hash_map",
[2025-11-26T06:17:15Z]          "@com_google_absl//absl/synchronization",
[2025-11-26T06:17:15Z] +        "@nlohmann_json",
[2025-11-26T06:17:15Z]      ],
[2025-11-26T06:17:15Z]  )

lint failure

@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Dec 4, 2025
@dancingactor
Copy link
Contributor Author

dancingactor commented Dec 5, 2025

Fixed lint failure, thanks!

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG. Thanks for the contribution!

Comment on lines 88 to 94
value.value("num_failures", 0L),
value.value("req_failure_prob", 0UL),
value.value("resp_failure_prob", 0UL),
value.value("in_flight_failure_prob", 0UL),
value.value("num_lower_bound_req_failures", 0UL),
value.value("num_lower_bound_resp_failures", 0UL),
value.value("num_lower_bound_in_flight_failures", 0UL),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also LOG(FATAL) if user specifies some value that's unknown: e.g. a typo

Copy link
Contributor Author

@dancingactor dancingactor Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e555826, and passed the test

  • bazel test //src/ray/rpc/tests:rpc_chaos_test
INFO: Build completed successfully, 9 total actions
//src/ray/rpc/tests:rpc_chaos_test                                       PASSED in 0.3s

Executed 1 out of 1 test: 1 test passes.

@dancingactor dancingactor force-pushed the RPC branch 3 times, most recently from e555826 to 923994a Compare December 6, 2025 04:05
@dancingactor dancingactor requested a review from jjyao December 6, 2025 04:05
@jjyao
Copy link
Collaborator

jjyao commented Dec 6, 2025

lint failures

@dancingactor dancingactor force-pushed the RPC branch 3 times, most recently from a7e1592 to 993b2a2 Compare December 6, 2025 07:24
Signed-off-by: dancingactor <s990346@gmail.com>
- modified variable names

Signed-off-by: dancingactor <s990346@gmail.com>
Signed-off-by: dancingactor <s990346@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[core] Make RPC Chaos Configurations More Readable (JSON)

4 participants