Improve rpc_soak and channel_soak test to cover streaming #11687

zbilun · 2024-11-13T06:21:36Z

PTAL @apolcyn
The clientCompressedStreaming and serverCompressedStreaming cases are not included in this version of the PR but will be added if necessary.
In this design, the large_unary soak test case will continue to use the blocking API, while the streaming test cases will use the non-blocking API.
A new test type parameter has been introduced to the performSoakTest function.
Please review the POC code, and we can discuss the design further, thanks!

…ing soak_num_threads Flag

…, channel creation logic, and refactor thread body for performSoakTest

…d simplify thread result aggregation

…edException. Update the ThreadResults data type.

apolcyn · 2024-11-22T18:37:30Z

interop-testing/src/main/java/io/grpc/testing/integration/AbstractInteropTest.java

+    return new SoakIterationResult(TimeUnit.NANOSECONDS.toMillis(elapsedNs), status);
+  }
+
+  private SoakIterationResult performOneSoakIterationPingPong(


I'm not sure how much benefit there will really be to running the different RPC types (client streaming, server streaming, etc.) in a loop.

The code paths and behaviors exercised are going to be very similar to the unary based soak tests we already have.

Here is a straw man idea for what I think might be useful here:

using a long-lived stream

start the stream once (per thread). On each soak iteration, send one message and receive one message.

If/when the stream fails, indicate it in the log, but otherwise restart the stream and continue on the new one.

This would provide us a new dimension of test coverage that we don't currently have much of (long lived RPCs).

Alex, thanks for the suggestion! I have discussed with Feng. He said, “For completeness, we should cover all these variations of RPCs, but I don’t see how sending a message out and back can construct a long-lived stream. Usually, long-lived streams last for hours, and covering long-lived streams is not part of the plan.”

So, seems he definitely wants to ensure we cover all RPC types. My thought is we can handle the long-lived RPCs in a separate set of tests, not as part of the current soak tests. We can definitely consider this in more detail and plan it out for a future PR. Let me know your thoughts, and happy to discuss further!

“For completeness, we should cover all these variations of RPCs, but I don’t see how sending a message out and back can construct a long-lived stream. Usually, long-lived streams last for hours, and covering long-lived streams is not part of the plan.”

My idea here is not to send a message out and back once. Instead, it's to keep sending messages out and back on the same stream for as long as possible. By setting soak iterations and soak_min_time_ms_between_rpcs, you can make these clients do a fixed QPS for a fixed time (e.g. 10 QPS for 1 hour).

So, seems he definitely wants to ensure we cover all RPC types.

Our integration test matrix is already huge, and these tests are expensive to maintain generally speaking. I'm not excited about adding this for the sake of completeness, unless there's a very strong reason that I'm missing. It seems like these tests will overlap a lot with the existing unary based tests, so it doesn't seem like there would be much bang for the buck with these additional tests.

Circling back on this per offline chats.

If we just want to run various RPC types in multiple threads in a loop, then I think the StressTestClient.java is already geared towards that.

That can be configured to run any of these tests:

grpc-java/interop-testing/src/main/java/io/grpc/testing/integration/StressTestClient.java

Line 525 in f1109e4

private void runTestCase(Tester tester, TestCases testCase) throws Exception {

We actually have this stress test client is already running in our integration test dashboards for Go and Java (see internal bug b/298484219 for context), but only for empty_unary RPCs. We could extend it to run other types of RPCs (some of those test cases involving cancellation etc. may actually be interesting).

Also note there are currently some shortcomings of the stress test compared to the interop soak test:

Stress test has no error tolerance (note how it will abort a thread upon a single RPC failure).

Stress test has no knob to control QPS. I.e. each thread performs RPCs in an uncontrolled closed-loop.

Unlike the interop soak test, there is way to gather statistics results about latency, errors, etc. (the interop soak test logs all results into a parseable format that can be analyzed offline for these things).

I think 1) is the highest priority thing to fix.

I have talked with Feng and he has agreed with this approach. I will go ahead to work on it. Thanks!

zbilun added 9 commits October 30, 2024 09:08

Add concurrency condition to the soak test using exisiting blocking api

fedbd64

Modify the influenced files

885e109

Address code review comments from Alex: improve soak test logic by us…

58e2ebd

…ing soak_num_threads Flag

Address code review comments from Alex: modify totalFailures handling…

56883b5

…, channel creation logic, and refactor thread body for performSoakTest

Removed useless file

7a9f3a9

Modify the channel implementation for rpc_soak test.

84dac62

Refactor soak test to use function callback for channel management an…

512ad84

…d simplify thread result aggregation

Refactor performSoakTest and related functions to propagate Interrupt…

3d8c555

…edException. Update the ThreadResults data type.

Improve rpc_soak and channel_soak test to cover streaming

fbba2b8

DNVindhya requested review from apolcyn and DNVindhya November 21, 2024 18:47

zbilun added 4 commits November 21, 2024 12:13

Merge branch 'master' into streaming

5d10a34

Fix conflict resolution issues

f8ca6cd

Fix styles issues

a3ce4e8

Fix styles issues

a917099

apolcyn reviewed Nov 22, 2024

View reviewed changes

zbilun closed this Jan 2, 2025

github-actions bot locked as resolved and limited conversation to collaborators Apr 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve rpc_soak and channel_soak test to cover streaming #11687

Improve rpc_soak and channel_soak test to cover streaming #11687

Uh oh!

zbilun commented Nov 13, 2024 •

edited

Loading

Uh oh!

apolcyn Nov 22, 2024 •

edited

Loading

Uh oh!

zbilun Nov 26, 2024

Uh oh!

apolcyn Nov 26, 2024

Uh oh!

apolcyn Dec 11, 2024

Uh oh!

zbilun Dec 13, 2024 •

edited

Loading

Uh oh!

Uh oh!

Improve rpc_soak and channel_soak test to cover streaming #11687

Improve rpc_soak and channel_soak test to cover streaming #11687

Uh oh!

Conversation

zbilun commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apolcyn Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zbilun Nov 26, 2024

Choose a reason for hiding this comment

Uh oh!

apolcyn Nov 26, 2024

Choose a reason for hiding this comment

Uh oh!

apolcyn Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

zbilun Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zbilun commented Nov 13, 2024 •

edited

Loading

apolcyn Nov 22, 2024 •

edited

Loading

zbilun Dec 13, 2024 •

edited

Loading