Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: overhaul of server streaming retries #1653

Merged
merged 106 commits into from
Sep 24, 2024
Merged

fix: overhaul of server streaming retries #1653

merged 106 commits into from
Sep 24, 2024

Conversation

leahecole
Copy link
Contributor

@leahecole leahecole commented Sep 9, 2024

This is a huge PR, sorry. I had hoped to break it into smaller chunks, but it got to a point where I had to rip the bandaid off and just go for it. There will be a followup PR that adds some more tests I have to the test application regarding buffers and pipelines.

Code changes

  • Removes streamingRetryRequest and its unit tests which was a cherrypicked version of retry request

gax/src/gax.ts

  • Adds a check to convertRetryOptions to ensure that if a user doesn't pass maxRetries or a timeout when using retryRequestOptions we have a default value for timeout if neither timeout nor maxRetries are set

gax/src/streamingCalls/streaming.ts

  • Adds the same phrasing to the timeout/max retries check to surface the underlying error that fix: expose underlying error with timeouts or retries #1650 did
  • adds the edge case checks for timeout or maxRetries being equal to zero (and therefore falsy) to throwIfMaxRetriesOrTotalTimeoutExceeded
  • Overhauls retries - adding a new function called newStreamingRetryRequest (I'm not attached to the name)
  • Before, we handled the first try of a server streaming call one way, and all subsequent tries another way. That wasn't great. This handles all tries the same way
  • The actual request is made with an inner function called newMakeRequest - this allows us to mimic what retry-request did well, which was preserve the stream that the end user connects to
  • Separate the stream into two streams
    • retryStream - this is the stream that is ultimately returned to the client library - users can pipe the output wherever, pumpify it, put it in a pipeline, and know that it won't be destroyed upon a retry like it erroneously was before
    • requestStream - this is the stream that is the request to the server. requestStream will be destroyed and recreated upon retry
  • Unlike retry-request, Instead of piping all events between streams or trying to use pipeline internally, we have event handlers for the specific events we care about and logic to ensure they appear in the order that users expect from grpcjs calls
    • there is some boolean logic around the order of the 'status' and the 'end' events - this is explained in the comments, but if that comment doesn't help, let me know and I'll redo it to make sure it's clear

Removes

  • prevDeadline parameter from streamProxy. This was added in the original implementation and it's not used in the current implementation. It was never adopted by client libraries - they never would have been opted into using these retries.
  • Retries parameter from streamProxy. This was added in the original implementation and it's not used in the current implementation. It was never adopted by client libraries - they never would have been opted into using these retries.

Test application

  • renamed a client to be more helpful and removed a set of options that was a duplicate
  • Modifies throwsIfMaxRetriesorTotalTimeoutexceeded to have an "originalError" parameter in order to surface that to the user
  • Removes testServerStreamingRetrieswithRetryRequestOptionsErrorsOnBadResumptionStrategy - This could be controversial and I'm open to a different way of handling this. I got rid of this, because I think that checking for a "Bad resumption strategy" is actually asking typescript to do something it can't really do - the only way I can really think to do that is to have it validate that the output of the resumption function is of "RequestType", but that's not something TS/JS lets you do. Instead, I think it is fair to rely on the fact that we define the type of getResumptionRequestFn so that will catch someone at least trying to pass something in that's not a function, and if they want to pass in a resumption strategy that's a crappy function, well, we can't anticipate every kind of crappy function.
  • Goes back and adds Promises back to test application - pretty much reverting chore: release main #1649 - if these aren't here, the server can stop before tests have finished. My bad. If there's another more modern way to do this with async/await lmk, I just wanted to focus on stopping the issue for now and this worked.
  • Adds tests that use both pumpify AND pipeline - two different ways that client libraries may be piping our streams to other streams, to ensure that behavior is consistent regardless of what's being used. This does mean adding the pumpify dependency to the test application. If was hesitant to add this dependency, but I know at least one client library that uses it, so I would prefer to keep it in if we can

Unit tests

  • I did a lot of adding setImmediates throughout various tests just to make sure we were more accurately mocking expected behavior from grpc - some are listed here, others are not
  • did some updating of spies throughout to change the functions they spy on to be ones in this overhaul
  • added a test to make sure cancellation happens properly
  • emit response when stream received metadata event - added a setImmediate to the s.push(null) for more control over the order in which the events are sent
  • add a test for erroring if neither maxRetries nor totalTimeout are defined
  • Emit transient error on second or later error when new retries is enabled - Modified mocking of events to more accurately represent the order in which events are received - ensures a status event is always there, and that events are sent in the event loop when they should be
  • Emit error retry twice and succeed with shouldRetryFn I adjusted this test behavior too - in the ending case it wasn't emitting a status - grpc streams ALWAYS emit a status. Also, when I only checked for thing in "end" and didn't also handle the .on data, things didn't pass, so I handled the data and magically it passes? Bizarre.
  • emit error and retry once - this set immediate behavior was kind of weird before and I adjusted it to reflect a more "normal" run - (the s.push used to be outside of the switch case which would result in it being called both times - for whatever reason, it wasn't called twice in the old implementation)
  • added a test to make sure we are exceeding total timeout and surfacing the underlying error
  • updated max retries test to make sure we are surfacing underlying error

Things I'm not sure about

  • I do some typecasting that isn't ideal but isn't horrible - would love insights on better ways to handle it. retryStream needs to be a CancellableStream so I do that by making a PassThrough, casting it to be a CancellableStream, and making sure it has a cancel function.
  • Function names - Be blunt - let me know if I can rename things better to be more readable

assert.strictEqual(error.code, 4);
assert.strictEqual(
error.message,
'Total timeout of API exceeded 2 milliseconds retrying error Error: 6 ALREADY_EXISTS: 6 before any response was received.'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible I need to update this to be less strict - I have also seen it error on the second error in the sequence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a try catch here because I think it can be either the first or the second error code

"google-gax": "file:./google-gax.tgz",
"google-gax": "file:google-gax.tgz",
"protobufjs": "^7.0.0",
"pumpify": "^2.0.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on using Readable.from rather than pumpify?

Example:

async function* concatStreams(streams: ReadableStream[]) {
  for (const stream of streams) {
    yield* stream;
  }
}

const readable = Readable.from(concatStreams());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question- generally I prefer to use pipeline instead of pumpify, but I explicitly want pumpify as part of the test application because I know it's used in at least one of the client libraries and was related to the issue that this PR is fixing. I will add comments around that though, and if you want to see a test including Readable.from as an option, I can do that too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I have reorganized all of the pumpify and all of the pipeline ones (I know, the organization of test application is a lil chaotic at this point) and I added comments about why they're both there. Lmk if you want me to add Readable.from tests also

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we'd want to recommend whoever's using pumpify off of it, as itself is pretty flaky. That would likely be a semver-major issue though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielbankhead let me know what the preferred action within gax is - I can open an issue in the affected client library to stop using pumpify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue to migrate the stakeholder off of pumpify, and I'm going to merge this for now. If you want to create an issue to have it removed later please feel free to!

"google-gax": "file:./google-gax.tgz",
"google-gax": "file:google-gax.tgz",
"protobufjs": "^7.0.0",
"pumpify": "^2.0.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we'd want to recommend whoever's using pumpify off of it, as itself is pretty flaky. That would likely be a semver-major issue though.

gax/src/streamingCalls/streaming.ts Outdated Show resolved Hide resolved
gax/src/streamingCalls/streaming.ts Outdated Show resolved Hide resolved
gax/src/streamingCalls/streaming.ts Outdated Show resolved Hide resolved
@leahecole
Copy link
Contributor Author

leahecole commented Sep 23, 2024

#1657 blocking tests

@leahecole leahecole merged commit 629f4c4 into main Sep 24, 2024
21 checks passed
@leahecole leahecole deleted the cbt2 branch September 24, 2024 20:57
@release-please release-please bot mentioned this pull request Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants