Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comprehensively retry event-poll failures #1063

Merged
merged 29 commits into from
Nov 20, 2024
Merged

Conversation

gnprice
Copy link
Member

@gnprice gnprice commented Nov 16, 2024

This is a bit of a long branch, as I ended up going into a variety of related cleanups and other prep work.

Fixes: #563

Selected commit messages:


store: On rate-limit error in poll, explicitly back off and retry

This covers a particular case of #946.

This commit is actually NFC as far as the interaction with the server
goes; its only change in behavior is that we treat the error as
"expected", and therefore skip reporting it to the user unless it
follows a string of other errors.

But then this also sets us up so that these rate-limit errors will
continue to be handled with backoff-retry when we change the handling
of unexpected errors, coming next.


store: Handle unexpected event-poll errors by reloading store, with backoff

This fixes #563. We'll follow up with a few more commits that give
more-informative errors in some cases, or change the handling of others
from retrying getEvents to reloading the data from scratch.

The if-disposed and store.isLoading lines have no effect in the case
of a BAD_EVENT_QUEUE_ID error, because those can only come from the
getEvents request, and then the inner catch block will have already
taken the same steps.

Fixes: #563


store: Better error message on handleEvent failure


store: On non-transient request errors, reload rather than retry

This hopefully gives us a greater chance of getting past whatever
the underlying problem is, by resetting more of our state.

@gnprice gnprice added the maintainer review PR ready for review by Zulip maintainers label Nov 16, 2024
@PIG208
Copy link
Member

PIG208 commented Nov 18, 2024

Thanks for cleaning this up! Looks good to me. Left some comments after reading through the PR.

@@ -81,6 +81,7 @@ class UnexpectedEvent extends Event {
final Map<String, dynamic> json;

@override
@JsonKey(includeToJson: true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have any effect on non-JsonSerializable classes like this one? We don't use generated functions for unexpected events.

(It is wrong to use the generated functions for UnexpectedEvent:

Map<String, dynamic> _$UnexpectedEventToJson(UnexpectedEvent instance) =>
    <String, dynamic>{
      'id': instance.id,
      'json': instance.json,
    };

)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah indeed, thanks — I'll cut this line. Yeah, it has no effect.

@@ -732,7 +732,12 @@ void main() {
});
}

test('retries on NetworkException', () {
test('retries on NetworkException from SocketException', () {
// We skip reporting errors on these; check we retry them all the same.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that later we add a test to verify that we indeed ignore such errors, maintaining some sort of "each error that we retries/reloads on gets a corresponding 'report error' test" relationship. But there is this visual gap between each pair of them. Is it feasible to place them next to each other?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is meant to coincide with one of the error types that in main we already test that we skip reporting:

      test('ignore boring errors', () => awaitFakeAsync((async) async {
        await prepare();

        for (int i = 0; i < UpdateMachine.transientFailureCountNotifyThreshold; i++) {
          connection.prepare(exception: const SocketException('failed'));
          pollAndFail(async);
          check(takeLastReportedError()).isNull();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is quite a nit. I was referring to the visual gap between this test and

      test('ignore NetworkException from SocketException', () {
        checkNotReported(prepareNetworkExceptionSocketException);
      });

or more in general, the gap between the 'report error` group and the tests before it. We have 10 such pairs:

    test('reloads on unexpected error within loop', () {
      checkReload(prepareUnexpectedLoopError);
    });

	// ...

    group('report error', () {

	  // ...

      test('report unexpected error within loop', () {
        checkReported(prepareUnexpectedLoopError);
      });

I'm not sure if it would be better to group the tests for the same error close to each other, or keep it this way. It does seem that such tests tend to co-occur.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yeah.

There's a fundamental trade-off to make: these test cases are really a matrix, of (which type of error) x (what to do vs. what to report). So we can group the different checks on a given error type together, but then the different error types get spread farther apart, or vice versa.

I went for this arrangement — different error types together for each kind of check — because I think when considering what we want to do (or report) for a given error type it's helpful to be able to compare across other error types. With this arrangement, all the reload/retry tests fit on the screen at once, or a couple of screenfuls on a smaller display.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great way to put it! Thanks for the explanation.

if (isUnexpected) {
// We don't know the cause of the failure; it might well keep happening.
// Avoid creating a retry storm.
await _unexpectedErrorBackoffMachine.wait();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you mentioned

The if-disposed and store.isLoading lines have no effect in the case
of a BAD_EVENT_QUEUE_ID error, because those can only come from the
getEvents request, and then the inner catch block will have already
taken the same steps.

and for errors other than BAD_EVENT_QUEUE_ID where isUnexpected is true, do we want to do an if (_disposed) check right after this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch, thanks. This is an await, so it should be followed by an if-disposed check.


if (e is! ApiRequestException) {
// Some unexpected error, outside even making the HTTP request.
// Definitely a bug in our code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this check also enables the exhaustive switch later. That's pretty neat!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep!

Ideally I'd have liked to express this as one more case in the switch, so that the rethrow cases were all next to each other. But the patterns minilanguage doesn't seem to have a way to express that. (For example if it had negations of patterns, that would do it.) So this works instead.

@gnprice
Copy link
Member Author

gnprice commented Nov 19, 2024

Thanks for the review! Pushed a revision with those fixes. See also the question at #1063 (comment) .

Copy link
Member

@PIG208 PIG208 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! This looks good to me overall. Left a new comment on something I missed last time and a follow-up to #1063 (comment).

// (The actual HTTP status should be 429, but that seems undocumented.)
connection.prepare(httpStatus: 400, json: {
'result': 'error', 'code': 'RATE_LIMIT_HIT',
'msg': 'API usage exceeded rate limit,',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
'msg': 'API usage exceeded rate limit,',
'msg': 'API usage exceeded rate limit',

From the API doc linked here.

{
    "code": "RATE_LIMIT_HIT",
    "msg": "API usage exceeded rate limit",
    "result": "error",
    "retry-after": 28.706807374954224
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed that's a typo, thanks 🙂

@gnprice gnprice force-pushed the pr-retry branch 2 times, most recently from c9ebc27 to 84f576c Compare November 20, 2024 21:51
@PIG208
Copy link
Member

PIG208 commented Nov 20, 2024

The update looks good to me. Thanks!

This makes ApiRequestException transitively sealed, so that we can
switch on it exhaustively.
These `JsonKey(includeToJson: true)` annotations are needed so that
the properties appear in the output of the generated toJson
implementations, like they do in the real JSON from the server.
We had the annotations on many of this kind of getter, but not all.

These types appear in responses, not requests, so the actual app
never calls these toJson methods.  But having accurate toJson methods
is handy in tests.
The store itself carries the realm URL; see this field's doc.
No need to reach further to the ApiConnection.
The `realmUrl` value itself is Zulip's unique identifier for the realm.
Looks like this slipped through the cracks, oops.

It is the only such call site, though.  The only other matches
for the following search:

  $ git grep -C12 -P '(?<!assert.)debugLog\('

are the function's definition, and two call sites that are enclosed
in IIFEs inside asserts.
We'll use this for testing further error-handling logic.
This was called `logAndReportErrorToUserBriefly`; but it doesn't
log *and* report, it just logs instead of reporting.
The "expired queue" case also produces a ZulipApiException.
This will help keep these test cases readable when we add a
bunch more alongside them.
This lets us share this setup code between the tests of how the logic
responds (retry, reload, or otherwise) and how it reports errors to
the user (or doesn't), while keeping separate test cases so that test
failures of one kind don't obscure the story with the other.
This is the one case that we were exercising in the report-error
tests below but not in these tests.
These will help us test the reporting behavior on a wider variety
of exceptions.
Now we exercise all the same cases in these report-error tests
as we do in the retry/reload tests above.
This will make it easier to add more cases here without making the
logic complicated to follow.
This covers a particular case of zulip#946.

This commit is actually NFC as far as the interaction with the server
goes; its only change in behavior is that we treat the error as
"expected", and therefore skip reporting it to the user unless it
follows a string of other errors.

But then this also sets us up so that these rate-limit errors will
continue to be handled with backoff-retry when we change the handling
of unexpected errors, coming next.
This comment is about just this one line.
As is, this structure looks a bit silly.  But in the next few commits
we'll make use of it in order to recover from a wider range of errors
in the event-poll loop, by using the same remedy of reloading server
data and replacing the store as we use when the event queue expires.
This lets us simplify by cutting `async.flushMicrotasks` calls.

Saying `await Future.delayed` creates a timer, and then waits for that
timer.  Waiting for the timer starts by flushing microtasks, then runs
any other timers due before the scheduled time (which can't exist when
the delay is Duration.zero), then moves on to timers scheduled for
right at the given delay.  But for those timers with the exact same
scheduled time, it runs only the timers that were previously
scheduled.  Any timers created by those microtasks or intermediate
timers, and scheduled for the exact time, remain in the queue behind
the timer created by that `Future.delayed`.

That's why these `await Future.delayed` lines had to be preceded by
`async.flushMicrotasks` calls: the timers they were meant to wait for
hadn't yet been created, but there were pending microtasks that would
create them.

By saying `async.elapse`, we step outside the timers.  Because the
`elapse` call doesn't itself have a place in the timer queue, it can
keep waiting until every timer with the scheduled time has run -- no
matter how long a chain of microtasks and previous timers is involved
in creating that timer.
…ackoff

This fixes zulip#563.  We'll follow up with a few more commits that give
more-informative errors in some cases, or change the handling of others
from retrying getEvents to reloading the data from scratch.

The if-disposed and store.isLoading lines have no effect in the case
of a BAD_EVENT_QUEUE_ID error, because those can only come from the
getEvents request, and then the inner catch block will have already
taken the same steps.

Fixes: zulip#563
This hopefully gives us a greater chance of getting past whatever
the underlying problem is, by resetting more of our state.
@gnprice gnprice merged commit 9e42f26 into zulip:main Nov 20, 2024
1 check passed
@gnprice
Copy link
Member Author

gnprice commented Nov 20, 2024

Thanks for the reviews! Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintainer review PR ready for review by Zulip maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Comprehensively retry poll/register even on unforeseen errors
2 participants