Comprehensively retry event-poll failures #1063

gnprice · 2024-11-16T00:15:32Z

This is a bit of a long branch, as I ended up going into a variety of related cleanups and other prep work.

Fixes: #563

Selected commit messages:

store: On rate-limit error in poll, explicitly back off and retry

This covers a particular case of #946.

This commit is actually NFC as far as the interaction with the server
goes; its only change in behavior is that we treat the error as
"expected", and therefore skip reporting it to the user unless it
follows a string of other errors.

But then this also sets us up so that these rate-limit errors will
continue to be handled with backoff-retry when we change the handling
of unexpected errors, coming next.

store: Handle unexpected event-poll errors by reloading store, with backoff

This fixes #563. We'll follow up with a few more commits that give
more-informative errors in some cases, or change the handling of others
from retrying getEvents to reloading the data from scratch.

The if-disposed and store.isLoading lines have no effect in the case
of a BAD_EVENT_QUEUE_ID error, because those can only come from the
getEvents request, and then the inner catch block will have already
taken the same steps.

Fixes: #563

store: Better error message on handleEvent failure

store: On non-transient request errors, reload rather than retry

This hopefully gives us a greater chance of getting past whatever
the underlying problem is, by resetting more of our state.

PIG208 · 2024-11-18T22:36:20Z

Thanks for cleaning this up! Looks good to me. Left some comments after reading through the PR.

PIG208 · 2024-11-18T18:54:50Z

lib/api/model/events.dart

@@ -81,6 +81,7 @@ class UnexpectedEvent extends Event {
  final Map<String, dynamic> json;

  @override
+  @JsonKey(includeToJson: true)


Does this have any effect on non-JsonSerializable classes like this one? We don't use generated functions for unexpected events.

(It is wrong to use the generated functions for UnexpectedEvent:

Map<String, dynamic> _$UnexpectedEventToJson(UnexpectedEvent instance) => <String, dynamic>{ 'id': instance.id, 'json': instance.json, };

)

Ah indeed, thanks — I'll cut this line. Yeah, it has no effect.

PIG208 · 2024-11-18T22:22:56Z

test/model/store_test.dart

@@ -732,7 +732,12 @@ void main() {
      });
    }

-    test('retries on NetworkException', () {
+    test('retries on NetworkException from SocketException', () {
+      // We skip reporting errors on these; check we retry them all the same.


I see that later we add a test to verify that we indeed ignore such errors, maintaining some sort of "each error that we retries/reloads on gets a corresponding 'report error' test" relationship. But there is this visual gap between each pair of them. Is it feasible to place them next to each other?

Hmm, this is meant to coincide with one of the error types that in main we already test that we skip reporting:

test('ignore boring errors', () => awaitFakeAsync((async) async { await prepare(); for (int i = 0; i < UpdateMachine.transientFailureCountNotifyThreshold; i++) { connection.prepare(exception: const SocketException('failed')); pollAndFail(async); check(takeLastReportedError()).isNull();

Yeah, this is quite a nit. I was referring to the visual gap between this test and

test('ignore NetworkException from SocketException', () { checkNotReported(prepareNetworkExceptionSocketException); });

or more in general, the gap between the 'report error` group and the tests before it. We have 10 such pairs:

test('reloads on unexpected error within loop', () { checkReload(prepareUnexpectedLoopError); }); // ... group('report error', () { // ... test('report unexpected error within loop', () { checkReported(prepareUnexpectedLoopError); });

I'm not sure if it would be better to group the tests for the same error close to each other, or keep it this way. It does seem that such tests tend to co-occur.

I see, yeah.

There's a fundamental trade-off to make: these test cases are really a matrix, of (which type of error) x (what to do vs. what to report). So we can group the different checks on a given error type together, but then the different error types get spread farther apart, or vice versa.

I went for this arrangement — different error types together for each kind of check — because I think when considering what we want to do (or report) for a given error type it's helpful to be able to compare across other error types. With this arrangement, all the reload/retry tests fit on the screen at once, or a couple of screenfuls on a smaller display.

That's a great way to put it! Thanks for the explanation.

PIG208 · 2024-11-18T22:29:21Z

lib/model/store.dart

+      if (isUnexpected) {
+        // We don't know the cause of the failure; it might well keep happening.
+        // Avoid creating a retry storm.
+        await _unexpectedErrorBackoffMachine.wait();


I see that you mentioned

The if-disposed and store.isLoading lines have no effect in the case of a BAD_EVENT_QUEUE_ID error, because those can only come from the getEvents request, and then the inner catch block will have already taken the same steps.

and for errors other than BAD_EVENT_QUEUE_ID where isUnexpected is true, do we want to do an if (_disposed) check right after this?

Yeah, good catch, thanks. This is an await, so it should be followed by an if-disposed check.

PIG208 · 2024-11-18T22:35:02Z

lib/model/store.dart

+
+          if (e is! ApiRequestException) {
+            // Some unexpected error, outside even making the HTTP request.
+            // Definitely a bug in our code.


Looks like this check also enables the exhaustive switch later. That's pretty neat!

Yep!

Ideally I'd have liked to express this as one more case in the switch, so that the rethrow cases were all next to each other. But the patterns minilanguage doesn't seem to have a way to express that. (For example if it had negations of patterns, that would do it.) So this works instead.

gnprice · 2024-11-19T05:47:41Z

Thanks for the review! Pushed a revision with those fixes. See also the question at #1063 (comment) .

PIG208

Thanks for the update! This looks good to me overall. Left a new comment on something I missed last time and a follow-up to #1063 (comment).

PIG208 · 2024-11-20T17:01:54Z

test/model/store_test.dart

+      // (The actual HTTP status should be 429, but that seems undocumented.)
+      connection.prepare(httpStatus: 400, json: {
+        'result': 'error', 'code': 'RATE_LIMIT_HIT',
+        'msg': 'API usage exceeded rate limit,',


nit:

Suggested change

'msg': 'API usage exceeded rate limit,',

'msg': 'API usage exceeded rate limit',

From the API doc linked here.

{ "code": "RATE_LIMIT_HIT", "msg": "API usage exceeded rate limit", "result": "error", "retry-after": 28.706807374954224 }

Indeed that's a typo, thanks 🙂

PIG208 · 2024-11-20T22:04:57Z

The update looks good to me. Thanks!

This makes ApiRequestException transitively sealed, so that we can switch on it exhaustively.

These `JsonKey(includeToJson: true)` annotations are needed so that the properties appear in the output of the generated toJson implementations, like they do in the real JSON from the server. We had the annotations on many of this kind of getter, but not all. These types appear in responses, not requests, so the actual app never calls these toJson methods. But having accurate toJson methods is handy in tests.

The store itself carries the realm URL; see this field's doc. No need to reach further to the ApiConnection.

The `realmUrl` value itself is Zulip's unique identifier for the realm.

Looks like this slipped through the cracks, oops. It is the only such call site, though. The only other matches for the following search: $ git grep -C12 -P '(?<!assert.)debugLog\(' are the function's definition, and two call sites that are enclosed in IIFEs inside asserts.

We'll use this for testing further error-handling logic.

This was called `logAndReportErrorToUserBriefly`; but it doesn't log *and* report, it just logs instead of reporting.

The "expired queue" case also produces a ZulipApiException.

This will help keep these test cases readable when we add a bunch more alongside them.

…ading

This lets us share this setup code between the tests of how the logic responds (retry, reload, or otherwise) and how it reports errors to the user (or doesn't), while keeping separate test cases so that test failures of one kind don't obscure the story with the other.

This is the one case that we were exercising in the report-error tests below but not in these tests.

These will help us test the reporting behavior on a wider variety of exceptions.

…ry/reload

Now we exercise all the same cases in these report-error tests as we do in the retry/reload tests above.

This will make it easier to add more cases here without making the logic complicated to follow.

This covers a particular case of zulip#946. This commit is actually NFC as far as the interaction with the server goes; its only change in behavior is that we treat the error as "expected", and therefore skip reporting it to the user unless it follows a string of other errors. But then this also sets us up so that these rate-limit errors will continue to be handled with backoff-retry when we change the handling of unexpected errors, coming next.

This comment is about just this one line.

As is, this structure looks a bit silly. But in the next few commits we'll make use of it in order to recover from a wider range of errors in the event-poll loop, by using the same remedy of reloading server data and replacing the store as we use when the event queue expires.

This lets us simplify by cutting `async.flushMicrotasks` calls. Saying `await Future.delayed` creates a timer, and then waits for that timer. Waiting for the timer starts by flushing microtasks, then runs any other timers due before the scheduled time (which can't exist when the delay is Duration.zero), then moves on to timers scheduled for right at the given delay. But for those timers with the exact same scheduled time, it runs only the timers that were previously scheduled. Any timers created by those microtasks or intermediate timers, and scheduled for the exact time, remain in the queue behind the timer created by that `Future.delayed`. That's why these `await Future.delayed` lines had to be preceded by `async.flushMicrotasks` calls: the timers they were meant to wait for hadn't yet been created, but there were pending microtasks that would create them. By saying `async.elapse`, we step outside the timers. Because the `elapse` call doesn't itself have a place in the timer queue, it can keep waiting until every timer with the scheduled time has run -- no matter how long a chain of microtasks and previous timers is involved in creating that timer.

…ackoff This fixes zulip#563. We'll follow up with a few more commits that give more-informative errors in some cases, or change the handling of others from retrying getEvents to reloading the data from scratch. The if-disposed and store.isLoading lines have no effect in the case of a BAD_EVENT_QUEUE_ID error, because those can only come from the getEvents request, and then the inner catch block will have already taken the same steps. Fixes: zulip#563

This hopefully gives us a greater chance of getting past whatever the underlying problem is, by resetting more of our state.

gnprice · 2024-11-20T22:13:14Z

Thanks for the reviews! Merged.

gnprice added the maintainer review PR ready for review by Zulip maintainers label Nov 16, 2024

PIG208 reviewed Nov 18, 2024

View reviewed changes

gnprice force-pushed the pr-retry branch from 5de954a to eea069d Compare November 19, 2024 05:47

PIG208 reviewed Nov 20, 2024

View reviewed changes

gnprice force-pushed the pr-retry branch 2 times, most recently from c9ebc27 to 84f576c Compare November 20, 2024 21:51

gnprice added 21 commits November 20, 2024 14:12

api [nfc]: Mark ServerException sealed

e4982f0

This makes ApiRequestException transitively sealed, so that we can switch on it exhaustively.

store: Add missing return-if-disposed after awaiting backoff

92c59ba

store [nfc]: Simplify a line to store.realmUrl, skipping connection

31624d5

The store itself carries the realm URL; see this field's doc. No need to reach further to the ApiConnection.

store: Identify realm by realmUrl, not a modified version of it

31b7df6

The `realmUrl` value itself is Zulip's unique identifier for the realm.

store [nfc]: Add UpdateMachine.debugPrepareLoopError

dfefa8a

We'll use this for testing further error-handling logic.

store test [nfc]: Push a bit more into setup for poll tests

454f27d

store test [nfc]: Rename logReportedError helper

57852f3

This was called `logAndReportErrorToUserBriefly`; but it doesn't log *and* report, it just logs instead of reporting.

store test [nfc]: Fix test name referring to ZulipApiException

250e3c6

The "expired queue" case also produces a ZulipApiException.

store test [nfc]: Give a logical order to retry-on-failure tests

c99bb9c

This will help keep these test cases readable when we add a bunch more alongside them.

store test [nfc]: Pull out checkReload, to use for more cases of relo…

372b882

…ading

store test: Add remaining test of retry/reload, for SocketException

d772b1f

This is the one case that we were exercising in the report-error tests below but not in these tests.

store test [nfc]: Pull out helpers checkReported and friends

1112d69

These will help us test the reporting behavior on a wider variety of exceptions.

store test [nfc]: Put report-error tests in same logical order as ret…

2f83357

…ry/reload

store test: Test no error reported for expired queue

23c3174

store test: Fill in remaining cases for report-error tests

0cf73b9

Now we exercise all the same cases in these report-error tests as we do in the retry/reload tests above.

store [nfc]: Dedupe details of how to report polling failure

1bf4422

store [nfc]: Dedupe report/retry logic for transient polling errors

59609d7

store [nfc]: Factor out report-and-retry from switch statement

b28eb28

This will make it easier to add more cases here without making the logic complicated to follow.

gnprice added 8 commits November 20, 2024 14:12

store test: Explicitly test rate-limit errors on poll

2dc6466

store [nfc]: Clarify scope of long comment on resetting backoff

2fe1eb3

This comment is about just this one line.

store: Better error message on handleEvent failure

e82ea7e

store: On non-transient request errors, reload rather than retry

9e42f26

This hopefully gives us a greater chance of getting past whatever the underlying problem is, by resetting more of our state.

gnprice force-pushed the pr-retry branch from 84f576c to 9e42f26 Compare November 20, 2024 22:12

gnprice merged commit 9e42f26 into zulip:main Nov 20, 2024
1 check passed

gnprice deleted the pr-retry branch November 25, 2024 04:52

This was referenced Dec 19, 2024

Factor error-handling out of poll method #1187

Merged

Unify httpStatus field on ApiRequestException subclasses #1188

Merged

	'msg': 'API usage exceeded rate limit,',
	'msg': 'API usage exceeded rate limit',

Comprehensively retry event-poll failures #1063

Comprehensively retry event-poll failures #1063

Uh oh!

Conversation

gnprice commented Nov 16, 2024

Uh oh!

PIG208 commented Nov 18, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gnprice commented Nov 19, 2024

Uh oh!

PIG208 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PIG208 commented Nov 20, 2024

Uh oh!

Uh oh!

gnprice commented Nov 20, 2024

Uh oh!

Uh oh!