-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comprehensively retry event-poll failures #1063
Conversation
Thanks for cleaning this up! Looks good to me. Left some comments after reading through the PR. |
lib/api/model/events.dart
Outdated
@@ -81,6 +81,7 @@ class UnexpectedEvent extends Event { | |||
final Map<String, dynamic> json; | |||
|
|||
@override | |||
@JsonKey(includeToJson: true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have any effect on non-JsonSerializable classes like this one? We don't use generated functions for unexpected events.
(It is wrong to use the generated functions for UnexpectedEvent
:
Map<String, dynamic> _$UnexpectedEventToJson(UnexpectedEvent instance) =>
<String, dynamic>{
'id': instance.id,
'json': instance.json,
};
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah indeed, thanks — I'll cut this line. Yeah, it has no effect.
@@ -732,7 +732,12 @@ void main() { | |||
}); | |||
} | |||
|
|||
test('retries on NetworkException', () { | |||
test('retries on NetworkException from SocketException', () { | |||
// We skip reporting errors on these; check we retry them all the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that later we add a test to verify that we indeed ignore such errors, maintaining some sort of "each error that we retries/reloads on gets a corresponding 'report error' test" relationship. But there is this visual gap between each pair of them. Is it feasible to place them next to each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is meant to coincide with one of the error types that in main we already test that we skip reporting:
test('ignore boring errors', () => awaitFakeAsync((async) async {
await prepare();
for (int i = 0; i < UpdateMachine.transientFailureCountNotifyThreshold; i++) {
connection.prepare(exception: const SocketException('failed'));
pollAndFail(async);
check(takeLastReportedError()).isNull();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is quite a nit. I was referring to the visual gap between this test and
test('ignore NetworkException from SocketException', () {
checkNotReported(prepareNetworkExceptionSocketException);
});
or more in general, the gap between the 'report error` group and the tests before it. We have 10 such pairs:
test('reloads on unexpected error within loop', () {
checkReload(prepareUnexpectedLoopError);
});
// ...
group('report error', () {
// ...
test('report unexpected error within loop', () {
checkReported(prepareUnexpectedLoopError);
});
I'm not sure if it would be better to group the tests for the same error close to each other, or keep it this way. It does seem that such tests tend to co-occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yeah.
There's a fundamental trade-off to make: these test cases are really a matrix, of (which type of error) x (what to do vs. what to report). So we can group the different checks on a given error type together, but then the different error types get spread farther apart, or vice versa.
I went for this arrangement — different error types together for each kind of check — because I think when considering what we want to do (or report) for a given error type it's helpful to be able to compare across other error types. With this arrangement, all the reload/retry tests fit on the screen at once, or a couple of screenfuls on a smaller display.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great way to put it! Thanks for the explanation.
if (isUnexpected) { | ||
// We don't know the cause of the failure; it might well keep happening. | ||
// Avoid creating a retry storm. | ||
await _unexpectedErrorBackoffMachine.wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you mentioned
The if-disposed and store.isLoading lines have no effect in the case
of a BAD_EVENT_QUEUE_ID error, because those can only come from the
getEvents request, and then the inner catch block will have already
taken the same steps.
and for errors other than BAD_EVENT_QUEUE_ID
where isUnexpected
is true
, do we want to do an if (_disposed)
check right after this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good catch, thanks. This is an await
, so it should be followed by an if-disposed check.
|
||
if (e is! ApiRequestException) { | ||
// Some unexpected error, outside even making the HTTP request. | ||
// Definitely a bug in our code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this check also enables the exhaustive switch later. That's pretty neat!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
Ideally I'd have liked to express this as one more case in the switch, so that the rethrow
cases were all next to each other. But the patterns minilanguage doesn't seem to have a way to express that. (For example if it had negations of patterns, that would do it.) So this works instead.
Thanks for the review! Pushed a revision with those fixes. See also the question at #1063 (comment) . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! This looks good to me overall. Left a new comment on something I missed last time and a follow-up to #1063 (comment).
test/model/store_test.dart
Outdated
// (The actual HTTP status should be 429, but that seems undocumented.) | ||
connection.prepare(httpStatus: 400, json: { | ||
'result': 'error', 'code': 'RATE_LIMIT_HIT', | ||
'msg': 'API usage exceeded rate limit,', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
'msg': 'API usage exceeded rate limit,', | |
'msg': 'API usage exceeded rate limit', |
From the API doc linked here.
{
"code": "RATE_LIMIT_HIT",
"msg": "API usage exceeded rate limit",
"result": "error",
"retry-after": 28.706807374954224
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed that's a typo, thanks 🙂
c9ebc27
to
84f576c
Compare
The update looks good to me. Thanks! |
This makes ApiRequestException transitively sealed, so that we can switch on it exhaustively.
These `JsonKey(includeToJson: true)` annotations are needed so that the properties appear in the output of the generated toJson implementations, like they do in the real JSON from the server. We had the annotations on many of this kind of getter, but not all. These types appear in responses, not requests, so the actual app never calls these toJson methods. But having accurate toJson methods is handy in tests.
The store itself carries the realm URL; see this field's doc. No need to reach further to the ApiConnection.
The `realmUrl` value itself is Zulip's unique identifier for the realm.
Looks like this slipped through the cracks, oops. It is the only such call site, though. The only other matches for the following search: $ git grep -C12 -P '(?<!assert.)debugLog\(' are the function's definition, and two call sites that are enclosed in IIFEs inside asserts.
We'll use this for testing further error-handling logic.
This was called `logAndReportErrorToUserBriefly`; but it doesn't log *and* report, it just logs instead of reporting.
The "expired queue" case also produces a ZulipApiException.
This will help keep these test cases readable when we add a bunch more alongside them.
This lets us share this setup code between the tests of how the logic responds (retry, reload, or otherwise) and how it reports errors to the user (or doesn't), while keeping separate test cases so that test failures of one kind don't obscure the story with the other.
This is the one case that we were exercising in the report-error tests below but not in these tests.
These will help us test the reporting behavior on a wider variety of exceptions.
Now we exercise all the same cases in these report-error tests as we do in the retry/reload tests above.
This will make it easier to add more cases here without making the logic complicated to follow.
This covers a particular case of zulip#946. This commit is actually NFC as far as the interaction with the server goes; its only change in behavior is that we treat the error as "expected", and therefore skip reporting it to the user unless it follows a string of other errors. But then this also sets us up so that these rate-limit errors will continue to be handled with backoff-retry when we change the handling of unexpected errors, coming next.
This comment is about just this one line.
As is, this structure looks a bit silly. But in the next few commits we'll make use of it in order to recover from a wider range of errors in the event-poll loop, by using the same remedy of reloading server data and replacing the store as we use when the event queue expires.
This lets us simplify by cutting `async.flushMicrotasks` calls. Saying `await Future.delayed` creates a timer, and then waits for that timer. Waiting for the timer starts by flushing microtasks, then runs any other timers due before the scheduled time (which can't exist when the delay is Duration.zero), then moves on to timers scheduled for right at the given delay. But for those timers with the exact same scheduled time, it runs only the timers that were previously scheduled. Any timers created by those microtasks or intermediate timers, and scheduled for the exact time, remain in the queue behind the timer created by that `Future.delayed`. That's why these `await Future.delayed` lines had to be preceded by `async.flushMicrotasks` calls: the timers they were meant to wait for hadn't yet been created, but there were pending microtasks that would create them. By saying `async.elapse`, we step outside the timers. Because the `elapse` call doesn't itself have a place in the timer queue, it can keep waiting until every timer with the scheduled time has run -- no matter how long a chain of microtasks and previous timers is involved in creating that timer.
…ackoff This fixes zulip#563. We'll follow up with a few more commits that give more-informative errors in some cases, or change the handling of others from retrying getEvents to reloading the data from scratch. The if-disposed and store.isLoading lines have no effect in the case of a BAD_EVENT_QUEUE_ID error, because those can only come from the getEvents request, and then the inner catch block will have already taken the same steps. Fixes: zulip#563
This hopefully gives us a greater chance of getting past whatever the underlying problem is, by resetting more of our state.
Thanks for the reviews! Merged. |
This is a bit of a long branch, as I ended up going into a variety of related cleanups and other prep work.
Fixes: #563
Selected commit messages:
store: On rate-limit error in poll, explicitly back off and retry
This covers a particular case of #946.
This commit is actually NFC as far as the interaction with the server
goes; its only change in behavior is that we treat the error as
"expected", and therefore skip reporting it to the user unless it
follows a string of other errors.
But then this also sets us up so that these rate-limit errors will
continue to be handled with backoff-retry when we change the handling
of unexpected errors, coming next.
store: Handle unexpected event-poll errors by reloading store, with backoff
This fixes #563. We'll follow up with a few more commits that give
more-informative errors in some cases, or change the handling of others
from retrying getEvents to reloading the data from scratch.
The if-disposed and store.isLoading lines have no effect in the case
of a BAD_EVENT_QUEUE_ID error, because those can only come from the
getEvents request, and then the inner catch block will have already
taken the same steps.
Fixes: #563
store: Better error message on handleEvent failure
store: On non-transient request errors, reload rather than retry
This hopefully gives us a greater chance of getting past whatever
the underlying problem is, by resetting more of our state.