This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Catch-up after Federation Outage #8096
Catch-up after Federation Outage #8096
Changes from 10 commits
18d900e
07a415c
5bc321a
c60e259
20c896a
d232200
967d8c1
d910798
74a6f4f
c1b32ae
5de9313
759e027
6c52666
9ba56cb
af13948
558af38
44765e9
56aaa17
84dbc43
2c740a7
d77e444
33874d4
c1a2b68
16eec5c
92517e9
ef4680d
de5caf0
3e308f9
b0bdadd
843403f
ad7124d
b1fd67b
e6890c7
7cfecf3
bf51d2f
7589a03
8d9f4ba
b60ad35
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would this happen? I'm struggling to imagine how we could end up here with an event with a lower stream ordering than the catch-up max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a (potentially hypothetical?) race condition.
I'm not sure if ^ has an opportunity to race or not; but even if it doesn't now, what about if someone innocently comes along and plops an await in there (at the X)?
The uncertainty made me feel like I should try and make this robust and 'just deal with it'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N.B. needs scrutiny about whether that needs to check we actually have
self._catch_up
🤔There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to me that we shouldn't add new events to the queue while we are catching up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you note later, I set a high-water-mark. Anything after this watermark should be handled in the normal way.
(Part of this is also to reduce the use of
/get_missing_events
for when we think the remote is online.)If we are backing off for a long time, it will clear the queue and high-water-mark when it attempts a transaction on the next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are those arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D'ow, that'll teach me for not writing my full thoughts whilst I still remember what I was on about.
So, vaguely I may have had two wavelengths of thought here:
If I recall, I appear to have been under the delusion that those PDUs would be lost altogether.
However, now I realise that this will enable catch-up, and those PDUs will be eligible for catch-up if they are the most recent PDUs in their respective rooms (a
destination_rooms
entry will exist and we won't update ourlast_successful_stream_ordering
on a failure…) — so the remote will still get a chance to hear about them when they next come up. (If they aren't, then the more recent PDUs will be in the main queue so the remote will hear eventually!)So: all OK here.
Broadly I am tempted to think this is currently fine, but may need tweaking/limiting/aggressifying in the future in case that really does grow too large. I think Erik or Rich will have a more trustworthy sense here, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine for now, though once we have more confidence in the "catching up" behaviour, I think we may as well drop the whole queue immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are some typical situations where we could hit this? I could imagine a server is switching between available/not available behind its reverse proxy, such that we occasionally get
502
's. In that case we should probably continue to catch up, correct?Does Synapse purposely return any error codes other than 200 here? (I'm not sure). The spec doesn't seem to say so.
In that case, if it's only things in front of Synapse that would be returning other error codes, it seems sensible to continue trying to catch up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I want to be more clear on which of these exceptions, we should later try to catch up again, or which ones we should treat as 'actually drop the PDUs for good this time'?
After looking with a fresh mind, I think yes, we should enable catch-up on:
RequestSendFailed
,HttpResponseException
Not sure about:
FederationDeniedError
,Exception
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think we should enter "catchup" mode in most of these cases.
We should not do so for
FederationDeniedError
since that means the remote server is deliberately blocked in our whitelist.All the other cases should trigger a catchup afaict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why do we need a high-water-mark at all? why not just keep going until
get_catch_up_room_event_ids
returns an empty list?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(doing so without care will introduce races though...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could do that if you are keen — as you say though, needs more thought.
The advantage of this approach is that it's easy to keep the logic in my head. Was keen not to give a chance for any nasty bugs to crawl in, because it'd be nice to have confidence in this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to note in the log line that this is unintended in some way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as noted elsewhere: I think the
order
is redundant: it might be easier to get rid of it (in a separate PR).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put it on my list, then :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when does this happen, and why is breaking the loop the correct behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this should be log an ERROR, since this shouldn't happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break
isn't the correct behaviour really, since it disables catch-up when we supposedly should have some to do.But only
break
will let us make progress and go to the main loop again.So I will log an error either way, but is it best to break or return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this being unnecessarily paranoid?
There will be a foreign key constraint to link our rows to events, so this should darn well be impossible.
Better to
assert events
orif not events: raise AssertionError(...)
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd kinda like to see the logic shuffled so that this can be a foreign key; however at the very least it needs a comment saying why it can't be one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, would it suffice to
INSERT IGNORE
a NULL row into the destinations table when upserting adestination_rooms
row?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably. It would be nice to avoid doing so on every PDU (for example by doing it when we first create a
PerDestinationQueue
for the destination), but that might be fiddly, and is something we can do later.