-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Conversation
…triggering about every 35 seconds and preventing idle.
…over sync and federation
Can confirm, same Problem here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure I'm grokking this code, but I left a few questions that will hopefully clarify for myself or a future reviewer.
The tests do look pretty reasonable though, so I can believe that this is fixing bugs!
presence == PresenceState.BUSY and self._busy_presence_enabled | ||
): | ||
if ( | ||
prev_state.state == PresenceState.OFFLINE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm failing to follow why this is specifically for offline->online; naively I would have expected if it is going from !online->online.
(The busy presence bit seems like it might have the same bug?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a reason, which is entirely escaping me right now(and appears to be missing from my notes). Let me test it out, because if this holds up we can remove the 'hacky' bit in handle_update()
. I think it was related to not using set_presence
on the /sync
query, as it defaults to 'online'. However, that behavior has recently changed(at least for Element, other clients may not have). Give me some time to look into it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! The various possible state transitions are rather fuzzy in my head, which is why I'm asking. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and made those changes. No deal(at least as-is). It will need something else to handle the fact that some clients still set presence via the presence endpoint and not through /sync
.
https://github.com/matrix-org/synapse/actions/runs/5745219286
I left my desktop client open(so it should have been syncing in the background, at least until it went to sleep). But then I checked in real quick with my Android mobile, a few minutes later this nonsense started spamming federation.
# Need to remove the user_id from users_going_offline or the looping call will | ||
# tell the presence writer that the user went offline. | ||
self.users_going_offline.pop(user_id, None) | ||
self.send_user_sync(user_id, True, self.clock.time_msec()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite following why we remove the if-statement here. Can you explain a bit more what this is doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned that above in Number 2 under Three-fold fix.
Some context to keep in mind before I begin the long version:
- This is only applies on a worker model of synapse where there is either a sync worker or a presence writer(at the very least). This does not apply on a monolith deployment.
- syncs can happen sometimes in rapid or sometime in slow succession, depending on how many other users the homeserver is aware exist + any room updates coming along
- There is a looping call every 10 seconds, that will mark a user as 'offline' if they are in
users_going_offline
and at least 10 seconds has passed. These timers are not synchronized in any way and a theoretical 19 seconds could occur. The looping call uses the companion tosend_user_sync()
calledsend_stop_syncing()
which makes an explicit call to the presence writer to mark them 'offline'. - This is tied to the context manager that wraps a call to the sync handler here
Now
- At the
_end()
of the sync context manager, it used to adduser_id
as a key tousers_going_offline
with a value of the current time in milliseconds. - On the next sync,
user_id
ispop()
ped off ofusers_going_offline
and checked forNone
(or more specifically, would check that there wasnot
a value), and it would only thensend_user_sync()
. This only occurs when it is the very first sync of a recent-ish session otherwise it willnot
beNone
andsend_user_sync()
would not be called. - So, we are mid sync and the presence writer has not been informed we are still around. Finish the context manager, and repeat however many times it takes to get to about 10 seconds(specifically the next time the looping call comes around to call
send_stop_syncing()
and mark them offline). - 10 seconds has passed, the context manager is still doing it's thing and knows the user is still online(obviously, they are still mid-sync after all). Over on the presence writer however, no indication has happened in 10 seconds that the user is still online: no further
USER_SYNC
has come over TCP replication, and remember thatset_state()
no longer bumpslast_active_ts
on a regular basis anymore. The machinery kicks into gear to mark the user as 'offline', which then gets sent over replication back to the sync handler side. - At this point, the presence writer thinks the user is 'offline' and the sync handler(well, it's associated
WorkerPresenceHandler
, actually) thinks the user is 'online'. The call from the presence writer will overwrite what the context manager thinks at this point. Still no update to thelast_active_ts
, by the way. - On the next sync, the context manager saw they were 'offline' last and puts the user back to 'online', beginning the cycle again. You can see this in metrics under the spoiler arrow with the words "One without Two resulted in this bad dream(every 35 seconds)".
Now, you are probably thinking, "Well, let's just remove the not
in the conditional to check for an actual time". I considered that, but then the presence writer doesn't get it's first notification that the user is syncing and that felt like undefined behavior. Honestly, it might work out(and is a place I'm planning on revisiting in the future as a bit of refactor, there are more things to touch here). I will test this if you wish, my instinct says as long as the second sync of a session is within 10 seconds then everything will be fine.
And, to be fair, fully fixing everything possible wasn't the scope of this PR. I wanted 'unavailable' to work as advertised. But this was affecting the behavior in a negative way and wasn't going to be allowed to get away, a traditional 'bug' that was fully exposed by last_active_ts
not being updated on every single sync any longer.
if prev_state.state == PresenceState.UNAVAILABLE: | ||
# If we had idled out, but are still syncing, then new state will look | ||
# like it's online. Verify with last_active_ts, as that is a sign of a | ||
# pro-active event. Override that we are online when we probably aren't. | ||
if now - new_state.last_active_ts > IDLE_TIMER: | ||
new_state = new_state.copy_and_replace( | ||
state=PresenceState.UNAVAILABLE | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit hacky -- can we avoid trying to update in this case at all instead backtracking from saying we're online to unavailable?
@realtyem FYI it sounds like there has been a regression in Element Web which might be a root cause for us seeing increasing problems around this area. I'm not sure how much that will affect this PR |
I'm aware, I'm still running some tests(they take a while to accumulate the data) to make sure this all holds true. I believe it does. It should be ok still, though I am currently side tracked by checking @clokep 's ask about the mentioned condition being not more encompassing. I should know for sure in about 24 hours(give or take) |
…st_active_ts is updated if we were anything but already ONLINE
… that last_active_ts is updated if we were anything but already ONLINE" This reverts commit 6b7e273.
# last_active_ts should not have changed as no pro-active event has occurred | ||
self.assertEqual(prev_state.last_active_ts, state.last_active_ts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erikjohnston and I went through this and we think there's a misunderstanding here -- last_active_ts
should be updated for each sync. It seems like not updating this except for the initial offline -> online transition is fixing the bug, at the expense of never updating last_active_ts
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading just that summary that sounds like a recipe for disaster.
start: /sync
(online) => update last_active_ts => new data to send to syncer => goto start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start:
/sync
(online) => update last_active_ts => new data to send to syncer => goto start
My understanding is that updating the last_active_ts
should not cause new data to be sent down sync while you're actively syncing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes
https://spec.matrix.org/v1.7/client-server-api/#last-active-ago
To reduce the number of presence updates sent to clients the server may include a currently_active boolean field when the presence state is online. When true, the server will not send further updates to the last active time until an update is sent to the client with either a) currently_active set to false or b) a presence state other than online. During this period clients must consider the user to be currently active, irrespective of the last active time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear
There is more to do in this area, but will be on a different PR as it may be somewhat contentious.
I wasn't done in this area. This doesn't completely fix that last_active_ts
is being sent inappropriately, just that part that was causing it to be affecting idle mode(I'm trying to stay inside the scope of the point of the PR. The TCP replication was one of the things that was masking and keeping this from working as well, or it would also have been separate).
To fix last_active_ts
being updated inappropriately, only really means that if currently_active
is True we don't have to send anything if the only thing to send is an update to last_active_ts
. The only time last_active_ts
should be sent is when currently_active
is False and that update should be changing currently_active
to True on the same packet.
State changes are separate from this, and this PR is only supposed to be about making the state change work. I'm sorry this is making problems. I can close this and separate it all out if you would rather. (But I'm not updating the pictures, those were a pain to do the first time)
I'm just going to break this into pieces. Maybe that will be easier to deal with. |
Presence has an idle mode, called 'Unavailable' which currently doesn't work right and is generating some additional traffic. Make it behave.
This is what it looks like if you zoom in to a 1 second resolution.
And it's repetitive. This is my client sitting open and largely untouched.
But it doesn't ever actually idle. It goes back to online in nearly the same second. For testing this PR and for all the following pictures, I parked myself into an empty room so it would not be contaminated by read receipts, read markers, or typing (and of course didn't send myself any messages) as all of those will bump
last_active_ts
and not allow idling. I also ramped up the scraping interval and changed Grafana's display points to use$__rate_interval
so I could get 1 second resolution on the timing(which makes the scale weird). The point was that I could see how often it was happening.A client/user is considered idle after 5 minutes of no 'pro-active' events or activity. As you can see, it's not quite....right.
The fix is three-fold
last_active_ts
bug, which allows a user to not update theirlast_active_ts
just by syncing.USER_SYNC
in which one would say we are 'online' and another would say we are 'offline', creating a flicker(@nico-famedly). This appeared to only occur on a worker-mode dance between a sync worker and a presence stream_writer, but would still propagate over federation as it was a change of state. I want to thank @anoadragon453 for putting on an archeology hat and providing the fix for this. I don't think I would have looked here if it wasn't for that.PresenceState
was in 'unavailable'(which is what idle is called by the spec) and verifies that per thelast_active_ts
to see if the call to be 'online' is accurate. I want to thank @jevolk for letting me borrow his homeserver to see what was happening on the other side of federation.One without Two resulted in this bad dream(every 35 seconds):
Just the yellow one on the left, the orange one on the right we'll deal with in Three.
I called it after 30 minutes that this was enough of that.
One and Two without Three gave this nightmare(which didn't start for over 5 minutes, but then every 5 seconds):
Close up on these is up above.
And I threw in some tests to check this stuff. They pass now but on
develop
you get:(For those that were paying attention, you may have noticed that the original version of this PR had the conditional to fix
last_active_ts
check against!= PresenceState.ONLINE
. This was affecting using idle, so was changed to exclude it.)Hey look, it works now
Correct idling federation traffic:
Output provided by jevolk's server(does not match up with metrics above, but was used to verify unavailable state)
There is more to do in this area, but will be on a different PR as it may be somewhat contentious.
Should...
last_active_ago
is updated on sync when it shouldn't be #12424Pull Request Checklist
(run the linters)
Signed-off-by: Jason Little realtyem@gmail.com