-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Messages not being sent over federation #7099
Comments
Update: This continues, sometimes messaging in 1:1 rooms between two server works, sometimes it fails. Addon: |
It sounds like the upgrade from 1.11.0 to 1.11.1 is irrelevant. Possibly it just seemed related because things coincidentally broke at the same time. I'm wondering if this is somehow related to #7065: there was a bug in synapse 1.10 which could cause some incorrect database updates. There is a fix for this (#7070) in the forthcoming 1.12.0 release. |
@richvdh thanks for the info. This all fells like a some form of data corruption bug for 1:1 federation chats which at least can be worked around be starting new 1:1 chats |
ok if you want to investigate further, please send a message in one of the affected rooms, and then share the logs from both the sending server and a server which didn't receive the message. |
By chance we got a broken room again, we'll try to filter the logs on both servers accordingly. There is a lot going on in there. I'll keep you posted |
Sorry for the delay! We have some logs now, also we've seen that:
Here are logs from the admins of server1 and server2. All systems should be latest greatest (latest stable riot web and latest stable synapse version) We've tried to get the correct point in time as good as we can and sanitize the logs Server1 sending:
|
looks like server2 isn't sending the events to server1 for some reason. can you enable DEBUG logging for it's also worth grepping the logs for ERROR and CRITICAL lines to see if anything looks odd. |
this is odd. have you got a reverse-proxy which is converting HTTP/1.1 requests to HTTP/1.0? |
That is nginx default for proxy connections (https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_http_version) The admin of server2 changed the configuration as per your recommendation. As we found out before messages started flowing again after the restart, but we'll try until it breaks again. |
Today it finally broke again. Here are the logs with debug enabled for synapse.federation.sender on server2. First I (user1) have sent a direct message to server 2, and than user2 on server 2 instantly tried to respond, which did not come through (also no read or typing notifications). If we would restart server2 now, it would work again and also old messages would be re-sent with the next new new message user 2 writes in the direct channel. Server1 log (Can send, but does not receive)
Server2 log (Did receive all, but no messages are flowing back)
|
For some reason, server2 has decided that server1 is no longer in the room :/. Figuring out why is going to take some deep digging. Is the problem limited to a small number of rooms? Perhaps just creating new rooms is a viable workaround. Otherwise, set up a manhole and prepare for the long haul... |
The pattern we've seen so far is that only direct chats are impacted by this. Workarounds right now are restarting server2 (or server3, there is another one behaving exactly the same but with >50 people on it, so the logs are quite exhausting) or re-creating the room. Given that there is another server acting the same way, could the problem be with server1? If we can provide any insights with more debug output etc. we are happy to help =), sadly it always takes a while until a room brakes again. Side note which comes into my mind right now: Some direct chats with server2 and server3 did break in the same timeframe. However, most of the direct chats remained working. |
do you have a manhole configured on each of the affected servers? if not, set one up, so that we can do some digging next time it happens. Docs are in the |
We'll set one up at server1 and server2. Could you provide us with the commands and information we should get through the manhole as soon as a room breaks again? |
suggest once it happens you dm me at |
Sounds like a plan. We'll contact you. Thank you for all the effort =) |
After investigation: it appears that this is due to custom status events corrupting the |
Fix a bug where the `get_joined_users` cache could be corrupted by custom status events (or other state events with a state_key matching the user ID). Fixes: #7099.
(many thanks to @ErrorProne and team for their patience in tracking this one down) |
I'm labelling this as a release-blocker because it's becoming more of an issue now that custom statuses are a thing, and the fix is trivial (#7376) |
Great to hear, thank you again =) |
) Fix a bug where the `get_joined_users` cache could be corrupted by custom status events (or other state events with a state_key matching the user ID). The bug was introduced by matrix-org#2229, but has largely gone unnoticed since then. Fixes matrix-org#7099, matrix-org#7373.
I had exactly the same problem, however I fixed it via changing my config on my apache reverse proxy.
|
Description
I'll try to do this as exact as I'm able to in reverse.
After upgrading our matrix server running on ubuntu 18.04 to 1.11.1+bionic1 we did not receive any messages from servers running on older instances (only in 1:1 chats, groups have been working fine).
We are explicitly talking with a another company running a matrix instance and a freelancer and his own setup.
Both problems were solved after upgrading their versions to the matrix synapse 1.11.1 version. But since at least one party was running 1.11.0+bionic1 before, it looks like the bug has been introduced in the patch release.
All messages and presence updates going from 1.11.1+bionic1 to the other sides went through without problems. But nothing the other way around.
One major pain point was that we discovered this by getting suspicious that some people did not respond in time and thus tried other communication channels. So the users where totally unaware of this.
A last note: As mentioned bringing all instances to 1.11.1 (or the latest docker image) solved it, no message went missing! As soon as the other parties wrote a new message all the missing ones got redelivered.
Edit: As far as I can tell it did not matter on which side the 1:1 chatroom has been initially created.
Steps to reproduce
(I did not re-test this, but this should result in the problematic setup)
You should now only be able to send 1:1 messages from 1.11.1+bionic1 to 1.11.0+bionic1 but not the other way around (At least using any riot the user won't get any feedback about this. I'll write an issue for that project later).
I could not find any suspicious log entries on our 1.11.1 server. But I've requested logs from the other side, maybe we can find something there.
Version information
The text was updated successfully, but these errors were encountered: