Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Pintobyte.com regularly receives thundering herds of presence EDUs which badly overloads it. #2514

Closed
ara4n opened this issue Oct 9, 2017 · 6 comments

Comments

@ara4n
Copy link
Member

ara4n commented Oct 9, 2017

The box apparently spikes CPU when this happens, and the logs look like:

2017-10-09 09:46:54,078 - synapse.access.https.8448 - 91 - INFO - PUT-608361- 137.74.150.193 - 8448 - {maunium.net} Processed request: 15875ms (12ms, 0ms) (20ms/2) 11B 200 "PUT /_matrix/federation/v1/send/1507147021275/ HTTP/1.1" "Synapse/0.23.1"
2017-10-09 09:46:54,079 - synapse.access.https.8448 - 91 - INFO - PUT-608362- 5.196.33.160 - 8448 - {unixpimps.net} Processed request: 15873ms (8ms, 0ms) (31ms/2) 11B 200 "PUT /_matrix/federation/v1/send/1506285847368/ HTTP/1.1" "Synapse/0.22.1"
2017-10-09 09:46:54,080 - synapse.access.https.8448 - 91 - INFO - PUT-608363- 83.136.253.55 - 8448 - {status.im} Processed request: 15381ms (11ms, 0ms) (30ms/2) 11B 200 "PUT /_matrix/federation/v1/send/1507209450899/ HTTP/1.1" "Synapse/0.23.1 (b=develop,5de926d)"
2017-10-09 09:46:54,080 - synapse.access.https.8448 - 91 - INFO - PUT-608360- 74.15.30.197 - 8448 - {otonokizaka.is-a-linux-user.org} Processed request: 15881ms (0ms, 0ms) (40ms/2) 11B 200 "PUT /_matrix/federation/v1/send/1507329008642/ HTTP/1.1" "Synapse/0.22.1"
2017-10-09 09:46:54,087 - synapse.access.https.8448 - 91 - INFO - PUT-608364- 178.255.151.168 - 8448 - {err.no} Processed request: 15245ms (0ms, 0ms) (29ms/2) 11B 200 "PUT /_matrix/federation/v1/send/1507308264601/ HTTP/1.1" "Synapse/0.22.1"
2017-10-09 09:46:54,089 - synapse.access.https.8448 - 91 - INFO - PUT-608365- 62.210.203.126 - 8448 - {thekinrar.fr} Processed request: 15122ms (16ms, 0ms) (61ms/2) 11B 200 "PUT /_matrix/federation/v1/send/1502572424591/ HTTP/1.1" "Synapse/0.22.1"
2017-10-09 09:46:54,098 - synapse.access.https.8448 - 91 - INFO - PUT-608366- 5.9.31.99 - 8448 - {matrix.crypto.koeln} Processed request: 15115ms (12ms, 0ms) (60ms/2) 11B 200 "PUT /_matrix/federation/v1/send/1507500334310/ HTTP/1.1" "Synapse/0.22.1"
2017-10-09 09:46:54,101 - synapse.access.https.8448 - 91 - INFO - PUT-608370- 5.45.104.21 - 8448 - {msgs.tk} Processed request: 9335ms (0ms, 0ms) (56ms/2) 11B 200 "PUT /_matrix/federation/v1/send/1507217392729/ HTTP/1.1" "Synapse/0.23.1 (b=msgs.tk,d7e7bbe8)"
2017-10-09 09:46:54,103 - synapse.access.https.8448 - 91 - INFO - PUT-608371- 109.74.200.146 - 8448 - {sw1v.org} Processed request: 8830ms (15ms, 0ms) (40ms/2) 11B 200 "PUT /_matrix/federation/v1/send/1507307336083/ HTTP/1.1" "Synapse/0.23.1 (b=rav/federation_server_meltdown,37048b8)"
2017-10-09 09:46:54,112 - synapse.access.https.8448 - 91 - INFO - PUT-608372- 163.172.150.247 - 8448 - {thewordnerd.info} Processed request: 8205ms (8ms, 0ms) (40ms/2) 11B 200 "PUT /_matrix/federation/v1/send/1504875069277/ HTTP/1.1" "Synapse/0.22.1"
2017-10-09 09:46:54,120 - synapse.access.https.8448 - 91 - INFO - PUT-608373- 35.185.61.183 - 8448 - {pfe.io} Processed request: 8204ms (4ms, 0ms) (47ms/2) 11B 200 "PUT /_matrix/federation/v1/send/1506311318218/ HTTP/1.1" "Synapse/0.22.1"

However, in the preceding 15s, there are pauses of 2-3s in the logs. It looks like the reactor is getting wedged on something, but no idea what, nor does it seem to be covered by a log context. It's not getting stuck behind a linearizer lock seemingly.

I'm wondering if there's some queuing going on in processing the requests such that they suddenly all end up getting processed at the same point. I vaguely remember seeing this when investigating the Christmas meltdown - that Twisted has some kind of internal queue where requests start to stack up if the reactor is too busy?

@ara4n
Copy link
Member Author

ara4n commented Nov 8, 2017

This is still happening, and it's really quite bizarre. I think there are two problems: synapse failing to handle spikes of EDU traffic well... but also the fact we're seeing weird spikes in presence traffic too. Jeremy has shared logs of this happening at DEBUG, and looking at the pattern of RX and TX federation traffic, we see huge spikes in RXs during which CPU goes to 100% and the reactor melts:

zless homeserver\ \(6\).log.gz | egrep 'Received txn' | cut -d' ' -f1,2 > rx.txt
zless homeserver\ \(6\).log.gz | egrep 'Sending transaction' | cut -d' ' -f1,2 > tx.txt

gnuplot <<EOT
set xdata time
set format x "%H:%M"
bin(x,width)=width*floor(x/width)
binwidth=1
plot 'tx.txt' using (bin(strptime("%Y-%m-%d %H:%M:%S", strcol(1)." ".strcol(2)),binwidth)):(1.0) smooth freq with boxes, \
     'rx.txt' using (bin(strptime("%Y-%m-%d %H:%M:%S", strcol(1)." ".strcol(2)),binwidth)):(1.0) smooth freq with boxes
EOT

yields
screen shot 2017-11-08 at 13 35 44

Looking at the biggest RX spike, this is 27(!) fed /sends arriving at 15:54:15:

sierra:Downloads matthew$ zless homeserver\ \(6\).log.gz | grep 'synapse.federation.transport.server' | grep Decoded | grep "15:54:15"
2017-11-08 15:54:15,004 - synapse.federation.transport.server - 243 - DEBUG - PUT-7745- Decoded 1509985970883: {'origin': 'monkeystew.net', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@thrrgilag:monkeystew.net', 'last_active_ago': 25885448, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174419133}
2017-11-08 15:54:15,021 - synapse.federation.transport.server - 243 - DEBUG - PUT-7746- Decoded 1507010868618: {'origin': 'vn-rtmessaging.rtcenter.org', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@C0261484814968.87200158807a78d51680.66233866:vn-rtmessaging.rtcenter.org', 'last_active_ago': 27877240, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174418595}
2017-11-08 15:54:15,033 - synapse.federation.transport.server - 243 - DEBUG - PUT-7747- Decoded 1509440944224: {'origin': 'cn.matrix.gensokyo.link', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@jhon:cn.matrix.gensokyo.link', 'last_active_ago': 240700698, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174419409}
2017-11-08 15:54:15,045 - synapse.federation.transport.server - 243 - DEBUG - PUT-7748- Decoded 1510159775961: {'origin': 'zenfolie.org', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@alain:zenfolie.org', 'presence': 'online', 'last_active_ago': 3408, 'currently_active': True}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174420919}
2017-11-08 15:54:15,062 - synapse.federation.transport.server - 243 - DEBUG - PUT-7749- Decoded 1510064434477: {'origin': 'hveem.no', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@Tor:hveem.no', 'presence': 'online', 'last_active_ago': 3908, 'currently_active': True}, {'user_id': '@ivarbot:hveem.no', 'presence': 'online', 'last_active_ago': 3124, 'currently_active': True}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174420212}
2017-11-08 15:54:15,076 - synapse.federation.transport.server - 243 - DEBUG - PUT-7750- Decoded 1506346115725: {'origin': 'esko.bar', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@test:esko.bar', 'last_active_ago': 3046417149, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174417874}
2017-11-08 15:54:15,087 - synapse.federation.transport.server - 243 - DEBUG - PUT-7751- Decoded 1509748021517: {'origin': 'cadair.com', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@cadair:cadair.com', 'last_active_ago': 100355884, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174724081}
2017-11-08 15:54:15,099 - synapse.federation.transport.server - 243 - DEBUG - PUT-7752- Decoded 1509069475360: {'origin': 'ravne.land', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@e:ravne.land', 'last_active_ago': 7366024394, 'presence': 'offline'}, {'user_id': '@esh:ravne.land', 'presence': 'online', 'last_active_ago': 3356, 'currently_active': True}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174419794}
2017-11-08 15:54:15,110 - synapse.federation.transport.server - 243 - DEBUG - PUT-7753- Decoded 1507289681847: {'origin': 'conference.open-xchange.com', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@bjoern.vonderlinde:conference.open-xchange.com', 'last_active_ago': 1093099723, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174531612}
2017-11-08 15:54:15,121 - synapse.federation.transport.server - 243 - DEBUG - PUT-7754- Decoded 1509368427506: {'origin': 'krtdex.com', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@seequ:krtdex.com', 'presence': 'online', 'last_active_ago': 2560, 'currently_active': True}, {'user_id': '@mojo:krtdex.com', 'last_active_ago': 8750745135, 'presence': 'offline'}, {'user_id': '@wuppi:krtdex.com', 'last_active_ago': 15044496845, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174419713}
2017-11-08 15:54:15,140 - synapse.federation.transport.server - 243 - DEBUG - PUT-7755- Decoded 1508510167206: {'origin': 'matrix.f5.htw-berlin.de', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@plesch:matrix.f5.htw-berlin.de', 'last_active_ago': 16934927, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174418916}
2017-11-08 15:54:15,151 - synapse.federation.transport.server - 243 - DEBUG - PUT-7756- Decoded 1509066263029: {'origin': 'tang.ents.ca', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@uhoreg:tang.ents.ca', 'last_active_ago': 1318151786, 'presence': 'offline'}, {'user_id': '@travis:tang.ents.ca', 'last_active_ago': 685954574, 'presence': 'offline'}, {'user_id': '@ownway:tang.ents.ca', 'last_active_ago': 14743297, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174412825}
2017-11-08 15:54:15,163 - synapse.federation.transport.server - 243 - DEBUG - PUT-7757- Decoded 1503936283541: {'origin': 'aius.u-strasbg.fr', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@root:aius.u-strasbg.fr', 'last_active_ago': 3375159631, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174420904}
2017-11-08 15:54:15,180 - synapse.federation.transport.server - 243 - DEBUG - PUT-7758- Decoded 1509898739015: {'origin': 'croxis.net', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@croxis:croxis.net', 'last_active_ago': 148888981, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174422227}
2017-11-08 15:54:15,191 - synapse.federation.transport.server - 243 - DEBUG - PUT-7759- Decoded 1510019382473: {'origin': 'vurpo.fi', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@vurpo:vurpo.fi', 'presence': 'online', 'last_active_ago': 2911, 'currently_active': True}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174416143}
2017-11-08 15:54:15,202 - synapse.federation.transport.server - 243 - DEBUG - PUT-7760- Decoded 1509233656078: {'origin': 'pvagner.tk', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@pvagner:pvagner.tk', 'last_active_ago': 3260859, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174418945}
2017-11-08 15:54:15,212 - synapse.federation.transport.server - 243 - DEBUG - PUT-7761- Decoded 1510152209525: {'origin': 'filterfish.org', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@rgh:filterfish.org', 'last_active_ago': 85018665, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174422211}
2017-11-08 15:54:15,223 - synapse.federation.transport.server - 243 - DEBUG - PUT-7762- Decoded 1509397493503: {'origin': 'riot.ovh', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@x:riot.ovh', 'presence': 'online', 'last_active_ago': 2136, 'currently_active': True}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174418988}
2017-11-08 15:54:15,235 - synapse.federation.transport.server - 243 - DEBUG - PUT-7763- Decoded 1508969276823: {'origin': 'tophattedcat.co.uk', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@root:tophattedcat.co.uk', 'presence': 'online', 'last_active_ago': 2186, 'currently_active': True}, {'user_id': '@tophattedcat:tophattedcat.co.uk', 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174419312}
2017-11-08 15:54:15,262 - synapse.federation.transport.server - 243 - DEBUG - PUT-7765- Decoded 1509752520004: {'origin': 'roeckx.be', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@kurt:roeckx.be', 'last_active_ago': 6646812, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174419889}
2017-11-08 15:54:15,274 - synapse.federation.transport.server - 243 - DEBUG - PUT-7766- Decoded 1507151718890: {'origin': 'internet-portal.cz', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@vaclav.fiser:internet-portal.cz', 'presence': 'online', 'last_active_ago': 3722, 'currently_active': True}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174421224}
2017-11-08 15:54:15,285 - synapse.federation.transport.server - 243 - DEBUG - PUT-7767- Decoded 1508859265489: {'origin': 'exul.org', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@exul:exul.org', 'last_active_ago': 5515739, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174420345}
2017-11-08 15:54:15,296 - synapse.federation.transport.server - 243 - DEBUG - PUT-7768- Decoded 1508860580589: {'origin': 'matrix.zottel.net', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@zottel:matrix.zottel.net', 'last_active_ago': 14613576, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174419140}
2017-11-08 15:54:15,307 - synapse.federation.transport.server - 243 - DEBUG - PUT-7769- Decoded 1508605412807: {'origin': 'home.everets.org', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@keverets:home.everets.org', 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174418237}
2017-11-08 15:54:15,318 - synapse.federation.transport.server - 243 - DEBUG - PUT-7770- Decoded 1509567732286: {'origin': 'cervoi.se', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@erdnaxeli:cervoi.se', 'presence': 'online', 'last_active_ago': 4189, 'currently_active': True}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174477539}
2017-11-08 15:54:15,329 - synapse.federation.transport.server - 243 - DEBUG - PUT-7771- Decoded 1509524828367: {'origin': 'matrix.project47.xyz', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@ContainsLiquid:matrix.project47.xyz', 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174421629}
2017-11-08 15:54:15,347 - synapse.federation.transport.server - 243 - DEBUG - PUT-7772- Decoded 1509455980272: {'origin': 'kiwiki.eu', 'pdu_failures': [], 'pdus': [], 'edus': [{'content': {'push': [{'user_id': '@Dirk:kiwiki.eu', 'last_active_ago': 102347853, 'presence': 'offline'}]}, 'edu_type': 'm.presence'}], 'origin_server_ts': 1510174419139}

Most of these presence EDUs are nothing remotely interesting - users like cadair happening to be offline and last active 1.1 days ago.

I have no idea what is causing this thundering herd. @erikjohnston any idea? @leonerd any chance this rings a bell from your original implementation?

@ara4n ara4n changed the title Pintobyte.com sometimes takes 15s to process EDU fed sends for no obvious reason Pintobyte.com regularly receives thundering herds of presence EDUs which badly overloads it. Nov 8, 2017
@ara4n
Copy link
Member Author

ara4n commented Nov 9, 2017

having had a very cursory look at the code, the presence handler has

last_federate = new_state.last_federation_update_ts
if now - last_federate > FEDERATION_PING_INTERVAL:
# Been a while since we've poked remote servers
new_state = new_state.copy_and_replace(
last_federation_update_ts=now,
)
federation_ping = True
in it, where FEDERATION_PING_INTERVAL. So I'm wondering if we have a thinko somewhere here where the inactivity timers of the servers get in sync (after receiving an EDU), and then after 25 mins the trigger fires, causing a thundering herd of pings?

@richvdh
Copy link
Member

richvdh commented Mar 5, 2018

Didn't this get fixed?

@ara4n
Copy link
Member Author

ara4n commented Apr 12, 2018

not afaik.

@ara4n
Copy link
Member Author

ara4n commented Apr 12, 2018

it may also be a dup of #1324

@richvdh
Copy link
Member

richvdh commented May 24, 2019

assuming this is caused by #3962, it's fixed.

@richvdh richvdh closed this as completed May 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants