-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Synapse completely hangs every few minutes, when configured with an IPv6 DNS server but no IPv6 network #2395
Comments
Four homeserver logs from start to complete lockup each. from_launch_to_crash_05.txt My random intuition tells me that the lockups could be related to hostnames returning NX. |
Those logs do make it look like it blows up when it starts sending out transactions, but otherwise there isn't anything particularly interesting in them :/ We've seen some "weird" behaviour on matrix.org relating to federation sending and memory usage, so I wouldn't be that surprised if this was something similar. Though I've never seen it lock up before. My suggestion at this point would be to try running synapse with a split out worker for the federation sending, this should then tell us whether it is that or not. The generic docs for workers are: https://github.com/matrix-org/synapse/blob/master/docs/workers.rst An example config for federation sender: worker_app: synapse.app.federation_sender
# The replication listener on the synapse to talk to.
worker_replication_url: http://127.0.0.1:9092/_synapse/replication
worker_replication_host: 127.0.0.1
worker_replication_port: 9111
worker_daemonize: True
worker_pid_file: /home/matrix/synapse/federation_sender.pid
worker_log_config: /home/matrix/synapse/config/federation_sender_log_config.yaml With the following added to the main config: # As we now have a federation_sender worker
send_federation: false
listeners:
# ... existing listeners go here...
- port: 9092
bind_address: ''
type: http
tls: false
x_forwarded: false
resources:
- names: [replication]
compress: false
- port: 9111
bind_address: 127.0.0.1
type: replication |
Oh, but you're using sqlite rather than postgres. I wonder if there is a perverse query that is taking out a lock on sqlite? (I'd be tempted to try migrating to postgres personally.) The other thing you could try doing is setting the log level of |
As for switching to I've set I now mostly get these twisted error messages in between the few lines of INFO. The number after
which somehow doesn't seem right to me and not what, I guess, should change in the logging. Synapsed locked up again after some time. The last lines in the log, which were not a twisted
|
Tried to dump and rebuild the |
Yup, it really shouldn't be necessary (though postgres is quite small). The back trace looks like the config doesn't have the default filters:
context:
(): synapse.util.logcontext.LoggingContextFilter
request: "" or are you using a different way of setting up logging? |
I'm using the
What should I put in the |
Hmm, that looks correct (a blank Basically I'm worried that its swallowing all the useful SQL logging. Are you seeing any log lines like:
and if so what are the last lines you see? |
Nope, no SLQ queries logged at all, just the This is the complete version: 1
formatters:
precise:
format: '%(asctime)s - %(name)s - %(lineno)d - %(levelname)s - %(request)s- %(message)s'
filters:
context:
(): synapse.util.logcontext.LoggingContextFilter
request: ""
handlers:
file:
class: logging.handlers.RotatingFileHandler
formatter: precise
filename: /var/synapse/.synapse/homeserver.log
maxBytes: 104857600
backupCount: 10
filters: [context]
level: INFO
console:
class: logging.StreamHandler
formatter: precise
loggers:
synapse:
level: INFO
synapse.storage.SQL:
level: DEBUG
synapse.storage.txn:
level: DEBUG
root:
level: INFO
handlers: [file, console] (All indentations are spaces which I verified.) Most
Where the most lines containing |
Ah, in the console handler you'll need a If that still doesn't fix the logging, I'd suggest for now just removing the
|
Ok, I added the handlers:
file:
class: logging.handlers.RotatingFileHandler
formatter: precise
filename: /var/synapse/.synapse/homeserver.log
maxBytes: 104857600
backupCount: 10
filters: [context]
level: INFO
console:
class: logging.StreamHandler
formatter: precise
filters: [context] That seems to have fixed logging all the twisted error messages. \o/ Here's the last lines before synapse locked up with the new logging settings. |
Woo! Also, the reason its not logging the SQL is because you have the Can you run |
% sqlite
sqlite3 homeserver.db
SQLite version 3.20.1 2017-08-24 16:21:36
Enter ".help" for usage hints.
sqlite> select count(*) from device_federation_outbox;
4 I've now set |
Could you also have a look at: select room_id, count(*) c from event_forward_extremities group by room_id order by c desc limit 20; please? If there are rooms with a large number of forward extremities then that is also known to cause issues. |
sqlite> select room_id, count(*) c from event_forward_extremities group by room_id order by c desc limit 20;
!******************:matrix.org|26
!******************:darkfasel.net|4
!******************:matrix.xwiki.com|4
!******************:matrix.org|3
!******************:chat.weho.st|2
!******************:maclemon.at|1
!******************:matrix.org|1
!******************:matrix.org|1
!******************:maclemon.at|1
!******************:maclemon.at|1
!******************:matrix.org|1
!******************:matrix.org|1
!******************:matrix.org|1
!******************:matrix.org|1
!******************:maclemon.at|1
!******************:maclemon.at|1
!******************:darkfasel.net|1
!******************:matrix.org|1
!******************:maclemon.at|1
!******************:maclemon.at|1 |
After my last Update on 23.1 I have a simmilar Problems with an PostgreSQL Database, last Entry in the logfile after the crash is always somthing like this:
Because of this issue I'm not able to user my synapse server anymore. Before the Update the server runs 3 month without any Problem. |
I'm still on py27-matrix-synapse-0.23.0, since the FreeBSD packages haven't been updated yet. (From experience, I guess they will be updated soon.) The crashing/locking issue has worsened for me with 0.23.0. Forward extremeties are still only a few:
|
we're in the middle of a bunch of synapse bughunting currently, and i'd really like to get to the bottom of this. @hamber-dick is your synapse being OOM killed? @MacLemon can you share the logs of the current crashing/locking issue with me over Matrix again please? I assumed things were okay on your side atm :( |
Sure, I'll collect a few logs and bundle them up. (Situation hadn't changed, otherwise I'd have posted success, cheers and happiness here already.) :-) |
Yes, my synapse server ist killed at the Moment. It crashes after 5-15 minutes after I start it. So i have to kill -9 him then. I'm not able to restart and kill him every 5 Minutes ;-) |
@hamber-dick - i'm investigating, although I suggest you change the access tokens for any ASes on that homeserver; it's safest to share HS logs directly. My best bet for why your server (and @MacLemon's) is wedging and needing a kill -9 is that it's running out of available FDs due to to the amount of concurrent federation traffic going on (or just having a low maximum FD set). Normally this should be obvious in the logs with a bunch of EMFILE errors but it's possible it's just making twisted just wedge solid. What is your There isn't anything obvious otherwise that i can see in the logs. |
@hamber-dick please can you also list your extremities (as per #2395 (comment)) |
Here my extremities: !:matrix.org | 14 |
thanks. the 14 and 12 ones might be complicating things a bit; if you speak a bit in those rooms it may get better. I’d like to know your FD status though |
Those two rooms are Matrix HQ and RIOT. The Big rooms were always the reason for the Crash of my HS. Because of that I have tried to leave them but tho HS still syncs the rooms I think, is there a possibility to delete them complete from my HS. |
Okay - i suspect @hamber-dick's problem can be mitigated by #2490 which just merged to develop. I'd still like to understand what everyone's FD counts are doing though when the app wedges.... |
I'm not sure if I unterstand it correct so here are my results for: During crash: lsof -p synapse-pid|wc -l : 600 to 636 |
I set the soft_file_limit in the homserver.yaml to 400 now I have no crash since 2 Hours. But I can't load the Room Directory from matrix.org(Time Out), and the Extrimities are increasing from Rooms with no member from my homserver(matrix.org). |
@hamber-dick thanks for helping check. Can you upgrade to the develop branch so you can see if PR #2490 fixes the extremities problem? The inability to load the room directory from matrix.org is probably #3327 which is currently unsolved. |
...that said, setting the soft_file_limit in homeserver.yaml to 400 is a bad idea if you were seeing it requiring 600-700 FDs. I would strongly recommend keeping soft_file_limit at the default value to make things easier to debug (which should put it at 1024, based on your |
Yesterday I updated synapse to develop. And the extremities are not growing anymore. Maximum ist 2. |
@MacLemon can you please try the same trick (i.e. upgrading to develop to see if that resolves things) - also if you can confirm the FD health when it wedges then we might actually be able to close this up! (Also, if either of you happen to be anywhere near Berlin atm, be aware that this time next week we're organising an IRL meetup in case we can provide in person support: https://matrix.org/blog/2017/10/12/announcing-matrix-meetup-in-berlin-october-19th.) |
Ok, update from my instance: FreeBSD 11.1-p6
Number of file descriptors in FreeBSD is best fetched via procstat(1) (opposed to lsof(8)) Directly after starting synapse: 55, raises to roughly 180 during runtime.
I've put fetching of all these values, as well as forwared extremities into my autokiller script which checks if synapsed is still responding properly by querying |
Even with the latest synapse 0.29, I still have the hanging issue on a regular basis (a quite important number amount of times per day). Each time, synapse suddenly eats up 100% of the CPU. I'll try to provide logs. |
Here are some logs I captured during the hang: https://framadrop.org/r/AwCAgUHOSB#aZwziubx9AamEbUwU30dc+RYOrmdg7QZJ2i5jHf6TM0= |
@Ezwen Might it be possible that synapse faces memory issues? |
@krombel interesting, thanks for sharing! if that's the case, then I suppose there is not much I can do :(. Unfortunately I cannot easily add memory to the physical host. |
Because the maintainer of the Then I thought I wanted to upgrade from SQLite to Postgres a long time ago and took the time to do so following the migration guide which seems to have worked fine. What a difference in performance. Not what I had expected… Snarky remarks aside… I've been DEBUG logging this thing for a long time now. I found out that at some point the python process is hanging, getting stuck and not doing anything anymore. Then the kernel event pip from the network starts building up until it reaches its soft limit of 50, then goes over that to the hard limit of about ≈76. This is usually when my cron job that checks it, also kills it, and kicks it up again. It seems to consume between 125MB and 195MB RAM (≈300-380MB virtual) at that time hitting 100% on a single core. This is the point where it won't recover anymore and I must From the logs it still seems to hang when trying to send out messages. |
Sooo after a of hours knee-deep in gdb and the twisted source I found something. Have a look :)
|
Great detective work!
This sounds depressingly familiar: #2800 (comment) |
@hawkowl might this be of relevance to your interests? |
(synapse trying to talk ipv6 even when there isn't ipv6 reminds me of #2850, although i think that was a thinko in synapse rather than twisted) |
It's actually twisted trying to resolve over IPv6 which then fails to bind to the non-existing IPv6 interface which it tries until infinity instead of having any break condition. |
I vaguely remember this getting fixed ages ago in Twisted - can anyone confirm? |
Summary:
Since 0.22 synapse hangs and crashes so often it has become mostly unusable.
Steps to Reproduce:
Run synapse 0.22 on FreeBSD 11.0p11. Wait a few minutes on a server with exactly two user accounts, federated with matrix.org. Synapse hangs and Riot clients show that it has lost connection to the server.
Then trying to restart synapse with synctl or service(8) it tells me it's waiting for PID . It's waiting forever for synapse to quit. I must kill -9 the process before I can restart the service again.
Expected Results:
synapse shouldn't hang/crash multiple times a day. At least it should cleanly restart when told to.
Actual Results:
synapse hangs, often after a few seconds/minutes. Trying to restart it doesn't work, since it waits forever for the process to quit. Manual kill -9 works. (Yes, -9 is necessary to successfully terminate the process.)
Fully restarting the FreeBSD jail works as well.
Regression:
Synapse was a lot more stable with 0.18, 0.19. With 0.21 it became crashy, requiring restarts most days. With 0.22 it crashes multiple times a day, sometimes only lasting 30s before it hangs.
Automatically restarting synapse with a cron job doesn't work either since it never finishes to terminate. Having a kill -9 in a cron job is unacceptably ugly and must not be necessary to keep a daemon running smoothly. (Especially with next to no load at all. TWO (in numbers, 2) user accounts on that server.)
Notes:
I suspect this to be related to low-memory conditions, but cannot tell for sure.
Are there any logs I should reproduce here to aid tracking down the problem? If so, which ones?
The text was updated successfully, but these errors were encountered: