Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Synapse completely hangs every few minutes, when configured with an IPv6 DNS server but no IPv6 network #2395

Open
MacLemon opened this issue Jul 29, 2017 · 44 comments

Comments

@MacLemon
Copy link

Summary:
Since 0.22 synapse hangs and crashes so often it has become mostly unusable.

Steps to Reproduce:
Run synapse 0.22 on FreeBSD 11.0p11. Wait a few minutes on a server with exactly two user accounts, federated with matrix.org. Synapse hangs and Riot clients show that it has lost connection to the server.

Then trying to restart synapse with synctl or service(8) it tells me it's waiting for PID . It's waiting forever for synapse to quit. I must kill -9 the process before I can restart the service again.

Expected Results:
synapse shouldn't hang/crash multiple times a day. At least it should cleanly restart when told to.

Actual Results:
synapse hangs, often after a few seconds/minutes. Trying to restart it doesn't work, since it waits forever for the process to quit. Manual kill -9 works. (Yes, -9 is necessary to successfully terminate the process.)
Fully restarting the FreeBSD jail works as well.

Regression:
Synapse was a lot more stable with 0.18, 0.19. With 0.21 it became crashy, requiring restarts most days. With 0.22 it crashes multiple times a day, sometimes only lasting 30s before it hangs.
Automatically restarting synapse with a cron job doesn't work either since it never finishes to terminate. Having a kill -9 in a cron job is unacceptably ugly and must not be necessary to keep a daemon running smoothly. (Especially with next to no load at all. TWO (in numbers, 2) user accounts on that server.)

Notes:
I suspect this to be related to low-memory conditions, but cannot tell for sure.

Are there any logs I should reproduce here to aid tracking down the problem? If so, which ones?

@MacLemon
Copy link
Author

MacLemon commented Aug 31, 2017

Four homeserver logs from start to complete lockup each.

from_launch_to_crash_05.txt
from_launch_to_crash_44.txt
from_launch_to_crash_53.txt
from_launch_to_crash_56.txt

My random intuition tells me that the lockups could be related to hostnames returning NX.

@erikjohnston
Copy link
Member

Those logs do make it look like it blows up when it starts sending out transactions, but otherwise there isn't anything particularly interesting in them :/

We've seen some "weird" behaviour on matrix.org relating to federation sending and memory usage, so I wouldn't be that surprised if this was something similar. Though I've never seen it lock up before.

My suggestion at this point would be to try running synapse with a split out worker for the federation sending, this should then tell us whether it is that or not. The generic docs for workers are: https://github.com/matrix-org/synapse/blob/master/docs/workers.rst

An example config for federation sender:

worker_app: synapse.app.federation_sender

# The replication listener on the synapse to talk to.
worker_replication_url: http://127.0.0.1:9092/_synapse/replication
worker_replication_host: 127.0.0.1
worker_replication_port: 9111

worker_daemonize: True
worker_pid_file: /home/matrix/synapse/federation_sender.pid
worker_log_config: /home/matrix/synapse/config/federation_sender_log_config.yaml

With the following added to the main config:

# As we now have a federation_sender worker
send_federation: false

listeners:
  # ... existing listeners go here...

  - port: 9092
    bind_address: ''
    type: http
    tls: false
    x_forwarded: false
    resources:
      - names: [replication]
        compress: false
  - port: 9111
    bind_address: 127.0.0.1
    type: replication

@erikjohnston
Copy link
Member

Oh, but you're using sqlite rather than postgres. I wonder if there is a perverse query that is taking out a lock on sqlite? (I'd be tempted to try migrating to postgres personally.)

The other thing you could try doing is setting the log level of synapse.storage.txn and synapse.storage.SQL to DEBUG─which logs the SQL going back and forth─though its incredibly spammy.

@MacLemon
Copy link
Author

MacLemon commented Sep 2, 2017

As for switching to postgres, I'm all for it, though I'm hesitant to actually do it for the total of two (in words two) users on that server, since I'm not sure if the box can handle an additional daemon. If that actually is key to finding a fix for the problem, I'll give it a shot of course.

I've set synapse.storage.SQL:to level: DEBUG as well as added synapse.storage.txn: with level: DEBUG.

I now mostly get these twisted error messages in between the few lines of INFO. The number after - PUT- isn't always there. When it is, it seems to increase from time to time.

2017-09-02 18:01:09,885 - twisted - 131 - ERROR - PUT-17- KeyError: 'request'
2017-09-02 18:01:09,887 - twisted - 131 - ERROR - PUT-17- Logged from file _base.py, line 104
2017-09-02 18:01:09,890 - twisted - 131 - ERROR - PUT-17- Traceback (most recent call last):
2017-09-02 18:01:09,892 - twisted - 131 - ERROR - PUT-17-   File "/usr/local/lib/python2.7/logging/__init__.py", line 861, in emit
2017-09-02 18:01:09,894 - twisted - 131 - ERROR - PUT-17-     msg = self.format(record)
2017-09-02 18:01:09,895 - twisted - 131 - ERROR - PUT-17-   File "/usr/local/lib/python2.7/logging/__init__.py", line 734, in format
2017-09-02 18:01:09,897 - twisted - 131 - ERROR - PUT-17-     return fmt.format(record)
2017-09-02 18:01:09,900 - twisted - 131 - ERROR - PUT-17-   File "/usr/local/lib/python2.7/logging/__init__.py", line 469, in format
2017-09-02 18:01:09,902 - twisted - 131 - ERROR - PUT-17-     s = self._fmt % record.__dict__

which somehow doesn't seem right to me and not what, I guess, should change in the logging.

Synapsed locked up again after some time. The last lines in the log, which were not a twisted ERROR block like above were these:

2017-09-02 18:16:18,436 - twisted - 131 - INFO - - DNSDatagramProtocol starting on 56234
2017-09-02 18:16:18,455 - twisted - 131 - INFO - - Starting protocol <twisted.names.dns.DNSDatagramProtocol object at 0x80f1705d0>
2017-09-02 18:16:18,461 - synapse.http.endpoint - 248 - INFO - - Connecting to matrix.xwiki.com:8448
2017-09-02 18:16:18,490 - synapse.federation.transaction_queue - 562 - INFO - - TX [telekollektiv.org] {1504375506534} Sending transaction [1504375506534], (PDUs: 0, EDUs: 1, failures: 0)
2017-09-02 18:16:18,511 - synapse.http.outbound - 154 - INFO - - {PUT-O-10} [telekollektiv.org] Sending request: PUT matrix://telekollektiv.org/_matrix/federation/v1/send/1504375506534/
2017-09-02 18:16:18,523 - twisted - 131 - INFO - - DNSDatagramProtocol starting on 2259
2017-09-02 18:16:18,541 - twisted - 131 - INFO - - Starting protocol <twisted.names.dns.DNSDatagramProtocol object at 0x80f170f50>
2017-09-02 18:16:18,569 - synapse.federation.transaction_queue - 562 - INFO - - TX [matrix.freedombone.net] {1504375506535} Sending transaction [1504375506535], (PDUs: 0, EDUs: 1, failures: 0)
2017-09-02 18:16:18,590 - synapse.http.outbound - 154 - INFO - - {PUT-O-11} [matrix.freedombone.net] Sending request: PUT matrix://matrix.freedombone.net/_matrix/federation/v1/send/1504375506535/
2017-09-02 18:16:18,621 - twisted - 131 - INFO - - DNSDatagramProtocol starting on 30969
2017-09-02 18:16:18,641 - twisted - 131 - INFO - - Starting protocol <twisted.names.dns.DNSDatagramProtocol object at 0x80f1774d0>
2017-09-02 18:16:18,648 - twisted - 131 - INFO - - Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x80f149f80>
2017-09-02 18:16:18,672 - synapse.federation.transaction_queue - 562 - INFO - - TX [t2bot.io] {1504375506536} Sending transaction [1504375506536], (PDUs: 0, EDUs: 1, failures: 0)
2017-09-02 18:16:18,695 - synapse.http.outbound - 154 - INFO - - {PUT-O-12} [t2bot.io] Sending request: PUT matrix://t2bot.io/_matrix/federation/v1/send/1504375506536/
2017-09-02 18:16:18,706 - twisted - 131 - INFO - - DNSDatagramProtocol starting on 1431
2017-09-02 18:16:18,726 - twisted - 131 - INFO - - Starting protocol <twisted.names.dns.DNSDatagramProtocol object at 0x80f177b90>
2017-09-02 18:16:18,754 - twisted - 131 - INFO - - Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x80f151ef0>
2017-09-02 18:16:18,778 - twisted - 131 - INFO - - (UDP Port 46439 Closed)
2017-09-02 18:16:18,799 - twisted - 131 - INFO - - Stopping protocol <twisted.names.dns.DNSDatagramProtocol object at 0x80f1475d0>
2017-09-02 18:16:18,821 - twisted - 131 - INFO - - (UDP Port 18094 Closed)
2017-09-02 18:16:18,824 - twisted - 131 - INFO - - Stopping protocol <twisted.names.dns.DNSDatagramProtocol object at 0x80f147bd0>
2017-09-02 18:16:18,847 - twisted - 131 - INFO - - (UDP Port 64980 Closed)
2017-09-02 18:16:18,870 - twisted - 131 - INFO - - Stopping protocol <twisted.names.dns.DNSDatagramProtocol object at 0x80f15c4d0>
2017-09-02 18:16:18,893 - twisted - 131 - INFO - - (UDP Port 36136 Closed)
2017-09-02 18:16:18,915 - twisted - 131 - INFO - - Stopping protocol <twisted.names.dns.DNSDatagramProtocol object at 0x80f15c590>

@MacLemon
Copy link
Author

MacLemon commented Sep 3, 2017

Tried to dump and rebuild the sqlite database as suggested in #2395. This also did not remedy the lockups.

@erikjohnston
Copy link
Member

As for switching to postgres, I'm all for it, though I'm hesitant to actually do it for the total of two (in words two) users on that server, since I'm not sure nameif the box can handle an additional daemon.

Yup, it really shouldn't be necessary (though postgres is quite small).

The back trace looks like the config doesn't have the default request field set in the config:

filters:
    context:
        (): synapse.util.logcontext.LoggingContextFilter
        request: ""

or are you using a different way of setting up logging?

@MacLemon
Copy link
Author

MacLemon commented Sep 5, 2017

I'm using the <hostname.example.com>.log.config as it was autocreated by the setup process. I'm logging to homeserver.log inside ~synapse. Nothing fancy going on there. I just retried in a testing directory which creates the <hostname.example.com>.log.config part like so:

filters:
  context:
    (): synapse.util.logcontext.LoggingContextFilter
    request: ""

What should I put in the request: section?

@erikjohnston
Copy link
Member

Hmm, that looks correct (a blank request key is fine). And there is filters: [context] in all your handlers? That context basically adds an empty string as the request key for all log lines that don't otherwise have a request set, so I'm very confused why its complaining.

Basically I'm worried that its swallowing all the useful SQL logging. Are you seeing any log lines like:

2017-09-05 14:46:29,988 - synapse.storage.SQL - 97 - DEBUG - GET-5 - [SQL] {get_users_in_room-1e9} SELECT m.user_id FROM room_memberships as m ....
2017-09-05 14:46:29,988 - synapse.storage.SQL - 104 - DEBUG - GET-5 - [SQL values] {get_users_in_room-1e9} ...
2017-09-05 14:46:29,992 - synapse.storage.SQL - 121 - DEBUG - GET-5 - [SQL time] {get_users_in_room-1e9} 3.506348

and if so what are the last lines you see?

@MacLemon
Copy link
Author

MacLemon commented Sep 6, 2017

Nope, no SLQ queries logged at all, just the INFO statements we already has, plus tons of twisted errors.

This is the complete maclemon.at.log.config

version: 1

formatters:
  precise:
   format: '%(asctime)s - %(name)s - %(lineno)d - %(levelname)s - %(request)s- %(message)s'

filters:
  context:
    (): synapse.util.logcontext.LoggingContextFilter
    request: ""

handlers:
  file:
    class: logging.handlers.RotatingFileHandler
    formatter: precise
    filename: /var/synapse/.synapse/homeserver.log
    maxBytes: 104857600
    backupCount: 10
    filters: [context]
    level: INFO
  console:
    class: logging.StreamHandler
    formatter: precise

loggers:
    synapse:
        level: INFO
    synapse.storage.SQL:
        level: DEBUG
    synapse.storage.txn:
        level: DEBUG

root:
    level: INFO
    handlers: [file, console]

(All indentations are spaces which I verified.)

Most twisted errors look like this:

2017-09-06 09:36:23,403 - twisted - 131 - ERROR - - Traceback (most recent call last):
2017-09-06 09:36:23,404 - twisted - 131 - ERROR - -   File "/usr/local/lib/python2.7/logging/__init__.py", line 861, in emit
2017-09-06 09:36:23,406 - twisted - 131 - ERROR - -     msg = self.format(record)
2017-09-06 09:36:23,407 - twisted - 131 - ERROR - -   File "/usr/local/lib/python2.7/logging/__init__.py", line 734, in format
2017-09-06 09:36:23,409 - twisted - 131 - ERROR - -     return fmt.format(record)
2017-09-06 09:36:23,411 - twisted - 131 - ERROR - -   File "/usr/local/lib/python2.7/logging/__init__.py", line 469, in format
2017-09-06 09:36:23,412 - twisted - 131 - ERROR - -     s = self._fmt % record.__dict__
2017-09-06 09:36:23,414 - twisted - 131 - ERROR - - KeyError: 'request'
2017-09-06 09:36:23,416 - twisted - 131 - ERROR - - Logged from file _base.py, line 235
2017-09-06 09:36:23,418 - twisted - 131 - ERROR - - Traceback (most recent call last):
2017-09-06 09:36:23,420 - twisted - 131 - ERROR - -   File "/usr/local/lib/python2.7/logging/__init__.py", line 861, in emit
2017-09-06 09:36:23,421 - twisted - 131 - ERROR - -     msg = self.format(record)
2017-09-06 09:36:23,423 - twisted - 131 - ERROR - -   File "/usr/local/lib/python2.7/logging/__init__.py", line 734, in format
2017-09-06 09:36:23,425 - twisted - 131 - ERROR - -     return fmt.format(record)
2017-09-06 09:36:23,426 - twisted - 131 - ERROR - -   File "/usr/local/lib/python2.7/logging/__init__.py", line 469, in format
2017-09-06 09:36:23,427 - twisted - 131 - ERROR - -     s = self._fmt % record.__dict__
2017-09-06 09:36:23,429 - twisted - 131 - ERROR - - KeyError: 'request'
2017-09-06 09:36:23,430 - twisted - 131 - ERROR - - Logged from file _base.py, line 292
2017-09-06 09:36:23,502 - synapse.handlers.typing - 78 - INFO - - Checking for typing timeouts
2017-09-06 09:36:28,420 - twisted - 131 - ERROR - - Traceback (most recent call last):
2017-09-06 09:36:28,422 - twisted - 131 - ERROR - -   File "/usr/local/lib/python2.7/logging/__init__.py", line 861, in emit
2017-09-06 09:36:28,424 - twisted - 131 - ERROR - -     msg = self.format(record)
2017-09-06 09:36:28,425 - twisted - 131 - ERROR - -   File "/usr/local/lib/python2.7/logging/__init__.py", line 734, in format
2017-09-06 09:36:28,427 - twisted - 131 - ERROR - -     return fmt.format(record)
2017-09-06 09:36:28,428 - twisted - 131 - ERROR - -   File "/usr/local/lib/python2.7/logging/__init__.py", line 469, in format
2017-09-06 09:36:28,430 - twisted - 131 - ERROR - -     s = self._fmt % record.__dict__
2017-09-06 09:36:28,432 - twisted - 131 - ERROR - - KeyError: 'request'

Where the most lines containing Logged from file _base.py, line reference line 292 or 235, less often they also reference 97, 104, 121. Usually the come in blocks of 292+235, in that order and multiple times, with the occasional appearance of 97+104+121` followed by one or multiple of the former blocks.

@erikjohnston
Copy link
Member

Ah, in the console handler you'll need a filters: [context] too if you use the precise formatter. Though I have no idea why that would cause issues in the file logger...

If that still doesn't fix the logging, I'd suggest for now just removing the %(request)s% from the formatter, so just something like:

format: '%(asctime)s - %(name)s - %(lineno)d - %(levelname)s - - %(message)s'

@MacLemon
Copy link
Author

MacLemon commented Sep 7, 2017

Ok, I added the filters [context] part and the corresponding section now looks like this:

handlers:
  file:
    class: logging.handlers.RotatingFileHandler
    formatter: precise
    filename: /var/synapse/.synapse/homeserver.log
    maxBytes: 104857600
    backupCount: 10
    filters: [context]
    level: INFO
  console:
    class: logging.StreamHandler
    formatter: precise
    filters: [context]

That seems to have fixed logging all the twisted error messages. \o/

Here's the last lines before synapse locked up with the new logging settings.

synapse lockup 2017-09-08.txt

@erikjohnston
Copy link
Member

Woo! Also, the reason its not logging the SQL is because you have the level: INFO set on the file handler. (Logging should not be this difficult...)

Can you run select count(*) from device_federation_outbox; on your database please? I'm wondering if its getting stuck trying to send a huge number of messages, or something.

@MacLemon
Copy link
Author

MacLemon commented Sep 9, 2017

% sqlite
sqlite3 homeserver.db
SQLite version 3.20.1 2017-08-24 16:21:36
Enter ".help" for usage hints.
sqlite> select count(*) from device_federation_outbox;
4

I've now set level: DEBUG and will add more logs when it locks up again.

@erikjohnston
Copy link
Member

Could you also have a look at:

select room_id, count(*) c from event_forward_extremities group by room_id order by c desc limit 20;

please? If there are rooms with a large number of forward extremities then that is also known to cause issues.

@MacLemon
Copy link
Author

sqlite> select room_id, count(*) c from event_forward_extremities group by room_id order by c desc limit 20;
!******************:matrix.org|26
!******************:darkfasel.net|4
!******************:matrix.xwiki.com|4
!******************:matrix.org|3
!******************:chat.weho.st|2
!******************:maclemon.at|1
!******************:matrix.org|1
!******************:matrix.org|1
!******************:maclemon.at|1
!******************:maclemon.at|1
!******************:matrix.org|1
!******************:matrix.org|1
!******************:matrix.org|1
!******************:matrix.org|1
!******************:maclemon.at|1
!******************:maclemon.at|1
!******************:darkfasel.net|1
!******************:matrix.org|1
!******************:maclemon.at|1
!******************:maclemon.at|1

@hamber-dick
Copy link
Contributor

After my last Update on 23.1 I have a simmilar Problems with an PostgreSQL Database, last Entry in the logfile after the crash is always somthing like this:

2017-10-09 11:46:15,487 - twisted - 131 - INFO - - DNSDatagramProtocol starting on 19479 2017-10-09 11:46:15,487 - twisted - 131 - INFO - - Starting protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7fe469163850> 2017-10-09 11:46:15,487 - twisted - 131 - INFO - - DNSDatagramProtocol starting on 32892 2017-10-09 11:46:15,488 - twisted - 131 - INFO - - Starting protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7fe469163090> 2017-10-09 11:46:15,488 - twisted - 131 - INFO - - DNSDatagramProtocol starting on 3688 2017-10-09 11:46:15,488 - twisted - 131 - INFO - - Starting protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7fe469163810>

Because of this issue I'm not able to user my synapse server anymore. Before the Update the server runs 3 month without any Problem.

@MacLemon
Copy link
Author

MacLemon commented Oct 9, 2017

I'm still on py27-matrix-synapse-0.23.0, since the FreeBSD packages haven't been updated yet. (From experience, I guess they will be updated soon.) The crashing/locking issue has worsened for me with 0.23.0.
Synapse now hardly runs for more than a few hours, mostly locking up after a few minutes already. I'll update once I found time to dig through the logs if there's anything helpful in them this time.

Forward extremeties are still only a few:

!******************:matrix.org|21
!******************:darkfasel.net|5
!******************:matrix.org|5
!******************:matrix.xwiki.com|4
!******************:matrix.org|2
!******************:chat.weho.st|2

@ara4n
Copy link
Member

ara4n commented Oct 9, 2017

we're in the middle of a bunch of synapse bughunting currently, and i'd really like to get to the bottom of this. @hamber-dick is your synapse being OOM killed?

@MacLemon can you share the logs of the current crashing/locking issue with me over Matrix again please? I assumed things were okay on your side atm :(

@MacLemon
Copy link
Author

MacLemon commented Oct 9, 2017

Sure, I'll collect a few logs and bundle them up. (Situation hadn't changed, otherwise I'd have posted success, cheers and happiness here already.) :-)

@hamber-dick
Copy link
Contributor

Yes, my synapse server ist killed at the Moment. It crashes after 5-15 minutes after I start it. So i have to kill -9 him then. I'm not able to restart and kill him every 5 Minutes ;-)

@matrix-org matrix-org deleted a comment from hamber-dick Oct 9, 2017
@ara4n
Copy link
Member

ara4n commented Oct 9, 2017

@hamber-dick - i'm investigating, although I suggest you change the access tokens for any ASes on that homeserver; it's safest to share HS logs directly.

My best bet for why your server (and @MacLemon's) is wedging and needing a kill -9 is that it's running out of available FDs due to to the amount of concurrent federation traffic going on (or just having a low maximum FD set). Normally this should be obvious in the logs with a bunch of EMFILE errors but it's possible it's just making twisted just wedge solid. What is your ulimit -n, and can you do an lsof whilst it's wedged to see whether it's hitting the limit?

There isn't anything obvious otherwise that i can see in the logs.

@ara4n
Copy link
Member

ara4n commented Oct 9, 2017

@hamber-dick please can you also list your extremities (as per #2395 (comment))

@hamber-dick
Copy link
Contributor

Here my extremities:

!:matrix.org | 14
!
:matrix.org | 12
!:matrix.org | 3
!
:matrix.org | 2
!:tchncs.de | 1
!
:chat.alko-sbk.de | 1
!:matrix.org | 1
!
:chat.alko-sbk.de | 1
!:chat.alko-sbk.de | 1
!
:chat.alko-sbk.de | 1
!:chat.alko-sbk.de | 1
!
:chat.alko-sbk.de | 1
!:chat.alko-sbk.de | 1
!
:chat.alko-sbk.de | 1
!:chat.alko-sbk.de | 1
!
:chat.alko-sbk.de | 1
!:matrix.org | 1
!
:matrix.org | 1
!:chat.alko-sbk.de | 1
!
:tchncs.de | 1

@ara4n
Copy link
Member

ara4n commented Oct 9, 2017

thanks. the 14 and 12 ones might be complicating things a bit; if you speak a bit in those rooms it may get better. I’d like to know your FD status though

@hamber-dick
Copy link
Contributor

Those two rooms are Matrix HQ and RIOT. The Big rooms were always the reason for the Crash of my HS. Because of that I have tried to leave them but tho HS still syncs the rooms I think, is there a possibility to delete them complete from my HS.

@ara4n
Copy link
Member

ara4n commented Oct 10, 2017

Okay - i suspect @hamber-dick's problem can be mitigated by #2490 which just merged to develop. I'd still like to understand what everyone's FD counts are doing though when the app wedges....

@hamber-dick
Copy link
Contributor

hamber-dick commented Oct 10, 2017

I'm not sure if I unterstand it correct so here are my results for:
ulimint -n : 1024

During crash:

lsof -p synapse-pid|wc -l : 600 to 636
lsof |wc -l :27458
ls -l /proc/synapse-pid/fd|wc -l :565
cat /proc/synapse-pid/limit: Max open files : 4096

@hamber-dick
Copy link
Contributor

I set the soft_file_limit in the homserver.yaml to 400 now I have no crash since 2 Hours. But I can't load the Room Directory from matrix.org(Time Out), and the Extrimities are increasing from Rooms with no member from my homserver(matrix.org).

@ara4n
Copy link
Member

ara4n commented Oct 11, 2017

@hamber-dick thanks for helping check. lsof -p $pid | wc -l is the right command, and if your max FDs was 1024 (or 4096), 636 won't have been the problem. So something else is wedging.

Can you upgrade to the develop branch so you can see if PR #2490 fixes the extremities problem? The inability to load the room directory from matrix.org is probably #3327 which is currently unsolved.

@ara4n
Copy link
Member

ara4n commented Oct 11, 2017

...that said, setting the soft_file_limit in homeserver.yaml to 400 is a bad idea if you were seeing it requiring 600-700 FDs. I would strongly recommend keeping soft_file_limit at the default value to make things easier to debug (which should put it at 1024, based on your ulimit -n output)

@hamber-dick
Copy link
Contributor

Yesterday I updated synapse to develop. And the extremities are not growing anymore. Maximum ist 2.
Also I increased the soft_file_limit to 1024. and there was no crash since 2 Days. So for me all problems seemed to be solved.

@ara4n
Copy link
Member

ara4n commented Oct 12, 2017

@MacLemon can you please try the same trick (i.e. upgrading to develop to see if that resolves things) - also if you can confirm the FD health when it wedges then we might actually be able to close this up!

(Also, if either of you happen to be anywhere near Berlin atm, be aware that this time next week we're organising an IRL meetup in case we can provide in person support: https://matrix.org/blog/2017/10/12/announcing-matrix-meetup-in-berlin-october-19th.)

@MacLemon
Copy link
Author

MacLemon commented Jan 5, 2018

Ok, update from my instance:

FreeBSD 11.1-p6
py27-matrix-synapse-0.25.0 (via ports)

% ulimit -n
116064

Number of file descriptors in FreeBSD is best fetched via procstat(1) (opposed to lsof(8))
~/.synapse % procstat -f `cat homeserver.pid` | wc -l

Directly after starting synapse: 55, raises to roughly 180 during runtime.

% sysctl kern.maxfiles
kern.maxfiles: 128960

homeserver.yaml has the default setting of soft_file_limit: 1024.

I've put fetching of all these values, as well as forwared extremities into my autokiller script which checks if synapsed is still responding properly by querying https://matrix.maclemon.at/_matrix/client/versions (like Riot.im does). This will insert them into homeserver.log every time it restarts synapsed. That way we should get these values with time context in the same log file.

@Ezwen
Copy link

Ezwen commented May 17, 2018

Even with the latest synapse 0.29, I still have the hanging issue on a regular basis (a quite important number amount of times per day). Each time, synapse suddenly eats up 100% of the CPU. I'll try to provide logs.

@Ezwen
Copy link

Ezwen commented May 17, 2018

Here are some logs I captured during the hang: https://framadrop.org/r/AwCAgUHOSB#aZwziubx9AamEbUwU30dc+RYOrmdg7QZJ2i5jHf6TM0=

@krombel
Copy link
Contributor

krombel commented May 21, 2018

@Ezwen Might it be possible that synapse faces memory issues?
I have played a bit with memory limits for synapse and realized this issue as well.
When having a look at the things it does it is mainly GC - which is enforced by the system.
So the system halts synapse until it has freed some memory before it allowes synapse to use "new ram".
Might that be the issue in this case? If yes that issue can only be solved by making synapse to use less memory...

@Ezwen
Copy link

Ezwen commented May 21, 2018

@krombel interesting, thanks for sharing! if that's the case, then I suppose there is not much I can do :(. Unfortunately I cannot easily add memory to the physical host.

@MacLemon
Copy link
Author

Because the maintainer of the net-im/matrix-synapse gave up, it isn't getting updated anymore.
As a result I've now reinstalled synapse 0.30 via pip sticking with my existing SQLite installation. It held up for over 10 minutes and I already thought the problems were fixed. Then it locked up again, heating a CPU core at 100%. Yet, the result was positive, showing to be faster than before.

Then I thought I wanted to upgrade from SQLite to Postgres a long time ago and took the time to do so following the migration guide which seems to have worked fine.

What a difference in performance. synapse now crashes in under 20 seconds, unlike every few minutes before.

Not what I had expected… Snarky remarks aside…

I've been DEBUG logging this thing for a long time now. I found out that at some point the python process is hanging, getting stuck and not doing anything anymore. Then the kernel event pip from the network starts building up until it reaches its soft limit of 50, then goes over that to the hard limit of about ≈76. This is usually when my cron job that checks it, also kills it, and kicks it up again.

It seems to consume between 125MB and 195MB RAM (≈300-380MB virtual) at that time hitting 100% on a single core. This is the point where it won't recover anymore and I must kill -9 the python 2.7 process which is done by my script.

From the logs it still seems to hang when trying to send out messages.
I'll redact the usernames from the logs and post an update with the debug logs.

@luto
Copy link

luto commented May 28, 2018

Sooo after a of hours knee-deep in gdb and the twisted source I found something. Have a look :)

  1. have a system with a secondary IPv6 nameserver in resolv.conf, but no v6
  2. have mostly good DNS queries, but a few failing ones
  3. twisted tries each DNS server: client.py#L309
  4. once it finds the v6 one and tries to bind on ::: client.py#L265-L266
  5. twisted tries to open a socket, but fails: client.py#L221-L228
  6. retries for ever and ever, because there is no v6 on the system m)
(gdb) py-bt
Traceback (most recent call first):
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/udp.py", line 115, in __init__
    self.setLogStr()
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/posixbase.py", line 368, in listenUDP
    p = udp.Port(port, protocol, interface, maxPacketSize, self)
  File "/usr/local/lib/python2.7/site-packages/twisted/names/client.py", line 224, in _connectedProtocol
    interface=interface)
  File "/usr/local/lib/python2.7/site-packages/twisted/names/client.py", line 266, in _query
    protocol = self._connectedProtocol(interface='::')
  File "/usr/local/lib/python2.7/site-packages/twisted/names/client.py", line 334, in _reissue
    d = self._query(address, query, timeout[0], reason.value.id)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 567, in _startRunCallbacks
    self._runCallbacks()
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 500, in errback
    self._startRunCallbacks(fail)
  File "/usr/local/lib/python2.7/site-packages/twisted/names/dns.py", line 2826, in _clearFailed
    deferred.errback(failure.Failure(DNSQueryTimeoutError(id)))
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 878, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/var/synapse/.local/lib/python2.7/site-packages/synapse/metrics/__init__.py", line 201, in f
    ret = func(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1252, in mainLoop
    self.runUntilCurrent()
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1243, in run
    self.mainLoop()
  File "/var/synapse/.local/lib/python2.7/site-packages/synapse/app/_base.py", line 101, in run
    reactor.run()
  File "/usr/local/lib/python2.7/site-packages/daemonize.py", line 245, in start
    self.action(*privileged_action_result)
  File "/var/synapse/.local/lib/python2.7/site-packages/synapse/app/_base.py", line 112, in start_reactor
    daemon.start()
  File "/var/synapse/.local/lib/python2.7/site-packages/synapse/app/homeserver.py", line 517, in run
    logger,
  File "/var/synapse/.local/lib/python2.7/site-packages/synapse/app/homeserver.py", line 526, in main
    run(hs)
  File "/var/synapse/.local/lib/python2.7/site-packages/synapse/app/homeserver.py", line 530, in <module>
    main()
  File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)

@richvdh
Copy link
Member

richvdh commented May 29, 2018

Great detective work!

twisted tries to open a socket, but fails: client.py#L221-L228

This sounds depressingly familiar: #2800 (comment)

@ara4n
Copy link
Member

ara4n commented Jun 5, 2018

@hawkowl might this be of relevance to your interests?

@ara4n
Copy link
Member

ara4n commented Jun 5, 2018

(synapse trying to talk ipv6 even when there isn't ipv6 reminds me of #2850, although i think that was a thinko in synapse rather than twisted)

@MacLemon
Copy link
Author

MacLemon commented Jun 6, 2018

It's actually twisted trying to resolve over IPv6 which then fails to bind to the non-existing IPv6 interface which it tries until infinity instead of having any break condition.

@richvdh richvdh changed the title Synapse 0.22.1 completely hangs every few minutes Synapse completely hangs every few minutes, when configured with an IPv6 DNS server but no IPv6 network Jul 23, 2018
@ara4n
Copy link
Member

ara4n commented Oct 10, 2020

I vaguely remember this getting fixed ages ago in Twisted - can anyone confirm?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants