Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporarily failing bridge healthcheck permanently leaves Jitsi without any operational bridges #1143

Open
pbirkants opened this issue Mar 8, 2024 · 6 comments

Comments

@pbirkants
Copy link

Description

Bridge link between Jicofo and JVB sometimes is terminated when host is under heavy load by other processes, but later never recovers, preventing any Jitsi calls from working until manually restarted.

Current behavior

If the healthcheck takes too long, the JVB node is dropped and never resumed, even though it's running fine.

Here are the relevant Jicofo and JVB logs from the time period, nothing else was recorded before or after this (until JVB was restarted manually).

Jicofo 2024-03-06 03:50:59.501 WARNING: [14] JvbDoctor$HealthCheckTask.doHealthCheck#189: Bridge[jid=jvbbrewery@internal.auth.**REDACTED**/**REDACTED**, version=2.3.67-gb2d4229f, relayId=null, region=null, stress=0.00] health-check timed out, but will give it another try after: 5000
Jicofo 2024-03-06 04:24:29.799 WARNING: [14] JvbDoctor$HealthCheckTask.doHealthCheck#240: Health check failed for: Bridge[jid=jvbbrewery@internal.auth.**REDACTED**/**REDACTED**, version=2.3.67-gb2d4229f, relayId=null, region=null, stress=0.00]: <error xmlns='jabber:client' type='cancel'><internal-server-error xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'/><text xml:lang='en'>Performing a health check took too long: PT3.512705S</text></error>
Jicofo 2024-03-06 04:24:29.836 INFO: [39] JvbDoctor.bridgeRemoved#105: Stopping health-check task for: Bridge[jid=jvbbrewery@internal.auth.**REDACTED**/**REDACTED**, version=2.3.67-gb2d4229f, relayId=null, region=null, stress=0.00]
JVB 2024-03-06 04:24:24.985 SEVERE: [25] HealthChecker.run#181: Health check failed in PT3.512705S: Result(success=false, hardFailure=true, responseCode=null, sticky=false, message=Performing a health check took too long: PT3.512705S)
JVB 2024-03-06 04:24:29.633 WARNING: [6243] XmppConnection.measureDelay#244: Took 171 ms to handle IQ: <iq xmlns='jabber:client' to='jvb@auth.**REDACTED**/-BUTIsDF' from='jvbbrewery@internal.auth.**REDACTED**/focus' id='**REDACTED**' type='get'><healthcheck xmlns='http://jitsi.org/protocol/healthcheck'/></iq>
JVB 2024-03-06 04:25:21.295 INFO: [25] HealthChecker.run#179: Performed a successful health check in PT0.000029S. Sticky failure: false

Expected Behavior

The bridge connection should be recovered automatically.

Steps to reproduce

Not sure how to reproduce this reliably, it has happened two or three times over several months during the night, when Jitsi is completely idle, but other processes running on the host are causing significant system load.

Environment details

All Jitsi components installed locally on a single server, with a single bridge used.

APT package versions (but this has happened with earlier versions, too):

jitsi-meet            2.0.9220-1
jitsi-videobridge2    2.3-67-gb2d4229f-1

@damencho
Copy link
Member

damencho commented Mar 8, 2024

Please, when you have questions or problems use the community forum before opening new issues, thank you.

@damencho
Copy link
Member

damencho commented Mar 8, 2024

You can disable the health checks to avoid the bridge being removed if you do not have multiple bridges and autoscaling.

@damencho damencho closed this as completed Mar 8, 2024
@bgrozev bgrozev transferred this issue from jitsi/jitsi-videobridge Mar 11, 2024
@bgrozev bgrozev reopened this Mar 11, 2024
@bgrozev
Copy link
Member

bgrozev commented Mar 11, 2024

Jicofo fails to resume jvb health checks once they fail. This is fine in most environments where we use sticky-failures=true which is why we haven't noticed before.

@pbirkants
Copy link
Author

Thank you for reopening this issue.

I'd like to add that I'm using the defaults for any related settings for both Jicofo and JVB, which, I believe, are sticky-failures=false.

Disabling health checks does not seem like a good solution, as that could make it difficult to detect when the bridge is actually down.

@0ki
Copy link

0ki commented May 13, 2024

This bug affects me too. Is there currently a planned timeline for a fix?

@doerofthedo
Copy link

It would be wrong to handle this bug by turning off health checks. I see that nobody is assigned to solve this. I'd like to know if there will be some movement regarding this in the near future..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants