-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.x] videoroom: remote publisher doesn't release RTP/RTCP ports #3345
Comments
Hi @boddob thanks for reporting. |
@atoppi thanks for sharing this. I'll give it a try and get back! |
@atoppi I gave this a try and it works as expected now. I'm seeing that the sockets are released now. Thanks! Is there something more to test here to validate the change? Mar 19 10:46:48.477174 : [plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7202:init] 0xffff6000f840 (1) |
I guess due to the nature of the fix a quick test is enough but I'll let @lminiero reply in case he feels more testing is needed. |
If you can test it a few more days it will help, just to be sure. But I agree this is 99.99999% likely the cause of the issue. |
@boddob the patch has been merged, thanks for testing. |
@atoppi sorry for not getting back in time. I've been running this under heavy loads, and I've not faced any issues. |
I've deployed 2 servers with this patch, one about a week ago, and seems running fine, metrics doesn't show up improvements or regressions compared with the rest of the servers so far. Yesterday I've deployed a second one, that unfortunately crashed about 10hs later. Core dump |
What version of Janus is this happening on?
1202 (commit: 3773de6)
Have you tested a more recent version of Janus too?
This is reproducible on the latest commit, we use 1200 (commit: e3a2aea) for production on which we first saw it.
Was this working before?
This issue has been there since we've started using it.
Is there a gdb or libasan trace of the issue?
Don't have a trace, but have logs with refcount debug enabled, which seems to point to a refcount issue. Shared the gist below.
https://gist.github.com/boddob/e5f82c19e664840083fc8f8f0f0979b7
The remote publisher refcount object to look at is "0xffff50001680". A cat of the gist content shows how the refcount seems to have some issue.
cat janus_refcount_gist.txt | grep 0xffff50001680
[plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7202:init] 0xffff50001680 (1)
[plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7221:increase] 0xffff50001680 (2)
[plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7221:increase] 0xffff50001680 (3)
[plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7313:increase] 0xffff50001680 (4)
[plugins/janus_videoroom.c:janus_videoroom_remote_publisher_thread:12835:increase] 0xffff50001680 (5)
[plugins/janus_videoroom.c:janus_videoroom_publisher_stream_destroy:2344:decrease] 0xffff50001680 (2117)
[plugins/janus_videoroom.c:janus_videoroom_publisher_stream_destroy:2344:decrease] 0xffff50001680 (2116)
[plugins/janus_videoroom.c:janus_videoroom_publisher_dereference:2387:decrease] 0xffff50001680 (2115)
[plugins/janus_videoroom.c:janus_videoroom_publisher_destroy:2429:decrease] 0xffff50001680 (2114)
[plugins/janus_videoroom.c:janus_videoroom_remote_publisher_thread:13131:decrease] 0xffff50001680 (2113)
Additional context
Background:
We use the remote publishing/SFU cascading feature to support video rooms on multiple Janus servers. We have a separate orchestrator service on each server, that does the job of adding and removing remote publishers. The design was inspired from the SFU cascading blog post. These are EC2 servers running on aarch64 CPU(s).
Problem:
Over time, the servers run out of free RTP/RTCP ports, and the request "add_remote_publisher" fails with the error code: 499, "Could not open UDP socket for RTP stream". This happens even though we ensure "remove_remote_publisher" is called for the same publisher once we're done.
Observations:
When testing this in an isolated environment, it was observed that the refcount for the remote publisher (initalized in line no. 7202 of the current src/plugin/janus_videoroom.c), doesn't behave as expected. At the point we try to remove the remote publisher, the refcount is a very high number, and the function "janus_videoroom_publisher_free" is never called. Interestingly, the refcount value seems to grow with time, i.e for how long the "remote publisher" was active and receiving data from another instance. I have shared the logs of this experiment, with refcounting enabled.
Our orchestrator that talks to Janus has multiple threads, and each one creates a different session, and each thread took up on requests/events. So, the session that adds the remote publisher could be different from the session that removes the remote publisher. Our initial suspicion was that, but the issue persists even if the orchestrator uses a single session/thread.
The text was updated successfully, but these errors were encountered: