Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.x] segmentation faults on videoroom hangup #3154

Closed
spscream opened this issue Jan 24, 2023 · 44 comments
Closed

[1.x] segmentation faults on videoroom hangup #3154

spscream opened this issue Jan 24, 2023 · 44 comments
Labels
multistream Related to Janus 1.x pr-exists

Comments

@spscream
Copy link
Contributor

What version of Janus is this happening on?
1.1.0, 1.1.1

Have you tested a more recent version of Janus too?
problem occured first on 1.1.0 and we upgraded to 1.1.1 and it steal appears.

Was this working before?
this is occuring only on one of our customers servers and not occuring on others with the same version(1.1.0 and 1.1.1).

Is there a gdb or libasan trace of the issue?
https://gist.github.com/spscream/6ee8265dbc7d0c8f7fa24c2c9c0f8eef

Additional context
We use videoroom plugin mainly and this segmentation fault appers only on one customer.
This customer also having some random issues with turn/stun servers and connectivity between janus and webrtc clients,
so this could not be directly related to the crash, but maybe this leads to this segmentation fault.

maybe this duplicates #3124

@spscream spscream added the multistream Related to Janus 1.x label Jan 24, 2023
@lminiero
Copy link
Member

The current version of Janus master is 1.1.3, please try with that one too.

@spscream
Copy link
Contributor Author

okey, we will install it and i will return with logs from it.
But i didn't see any changes which can fix this issue on master.

@spscream
Copy link
Contributor Author

tha same happened on master

@atoppi
Copy link
Member

atoppi commented Jan 25, 2023

@spscream thanks for checking, would you please share an ASan stack trace from the current master?

@lminiero
Copy link
Member

@spscream if you also have ways that we can use to consistently replicate it ourselves that would help too!

@spscream
Copy link
Contributor Author

@spscream
Copy link
Contributor Author

I don't know hot to make repetition of this, it happens only on one environment

@spscream
Copy link
Contributor Author

maybe this related to some buggy participant which calls unpublish twice in parallel, we are trying to figure it out

@spscream
Copy link
Contributor Author

We got some more crashes again, they are in the same place as i provided before.
In one case janus crashed when videoroom was destroyed while people were in it and in parallel clients were sending hangup requests - in the end of the group meeting organizer called termination and destroy were called as a part of it.

I can't provide any additional logs at moment. I'm trying to repeat crash myself by integration tests, but I have no success.

@lminiero
Copy link
Member

Since I suspect it's an issue with reference counting, where something is freed sooner than it should, one thing that might help is if you uncommented this line and recompiled Janus. It will make the logs explode in size, since it will be tracking all refs and unrefs, but should the problem occur again it would give us much more info to track where and why the affected resources was freed before its time. You may only want to do that in your own local deployment, maybe, rather than the one in production: your choice! But that would definitely help, especially if you're able to replicate it eventually.

@spscream
Copy link
Contributor Author

Hi!
we got some crashes and logs with refcount debug here:
https://drive.google.com/drive/folders/10ppbofxkUNvnFJ0kxytvi5dNgxIkvojQ?usp=sharing

in janus.log is a stdout of janus, you can find exact moments of crash by janus version banner(Janus version: 1103 (1.1.3)).

@atoppi
Copy link
Member

atoppi commented Feb 13, 2023

Janus commit: 806ac371e9d829b880e0af62e7b6f24ba698218d

is not a valid commit hash, I guess we should refer to current master since the version is mentioning 1.1.3 ?

@spscream
Copy link
Contributor Author

https://github.com/spscream/janus-gateway it refers to my fork with refs debug enabled

@atoppi
Copy link
Member

atoppi commented Feb 13, 2023

@spscream are you running janus on a 32-bit architecture?

@spscream
Copy link
Contributor Author

no, we are running on 64bit

@spscream
Copy link
Contributor Author

do you have any glue what is happening or how can I help to debug it?

@lminiero
Copy link
Member

Still investigating.

@lminiero
Copy link
Member

@spscream please test the PR above.

@spscream
Copy link
Contributor Author

thank you! I build image with this pr and gave it to our customer

@spscream
Copy link
Contributor Author

it is crashed again on 2 different sites 2 or 3 times.
I can provide log only for one crash now. Waiting for crashes on other sites for more logs.
https://drive.google.com/drive/folders/1CrIs9695SzIs21dtONOOUBHez_DIP3cb?usp=sharing

@atoppi
Copy link
Member

atoppi commented Feb 22, 2023

@spscream thanks for reporting, the patch has been updated with another fix.

@atoppi
Copy link
Member

atoppi commented Mar 1, 2023

@spscream have you had the chance to test the updated patch ?

@zavndw
Copy link

zavndw commented Mar 1, 2023

@spscream have you had the chance to test the updated patch ?

yes. We await new crush log from customer. It is crashed again

@spscream
Copy link
Contributor Author

spscream commented Mar 1, 2023

we got the crash and log: https://drive.google.com/drive/folders/1ATMh-cWhvcVNOJ22El_bFk8DudlsKLdI?usp=sharing

@atoppi
Copy link
Member

atoppi commented Mar 2, 2023

@spscream thanks for the logs!
The patch has been updated again, please test and report.

@spscream
Copy link
Contributor Author

spscream commented Mar 3, 2023

@atoppi
Copy link
Member

atoppi commented Mar 3, 2023

The reference that made it crash is not present in the log file, probably because it has been removed before the instant the file starts.
Do you have the whole file?

@spscream
Copy link
Contributor Author

spscream commented Mar 3, 2023

no, they have they logs limited by size and this is all we have. We will wait for another crash

@spscream
Copy link
Contributor Author

spscream commented Mar 3, 2023

@atoppi i added archive with full log to the same folder

@atoppi
Copy link
Member

atoppi commented Mar 3, 2023

@spscream patch updated according to latest crash data, please test and report.

@spscream
Copy link
Contributor Author

spscream commented Mar 7, 2023

two days since running on latest patch and no crashes, we will run it for few more days

@spscream
Copy link
Contributor Author

crashed again today https://drive.google.com/drive/folders/1VXaoUhOt2eXnl9OtqsWyopt1oSz7WI23?usp=sharing

@spscream
Copy link
Contributor Author

crashed again twice yesterday, do you need any additional info?

@lminiero
Copy link
Member

We're busy on other activities these days, sorry. We'll have a look as soon as possible.
That said, I'll probably merge the existing PR as it is in the next few days, since it already solves many instances of crashes you had: the crash log you shared a few days ago seemed to be triggered in a different point, so may be worth investigating in a new PR.

@atoppi
Copy link
Member

atoppi commented Mar 16, 2023

@spscream please share the core dumps and logs

@spscream
Copy link
Contributor Author

spscream commented Mar 16, 2023

@spscream please share the core dumps and logs

unfortunately, we didn't get crash logs for yesterday crashes. If we will have any new crashes I will share logs for it.

@spscream
Copy link
Contributor Author

@atoppi
Copy link
Member

atoppi commented Mar 28, 2023

I got crash from 17.03: https://drive.google.com/drive/folders/1qD3T4n77sz5e8--TGbSrL26WY2Aats-O?usp=sharing

This one happened into the core:

==1==ERROR: AddressSanitizer: heap-use-after-free on address 0x60400315b9e0 at pc 0x558859f37dc7 bp 0x7effbd418600 sp 0x7effbd4185f8
READ of size 4 at 0x60400315b9e0 thread T44963 (pool-janus)
    #0 0x558859f37dc6 in janus_plugin_session_is_alive /build/janus-gateway/src/ice.c:750
    #1 0x558859fa7b99 in janus_process_incoming_request /build/janus-gateway/src/janus.c:1686
    #2 0x558859fb34c6 in janus_transport_task /build/janus-gateway/src/janus.c:3478

The thread is trying to read the atomic plugin_session->stopped to check the session liveness, whereas the session was already been freed in the VR plugin:

freed by thread T45081 (hloop 497628586) here:
   #0 0x7f00fbafab6f in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:123
   #1 0x558859f1d605 in janus_ice_plugin_session_free /build/janus-gateway/src/ice.c:1591
   #2 0x7f00f3f78b9c in janus_videoroom_session_free plugins/janus_videoroom.c:2542
   #3 0x7f00f3fc878b in janus_videoroom_destroy_session plugins/janus_videoroom.c:4360
   #4 0x558859f7fda7 in janus_ice_outgoing_traffic_handle /build/janus-gateway/src/ice.c:4500

This is the trace of the involved reference:

2023-03-17T13:11:50.171327+03:00: [ice.c:janus_ice_handle_attach_plugin:1382:init] 0x60400315b9e8 (1)
2023-03-17T13:11:50.171391+03:00: [ice.c:janus_ice_handle_attach_plugin:1384:increase] 0x60400315b9e8 (2)
2023-03-17T13:11:50.182581+03:00: [janus.c:janus_plugin_push_event:3592:increase] 0x60400315b9e8 (3)
2023-03-17T13:11:50.182842+03:00: [janus.c:janus_plugin_push_event:3671:decrease] 0x60400315b9e8 (2)
2023-03-17T13:11:54.288829+03:00: [janus.c:janus_plugin_push_event:3592:increase] 0x60400315b9e8 (3)
2023-03-17T13:11:54.289075+03:00: [janus.c:janus_plugin_push_event:3671:decrease] 0x60400315b9e8 (2)
2023-03-17T13:11:55.888737+03:00: [ice.c:janus_plugin_session_dereference:763:decrease] 0x60400315b9e8 (1)
2023-03-17T13:11:55.888797+03:00: Detaching handle from JANUS VideoRoom plugin; 0x612001a9a1c0 0x60400315b9d0 0x612001a9a1c0 0x607003756ea0
2023-03-17T13:11:55.889041+03:00: [janus.plugin.videoroom-0x60400315b9d0] No WebRTC media anymore; 0x612001a9a1c0 0x607003756ea0
2023-03-17T13:11:55.891391+03:00: [plugins/janus_videoroom.c:janus_videoroom_session_free:2542:decrease] 0x60400315b9e8 (0)
2023-03-17T13:11:55.893659+03:00: ==1==ERROR: AddressSanitizer: heap-use-after-free on address 0x60400315b9e0 at pc 0x558859f37dc7 bp 0x7effbd418600 sp 0x7effbd4185f8

The session refcount goes to 0 when handling a Detach request, after the plugin is notified through its internal loop.
Concurrently, Janus is handling another message request in the core (on another thread) and is failing in the check specifically meant to validate the involved plugin session, just before submitting the message to the plugin.

janus-gateway/src/janus.c

Lines 1685 to 1693 in 6ddaaa9

/* Make sure the app handle is still valid */
if(handle->app == NULL || !janus_plugin_session_is_alive(handle->app_handle)) {
ret = janus_process_error(request, session_id, transaction_text, JANUS_ERROR_PLUGIN_MESSAGE, "No plugin to handle this message");
if(jsep_sdp_stripped != NULL)
janus_flags_clear(&handle->webrtc_flags, JANUS_ICE_HANDLE_WEBRTC_PROCESSING_OFFER);
g_free(jsep_type);
g_free(jsep_sdp_stripped);
goto jsondone;
}

We are facing many different race conditions here, but each of them is related to VR session teardown.

@lminiero
Copy link
Member

As anticipated, I merged #3167, so when we have a possible fix for these other issues we'll open a new PR.

@zavndw
Copy link

zavndw commented May 11, 2023

@lminiero
Copy link
Member

Please test the PR above.

@lminiero
Copy link
Member

@spscream @zavndw any update? Can I assume the fix works and merge, since it's been two weeks?

@zavndw
Copy link

zavndw commented May 22, 2023

@spscream @zavndw any update? Can I assume the fix works and merge, since it's been two weeks?

Yes. We right away assembled the update and delivered it to the customer. No complaints for two weeks

@lminiero
Copy link
Member

I'll merge and close then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multistream Related to Janus 1.x pr-exists
Projects
None yet
Development

No branches or pull requests

4 participants