-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal error in GNU libmicrohttpd daemon.c #1154
Comments
If it's happening deep in the library, I don't think there's anything we can do. I'm definitely not an expert of the library internals. You may want to check with the MHD developers instead for some insight: in case it turns out we're doing something wrong, they may have info for us on where to look. |
I can add a bit more color to the issue with the hope that it will help you track it down. I'm still not sure that it isn't an underlying libmicrohttpd issue. The issue seems to be caused by post data overflowing its memory allocation. It only seems to be triggered on on stream close when closing more than one stream in quick succession. In looking to track down the bug I added a bit of additional debug output to the janus_http_handler function and had it dump the upload_data to the log as soon as it was received. Normally upload_data's size matches the size of upload_data_size and upload_data contains just the posted json. However when it is about to crash extra data is passed along with the actual json exceeding the upload_data_size. When Janus copies the raw upload_data it ignores the extra garbage on the end but this extra data seems to be causing, or is a symptom of, some sort of memory leak. A broken request looks like this:
Whereas a clean request looks like this:
I hope this helps. |
Maybe a buffer overflow somewhere, and the file descriptor the library thinks it's closing is actually random data not corresonding to a socket which is why it fails? Just thinking out loud. |
The
I'm wondering if the |
@austinmarkus can you let me know if the above commit fixes it for you? In theory, it should just print a warning (we now register our own "panic" handler to pring the message but NOT abort) and then go on. Of course, if the error was masking some other problem it will die anyway, but let's make one step at a time. |
Thank you! |
You seem to be on the right track as now I'm getting different errors, but unfortunately it is still crashing and now the backtrace is even less helpful. |
You probably need to compile the library in debug mode, as otherwise you'll miss the necessary symbols to debug. Compiling Janus with libasan support might also help. |
First off my apologies fro not following up on this sooner, but I haven't had much to report that can be of any help. I tried compiling with libasan and there were nothing new in the back traces and while I don't think you can build libmicrohttpd in debug mode (please correct me if I'm wrong) I did add MHD_USE_DEBUG to everywhere that it was initialized in janus_http and that too revealed little of note. I finally did come across something that seemed to fix or at least mitigate the issue a large extent. Running both the http and admin transports both with threads = 1 seemed to prevent the crash. In order to make this work I switched the clients from http to websockets and only use the http interface for daemon to daemon communication thus minimizing the number of simultaneous calls. While the bug persists, this workaround is good enough for now so if you want to close the issue feel free. |
That's not really a fix though... it just means single thread instead of a thread per connection or thread pool! 🙂 While I'd like to start using MHD with a single thread, we're not there yet (a lot of refactoring needed for the suspend/resume stuff), so as you pointed out it's not really usable that way. I'll keep this open just a little while longer as a memo, so that I can look into this later on. |
Can you let me know if #1173 fixes it for you? It uses a single thread for HTTP. |
Will do. Do I need to do anything from a config perspective or is the threads= line deprecated? |
Yes, |
It is definitely an improvement over the multi-threaded code but I was still able to get it to crash using my normal stress test of 4 simultaneous streams to 3 different browsers. Again it crashed with the not so helpful https://pastebin.com/7UXTZznu. But interestingly right before the crash I got this. I have been logging all of the requests that my server makes to Janus and the JSON that I sent is fine. My current theory is that there is a leak somewhere in the ice thread that corrupts the memory of the http thread. The ice thread it too sometimes crashes (particularly when libasan was compiled in to both Janus and libmicrohttpd), and I only get a crash on the http thread if there are lots of streams running for a long time. As the calls to http are only made at stream startup and destruction it doesn't make sense to me that the time a stream is running would have any impact on the http thread. It is just a guess and I don't have the c programming chops to track it down. I'm going to run a tshark session and make absolutely certain that corruption isn't happening at the network layer (unlikely as it is on the same physical server) and will report back my results. Thanks again for all of your help. |
In the HTTP plugin, the keep-alive is typically only originated by the plugin itself (janus.js only sends them for websockets), as a way to have long polls (which never get to the core) act as application level keepalives, so it may be something breaking there. Can you try changing https://github.com/meetecho/janus-gateway/blob/master/transports/janus_http.c#L1319 and let me know if this fixes it for you? It might be a buffer overflow thing when we generate a random string in a buffer that might be too tight. |
Will do. It takes me a while to get the crash to happen so it might be a while before I respond. |
Definitely looks like a buffer overflow somewhere, anyway, as the line you get:
comes from this line here:
which means the function thinks |
Probably a dumb question, but are you using the latest libmicrohttpd? |
Yes, 0.9.59. |
@austinmarkus you mentioned some crashes with libasan-enabled binaries.
If any heap/stack overflow or use-after-free is happening here, libasan should detect it. |
Give me a moment and I will re-compile/test. As before it may take a while before I can get a crash. |
With the compile flags @atoppi suggested in for both Janus and libmicrohttpd and the small change @lminiero suggested I can no longer maintain a video stream. They start up fine but after roughly 30 seconds they freeze. Nothing is in the level 5 logs and this ">> >> Can't retransmit packet 37094, we don't have it..." is repeated over and over in the > level 6 logs with the packet numbers changing. I'm going to walk back the changes until I can get it working again. |
|
I'm having difficulty reproducing the crash as I cannot keep streams going for long enough with libasan in, however I did just get one that I think is related. It is the ice thread crashing not httpd but if you look in the log you'll see a: |
Unfortunately the last dump does not add anything useful.
The raised error means that the icethread is trying to write on a pipe that has been closed, but it's unclear who closed it and why. Moreover, given that we suspect a memory leak, this might also be caused by a pointer corruption, leading to write/read on an invalid memory space in place of the previously allocated file descriptor. @austinmarkus I noticed you are using both WebSockets and HTTP transports for your clients. I have never tested this scenario too much, maybe trying only with HTTP is worth a try. |
The mix of websockets and http are a legacy from when I was forcing a single thread by setting threads=1 in the config. As it was stalling when I had multiple clients all trying to negotiate their streams at the same time I switched it to websockets on the client side but left it http on the server side as that was a more involved re-write. It is easy enough to switch back and I can test later today. I will keep working on trying to replicate the original error with libasan in and will report back as soon as I have anything worth sharing. |
I've finally been able to get a decent backtrace of the libmicrohttpd crash. Hopefully this helps https://pastebin.com/9sR4hsxC. In order to do this I needed to compile libmicrohttpd with sanitize=address and no-omit-frame-pointer but not Janus. When I compiled both with those flags Janus wouldn't crash, it would get into the state where it would normally crash with log statements like this Let me know if there is anything else I can get to help you track this down, or if you think it is a bug in libmicrohttpd and I should take it up with them. |
The dump you posted is not from libasan, it is a core dump. What about printing the entire memoy area around |
I think this is what you're after. It is from a different crash. |
Okay, that crash came with MHD compiled with libasan. The back trace that I get is this. #0 0x0000000000000000 in ?? () I will recompile MHD with libasan and try to get it to crash again. |
There is no need to compile MHD with libasan. To inspect the |
@austinmarkus I'd start considering reporting the issue to the MHD developers, just in case we're chasing tails here and the issue lies there instead. |
Just a small related question: Am I right there are no any switch in Janus |
You're right, you have to tweak the environment variables or the Makefile.am manually for that. |
@RSATom default CFLAGS for Janus are |
Thank you, understood. Just don't like how it jumps between lines on step by step debugging... |
Here is what I hope you're after. Full back trace with the area around the msg after it. I will start a bug report with MHD and will reference this thread in a couple of hours. |
Here is another, this time without libasan although it looks pretty similar to my untrained eye. One thing that I think bears mentioning is that I can get similar program instability without using the http interface. I rewrote my app to use websockets (both client and server) and ran the same overnight stress test I did for http where I set 5 steams each running to 5 clients and left them go overnight. When I came back in the next day they were still running but after stopping and restarting the streams the websocket became unresponsive. This is also usually where MHD will segfault. Janus was still running, you just couldn't connect to the socket. To me this points to a issue independent of the transport but again I'm not sure. |
@austinmarkus please check the version of libwebsockets you are using. We recommend a recent version (e.g. EDIT: you're testing the |
Anyway, if you are still testing a pre- |
I'm on vacation this week. But will try out the change first thing next week.
-Austin
…On April 9, 2018 1:45:28 AM PDT, Alessandro Toppi ***@***.***> wrote:
Anyway, if you are still testing a pre-`0.4.0` master with a
recommended lws version, then this could point to a transport
independent issue, arising in long lived connections, just like you
said.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1154 (comment)
|
Ok @austinmarkus . |
First off my apologies for not getting back to you sooner, I was out of cell phone range. First off the libwebsockets I am running is from git and was pretty out of date (commit a663aefebd3ea8397cfa4ac409b9fb8e089fe35f from Feb 14) It was after the 2.4 branch but I will try the 2.4 stable release anyway to see if that makes a difference. I used this cmake line from your install instructions so I should be okay on that front. As for the steps to replicate the crash, I have removed token auth and all interactions with the admin interface and the crashes persist, so that can be ruled out. I should also note that it will often take far longer than 15 minutes for the stability issues to manifest, my new test is to let it run for 12+ hours overnight and then check back in in the morning. But the basic steps are the same.
|
@austinmarkus thanks for getting back with your scenario details. |
Hi @austinmarkus, we have hopefully fixed a race condition on the |
Kudos to @mtdxc for debugging the issue. Anyway, I thought your recent tests were on master? So all the data you've collected so far was for the single-thread branch? I would have loved to see the way we use MHD fixed too... |
All of my recent tests were on the single-thread branch. I will update and test later today. |
@austinmarkus got it: then you were not really using even the same codebase as master, since we recently merged the |
I think I may have a lead although to be honest I'm not sure of anything at this point. To be totally sure that the issue wasn't with MHD I built Janus with --disable-rest and set libjanus_http.so in the disable transports section of the config and used websockets as my transport mechanisim. I ran my standard test and while it no longer segfaults it does deadlock and my thinking is that the deadlock was causing the MHD crash and it is just manifesting itself differently with websockets. Here is a trace of the wss thread in the deadlocked state. As you can see the deadlock is in the audiobridge, I am going to try disabling my use of it of my code and see if I can duplicate the deadlock in the main streaming plugin. I apologize for not mentioning the audiobridge in my earlier comments I had forgotten I was even calling it. Edit: This test was run back on trunk |
Notice that if you're using the streaming plugin we fixed something in there yesterday, and I fixed something in the AudioBridge plugin too, so make sure you get the latest changes before trying again. |
In case this is still a problem after the very recent commit @lminiero suggested, you may want to enable the locking debug in the Janus Admin. |
Over the weekend I ran a test on the previous trunk (before the changes @lminiero mentioned were in) but without any calls to the audio bridge and for the first time I experienced no crashes or deadlocks. Today I am going to run the same test but with the updated code and using the HTTP transport. If that too yields success; I will go back to using audiobridge, will enable locking debug and will try to get to the bottom of things once and for all. Thank you all for bearing with me on this, I feel like we're getting closer. |
I have been able to isolate the crash to a specific series of actions, and as such have been able to work around it.
I have tested on trunk using both websockets and using MHD with 100 threads and both have survived crash and deadlock free by simply issuing a stop_rtp_forward sleeping for a few seconds and then destroying the audiobridge. Please let me know if you would like me to do any additional work to help track the issue down, or if you would like me to just close it, as it is something attributable to user error. Thanks again for all of your help and patience in figuring this out. Edit: I'm going to close my MHD bug report, as it doens't appear to be a MHD issue, unless there is any objection. |
You seem on the right way, go ahead closing the MHD issue! |
Not sure about the exact cause, but RTP forwarders use manually created sockets: it might be that some of them are incorrectly closed twice, and the second time they close a socket identifier that is now associated with a different connection (like an HTTP connection handled by MHD). This might explain the "close socket failed" error, as the plugin is closing the socket and MHD failes to close it later on. Just guessing but I'm wondering if that may be it. |
I am closing this issue as I am no longer able to reproduce it using http or websockets either closing the rtp forwarder first or not. I suspect it was a commit that you made about two weeks ago that fixed the issue and although I'm not sure of the exact fix, it does seem fixed. Thank you for all of your help. |
Thanks for your patience, and especially for collecting so many info! |
Hi,
While I think this is a duplicate of #212, that has been closed with the bug still remaining. Janus crashes with a "Fatal error in GNU libmicrohttpd daemon.c:3092: Close socket failed."
Steps to duplicate:
I can duplicate using both the thread pool (256 threads) and thread per connection, using both http and https on the admin and client side.
I am running libmicrohttpd 0.9.59 built from source.
Gdb trace: https://pastebin.com/vYapYPQh (thread per connection)
Log snippet (verbosity=7) leading up to the crash
https://pastebin.com/P2rYvvB7
and the backtrace after https://pastebin.com/EJgNrRa5 (This using a thread pool)
I'm happy to add any additional information required to help get to the bottom of this.
Thank you,
-Austin
The text was updated successfully, but these errors were encountered: