-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DTLS1_READ_BYTES:tlsv1 alert decrypt error #136
Comments
No idea what that is, honestly, especially considering it only seems to happen after a long time and not for every call (which probably excludes certificate or openssl errors, for instance). Unfortunately line 311 doesn't specify the handle ID this is happening on, so I'll have to modify the code to enrich the related debug. What's the CPU/mem status after those ~20 hours? Just wanted to exclude memory leaks or something like that as a potential cause of failure here. |
Can you try again with the latest push I made? It should print the handle-related info as well. |
Thanks a lot for your reply and adding the extra debug logging I see you added some info for LOG_HUGE, should I include that in the log too, (Is that debug-level 7?) The CPU was only around 150%-200% (8 cores) and the memory usage was around %15-%20 (8GB) (Linux also cached ~5GB but that is to be expected with the I/O activity) Just to clarify as you said "after a long time and not for every call" Thanks for you help |
HUGE is 6, not 7. You can try editing the code and change the DTLS related output to LOG_VERB, which will then appear when you use 5 as well. |
Thanks for clarifying, |
Hi again, I replicated the issue again by running the server for 19 hours with the latest commit, You can download the log which includes the first error up to when I restarted the gateway here; I got the error again about 12 hours later so I was able to get another log; Thanks for your time on debugging this, I appreciate it. |
Hi, |
Hey, I am using openssl 1.0.1-4ubuntu5.20 (Which is fairly newer than 1.0.1e as far as I can see) I hope my log will give @lminiero some more insight on what happens so we can fidn out the cause.
I am not too familiar with C but I wonder if it setting -1 to memory causes it to fail for new handshakes too?
Thanks |
No that code cannot cause the issue you're experiencing. The -1 you see is the return value of the SSL_read call itself, and is perfectly normal during handshakes. |
Ah yeah didn't read that correctly thanks, does the log give any clue what it could be though? why could it return -1 here, does that mean dtls->ssl is not set or would the SSL_read function fail for some other reason? Would it be worth trying to update OpenSSL to the latest version, would that affect this? Thanks |
The -1 is normal because there's no data returned by the SSL_read: all the bytes are handshake related, and so there's no payload to handle. Real openssl errors are printed on the console (like the one you pasted earler). Not sure that upgrading will change anything, especially as it only happens much later into using the tool. |
Is there anything I can do to further debug this? If it would simply be a handshake failing and not sending SSL once, or one user it would make perfect sense, but it being persistent after this happens once seems so odd. And the only 'fix' I can think of at the moment is killing the gateway when it happens Thanks |
No idea... have you already tried capturing some dumps with Wireshark to see the one of the DTLS handshake that fails? Just to check who's sending the alert and as part of what messages exchange. |
I have not, but as it only happens after 10-30hours, the server is not on my network and I cannot replicate it consistently without just waiting for it to show up, I wouldn't have a clue where to start looking for packets, logging every packet for 30 hours and then finding the handshake that cuased it seems like looking for a needle in a haystack. |
I only meant capturing some data after the issue has happened, of course... since you said that, as soon as it happens once, it happens all over again for all subsequent calls until you restart the gateway, and so getting that info for one of those should be easy. A dump on the client side should work as well, since we'd be able to see the DTLS the browser and Janus exchange, and check if there's anything funny there. I'm not an openssl/crypto expert, so I have no idea what the |
Ah ofcourse that would work, I was thinking of getting the initial packet that causes it. Would it help to see log Thanks for taking the time to look into this, I will let you know when I have a pcap file (Now I need it to show me the error it has been up for 37 hours without issues, hah) |
The code is already logging the SSL_read errors, that's where the |
But the SSL_Read is generating that error because the data feeding into it (dtls->ssl, or &data ?) has a problem right? Apologies if that doesn't make much sense, I'm just trying to think of ways to debug with my limited C knowledge thanks |
Ah ok I got what you mean: yes you can try capturing those bytes as well, even though they should in theory be exactly the same available in the Wireshark/tcpdump. If they're not, it means the server is getting something different than what the client sent, which might explain an error. |
Thanks for confirming that So should this do the trick?
Thanks again for all your comments and for your time on debugging this! |
Here's a pcap file with a few failing handshakes, hope this tells you something useful Thanks |
Thanks I'll check that later. |
Any news on this issue ? |
Hi, I've not found the cause or a solution to this problem yet. if(err == SSL_ERROR_SSL) {
/* Ops, something went wrong with the DTLS handshake */
char error[200];
ERR_error_string_n(ERR_get_error(), error, 200);
JANUS_LOG(LOG_ERR, "[%"SCNu64"] Handshake error: %s\n", handle->handle_id, error);
//return;
JANUS_LOG(LOG_ERR, "Shutting down server for automatic restart");
exit(0);
} To see how often this particular error made the server crash and to see if it is persistent without going through the logs I've set up a script which send me a push notification when "Handshake error:" is detected. I am currently having a few more persistent errors I'm looking into, if there's anything you came to know regarding this that's worth sharing it'd be appreciated Thanks |
No update here either, as we can't replicate this. We've been doing some stress tests lately and never encountered the issue. |
Is there anything else I can provide to help debug this issue? I am consistently getting this at least once every 24 hours, with around (5-20 publishers and 5-15 listeners constantly) so if there's any other debug info you'd like me to get please let me know. I have been able to replicate it with multiple browsers but can't find exactly what causes it, I do know that refreshing the page multiple times, sometimes even before it's loaded will increase the chance of it happening. |
Is this still happening with the latest version, after the several memory leaks we fixed recently? Just to rule out the possibility that it was related to something like that. |
it turned out that this was not the issue i was facing, my problem was I have never faced this DTLS issue.
|
Ok, thanks, waiting for feedback from @LetsVape then. |
Thanks for following up, I have not yet checked to see if this error still occurs after the latest memory leak fixes were done Thanks! |
I had the modular transport branch running for 30 days straight without encountering this error after compiling it against BoringSSL so that defnitely solved it, but then suddenly, without any changes on our end it started crashing hourly. I first suspected Google updating Chrome might've caused this as it was shortly after that got released, but looking at the DEPS file for v45 it seems like they use BoringSSL from June 30th there. Canary's BoringSSL version was updated however, but after recompiling both master and modular transport branches with the BoringSSL used in V46/V47 (12fe1b2/ac8302a) but I still can't get them stable yet I'll soon recompile with AddressSanitizer to get some useful logs for you Regards |
@LetsVape first of all thanks a lot for giving the modular transports branch such a prolonged try! I'd like to make that the default soon, so knowing it works as expected for others too is good news. Let us know if you find any weird behaviour in that respect. As to the crashes, keep us posted. How do you replicate them usually? |
Will do! Unfortunately I've not been able to reproduce these crashes consistently, we're using it for a videoroom with a fair amount of unique users throughout the day and we're not limiting users to use a specific browser or version so it's hard to pinpoint who or what is causing it, we weren't able to see any pattern in the analytics yet. I wasn't able to reproduce it on my test server yetr either which might be due to lack of activity so I'll get AddressSanitizer running soon and stress test it with a group of users |
As a side note, from time to time I play with BoringSSL to see if I can get it to add easily as an alternative in configure for Janus. Anyway, I noticed that the DTLS code fails to build entirely if you don't take care of some deprecated methods (OpenSSL_add_all_algorithms, SSL_CTX_set_read_ahead, BIO_read_filename). Are you guys fixing dtls.c as well or does it build fine for you instead? |
No I don't actually, I just recompiled 2 hours ago against the modular-transports branch I did a fresh git clone, and checkout to get to modular branch, then I modified Makefile.am to add the 2 lines @frk2 mentioned in his Google post (I don't include jansson and curls which he uses in his commits though, I saw janus already has jansson as dependency without this line and when I saw it show up in a coredump I removed it from the CFLAGS and that seemed to help) so the following is in my Makefile.am
and I compiled BoringSSL 12fe1b2 as described in the Google groups post And I confirmed that I then saw the DTLSv1_2 references using Even though I'm able to compile, are there any issues that could arise due to these deprecated methods? The instructions above are what I ran the gateway with for 30 days with up to 30 people at a time in a 24/7 chat so I certainly didn't notice anything I'm now 5 hours into the latest modular transports branch with the BoringSSL commit mentioned above |
No idea, I'm not an OpenSSL/BoringSSL expert. I just noticed those when trying to build a test application based on the same DTLS code we have in Janus against BoringSSL. They were reported as undefined references and looking around it looked like some things changed. |
Ok, So we compiled & linked (statically) against BoringSSL according Faraz's instructions and re-ran the test. It ran for about 14+ hours, handled 43000+ calls and had no errors. Which begs the following question - what will/should Janus do? Should the standard dependency/build switch to -a version of- BoringSSL and the instructions on installation be changed accordingly (possibly including a Hope to hear on this further. Cheers, |
I've been working on a configure flag to force BoringSSL, plus a simple script to install it. That said, it looks that one way or the other I always end up with the installed OpenSSL instead of BoringSSL, so I'm still struggling with that. Besides, there are differences in the library that cannot be ignored, as methods being deprecated and such, at least in the BoringSSL version I have. |
Please no. Adding the ability to build with BoringSSL would be indeed a welcome addition, but making it the only option is not a great idea, specially since Google doesn't maintain BoringSSL for others, the may break the API / ABI without notice. |
I agree, and in fact what I was working on was a configure flag, which would assume OpenSSL would still be the default. |
👍 |
Good to know that boringSSL helped! working fine for us as well! |
For me BoringSSL work more then month without issue. |
Just finished preparing an extended version of I can confirm that the That said, I'll prepare a pull request shortly with the changes: as soon as it is ready I'll notify it here, so when that happens I'd ask you guys (especially those of you who are already using BoringSSL) to test it and let me know if the dynamic flag does indeed make use of the right library for you, so that I can merge it for good. |
How about making the path configurable as:
|
This is something we can fix later on, I first want to make sure we have something workable. Besides, considering that apparently a specific revision is needed, it makes sense to have it installed somewhere specific for the moment. |
PR created #343, feedback welcome. |
Closing as the BoringSSL support fixes that for affected users. |
Sorry, intended to test the PR but didn't get to it, see it's in the main and modular transports branch now, thanks for your work on this @lminiero! much appreciated |
Just for the record. I tried with openSSL 1.0.2d (I hoped that bumping openSSL to 1.0.2 would have similar effects to using BoringSSL) and the problem is still there. |
Ping 😉 |
I think that we can close this. Looks like #394 fixed the problem. |
Sorry, I commented in the wrong issue. :) I meant closing #316 |
Well, this was where the issue was first discussed anyway so no harm done 😉 |
👍 Thanks |
Hello,
I've been running the janus-gateway for 28 hours yesterday and ran into the following error:
[31m[ERR]�[0m [dtls.c:janus_dtls_srtp_incoming_msg:311:] Handshake error: error:1410241B:SSL routines:DTLS1_READ_BYTES:tlsv1 alert decrypt error
Last time I ran into this issue it was after 21 hours so it doesn't seem to be a consistent interval
When someone joins and this error pops up once it will keep popping up (on the server) for every new person that joins, everyone that previously joined the videoroom will be able to continue to view the feeds but new people joining will only see publisher names, no video and the gateway will throw the above error for everyone that joins, I will have to restart the gateway to allow people to join again
I've included a debug-level=5 log which includes the first error and then 2 more, I've stopped there for pastebin limits but have a longer log if you prefer, I hope this helps, please let me know if you'd like any other logs/info
http://pastebin.com/qKaAXQcD
Thanks
The text was updated successfully, but these errors were encountered: