Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DTLS1_READ_BYTES:tlsv1 alert decrypt error #136

Closed
LetsVape opened this issue Jan 16, 2015 · 89 comments
Closed

DTLS1_READ_BYTES:tlsv1 alert decrypt error #136

LetsVape opened this issue Jan 16, 2015 · 89 comments

Comments

@LetsVape
Copy link

Hello,

I've been running the janus-gateway for 28 hours yesterday and ran into the following error:
[31m[ERR]�[0m [dtls.c:janus_dtls_srtp_incoming_msg:311:] Handshake error: error:1410241B:SSL routines:DTLS1_READ_BYTES:tlsv1 alert decrypt error

Last time I ran into this issue it was after 21 hours so it doesn't seem to be a consistent interval

When someone joins and this error pops up once it will keep popping up (on the server) for every new person that joins, everyone that previously joined the videoroom will be able to continue to view the feeds but new people joining will only see publisher names, no video and the gateway will throw the above error for everyone that joins, I will have to restart the gateway to allow people to join again

I've included a debug-level=5 log which includes the first error and then 2 more, I've stopped there for pastebin limits but have a longer log if you prefer, I hope this helps, please let me know if you'd like any other logs/info

http://pastebin.com/qKaAXQcD

Thanks

@lminiero
Copy link
Member

No idea what that is, honestly, especially considering it only seems to happen after a long time and not for every call (which probably excludes certificate or openssl errors, for instance). Unfortunately line 311 doesn't specify the handle ID this is happening on, so I'll have to modify the code to enrich the related debug.

What's the CPU/mem status after those ~20 hours? Just wanted to exclude memory leaks or something like that as a potential cause of failure here.

@lminiero
Copy link
Member

Can you try again with the latest push I made? It should print the handle-related info as well.

@LetsVape
Copy link
Author

Thanks a lot for your reply and adding the extra debug logging
I will rebuild from the latest commit and wait to get another log of the error

I see you added some info for LOG_HUGE, should I include that in the log too, (Is that debug-level 7?)
I went with 5 as the file-size was still manageable this way, 10 mins on 7 will give a 2GB file or so

The CPU was only around 150%-200% (8 cores) and the memory usage was around %15-%20 (8GB) (Linux also cached ~5GB but that is to be expected with the I/O activity)
No other processes were consistently using over %1 CPU or %MEM during this time so resources don't seem to be an isse

Just to clarify as you said "after a long time and not for every call"
It does happen for every call after this error showed up once, I don't see any successful DTLS after this.
But it's not like anyone is persistently unable to connect and causing this error, after a refresh everyone can join

Thanks for you help

@lminiero
Copy link
Member

HUGE is 6, not 7. You can try editing the code and change the DTLS related output to LOG_VERB, which will then appear when you use 5 as well.

@LetsVape
Copy link
Author

Thanks for clarifying,
I'll change it to VERB for now

@LetsVape
Copy link
Author

Hi again,

I replicated the issue again by running the server for 19 hours with the latest commit,

You can download the log which includes the first error up to when I restarted the gateway here;
http://we.tl/biylPxtmbs

I got the error again about 12 hours later so I was able to get another log;
http://we.tl/s29dcaSyMM

Thanks for your time on debugging this, I appreciate it.

@Will-I4M
Copy link

Hi,
are yoiu using openssl 1.0.1e ? If so, you should try with a newer version, and the latest janus commit.
Maybe something around sslv3/tls1.1...

@LetsVape
Copy link
Author

Hey,

I am using openssl 1.0.1-4ubuntu5.20 (Which is fairly newer than 1.0.1e as far as I can see)
I saw the latest Janus commit addressing issue 132 and 134 and will compile that and give it a try later.
I had never seen the error that was mentioned there though, and since my gateway actually handles 10-25+ people at a time during 20 hours or so before it starts failing I didn't think those issues were related to mine. (35 hour uptime at the moment actually :])

I hope my log will give @lminiero some more insight on what happens so we can fidn out the cause.
After it reads -1 from SSL once it keeps doing that for every handshake'

[2602858623]     ... and read -1 of them from SSL...
�[31m[ERR]�[0m [dtls.c:janus_dtls_srtp_incoming_msg:311:] [2602858623] Handshake error: error:1410241B:SSL routines:DTLS1_READ_BYTES:tlsv1 alert decrypt error

I am not too familiar with C but I wonder if it setting -1 to memory causes it to fail for new handshakes too?

char data[1500];    /* FIXME */
    memset(&data, 0, 1500);
    int read = SSL_read(dtls->ssl, &data, 1500);
    JANUS_LOG(LOG_HUGE, "[%"SCNu64"]     ... and read %d of them from SSL...\n", handle->handle_id, read);
if(read < 0) {

Thanks

@lminiero
Copy link
Member

No that code cannot cause the issue you're experiencing. The -1 you see is the return value of the SSL_read call itself, and is perfectly normal during handshakes.

@LetsVape
Copy link
Author

Ah yeah didn't read that correctly thanks, does the log give any clue what it could be though?
Is it normal for it to return -1? because this only seems to show up when this issue occurs so I thought about making it shut down the gateway if it's -1 so it doesn't need to be restarted manually at least

why could it return -1 here, does that mean dtls->ssl is not set or would the SSL_read function fail for some other reason?

Would it be worth trying to update OpenSSL to the latest version, would that affect this?

Thanks

@lminiero
Copy link
Member

The -1 is normal because there's no data returned by the SSL_read: all the bytes are handshake related, and so there's no payload to handle. Real openssl errors are printed on the console (like the one you pasted earler). Not sure that upgrading will change anything, especially as it only happens much later into using the tool.

@LetsVape
Copy link
Author

Is there anything I can do to further debug this? If it would simply be a handshake failing and not sending SSL once, or one user it would make perfect sense, but it being persistent after this happens once seems so odd. And the only 'fix' I can think of at the moment is killing the gateway when it happens

Thanks

@lminiero
Copy link
Member

No idea... have you already tried capturing some dumps with Wireshark to see the one of the DTLS handshake that fails? Just to check who's sending the alert and as part of what messages exchange.

@LetsVape
Copy link
Author

I have not, but as it only happens after 10-30hours, the server is not on my network and I cannot replicate it consistently without just waiting for it to show up, I wouldn't have a clue where to start looking for packets, logging every packet for 30 hours and then finding the handshake that cuased it seems like looking for a needle in a haystack.
I thought the handler info added in the previous commit would give some info on this?

@lminiero
Copy link
Member

I only meant capturing some data after the issue has happened, of course... since you said that, as soon as it happens once, it happens all over again for all subsequent calls until you restart the gateway, and so getting that info for one of those should be easy. A dump on the client side should work as well, since we'd be able to see the DTLS the browser and Janus exchange, and check if there's anything funny there.

I'm not an openssl/crypto expert, so I have no idea what the tlsv1 alert decrypt error you're getting means. I tried looking around and there's no clear reason for it, or at least not anything that's obvious to me. That's why I was interested in looking at the messages to see if we can figure out something that way.

@LetsVape
Copy link
Author

Ah ofcourse that would work, I was thinking of getting the initial packet that causes it.
I'll quickly try to record some packets before restarting when it happens next time I'm around

Would it help to see log dtls->ssl as I imagine there is something wrong with that causing int read = SSL_read(dtls->ssl, &data, 1500); to fail?

Thanks for taking the time to look into this, I will let you know when I have a pcap file (Now I need it to show me the error it has been up for 37 hours without issues, hah)

@lminiero
Copy link
Member

The code is already logging the SSL_read errors, that's where the tlsv1 alert decrypt error stuff comes from, https://github.com/meetecho/janus-gateway/blob/master/dtls.c#L312

@LetsVape
Copy link
Author

But the SSL_Read is generating that error because the data feeding into it (dtls->ssl, or &data ?) has a problem right?
So I thougth knowing what data causes that function to fail will give some useful information (to see what is different form the successful handshake data)

Apologies if that doesn't make much sense, I'm just trying to think of ways to debug with my limited C knowledge

thanks

@lminiero
Copy link
Member

Ah ok I got what you mean: yes you can try capturing those bytes as well, even though they should in theory be exactly the same available in the Wireshark/tcpdump. If they're not, it means the server is getting something different than what the client sent, which might explain an error.

@LetsVape
Copy link
Author

Thanks for confirming that
Would you please be able to show me how to log it properly (so I don't have to wait for the error to show up to realise I used the wrong syntax :] ) so i can include it next time I compile?
I'm not sure which conversion specifier I should use there, what is dtls->ssl (string?) and is &data just a pointer to a chunk of memory for it to put that data in?

So should this do the trick?

JANUS_LOG(LOG_VERB, "[%"SCNu64"]     SSL_Read dtls: %s \n", handle->handle_id, dtls->ssl); //is it a string (%s)?

Thanks again for all your comments and for your time on debugging this!

@LetsVape
Copy link
Author

Here's a pcap file with a few failing handshakes, hope this tells you something useful
http://we.tl/CykBks6YS2

Thanks

@lminiero
Copy link
Member

Thanks I'll check that later.

@jacobhub
Copy link

jacobhub commented Mar 1, 2015

Any news on this issue ?

@LetsVape
Copy link
Author

LetsVape commented Mar 1, 2015

Hi,

I've not found the cause or a solution to this problem yet.
I still encounter this issue every 10-20 hours or so and work around it by invoking the exit() function when this happens;

if(err == SSL_ERROR_SSL) {
/* Ops, something went wrong with the DTLS handshake */
char error[200];
ERR_error_string_n(ERR_get_error(), error, 200);
JANUS_LOG(LOG_ERR, "[%"SCNu64"] Handshake error: %s\n", handle->handle_id, error);
//return;

JANUS_LOG(LOG_ERR, "Shutting down server for automatic restart");
exit(0);

}

To see how often this particular error made the server crash and to see if it is persistent without going through the logs I've set up a script which send me a push notification when "Handshake error:" is detected.
In the few days I removed the automatic restart and added the push notifications I have only had 1 occurrence where I got notified there was a handshake error and it did not persist, the remaining 4 times this happened once it would continue to happen for each new connection (I have waited it out for 5-10 mins but it did not recover)

I am currently having a few more persistent errors I'm looking into, if there's anything you came to know regarding this that's worth sharing it'd be appreciated

Thanks

@lminiero
Copy link
Member

lminiero commented Mar 2, 2015

No update here either, as we can't replicate this. We've been doing some stress tests lately and never encountered the issue.

@LetsVape
Copy link
Author

LetsVape commented Mar 2, 2015

Is there anything else I can provide to help debug this issue?

I am consistently getting this at least once every 24 hours, with around (5-20 publishers and 5-15 listeners constantly) so if there's any other debug info you'd like me to get please let me know.

I have been able to replicate it with multiple browsers but can't find exactly what causes it, I do know that refreshing the page multiple times, sometimes even before it's loaded will increase the chance of it happening.

@lminiero
Copy link
Member

Is this still happening with the latest version, after the several memory leaks we fixed recently? Just to rule out the possibility that it was related to something like that.

@jacobhub
Copy link

it turned out that this was not the issue i was facing, my problem was
something else you already solved in previous versions.

I have never faced this DTLS issue.
On Mar 17, 2015 4:49 PM, "Lorenzo Miniero" notifications@github.com wrote:

Is this still happening with the latest version, after the several memory
leaks we fixed recently? Just to rule out the possibility that it was
related to something like that.


Reply to this email directly or view it on GitHub
#136 (comment)
.

@lminiero
Copy link
Member

Ok, thanks, waiting for feedback from @LetsVape then.

@LetsVape
Copy link
Author

Thanks for following up, I have not yet checked to see if this error still occurs after the latest memory leak fixes were done
I'll monitor the logs to see if it still occurs in the next few days and if it does I'll remove my exit(0); this weekend to confirm whether it still causes the server to lock up after this happens

Thanks!

@LetsVape
Copy link
Author

LetsVape commented Sep 8, 2015

I had the modular transport branch running for 30 days straight without encountering this error after compiling it against BoringSSL so that defnitely solved it, but then suddenly, without any changes on our end it started crashing hourly.

I first suspected Google updating Chrome might've caused this as it was shortly after that got released, but looking at the DEPS file for v45 it seems like they use BoringSSL from June 30th there.

Canary's BoringSSL version was updated however, but after recompiling both master and modular transport branches with the BoringSSL used in V46/V47 (12fe1b2/ac8302a) but I still can't get them stable yet

I'll soon recompile with AddressSanitizer to get some useful logs for you

Regards

@lminiero
Copy link
Member

lminiero commented Sep 9, 2015

@LetsVape first of all thanks a lot for giving the modular transports branch such a prolonged try! I'd like to make that the default soon, so knowing it works as expected for others too is good news. Let us know if you find any weird behaviour in that respect.

As to the crashes, keep us posted. How do you replicate them usually?

@LetsVape
Copy link
Author

LetsVape commented Sep 9, 2015

Will do!
It seems like the modular transports solved some connectivity issues we had with websockets on the master branch, and it's a quite noticeable improvement in speed to use websockets so thanks for all your work on that!

Unfortunately I've not been able to reproduce these crashes consistently, we're using it for a videoroom with a fair amount of unique users throughout the day and we're not limiting users to use a specific browser or version so it's hard to pinpoint who or what is causing it, we weren't able to see any pattern in the analytics yet.

I wasn't able to reproduce it on my test server yetr either which might be due to lack of activity so I'll get AddressSanitizer running soon and stress test it with a group of users

@lminiero
Copy link
Member

lminiero commented Sep 9, 2015

As a side note, from time to time I play with BoringSSL to see if I can get it to add easily as an alternative in configure for Janus. Anyway, I noticed that the DTLS code fails to build entirely if you don't take care of some deprecated methods (OpenSSL_add_all_algorithms, SSL_CTX_set_read_ahead, BIO_read_filename). Are you guys fixing dtls.c as well or does it build fine for you instead?

@LetsVape
Copy link
Author

LetsVape commented Sep 9, 2015

No I don't actually, I just recompiled 2 hours ago against the modular-transports branch

I did a fresh git clone, and checkout to get to modular branch, then I modified Makefile.am to add the 2 lines @frk2 mentioned in his Google post (I don't include jansson and curls which he uses in his commits though, I saw janus already has jansson as dependency without this line and when I saw it show up in a coredump I removed it from the CFLAGS and that seemed to help)

so the following is in my Makefile.am


janus_CFLAGS = \
        $(AM_CFLAGS) \
        $(JANUS_CFLAGS) \
        -DPLUGINDIR=\"$(plugindir)\" \
        -DTRANSPORTDIR=\"$(transportdir)\" \
        -DCONFDIR=\"$(confdir)\" \
        -I/opt/boringssl/include \
        $(NULL)

janus_LDADD = \
        -L/opt/boringssl \
        $(JANUS_LIBS) \
        $(JANUS_MANUAL_LIBS) \
        -lsrtp \
        $(NULL)

and I compiled BoringSSL 12fe1b2 as described in the Google groups post

And I confirmed that I then saw the DTLSv1_2 references using nm janus | grep DTLSv1_2

Even though I'm able to compile, are there any issues that could arise due to these deprecated methods? The instructions above are what I ran the gateway with for 30 days with up to 30 people at a time in a 24/7 chat so I certainly didn't notice anything

I'm now 5 hours into the latest modular transports branch with the BoringSSL commit mentioned above

@lminiero
Copy link
Member

Even though I'm able to compile, are there any issues that could arise due to these deprecated methods?

No idea, I'm not an OpenSSL/BoringSSL expert. I just noticed those when trying to build a test application based on the same DTLS code we have in Janus against BoringSSL. They were reported as undefined references and looking around it looked like some things changed.

@tuijldert
Copy link
Contributor

Ok,

So we compiled & linked (statically) against BoringSSL according Faraz's instructions and re-ran the test.

It ran for about 14+ hours, handled 43000+ calls and had no errors.
I think it's safe to say that using BorignsSSL will solve this problem and that the issue is indeed with the OpenSSL project (remember that gave errors and crashed Janus within 10 minutes).

Which begs the following question - what will/should Janus do?

Should the standard dependency/build switch to -a version of- BoringSSL and the instructions on installation be changed accordingly (possibly including a #define BORING_SSL_SHARED_LIBRARY) ?

Hope to hear on this further.

Cheers,
Tom.

@lminiero
Copy link
Member

Which begs the following question - what will/should Janus do?

Should the standard dependency/build switch to -a version of- BoringSSL and the instructions on installation be changed accordingly (possibly including a #define BORING_SSL_SHARED_LIBRARY) ?

I've been working on a configure flag to force BoringSSL, plus a simple script to install it. That said, it looks that one way or the other I always end up with the installed OpenSSL instead of BoringSSL, so I'm still struggling with that. Besides, there are differences in the library that cannot be ignored, as methods being deprecated and such, at least in the BoringSSL version I have.

@saghul
Copy link
Contributor

saghul commented Sep 15, 2015

Should the standard dependency/build switch to -a version of- BoringSSL and the instructions on installation be changed accordingly (possibly including a #define BORING_SSL_SHARED_LIBRARY) ?

Please no. Adding the ability to build with BoringSSL would be indeed a welcome addition, but making it the only option is not a great idea, specially since Google doesn't maintain BoringSSL for others, the may break the API / ABI without notice.

@lminiero
Copy link
Member

Please no. Adding the ability to build with BoringSSL would be indeed a welcome addition, but making it the only option is not a great idea, specially since Google doesn't maintain BoringSSL for others, the may break the API / ABI without notice.

I agree, and in fact what I was working on was a configure flag, which would assume OpenSSL would still be the default.

@saghul
Copy link
Contributor

saghul commented Sep 15, 2015

I agree, and in fact what I was working on was a configure flag, which would assume OpenSSL would still be the default.

👍

@frk2
Copy link

frk2 commented Sep 16, 2015

Good to know that boringSSL helped! working fine for us as well!

@amnonbb
Copy link
Contributor

amnonbb commented Sep 17, 2015

For me BoringSSL work more then month without issue.
But i got this #252 problem on some clients.
Only 1024 certs + fixed packet size in ffmpeg solve them problem.

@lminiero
Copy link
Member

Just finished preparing an extended version of configure.ac and Makefile.am that adds a new optional flag (--enable-boringssl), which if enabled (it's disabled by default) looks for BoringSSL in /opt/boringssl and dynamically modifies the janus_CFLAGS and janus_LDADD variables you guys edited manually. Compiled the BoringSSL revision @LetsVape mentioned and it does indeed work, no complaint about methods not being there anymore: working around the differences would be the right approach, rather than suggesting a specific version, as there might be other fixes to the library itself, but it would take more time, and we need a first step.

I can confirm that the nm gives the expected output, but I'm wondering if there may be any mixups with the stock OpenSSL libraries? In fact, ldd still reports libssl.so and libcrypto.so among the dependencies of the executable. Anyway, the code works and does its job, so I guess the static library overrides that.

That said, I'll prepare a pull request shortly with the changes: as soon as it is ready I'll notify it here, so when that happens I'd ask you guys (especially those of you who are already using BoringSSL) to test it and let me know if the dynamic flag does indeed make use of the right library for you, so that I can merge it for good.

@saghul
Copy link
Contributor

saghul commented Sep 24, 2015

adds a new optional flag (--enable-boringssl), which if enabled (it's disabled by default) looks for BoringSSL in /opt/boringssl

How about making the path configurable as:

--with-boringssl=no       // no broing ssl (default)
--with-boringssl=/opt/foo/bar/boringssl      // configurable!

@lminiero
Copy link
Member

This is something we can fix later on, I first want to make sure we have something workable. Besides, considering that apparently a specific revision is needed, it makes sense to have it installed somewhere specific for the moment.

@lminiero
Copy link
Member

PR created #343, feedback welcome.

@lminiero
Copy link
Member

Closing as the BoringSSL support fixes that for affected users.

@LetsVape
Copy link
Author

Sorry, intended to test the PR but didn't get to it, see it's in the main and modular transports branch now, thanks for your work on this @lminiero! much appreciated

@ancorgs
Copy link
Contributor

ancorgs commented Oct 19, 2015

Just for the record. I tried with openSSL 1.0.2d (I hoped that bumping openSSL to 1.0.2 would have similar effects to using BoringSSL) and the problem is still there.

@lminiero
Copy link
Member

Ping 😉

lminiero added a commit that referenced this issue Dec 2, 2015
Attempt to fix the infamous DTLS decrypt alert error (issues #136 and #316)
@ancorgs
Copy link
Contributor

ancorgs commented Dec 2, 2015

I think that we can close this. Looks like #394 fixed the problem.

@ancorgs
Copy link
Contributor

ancorgs commented Dec 2, 2015

Sorry, I commented in the wrong issue. :) I meant closing #316

@lminiero
Copy link
Member

lminiero commented Dec 2, 2015

Well, this was where the issue was first discussed anyway so no harm done 😉

@LetsVape
Copy link
Author

LetsVape commented Dec 2, 2015

👍 Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests