-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix intermittent musl integration test #781
Comments
The test case was commented out on master with this commit: |
A test in the musl integration suite has been disabled. The issue is a failure to disconnect from a TLS session using tcpserver for the remote connection. The same test using socat passes regularly. |
we've discovered a few things with this test. this is a simple tls test in musl:
the next issue that arises results is occasionally missing console data from the
we are not getting all events out on close.
in summary, we will push updates to branch bug/781_tls_musl and continue SSL research. neither the prospect of a delay on exit nor a wholesale socket option change seems like the right approach at this time. handing the SSL specifics to @jrcheli (off to a couple of Go items). |
So I've tried to reproduce this with the above instructions on an ec2 machine directly (without using the musl container) and couldn't reproduce the problem. (Meaning that I saw every console event every time while using TLS, and with SCOPE_EVENT_METRIC=true). So I switched to doing the same test inside the musl integration test container, and couldn't reproduce it again... But I did both of these tests with the latest appscope build from the master branch. The testing iapaddler described above was with the release/1.1 branch. When I switched to the release/1.1 branch I could reproduce the problem. Well, I could see the problem in the musl integration test container, not outside of it. f9acd14 is the version of master at this time So to include this new info with what was observed above... missing console data only occurs when:
So I tried adding |
Ok, but disabling the Nagle algorithm via a setsockopt(TCP_NODELAY) or BIO_set_tcp_ndelay() is not a great idea. The Nagle algorithm exists to keep a bunch of small packets (he called them tinygrams) from consuming network bandwidth, admittedly at the cost of a little latency. So what to do? Based on this link: http://www.stuartcheshire.org/papers/NagleDelayedAck/ I started to wonder if the problem I was observing was due to the interaction between Nagle and DelayedAck. So I ran another test, this time adding this to src/transport.c:establishTlsSession() on the release 1.1 branch. What it does is enable "QUICKACK". Here are two somewhat recent discussions, including https://news.ycombinator.com/item?id=24785405 and https://news.ycombinator.com/item?id=9048947 where John Nagle himself (as user Animats) weighs in and recommends the use of QUICKACK.
|
So with the QUICKACK fix alone I almost always get the console message successfully, but occasionally (maybe 1 out of 20 times or so) I don't. I see this message from socat quite a bit, but always see it when socat didn't receive the console message:
So, I added another fix on top of the quickack fix described above... I added a scope_shutdown() call immediately proceeding the scope_close call() in transport.c:shutdownTlsSession(). I did this because this link made me think it might be worth a try... https://stackoverflow.com/questions/1434451/what-does-connection-reset-by-peer-mean With these two changes together, things are incrementally better. Socat almost always receives the data. Now the only time I see the connection reset by peer is when socat doesn't receive the console data. (Previously I would sometimes see the connection reset message and still receive the console data). I'm choosing to keep the scope_shutdown, because I I've observed that it makes the "connection reset by peer" less frequent. After more monkeying around with things in this final state, I think the remaining times where I see socat spit out the "connection reset by peer" message are when the kernel network buffers are being overwhelmed by data. If I add the BIO_set_tcp_ndelay(), I can completely get rid of this problem, but I'm not sure if this could have unintended side effects. I'm pretty confident that the QUICKACK and scope_shutdown() do not have any side effects, so I think I'm going to go with that for now. Oh, and as a final note, I've added the musl integration test back in, as this ticket was originally written to do. =) |
During the review process, I ended up writing this script which can be run inside of the musl container...
Since I found it helpful, I thought I'd capture it here. By running this script inside the musl container and running with different combinations of the changes I made above, I've confirmed that:
|
I'm marking this as done, since we've merged it into release/1.1 |
@abetones Changelog for this might appear as something like: |
#768 added an integration test which is valuable, but not stable right now.
Specifically, in the musl integration test, the "tls" test case can pass but does not pass reliably enough to run in our pipeline every build.
I'm going to comment out this test case on master, and leave this ticket open as a reminder to 1) make it stable and then 2) restore the test functionality.
The text was updated successfully, but these errors were encountered: