Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix intermittent musl integration test #781

Closed
jrcheli opened this issue Feb 11, 2022 · 9 comments · Fixed by #920
Closed

Fix intermittent musl integration test #781

jrcheli opened this issue Feb 11, 2022 · 9 comments · Fixed by #920
Assignees
Labels
Milestone

Comments

@jrcheli
Copy link
Contributor

jrcheli commented Feb 11, 2022

#768 added an integration test which is valuable, but not stable right now.
Specifically, in the musl integration test, the "tls" test case can pass but does not pass reliably enough to run in our pipeline every build.

I'm going to comment out this test case on master, and leave this ticket open as a reminder to 1) make it stable and then 2) restore the test functionality.

@jrcheli
Copy link
Contributor Author

jrcheli commented Feb 11, 2022

The test case was commented out on master with this commit:
eb56998
This was done just before v1.0.0 was built.

@ledbit ledbit added the tests label Mar 16, 2022
@ghost ghost added this to the Next Minor (1.1.0) milestone Mar 23, 2022
@jrcheli jrcheli assigned iapaddler and unassigned jrcheli Apr 28, 2022
@iapaddler
Copy link
Contributor

A test in the musl integration suite has been disabled. The issue is a failure to disconnect from a TLS session using tcpserver for the remote connection. The same test using socat passes regularly.

@iapaddler
Copy link
Contributor

iapaddler commented Apr 29, 2022

we've discovered a few things with this test. this is a simple tls test in musl:

SCOPE_EVENT_DEST=tcp://127.0.0.1:12345 
SCOPE_EVENT_TLS_ENABLE=true 
SCOPE_EVENT_TLS_VALIDATE_SERVER=false 
SCOPE_EVENT_TLS_CA_CERT_PATH=/usr/local/scope/cert.pem 
ldscope ps -ef
  1. the initial issue causing test failure appears to be an issue with tcpserver.
  2. tcpserver in TCP mode seems to be solid. as a TLS server there appears to be an issue. have not tracked that down.
  3. replacing tcpserver with socat allows the test to work.
cat /usr/local/scope/key.pem /usr/local/scope/cert.pem >> /tmp/appscope.pem
socat OPENSSL-LISTEN:12345,reuseaddr,cert=/tmp/appscope.pem,cafile=/tmp/appscope.pem,fork,verify=0 $EVT_FILE &

the next issue that arises results is occasionally missing console data from the ps -ef command. several things to note about this behavior:

  • it only occurs if TLS is enabled for the event connection
  • it only occurs when raw events are enabled (SCOPE_EVENT_METRIC=true)
  • in this test a lot of file system events are seen followed by a console event

we are not getting all events out on close.
there appear to be a couple of possibilities:

  • TCP packets are delivered to the kernel, placed in the process specific TCP send list. these are not sent immediately and when the process exits, the packet(s) are not sent. there is a potential explanation. the Nagle algorithm used by default in the TCP/IP stack is an optimization intended to reduce the number of small packets sent. there is a 40ms delay implied by the use of Nagle. as a test, in the handleExit() function, a 50ms delay after all data has been sent, before sockets are shutdown and closed resolves the missing console events.
  • given that missing console events are only evident when TLS is enabled and there is a large amount of traffic at exit time, as when raw events are enabled, it would appear to add validity to Nagle theory.
  • there are a couple of ways to essentially disable Nagle, if this is the cause of delayed sends. a socket send with the flag MSG_MORE or a socket option with a TCP_CORK setting are both expected to essentially override the use of the Nagle algorithm. these have not yet been tested. however, given that a socket approach could work, there is concern as to the effect on overall performance and behavior.
  • it is possible that there is an SSL specific issue causing an exit to occur before all packets have been sent. there is reference to Nagle in the context of SSL read of early data. however, this is not, or does not appears as an early read issue. there is the potential that SSL specific behavior exists to ensure that all packets are transitioned before disconnect completes. we need to do more research.

in summary, we will push updates to branch bug/781_tls_musl and continue SSL research. neither the prospect of a delay on exit nor a wholesale socket option change seems like the right approach at this time.

handing the SSL specifics to @jrcheli (off to a couple of Go items).

@jrcheli
Copy link
Contributor Author

jrcheli commented May 2, 2022

So I've tried to reproduce this with the above instructions on an ec2 machine directly (without using the musl container) and couldn't reproduce the problem. (Meaning that I saw every console event every time while using TLS, and with SCOPE_EVENT_METRIC=true).

So I switched to doing the same test inside the musl integration test container, and couldn't reproduce it again... But I did both of these tests with the latest appscope build from the master branch. The testing iapaddler described above was with the release/1.1 branch.

When I switched to the release/1.1 branch I could reproduce the problem. Well, I could see the problem in the musl integration test container, not outside of it.

f9acd14 is the version of master at this time
f4d76b1 is the 1.1 version at this time

So to include this new info with what was observed above... missing console data only occurs when:

  • TLS is enabled for the event connection
  • when raw events are enabled (SCOPE_EVENT_METRIC=true)
  • on the release/1.1 branch
  • in the musl container

So I tried adding BIO_set_tcp_ndelay(trans->net.sock, TRUE); to src/transport.c:establishTlsSession() on the release/1.1 branch... And that seemed to resolve this issue. 👀

@jrcheli
Copy link
Contributor Author

jrcheli commented May 3, 2022

Ok, but disabling the Nagle algorithm via a setsockopt(TCP_NODELAY) or BIO_set_tcp_ndelay() is not a great idea. The Nagle algorithm exists to keep a bunch of small packets (he called them tinygrams) from consuming network bandwidth, admittedly at the cost of a little latency. So what to do?

Based on this link: http://www.stuartcheshire.org/papers/NagleDelayedAck/ I started to wonder if the problem I was observing was due to the interaction between Nagle and DelayedAck.

So I ran another test, this time adding this to src/transport.c:establishTlsSession() on the release 1.1 branch. What it does is enable "QUICKACK". Here are two somewhat recent discussions, including https://news.ycombinator.com/item?id=24785405 and https://news.ycombinator.com/item?id=9048947 where John Nagle himself (as user Animats) weighs in and recommends the use of QUICKACK.

#if defined(TCP_QUICKACK) && (defined(IPPROTO_TCP) || defined(SOL_TCP))
    int opt;
    int on = TRUE;

#ifdef SOL_TCP
    opt=SOL_TCP;
#else
#ifdef IPPROTO_TCP
    opt=IPPROTO_TCP;
#endif
#endif

    scope_setsockopt(s,opt,TCP_QUICKACK,&on,sizeof(on));
#endif

@jrcheli
Copy link
Contributor Author

jrcheli commented May 3, 2022

So with the QUICKACK fix alone I almost always get the console message successfully, but occasionally (maybe 1 out of 20 times or so) I don't. I see this message from socat quite a bit, but always see it when socat didn't receive the console message:

2022/05/03 18:02:28 socat[218] E SSL_read(): Connection reset by peer

So, I added another fix on top of the quickack fix described above... I added a scope_shutdown() call immediately proceeding the scope_close call() in transport.c:shutdownTlsSession(). I did this because this link made me think it might be worth a try... https://stackoverflow.com/questions/1434451/what-does-connection-reset-by-peer-mean

With these two changes together, things are incrementally better. Socat almost always receives the data. Now the only time I see the connection reset by peer is when socat doesn't receive the console data. (Previously I would sometimes see the connection reset message and still receive the console data). I'm choosing to keep the scope_shutdown, because I I've observed that it makes the "connection reset by peer" less frequent.

After more monkeying around with things in this final state, I think the remaining times where I see socat spit out the "connection reset by peer" message are when the kernel network buffers are being overwhelmed by data. If I add the BIO_set_tcp_ndelay(), I can completely get rid of this problem, but I'm not sure if this could have unintended side effects. I'm pretty confident that the QUICKACK and scope_shutdown() do not have any side effects, so I think I'm going to go with that for now.

Oh, and as a final note, I've added the musl integration test back in, as this ticket was originally written to do. =)

@jrcheli
Copy link
Contributor Author

jrcheli commented May 4, 2022

During the review process, I ended up writing this script which can be run inside of the musl container...

#!/bin/bash

COUNT=0
while [ $COUNT -lt 20 ]; do
    SCOPE_EVENT_DEST=tcp://127.0.0.1:9109 \
    SCOPE_EVENT_TLS_ENABLE=true \
    SCOPE_EVENT_METRIC=true \
    SCOPE_EVENT_TLS_CA_CERT_PATH=/tmp/appscope.pem \
    ldscope ps -ef
    echo "****************** $COUNT ********************"
    ((COUNT++))
done

Since I found it helpful, I thought I'd capture it here. By running this script inside the musl container and running with different combinations of the changes I made above, I've confirmed that:

  1. Adding the BIO_set_tcp_ndelay() is an honest fix if we choose to go that way in the future.
  2. The QUICKACK doesn't hurt anything at least. Though it's not a magic bullet either.
  3. With the shutdown before close, the number of "connection reset by peer" messages from socat plus the number of successful console messages has always been equal to 20 (the number of iterations above). (I think this supports the idea that the kernel is getting overwhelmed with data which causes socate to output the "connection reset by peer" messages)
  4. Without the shutdown before close, I can see "connection reset by peer" messages plus the number of successful console messages be larger than 20. (Which suggests that the shutdown before close is helpful)

jrcheli added a commit that referenced this issue May 4, 2022
… exit (#920)

* (#781) Fix intermittent musl integration test (tls at proc exit).

* (#781) Fix issues noticed during the review process.
@jrcheli
Copy link
Contributor Author

jrcheli commented May 4, 2022

I'm marking this as done, since we've merged it into release/1.1

@jrcheli
Copy link
Contributor Author

jrcheli commented Jun 10, 2022

@abetones Changelog for this might appear as something like:
Improve delivery of AppScope data as processes exit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants