Skip to content

Conversation

@shinrich
Copy link
Member

Thought I fixed this with 507368d, but after wider deployment of that fix, we got another crash on a freed client_vc. Another look at the ProxyClientSession::handle_api_return, shows that I missed the very obvious call to vc->do_io_close() before ProxyClientSession::free() which will trigger the exactly same freed client vc with stale reference. So the original fix is good but not sufficient.

Here is our most recent stack trace

#0  0x0000000000000000 in ?? ()
#1  0x00000000006c5c43 in Http2ClientSession::free (this=0x2af4403bdc60) at ../../../../trafficserver/proxy/http2/Http2ClientSession.cc:94
#2  0x00000000005bd5c9 in ProxyClientSession::handle_api_return (this=0x2af4403bdc60, event=0) at ../../../trafficserver/proxy/ProxyClientSession.cc:212
#3  0x00000000005bd3bc in ProxyClientSession::state_api_callout (this=0x2af4403bdc60, event=0, data=0x0) at ../../../trafficserver/proxy/ProxyClientSession.cc:156
#4  0x00000000005bd4a6 in ProxyClientSession::do_api_callout (this=0x2af4403bdc60, id=TS_HTTP_SSN_CLOSE_HOOK) at ../../../trafficserver/proxy/ProxyClientSession.cc:182
#5  0x00000000006c5bc8 in Http2ClientSession::destroy (this=0x2af4403bdc60) at ../../../../trafficserver/proxy/http2/Http2ClientSession.cc:82
#6  0x00000000006ce838 in Http2ConnectionState::release_stream (this=0x2af4403bdf00, stream=0x2af30c262660) at ../../../../trafficserver/proxy/http2/Http2ConnectionState.cc:1246
#7  0x00000000006bf399 in Http2Stream::destroy (this=0x2af30c262660) at ../../../../trafficserver/proxy/http2/Http2Stream.cc:716
#8  0x00000000006bd6fd in Http2Stream::terminate_if_possible (this=0x2af30c262660) at ../../../../trafficserver/proxy/http2/Http2Stream.cc:377
#9  0x00000000006bd63b in Http2Stream::transaction_done (this=0x2af30c262660) at ../../../../trafficserver/proxy/http2/Http2Stream.cc:366
#10 0x000000000066a16b in HttpSM::kill_this (this=0x2ae8e8113900) at ../../../../trafficserver/proxy/http/HttpSM.cc:7078
#11 0x000000000065b2af in HttpSM::main_handler (this=0x2ae8e8113900, event=2301, data=0x2ae8e8114c80) at ../../../../trafficserver/proxy/http/HttpSM.cc:2826
#12 0x0000000000831494 in Continuation::dispatchEvent (this=0x2ae8e8113900, event=2301, data=0x2ae8e8114c80) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:46
#13 0x00000000006b5f32 in HttpTunnel::main_handler (this=0x2ae8e8114c80, event=104, data=0x2aea980a5bb0) at ../../../../trafficserver/proxy/http/HttpTunnel.cc:1650
#14 0x00000000008313ad in Continuation::handleEvent (this=0x2ae8e8114c80, event=104, data=0x2aea980a5bb0) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#15 0x000000000080e35a in read_signal_and_update (event=104, vc=0x2aea980a5a60) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:144
#16 0x000000000080e6d5 in read_signal_done (event=104, nh=0x2ae87541dcf0, vc=0x2aea980a5a60) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:205
#17 0x00000000008113d2 in UnixNetVConnection::readSignalDone (this=0x2aea980a5a60, event=104, nh=0x2ae87541dcf0) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:1096
#18 0x00000000007e91e4 in SSLNetVConnection::net_read_io (this=0x2aea980a5a60, nh=0x2ae87541dcf0, lthread=0x2ae87541a010) at ../../../../trafficserver/iocore/net/SSLNetVConnection.cc:649
#19 0x0000000000806443 in NetHandler::waitForActivity (this=0x2ae87541dcf0, timeout=60000000) at ../../../../trafficserver/iocore/net/UnixNet.cc:497
#20 0x0000000000833b3a in EThread::execute_regular (this=0x2ae87541a010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:273
#21 0x0000000000833d01 in EThread::execute (this=0x2ae87541a010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:326
#22 0x0000000000832936 in spawn_thread_internal (a=0x2092df0) at ../../../../trafficserver/iocore/eventsystem/Thread.cc:85
#23 0x00002ae87314fdc5 in start_thread () from /lib64/libpthread.so.0
#24 0x00002ae873e7e76d in clone () from /lib64/libc.so.6

Looking at frame 1, client_vc is non-null, but it has been freed (its vtable pointer is bogus). Looking back at ProxyClientSession::handle_api_return, were we removed the call to net_vc, we see there is still a problem.

     NetVConnection *vc = this->get_netvc();
     if (vc) {
       vc->do_io_close();
     }
     free(); // You can now clean things up
     break;

We free the vc the call to do_io_close because we came in on the other vc, so this vc does not have its recursion count bumped. The Http2ClientSession still has a stale reference to the vc via its client_vc member, so then bad things can happen if that vc has already been freed.

Http2CientSession and Http1ClientSession both call do_io_close and clear the client VC in their ::do_io_close methods. But for their ::free methods only Http2ClientSession calls do_io_close on the client vc. Http1ClientSession does not.

Either both should or neither should. We cannot safely do both in handle_api_return. After calling ::free the session is gone, so we cannot reach in and clean up the VC afterwards. If we continue with the early vc->do_io_close, we would have to make another call to set the VC to null before calling into ::free.

At this point, it would be cleanest just to defer the client vc clean up to the Http*ClientSession::free logic, which is what I propose in this PR.

@shinrich shinrich added this to the 9.0.0 milestone Aug 30, 2018
@shinrich shinrich self-assigned this Aug 30, 2018
@bryancall
Copy link
Contributor

bryancall commented Aug 30, 2018

I am concerned with pulling these locking and cleanup PRs (there have been a bunch now) into the 8.0.0 release. @zwoop and I have been running 8.0.0 is production without any issues and it looks like these PRs would make the release less stable. I am also concerned that many of the PRs are reviewed by only one person.

@shinrich
Copy link
Member Author

It is a pretty narrow crash window. It took 1 week running in 1.5 colos for it to appear. Since the core pointed out the problem, I figured we should go ahead and make the fix. Of course you can decide whether to risk putting it into the current release trains or not.

Copy link
Contributor

@masaori335 masaori335 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable

@masaori335
Copy link
Contributor

@zwoop As for 7.1.x branch, we need this because we backported 507368d.

@shinrich shinrich merged commit b15b63e into apache:master Sep 5, 2018
@masaori335
Copy link
Contributor

This looks also need to be backported to 8.0.x, because we faced #4147 crash with 8.0.3 (rc).

@zwoop
Copy link
Contributor

zwoop commented Feb 13, 2019

Cherry-picked to 8.0.x

@zwoop zwoop modified the milestones: 9.0.0, 8.0.3 Feb 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants