Skip to content

Conversation

@shinrich
Copy link
Member

@shinrich shinrich commented Mar 2, 2020

We saw a number of crashes in one of our 9.0 deployments last week.

Most of the crashes look like the NetVC associated with the ProxySession has already been deleted and we are trying to delete it again or access it from the HttpSM::kill_this() path.

Here are a couple stacks.

The client_vc in Http1ClientSession seems to be already freed.

#0  0x0000000000519d03 in Http1ClientSession::do_io_write (this=<optimized out>, c=0x0, nbytes=0, buf=0x0, owner=false)
    at ../../../../../../_vcs/trafficserver9/proxy/http/Http1ClientSession.cc:220
#1  0x0000000000519f82 in Http1ClientSession::free (this=0x2ab43ccc8d00) at ../../../../../../_vcs/trafficserver9/proxy/http/Http1ClientSession.cc:115
#2  0x00000000006b5181 in ProxySession::handle_api_return (this=<optimized out>, event=<optimized out>)
    at ../../../../../_vcs/trafficserver9/proxy/ProxySession.cc:171
#3  0x00000000006b56ed in ProxySession::do_api_callout (this=this@entry=0x2ab43ccc8d00, id=id@entry=TS_HTTP_SSN_CLOSE_HOOK)
    at ../../../../../_vcs/trafficserver9/proxy/ProxySession.cc:148
#4  0x000000000051befd in destroy (this=0x2ab43ccc8d00) at ../../../../../../_vcs/trafficserver9/proxy/http/Http1ClientSession.cc:79
#5  Http1ClientSession::destroy (this=0x2ab43ccc8d00) at ../../../../../../_vcs/trafficserver9/proxy/http/Http1ClientSession.cc:68
#6  0x0000000000544331 in HttpSM::kill_this() () at ../../../../../../_vcs/trafficserver9/proxy/http/HttpSM.cc:6904
#7  0x0000000000544790 in main_handler (data=0x2ab55257df08, event=2301, this=0x2ab55257d000) at ../../../../../../_vcs/trafficserver9/proxy/http/HttpSM.cc:2624
#8  HttpSM::main_handler (this=0x2ab55257d000, event=2301, data=0x2ab55257df08) at ../../../../../../_vcs/trafficserver9/proxy/http/HttpSM.cc:2581
#9  0x0000000000589cc9 in handleEvent (data=0x2ab55257df08, event=2301, this=0x2ab55257d000)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver9/build/../../../../_vcs/trafficserver9/iocore/eventsystem/I_Continuation.h:167
#10 handleEvent (data=0x2ab55257df08, event=2301, this=0x2ab55257d000)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver9/build/../../../../_vcs/trafficserver9/iocore/eventsystem/I_Continuation.h:163
#11 HttpTunnel::main_handler (this=0x2ab55257df08, event=<optimized out>, data=<optimized out>)
    at ../../../../../../_vcs/trafficserver9/proxy/http/HttpTunnel.cc:1625
#12 0x00000000006743e2 in handleEvent (data=0x2ab26a64c660, event=103, this=0x2ab55257df08)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver9/build/../../../../_vcs/trafficserver9/iocore/eventsystem/I_Continuation.h:167
#13 handleEvent (data=0x2ab26a64c660, event=103, this=0x2ab55257df08)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver9/build/../../../../_vcs/trafficserver9/iocore/eventsystem/I_Continuation.h:163
#14 calluser (event=103, this=0x2ab26a64c400) at ../../../../../../_vcs/trafficserver9/iocore/cache/P_CacheInternal.h:624
#15 CacheVC::openWriteMain (this=0x2ab26a64c400) at ../../../../../../_vcs/trafficserver9/iocore/cache/CacheWrite.cc:1395
#16 0x000000000073abbb in handleEvent (data=0x2ab282774b00, event=1, this=0x2ab26a64c400)
    at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/I_Continuation.h:167
#17 handleEvent (data=0x2ab282774b00, event=1, this=0x2ab26a64c400) at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/I_Continuation.h:163
#18 EThread::process_event(Event*, int) () at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/UnixEThread.cc:143
#19 0x000000000073b2de in EThread::process_queue (this=this@entry=0x2aaf2aadd180, NegativeQueue=NegativeQueue@entry=0x2aaf2e806600, 
    ev_count=ev_count@entry=0x2aaf2e8065fc, nq_count=nq_count@entry=0x2aaf2e8065f8) at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/UnixEThread.cc:182
#20 0x000000000073b78c in EThread::execute_regular (this=this@entry=0x2aaf2aadd180) at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/UnixEThread.cc:242
#21 0x000000000073bec6 in execute (this=0x2aaf2aadd180) at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/UnixEThread.cc:351
#22 EThread::execute (this=0x2aaf2aadd180) at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/UnixEThread.cc:329
#23 0x000000000073a0e9 in spawn_thread_internal (a=0x2aaf26e828c0) at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/Thread.cc:92
#24 0x00002aaf2525bdd5 in start_thread () from /lib64/libpthread.so.0
#25 0x00002aaf25f90ead in clone () from /lib64/libc.so.6

Here the ssl_vc in frame 7 is associated with a different frame's net handler so the mutex assert triggers. I'm assuming that ssl_vc has been freed and reallocated.

#0  0x00002b9362e04207 in raise () from /lib64/libc.so.6
#1  0x00002b9362e058f8 in abort () from /lib64/libc.so.6
#2  0x00002b936059b98b in ink_abort (message_format=message_format@entry=0x2b9360606507 "%s:%d: failed assertion `%s`")
    at ../../../../../../_vcs/trafficserver9/src/tscore/ink_error.cc:99
#3  0x00002b9360599145 in _ink_assert (expression=expression@entry=0x758b44 "t == m->getThreadAffinity()", 
    file=file@entry=0x758798 "/home/shinrich/build2/_build/build_release_posix-x86_64/trafficserver9/build/../../../../_vcs/trafficserver9/iocore/eventsystem/I_Lock.h", line=line@entry=266) at ../../../../../../_vcs/trafficserver9/src/tscore/ink_assert.cc:37
#4  0x00000000004efc65 in Mutex_trylock (t=0x2b93680db900, m=<optimized out>)
    at /home/shinrich/build2/_build/build_release_posix-x86_64/trafficserver9/build/../../../../_vcs/trafficserver9/iocore/eventsystem/I_ThreadAffinity.h:43
#5  Mutex_trylock (t=0x2b93680db900, m=...)
    at /home/shinrich/build2/_build/build_release_posix-x86_64/trafficserver9/build/../../../../_vcs/trafficserver9/iocore/eventsystem/I_Lock.h:309
#6  MutexTryLock (t=0x2b93680db900, am=..., this=0x2b936e2fcfd0)
    at /home/shinrich/build2/_build/build_release_posix-x86_64/trafficserver9/build/../../../../_vcs/trafficserver9/iocore/eventsystem/I_Lock.h:549
#7  TSVConnReenableEx () at ../../../../../_vcs/trafficserver9/src/traffic_server/InkAPI.cc:9554
#8  0x00000000004efe9a in TSVConnReenable (vconn=vconn@entry=0x2b982d705700) at ../../../../../_vcs/trafficserver9/src/traffic_server/InkAPI.cc:9541
#9  0x00002b939312f478 in client_hello_ja3_handler (contp=<optimized out>, event=<optimized out>, edata=0x2b982d705700)
    at ../../../../../_vcs/trafficserver9/plugins/experimental/ja3_fingerprint/ja3_fingerprint.cc:346
#10 0x00000000004db770 in INKContInternal::handle_event (this=0x2b9367f22ee0, event=108, edata=0x2b982d705700)
    at ../../../../../_vcs/trafficserver9/src/traffic_server/InkAPI.cc:1101
#11 0x00000000004ed2bb in handleEvent (data=0x2b982d705700, event=108, this=0x2b9367f22ee0)
    at /home/shinrich/build2/_build/build_release_posix-x86_64/trafficserver9/build/../../../../_vcs/trafficserver9/iocore/eventsystem/I_Continuation.h:167
#12 handleEvent (data=0x2b982d705700, event=108, this=0x2b9367f22ee0)
    at /home/shinrich/build2/_build/build_release_posix-x86_64/trafficserver9/build/../../../../_vcs/trafficserver9/iocore/eventsystem/I_Continuation.h:163
#13 APIHook::invoke(int, void*) const () at ../../../../../_vcs/trafficserver9/src/traffic_server/InkAPI.cc:1338
#14 0x00000000006ea70f in SSLNetVConnection::callHooks(TSEvent) () at ../../../../../../_vcs/trafficserver9/iocore/net/SSLNetVConnection.cc:1807
#15 0x00000000006ea9a4 in SSLNetVConnection::do_io_close (this=0x2b982d705700, lerrno=-1)
    at ../../../../../../_vcs/trafficserver9/iocore/net/SSLNetVConnection.cc:880
#16 0x000000000051fad5 in Http1ClientSession::free (this=0x2ba8d3c2bb00) at ../../../../../../_vcs/trafficserver9/proxy/http/Http1ClientSession.cc:121
#17 0x00000000006cce31 in ProxySession::handle_api_return (this=<optimized out>, event=<optimized out>)
    at ../../../../../_vcs/trafficserver9/proxy/ProxySession.cc:171
#18 0x00000000006cd39d in ProxySession::do_api_callout (this=this@entry=0x2ba8d3c2bb00, id=id@entry=TS_HTTP_SSN_CLOSE_HOOK)
    at ../../../../../_vcs/trafficserver9/proxy/ProxySession.cc:148
#19 0x0000000000521a2d in destroy (this=0x2ba8d3c2bb00) at ../../../../../../_vcs/trafficserver9/proxy/http/Http1ClientSession.cc:79
#20 Http1ClientSession::destroy (this=0x2ba8d3c2bb00) at ../../../../../../_vcs/trafficserver9/proxy/http/Http1ClientSession.cc:68
#21 0x000000000054b001 in HttpSM::kill_this() () at ../../../../../../_vcs/trafficserver9/proxy/http/HttpSM.cc:6904
#22 0x000000000054b460 in main_handler (data=0x2ba78fe45f40, event=2301, this=0x2ba78fe45000) at ../../../../../../_vcs/trafficserver9/proxy/http/HttpSM.cc:2624
#23 HttpSM::main_handler (this=0x2ba78fe45000, event=2301, data=0x2ba78fe45f40) at ../../../../../../_vcs/trafficserver9/proxy/http/HttpSM.cc:2581
#24 0x00000000005901b9 in handleEvent (data=0x2ba78fe45f40, event=2301, this=0x2ba78fe45000)
    at /home/shinrich/build2/_build/build_release_posix-x86_64/trafficserver9/build/../../../../_vcs/trafficserver9/iocore/eventsystem/I_Continuation.h:167
#25 handleEvent (data=0x2ba78fe45f40, event=2301, this=0x2ba78fe45000)
    at /home/shinrich/build2/_build/build_release_posix-x86_64/trafficserver9/build/../../../../_vcs/trafficserver9/iocore/eventsystem/I_Continuation.h:163
#26 HttpTunnel::main_handler (this=0x2ba78fe45f40, event=<optimized out>, data=<optimized out>)
    at ../../../../../../_vcs/trafficserver9/proxy/http/HttpTunnel.cc:1625
#27 0x000000000068c122 in handleEvent (data=0x2ba6fa8d5260, event=103, this=0x2ba78fe45f40)
    at /home/shinrich/build2/_build/build_release_posix-x86_64/trafficserver9/build/../../../../_vcs/trafficserver9/iocore/eventsystem/I_Continuation.h:167
#28 handleEvent (data=0x2ba6fa8d5260, event=103, this=0x2ba78fe45f40)
    at /home/shinrich/build2/_build/build_release_posix-x86_64/trafficserver9/build/../../../../_vcs/trafficserver9/iocore/eventsystem/I_Continuation.h:163
#29 calluser (event=103, this=0x2ba6fa8d5000) at ../../../../../../_vcs/trafficserver9/iocore/cache/P_CacheInternal.h:624
#30 CacheVC::openWriteMain (this=0x2ba6fa8d5000) at ../../../../../../_vcs/trafficserver9/iocore/cache/CacheWrite.cc:1395
#31 0x000000000075281b in handleEvent (data=0x2ba78cf59de0, event=1, this=0x2ba6fa8d5000)
    at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/I_Continuation.h:167
#32 handleEvent (data=0x2ba78cf59de0, event=1, this=0x2ba6fa8d5000) at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/I_Continuation.h:163
#33 EThread::process_event(Event*, int) () at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/UnixEThread.cc:143
#34 0x0000000000752f3e in EThread::process_queue (this=this@entry=0x2b93680db900, NegativeQueue=NegativeQueue@entry=0x2b936e2fd600, 
    ev_count=ev_count@entry=0x2b936e2fd5fc, nq_count=nq_count@entry=0x2b936e2fd5f8) at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/UnixEThread.cc:182
#35 0x00000000007533ec in EThread::execute_regular (this=this@entry=0x2b93680db900) at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/UnixEThread.cc:242
#36 0x0000000000753b26 in execute (this=0x2b93680db900) at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/UnixEThread.cc:351
#37 EThread::execute (this=0x2b93680db900) at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/UnixEThread.cc:329
#38 0x0000000000751d49 in spawn_thread_internal (a=0x2b9363dbbd00) at ../../../../../../_vcs/trafficserver9/iocore/eventsystem/Thread.cc:92
#39 0x00002b9362196dd5 in start_thread () from /lib64/libpthread.so.0
#40 0x00002b9362ecbead in clone () from /lib64/libc.so.6

Looking more closely into the history of the HttpSM of the first stack. The HttpSM sends a 304 back to the client but then sets SM_ACTION_INTERNAL_CACHE_WRITE to write the body of the 200 response it got from the origin to cache.

The issue is that the code does call ProxyTransaction::release after sending the 304 which causes another Session specific handler to be set to handle future network IO on the client_vc. If a EOS or timeout occurs, the client_vc object will cleaned up. But ProxySession objects will still refer to the client_vc stale pointer. This PR clears the write_vio and clears the client_vc when the ProxySession object receives a _EOS or session ending timeout.

Testing this patch on a prod box.

@shinrich shinrich added this to the 10.0.0 milestone Mar 2, 2020
@shinrich shinrich self-assigned this Mar 2, 2020
SolidWallOfCode
SolidWallOfCode previously approved these changes Mar 3, 2020
Copy link
Member

@SolidWallOfCode SolidWallOfCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this directly with @shinrich and I think it's a reasonable change. As usual, the question wasn't
why this was needed, but why we haven't seen this problem before.

@shinrich shinrich force-pushed the deal-with-early-client-finish branch 2 times, most recently from b852dc6 to ceb6582 Compare March 3, 2020 18:49
@shinrich shinrich merged commit ba00281 into apache:master Mar 4, 2020
@zwoop zwoop added the OnDocs This is for PR currently running, or will run, on the Docs ATS server label Mar 5, 2020
@zwoop
Copy link
Contributor

zwoop commented Mar 6, 2020

Tested on docs.
Cherry-picked to v9.0.x branch.

@zwoop zwoop removed the OnDocs This is for PR currently running, or will run, on the Docs ATS server label Mar 6, 2020
@zwoop zwoop modified the milestones: 10.0.0, 9.0.0 Mar 6, 2020
This was referenced Apr 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants