Skip to content

Conversation

@sudheerv
Copy link
Contributor

@sudheerv sudheerv commented Aug 26, 2019

Replacing the fix with a more defensive fix to protect against double delete when Http2ConnectionState::destroy() is called more than once (similar to the technique used in Http2ClientSession::destroy() via a in_destroy state). (as it looks like preventing ua_session->destroy() from being called multiple times will result in leaking the io buffers). Also resetting the shutdown_cont_event

@sudheerv sudheerv added this to the 10.0.0 milestone Aug 26, 2019
@shinrich
Copy link
Member

Interesting. Do you see that as a crash case?

@bryancall bryancall assigned sudheerv and unassigned maskit and masaori335 Aug 26, 2019
@bryancall bryancall requested review from masaori335 and maskit August 26, 2019 17:49
@sudheerv
Copy link
Contributor Author

sudheerv commented Aug 26, 2019

Interesting. Do you see that as a crash case?

Yeah, it does cause a crash. Here's the traceback. I'm running the fix in prod to ensure there are no leaks or side-effects. It's been okay so far (about 8 hrs), will let it run for more than 24hours to be absolutely sure.

(gdb) bt
#0  0x00002b394e3e0de0 in ?? ()
#1  0x0000000000607811 in destroy (this=0x2b390d20a230) at Http2ConnectionState.h:161
#2  Http2ClientSession::free (this=<optimized out>) at Http2ClientSession.cc:137
#3  0x00000000004f16b1 in handle_api_return (event=<optimized out>, this=<optimized out>) at ProxyClientSession.cc:201
#4  ProxyClientSession::state_api_callout (this=0x2b390d209fb0, event=<optimized out>, data=<optimized out>) at ProxyClientSession.cc:153
#5  0x000000000060a6f6 in Http2ConnectionState::release_stream (this=this@entry=0x2b390d20a230, stream=stream@entry=0x2b3933c8d6c0) at Http2ConnectionState.cc:1102
#6  0x00000000005fa3be in Http2Stream::destroy (this=0x2b3933c8d6c0) at Http2Stream.cc:673
#7  0x00000000005facb5 in Http2Stream::terminate_if_possible (this=this@entry=0x2b3933c8d6c0) at Http2Stream.cc:368
#8  0x00000000005fadf2 in Http2Stream::transaction_done (this=0x2b3933c8d6c0) at Http2Stream.cc:357
#9  0x00000000005b5a5e in HttpSM::kill_this (this=this@entry=0x2b393e34ce40) at HttpSM.cc:7046
#10 0x00000000005b686d in HttpSM::state_api_callback (this=this@entry=0x2b393e34ce40, event=event@entry=60000, data=data@entry=0x0) at HttpSM.cc:1519
#11 0x00000000004d7d19 in TSHttpTxnReenable (txnp=0x2b393e34ce40, event=TS_EVENT_HTTP_CONTINUE) at InkAPI.cc:5711
#12 0x00002b38d36c3483 in MemberAssignmentRequestContext::process (this=0x2b38b4c278c0, url="stickyrouting/stickyrouting/memberBucketAssignments?action=fetch", 
    result=AtsPluginUtils::FetchResultProcessor::SUCCESS, status=TS_HTTP_STATUS_OK, 
    body=0x2b3a1f531539 "{\"value\":{\"lastIssued\":1566377666,\"expires\":1566457171,\"userGeneration\":437,\"coloChangedTime\":1565513471,\"currentBucket\":\"OB41\",\"bucketGeneration\":1835}}", 
    bodyLen=153) at MemberAssignmentRequestContext.cc:85
#13 0x00002b38c427809b in handleFetchEvents (cont=0x2b392a3f3cc0, event=10000, edata=0x2b38b4016200) at Utils.cc:172
#14 0x00000000004c2dd3 in INKContInternal::handle_event (this=0x2b392a3f3cc0, event=10000, edata=0x2b38b4016200) at InkAPI.cc:1012
#15 0x00000000004aeadf in handleEvent (data=0x2b38b4016200, event=10000, this=<optimized out>) at ../iocore/eventsystem/I_Continuation.h:153
#16 FetchSM::InvokePlugin (this=0x2b38b4016200, event=10000, data=0x2b38b4016200) at FetchSM.cc:104
#17 0x00000000004b036d in FetchSM::process_fetch_read (this=this@entry=0x2b38b4016200, event=event@entry=104) at FetchSM.cc:479
#18 0x00000000004b072b in FetchSM::fetch_handler (this=0x2b38b4016200, event=104, edata=0x2b38b4c34b48) at FetchSM.cc:525
#19 0x00000000004ebc1f in handleEvent (data=0x2b38b4c34b48, event=104, this=<optimized out>) at ../iocore/eventsystem/I_Continuation.h:153
#20 PluginVC::process_read_side (this=this@entry=0x2b38b4c34a38, other_side_call=other_side_call@entry=false) at PluginVC.cc:627
#21 0x00000000004ee01a in PluginVC::main_handler (this=0x2b38b4c34a38, event=<optimized out>, data=0x2b39221f4c80) at PluginVC.cc:221
#22 0x0000000000774de0 in handleEvent (data=0x2b39221f4c80, event=1, this=<optimized out>) at I_Continuation.h:153
#23 EThread::process_event (this=0x2b38a7e08000, e=0x2b39221f4c80, calling_code=1) at UnixEThread.cc:148
#24 0x0000000000775a9b in EThread::execute (this=0x2b38a7e08000) at UnixEThread.cc:202
#25 0x000000000077488a in spawn_thread_internal (a=0x2b38a5b0f100) at Thread.cc:86
#26 0x00002b38a3852dc5 in start_thread () from /lib64/libpthread.so.0
#27 0x00002b38a48ec76d in clone () from /lib64/libc.so.6

@sudheerv
Copy link
Contributor Author

@shinrich Also, I think the recent rate limiting code that's been added may have made this problem worse as now there are more code paths that can call ua_session->destroy (outside of the normal TXN cleanup)

@maskit
Copy link
Member

maskit commented Aug 27, 2019

I may not understand the condition correctly, but it sounds odd. Because if there is still a transaction alive, total_client_streams_count should not be 0 and this block should not be ran.

Is there any way to reproduce this crash?

@sudheerv
Copy link
Contributor Author

I may not understand the condition correctly, but it sounds odd. Because if there is still a transaction alive, total_client_streams_count should not be 0 and this block should not be ran.

Is there any way to reproduce this crash?

The problem is caused when handling a FIN. When handling a resultant HTTP2_SESSION_EVENT_FINI event, cleanup_streams() and release_stream() are called in that order (the first trying to delete all streams (in the process calling stream::initiating_close() for each stream which basically schedules a VC event to close the TXN) and the second then calls ua_session->destroy()). However, the VC events for the TXN may not actually be processed immediately because a plugin (independently) is blocking the TXN (i.e has not called TxnReenable, for e.g to perform a sideways i/o). When the plugin eventually calls TxnReenable, the pending VC events try to get processed ending up calling ua_session->destroy() (via the terminateIfposibble() etc - see the above traceback).

@sudheerv sudheerv changed the title Make sure ua_session->destroy is called only once on an active session Prevent Http2ConnectionState::destroy() being called more than once per session. Aug 27, 2019
@apache apache deleted a comment from bryancall Aug 27, 2019
@maskit
Copy link
Member

maskit commented Aug 27, 2019

The current change makes more sense than the first one. But it's still odd that destroy() is called twice.

Can you print _history?

@sudheerv
Copy link
Contributor Author

The current change makes more sense than the first one. But it's still odd that destroy() is called twice.

Can you print _history?

We are running an older version unfortunately, and don't have _history merged yet. We are working on upgrading to ATS 9 soon though.

@maskit
Copy link
Member

maskit commented Aug 28, 2019

I’m ok with adding this guard, but I still don’t understand how it is called twice.

Because destroy() should not be called twice, I don’t want to let it return silently without any reports. How about adding ink_assert(!in_destroy) ?

It would be really helpful if you could share an example plugin that causes the second call.

@sudheerv
Copy link
Contributor Author

I’m ok with adding this guard, but I still don’t understand how it is called twice.

Because destroy() should not be called twice, I don’t want to let it return silently without any reports. How about adding ink_assert(!in_destroy) ?

It would be really helpful if you could share an example plugin that causes the second call.

Yeah, I can add an ink_assert or may be a schedule_zombie_event().

The example plugin is basically doing a sideways call using the FetchSM construct (i.e another internal/dummy Txn) without calling TSHttpTxnReenable() at Pre_Remap hook (thus stalling the Txn). The plugin finally calls TSHttpTxnReenable() after the completion of the sideways call.

maskit
maskit previously requested changes Aug 29, 2019
Copy link
Member

@maskit maskit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a change request, but I want to reproduce the second call and dig into it before merging this.
Also I'm not familiar with zombie events, so need to understand how it works.

@maskit
Copy link
Member

maskit commented Aug 30, 2019

Hmm, I tried a few things to reproduce the second call on master and 8.0.x, but I couldn't.

  • Reenable after inactive timeout
  • Reenable after triggering rate limit

The plugin I wrote is like this:

TSHttpTxn transactions[128] = {0};                                               
int x = 0;                                                                       
                                                                                 
static int                                                                       
event_handler(TSCont contp, TSEvent event ATS_UNUSED, void *edata)               
{                                                                                
  TSHttpSsn ssnp;                                                                
  TSHttpTxn txnp;                                                                
                                                                                 
  switch (event) {                                                               
  case TS_EVENT_HTTP_SSN_START:                                                  
    ssnp = (TSHttpSsn) edata;                                                    
    TSHttpSsnHookAdd(ssnp, TS_HTTP_TXN_START_HOOK, contp);                       
    TSHttpSsnReenable(ssnp, TS_EVENT_HTTP_CONTINUE);                             
    break;                                                                       
  case TS_EVENT_HTTP_TXN_START:                                                  
    txnp = (TSHttpTxn) edata;                                                    
    TSHttpTxnHookAdd(txnp, TS_HTTP_PRE_REMAP_HOOK, contp);                       
    TSHttpTxnReenable(txnp, TS_EVENT_HTTP_CONTINUE);                             
    break;                                                                       
  case TS_EVENT_HTTP_PRE_REMAP:                                                  
    txnp = (TSHttpTxn) edata;                                                    
    TSDebug("maskit", "-------------------------------------------------------- PRE_REMAP");
    transactions[x] = txnp;                                                      
    ++x;                                                                         
    /* TSContScheduleOnPool(contp, 150000, TS_THREAD_POOL_TASK); */              
    TSContSchedule(contp, 150000, TS_THREAD_POOL_TASK);                          
    break;                                                                       
  default:                                                                       
    txnp = (TSHttpTxn) edata;                                                    
    TSDebug("maskit", "-------------------------------------------------------- DEFAULT");
    for (int i = 0; i < 128; ++i) {                                              
      if (transactions[i] != 0) {                                                
        TSHttpTxnReenable(transactions[i], TS_EVENT_HTTP_CONTINUE);              
        transactions[i] = 0;                                                     
        break;                                                                   
      }                                                                          
    }                                                                            
    break;                                                                       
  }                                                                              
  return 0;                                                                      
}

@maskit maskit dismissed their stale review September 2, 2019 00:51

Couldn't reproduce the issue.

@maskit
Copy link
Member

maskit commented Sep 2, 2019

I'm -0 on this. The change seems harmless, but it would hide the real issue (the cause of double destroy).

Also, zombie_event seems to be used for a different purpose. It was introduced to detect a kind of leak, but not double free, IIUC. It would confuse us because we would have no idea which one caused the event. Any comments? @shinrich

@shinrich
Copy link
Member

shinrich commented Sep 3, 2019

I am also lukewarm on the change. The zombie event, should track that the object we aren't destroying does get cleaned up within the zombie deadline, so I think the use here is ok.

Since the error was seen on an older version can cannot be reproduced on master or 8, I'd rather hold off on this change in case it is unnecessary due to some of the other numerous changes in the HTTP/2 area.

@sudheerv
Copy link
Contributor Author

sudheerv commented Sep 26, 2019

@maskit @shinrich

This happened again even with the additional guard. Found 2 different kinds of trace backs (on two different boxes) both crashing in Http2ConnectionState::destroy()

{code}
(gdb) bt
#0  0x00002ae162a280a0 in ?? ()
#1  0x0000000000616064 in destroy (this=0x2ae124cd5200) at Http2ConnectionState.h:169
#2  Http2ClientSession::free (this=<optimized out>) at Http2ClientSession.cc:142
#3  0x0000000000613c2e in Http2ClientSession::main_event_handler (this=0x2ae124cd4f80, event=<optimized out>, edata=0x2ae1a0c72470) at Http2ClientSession.cc:382
#4  0x00000000007698a0 in handleEvent (data=0x2ae1a0c72470, event=105, this=0x2ae124cd4f80) at ../../iocore/eventsystem/I_Continuation.h:153
#5  read_signal_and_update (vc=0x2ae1a0c72350, event=105) at UnixNetVConnection.cc:148
#6  UnixNetVConnection::mainEvent (this=0x2ae1a0c72350, event=<optimized out>, e=<optimized out>) at UnixNetVConnection.cc:1203
#7  0x0000000000760b33 in handleEvent (data=0x2ae0d8fbdc20, event=1, this=0x2ae1a0c72350) at ../../iocore/eventsystem/I_Continuation.h:153
#8  InactivityCop::check_inactivity (this=0x2ae0d830db80, event=<optimized out>, e=0x2ae0d8fbdc20) at UnixNet.cc:102
#9  0x000000000078d5ad in handleEvent (data=0x2ae0d8fbdc20, event=2, this=<optimized out>) at I_Continuation.h:153
#10 EThread::process_event (this=0x2ae0daa08000, e=0x2ae0d8fbdc20, calling_code=2) at UnixEThread.cc:152
#11 0x000000000078e281 in EThread::execute (this=0x2ae0daa08000) at UnixEThread.cc:234
#12 0x000000000078d04a in spawn_thread_internal (a=0x2ae0d830f280) at Thread.cc:86
#13 0x00002ae0d5dc5dc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00002ae0d6e5f76d in clone () from /lib64/libc.so.6
{code}
{code}
(gdb) bt
#0  0x00002b6525eda6a0 in ?? ()
#1  0x0000000000616064 in destroy (this=0x2b64d3753f00) at Http2ConnectionState.h:169
#2  Http2ClientSession::free (this=<optimized out>) at Http2ClientSession.cc:142
#3  0x00000000004f3f11 in handle_api_return (event=<optimized out>, this=<optimized out>) at ProxyClientSession.cc:201
#4  ProxyClientSession::state_api_callout (this=0x2b64d3753c80, event=<optimized out>, data=<optimized out>) at ProxyClientSession.cc:153
#5  0x000000000061ae72 in Http2ConnectionState::release_stream (this=this@entry=0x2b64d3753f00, stream=stream@entry=0x2b651a9d4f90) at Http2ConnectionState.cc:1152
#6  0x00000000006075f9 in Http2Stream::destroy (this=0x2b651a9d4f90) at Http2Stream.cc:677
#7  0x0000000000607f35 in Http2Stream::terminate_if_possible (this=this@entry=0x2b651a9d4f90) at Http2Stream.cc:372
#8  0x0000000000608072 in Http2Stream::transaction_done (this=0x2b651a9d4f90) at Http2Stream.cc:361
#9  0x00000000005bb4be in HttpSM::kill_this (this=this@entry=0x2b64de8b06c0) at HttpSM.cc:7057
#10 0x00000000005bc3bd in HttpSM::state_api_callback (this=this@entry=0x2b64de8b06c0, event=event@entry=60000, data=data@entry=0x0) at HttpSM.cc:1519
#11 0x00000000004d9379 in TSHttpTxnReenable (txnp=0x2b64de8b06c0, event=TS_EVENT_HTTP_CONTINUE) at InkAPI.cc:5711
#12 0x00002b64a86d6943 in MemberAssignmentRequestContext::process (this=0x2b64da0bf500, url="stickyrouting/stickyrouting/memberBucketAssignments?action=fetch", 
    result=AtsPluginUtils::FetchResultProcessor::SUCCESS, status=TS_HTTP_STATUS_OK, 
    body=0x2b64908a0fb9 "{\"value\":{\"lastIssued\":1569455728,\"expires\":1569529305,\"userGeneration\":105,\"coloChangedTime\":1569442905,\"currentBucket\":\"TB83\",\"bucketGeneration\":2000}}", bodyLen=153) at MemberAssignmentRequestContext.cc:89
#13 0x00002b649927809b in handleFetchEvents (cont=0x2b65f70b09a0, event=10000, edata=0x2b6488420800) at Utils.cc:172
#14 0x00000000004c4323 in INKContInternal::handle_event (this=0x2b65f70b09a0, event=10000, edata=0x2b6488420800) at InkAPI.cc:1012
#15 0x00000000004af96f in handleEvent (data=0x2b6488420800, event=10000, this=<optimized out>) at ../iocore/eventsystem/I_Continuation.h:153
#16 FetchSM::InvokePlugin (this=0x2b6488420800, event=10000, data=0x2b6488420800) at FetchSM.cc:104
#17 0x00000000004b1445 in FetchSM::process_fetch_read (this=this@entry=0x2b6488420800, event=event@entry=104) at FetchSM.cc:479
#18 0x00000000004b1743 in FetchSM::fetch_handler (this=0x2b6488420800, event=104, edata=0x2b6490862548) at FetchSM.cc:525
#19 0x00000000004f0737 in handleEvent (data=0x2b6490862548, event=104, this=<optimized out>) at ../iocore/eventsystem/I_Continuation.h:153
#20 PluginVC::process_read_side (this=this@entry=0x2b6490862438, other_side_call=other_side_call@entry=false) at PluginVC.cc:627
#21 0x00000000004f109a in PluginVC::main_handler (this=0x2b6490862438, event=<optimized out>, data=0x2b6520dd6ba0) at PluginVC.cc:221
#22 0x000000000078d5ad in handleEvent (data=0x2b6520dd6ba0, event=1, this=<optimized out>) at I_Continuation.h:153
#23 EThread::process_event (this=0x2b647d406000, e=0x2b6520dd6ba0, calling_code=1) at UnixEThread.cc:152
#24 0x000000000078e26b in EThread::execute (this=0x2b647d406000) at UnixEThread.cc:206
#25 0x000000000078d04a in spawn_thread_internal (a=0x2b647ab0f220) at Thread.cc:86
#26 0x00002b6478838dc5 in start_thread () from /lib64/libpthread.so.0
#27 0x00002b64798d276d in clone () from /lib64/libc.so.6
{code}

It seems like there's two code paths that can call Http2ClientSession::free() and they could cause a use-after-free symptom.

The crash itself seems to happen because Http2ConnectionState::shutdown_cont_event is garbage. I'm going to try and set shutdown_cont_event = null after calling cancel() inside Http2ConnectionState::destroy() to make sure it's not holding onto garbage reference.

Let me know what you think.

Some additional data. shutdown_cont_event seems to show a different thread to that of the thread in stack and it's mutex is either null or doesn't seem to be from the current thread in stack.

 (gdb) p this->shutdown_cont_event
$6 = (Event *) 0x2ae1d0e8fa40
(gdb) p *this->shutdown_cont_event
$7 = {<Action> = {_vptr.Action = 0x2ae128f1fd60, continuation = 0x2ae206cb30c0, mutex = {m_ptr = 0x0}, cancelled = 1}, ethread = 0x2ae0d8d07000, in_the_prot_queue = 1, 
  in_the_priority_queue = 0, immediate = 1, globally_allocated = 1, in_heap = 0, callback_event = 100, timeout_at = 0, period = 0, cookie = 0x2ae206cb33f0, link = {<SLink<Event>> = {
      next = 0x0}, prev = 0x0}}
(gdb) p *(ProxyMutex*)0x2ae1693e7fa0
$8 = {<RefCountObj> = {<ForceVFPTToTop> = {_vptr.ForceVFPTToTop = 0x791430 <vtable for ProxyMutex+16>}, m_refcount = 8}, the_mutex = {__data = {__lock = 1, __count = 0, __owner = 25056, 
      __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = "\001\000\000\000\000\000\000\000\340a\000\000\001", '\000' <repeats 26 times>, __align = 1}, thread_holding = 0x2ae0daa08000, nthread_holding = 3}
(gdb) p this->shutdown_cont_event
$6 = (Event *) 0x2b65a05c9c60
(gdb) p *this->shutdown_cont_event
$7 = {<Action> = {_vptr.Action = 0x7e8310 <vtable for Event+16>, continuation = 0x2b64e313c300, mutex = {m_ptr = 0x2b6563fe61c0}, cancelled = 0}, ethread = 0x2b647e608000, 
  in_the_prot_queue = 0, in_the_priority_queue = 1, immediate = 0, globally_allocated = 1, in_heap = 9, callback_event = 2256, timeout_at = 1569455730555762263, period = 0, cookie = 0x0, 
  link = {<SLink<Event>> = {next = 0x2b65a05c9060}, prev = 0x2b65102fef80}}
(gdb) p *(ProxyMutex*)0x2b64f05a3730
$8 = {<RefCountObj> = {<ForceVFPTToTop> = {_vptr.ForceVFPTToTop = 0x791430 <vtable for ProxyMutex+16>}, m_refcount = 16}, the_mutex = {__data = {__lock = 1, __count = 0, __owner = 9870, 
      __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = "\001\000\000\000\000\000\000\000\216&\000\000\001", '\000' <repeats 26 times>, __align = 1}, thread_holding = 0x2b647d406000, nthread_holding = 6}
(gdb) p *(ProxyMutex*)0x2b6563fe61c0
$9 = {<RefCountObj> = {<ForceVFPTToTop> = {_vptr.ForceVFPTToTop = 0x791430 <vtable for ProxyMutex+16>}, m_refcount = 18}, the_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, 
      __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, thread_holding = 0x0, 
  nthread_holding = 0}

I'm wondering if it's because HTTP2_SESSION_EVENT_SHUTDOWN_INIT is going through the event loop and therefore can cause a race condition in handling shutdown_cont_event

@masaori335
Copy link
Contributor

The crash itself seems to happen because Http2ConnectionState::shutdown_cont_event is garbage. I'm going to try and set shutdown_cont_event = null after calling cancel() inside Http2ConnectionState::destroy() to make sure it's not holding onto garbage reference.

+1. it should be set nullptr after canceled.

@maskit
Copy link
Member

maskit commented Sep 27, 2019

I'm wondering if it's because HTTP2_SESSION_EVENT_SHUTDOWN_INIT is going through the event loop and therefore can cause a race condition in handling shutdown_cont_event

HTTP2_SESSION_EVENT_SHUTDOWN_INIT is not going through the event loop. It's processed directly via send_connection_event.

On the first stack trace, if destroy is already called and it's the second call, triggering VC_EVENT_INACTIVITY_TIMEOUT (105) against the session doesn't make sense. I think we need to make sure that a session don't get INACTIVITY_TIMEOUT after destroying it.

@sudheerv
Copy link
Contributor Author

I'm wondering if it's because HTTP2_SESSION_EVENT_SHUTDOWN_INIT is going through the event loop and therefore can cause a race condition in handling shutdown_cont_event

HTTP2_SESSION_EVENT_SHUTDOWN_INIT is not going through the event loop. It's processed directly via send_connection_event.

Er, I meant the handling that follows HTTP2_SESSION_EVENT_SHUTDOWN_INIT (not that event itself).

See below.

https://github.com/apache/trafficserver/blob/master/proxy/http2/Http2ConnectionState.cc#L1056

On the first stack trace, if destroy is already called and it's the second call, triggering VC_EVENT_INACTIVITY_TIMEOUT (105) against the session doesn't make sense. I think we need to make sure that a session don't get INACTIVITY_TIMEOUT after destroying it.

I suspect the problem is because of the scheduling of a timer to process the graceful shutdown (see above). I'm going to try and run an ASAN build to see if we can confirm this.

double delete (similar to Http2ClientSession)

Add zombie_event when destroy() is called more than once.
As it seems like there could be a race condition because the handling of
HTTP2_SESSION_EVENT_SHUTDOWN_INIT is going through the event loop again
and therefore may come in after the Session and its
associated contexts are already cleaned up thus causing an use-after-free possibility
@sudheerv
Copy link
Contributor Author

sudheerv commented Oct 2, 2019

I'm wondering if it's because HTTP2_SESSION_EVENT_SHUTDOWN_INIT is going through the event loop and therefore can cause a race condition in handling shutdown_cont_event

HTTP2_SESSION_EVENT_SHUTDOWN_INIT is not going through the event loop. It's processed directly via send_connection_event.

Er, I meant the handling that follows HTTP2_SESSION_EVENT_SHUTDOWN_INIT (not that event itself).

See below.

https://github.com/apache/trafficserver/blob/master/proxy/http2/Http2ConnectionState.cc#L1056

On the first stack trace, if destroy is already called and it's the second call, triggering VC_EVENT_INACTIVITY_TIMEOUT (105) against the session doesn't make sense. I think we need to make sure that a session don't get INACTIVITY_TIMEOUT after destroying it.

I suspect the problem is because of the scheduling of a timer to process the graceful shutdown (see above). I'm going to try and run an ASAN build to see if we can confirm this.

Following up on this -

I ran an ASAN build on one of our prod boxes for hours and couldn't catch this issue. But, based on the core dump analysis and the code that schedules a delay back to the event loop when handling a shutdown event, I'm fairly positive that this can cause a race condition with potential use-after-free depending on what plugins you may have.

IMO, the fix to reset shutdown_cont_event to nullptr seems reasonable to prevent the use-after-free and as a protection.

@maskit
Copy link
Member

maskit commented Oct 3, 2019

I'm fine with setting nullptr to shutdown_cont_event. It wouldn't break anything. However, I still think the root cause is destroy is called multiple times somehow. Your instance may work fine, but we'll lose a hint of the misbehavior (double destroy).

@sudheerv
Copy link
Contributor Author

sudheerv commented Oct 3, 2019

I'm fine with setting nullptr to shutdown_cont_event. It wouldn't break anything. However, I still think the root cause is destroy is called multiple times somehow. Your instance may work fine, but we'll lose a hint of the misbehavior (double destroy).

Like I just explained and showed the stack traces, destroy() is indeed being called twice (one via inactivity timeout and the second via a unblocked stalled transaction) for the same session. It seems to be because of the logic to schedule an event with a delay when trying to gracefully close the stream. I don’t think you can prevent the two code paths from triggering in parallel because of the delay window, but may be shutdown_initiated state can be used to block the second cleanup sequence. Either way, it’s a defensive fix.

@zwoop
Copy link
Contributor

zwoop commented Oct 17, 2019

@maskit @sudheerv What do we do with this ?

@maskit
Copy link
Member

maskit commented Oct 17, 2019

The PR subject should be rephrased at least because destroy() still can be called multiple times. Then I'm ok with landing this.

@sudheerv
Copy link
Contributor Author

@zwoop We are running this patch in our prod and haven't seen the same issues since. I can change the title of the PR to say "Protect against destroy() being called more than once per Http2Session".

@sudheerv sudheerv changed the title Prevent Http2ConnectionState::destroy() being called more than once per session. Add protection against use after free due to Http2ConnectionState::destroy() being called more than once per session. Nov 12, 2019
@calavera calavera requested a review from maskit December 10, 2019 20:18
@zwoop
Copy link
Contributor

zwoop commented Jan 6, 2020

Yeh, sounds like we should merge this.

@zwoop
Copy link
Contributor

zwoop commented Jan 6, 2020

Also, is this an 8.0.x candidate ?

@sudheerv sudheerv merged commit b0685db into apache:master Jan 6, 2020
@zwoop
Copy link
Contributor

zwoop commented Jan 6, 2020

Cherry-picked to v9.0.x branch.

@zwoop zwoop modified the milestones: 10.0.0, 9.0.0 Jan 6, 2020
@zwoop zwoop modified the milestones: 9.0.0, 8.1.0 Mar 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants