Add protection against use after free due to Http2ConnectionState::destroy() being called more than once per session. #5878

sudheerv · 2019-08-26T17:13:42Z

Replacing the fix with a more defensive fix to protect against double delete when Http2ConnectionState::destroy() is called more than once (similar to the technique used in Http2ClientSession::destroy() via a in_destroy state). (as it looks like preventing ua_session->destroy() from being called multiple times will result in leaking the io buffers). Also resetting the shutdown_cont_event

shinrich · 2019-08-26T17:24:59Z

Interesting. Do you see that as a crash case?

sudheerv · 2019-08-26T23:15:18Z

Interesting. Do you see that as a crash case?

Yeah, it does cause a crash. Here's the traceback. I'm running the fix in prod to ensure there are no leaks or side-effects. It's been okay so far (about 8 hrs), will let it run for more than 24hours to be absolutely sure.

(gdb) bt
#0  0x00002b394e3e0de0 in ?? ()
#1  0x0000000000607811 in destroy (this=0x2b390d20a230) at Http2ConnectionState.h:161
#2  Http2ClientSession::free (this=<optimized out>) at Http2ClientSession.cc:137
#3  0x00000000004f16b1 in handle_api_return (event=<optimized out>, this=<optimized out>) at ProxyClientSession.cc:201
#4  ProxyClientSession::state_api_callout (this=0x2b390d209fb0, event=<optimized out>, data=<optimized out>) at ProxyClientSession.cc:153
#5  0x000000000060a6f6 in Http2ConnectionState::release_stream (this=this@entry=0x2b390d20a230, stream=stream@entry=0x2b3933c8d6c0) at Http2ConnectionState.cc:1102
#6  0x00000000005fa3be in Http2Stream::destroy (this=0x2b3933c8d6c0) at Http2Stream.cc:673
#7  0x00000000005facb5 in Http2Stream::terminate_if_possible (this=this@entry=0x2b3933c8d6c0) at Http2Stream.cc:368
#8  0x00000000005fadf2 in Http2Stream::transaction_done (this=0x2b3933c8d6c0) at Http2Stream.cc:357
#9  0x00000000005b5a5e in HttpSM::kill_this (this=this@entry=0x2b393e34ce40) at HttpSM.cc:7046
#10 0x00000000005b686d in HttpSM::state_api_callback (this=this@entry=0x2b393e34ce40, event=event@entry=60000, data=data@entry=0x0) at HttpSM.cc:1519
#11 0x00000000004d7d19 in TSHttpTxnReenable (txnp=0x2b393e34ce40, event=TS_EVENT_HTTP_CONTINUE) at InkAPI.cc:5711
#12 0x00002b38d36c3483 in MemberAssignmentRequestContext::process (this=0x2b38b4c278c0, url="stickyrouting/stickyrouting/memberBucketAssignments?action=fetch", 
    result=AtsPluginUtils::FetchResultProcessor::SUCCESS, status=TS_HTTP_STATUS_OK, 
    body=0x2b3a1f531539 "{\"value\":{\"lastIssued\":1566377666,\"expires\":1566457171,\"userGeneration\":437,\"coloChangedTime\":1565513471,\"currentBucket\":\"OB41\",\"bucketGeneration\":1835}}", 
    bodyLen=153) at MemberAssignmentRequestContext.cc:85
#13 0x00002b38c427809b in handleFetchEvents (cont=0x2b392a3f3cc0, event=10000, edata=0x2b38b4016200) at Utils.cc:172
#14 0x00000000004c2dd3 in INKContInternal::handle_event (this=0x2b392a3f3cc0, event=10000, edata=0x2b38b4016200) at InkAPI.cc:1012
#15 0x00000000004aeadf in handleEvent (data=0x2b38b4016200, event=10000, this=<optimized out>) at ../iocore/eventsystem/I_Continuation.h:153
#16 FetchSM::InvokePlugin (this=0x2b38b4016200, event=10000, data=0x2b38b4016200) at FetchSM.cc:104
#17 0x00000000004b036d in FetchSM::process_fetch_read (this=this@entry=0x2b38b4016200, event=event@entry=104) at FetchSM.cc:479
#18 0x00000000004b072b in FetchSM::fetch_handler (this=0x2b38b4016200, event=104, edata=0x2b38b4c34b48) at FetchSM.cc:525
#19 0x00000000004ebc1f in handleEvent (data=0x2b38b4c34b48, event=104, this=<optimized out>) at ../iocore/eventsystem/I_Continuation.h:153
#20 PluginVC::process_read_side (this=this@entry=0x2b38b4c34a38, other_side_call=other_side_call@entry=false) at PluginVC.cc:627
#21 0x00000000004ee01a in PluginVC::main_handler (this=0x2b38b4c34a38, event=<optimized out>, data=0x2b39221f4c80) at PluginVC.cc:221
#22 0x0000000000774de0 in handleEvent (data=0x2b39221f4c80, event=1, this=<optimized out>) at I_Continuation.h:153
#23 EThread::process_event (this=0x2b38a7e08000, e=0x2b39221f4c80, calling_code=1) at UnixEThread.cc:148
#24 0x0000000000775a9b in EThread::execute (this=0x2b38a7e08000) at UnixEThread.cc:202
#25 0x000000000077488a in spawn_thread_internal (a=0x2b38a5b0f100) at Thread.cc:86
#26 0x00002b38a3852dc5 in start_thread () from /lib64/libpthread.so.0
#27 0x00002b38a48ec76d in clone () from /lib64/libc.so.6

sudheerv · 2019-08-26T23:17:07Z

@shinrich Also, I think the recent rate limiting code that's been added may have made this problem worse as now there are more code paths that can call ua_session->destroy (outside of the normal TXN cleanup)

maskit · 2019-08-27T00:36:12Z

I may not understand the condition correctly, but it sounds odd. Because if there is still a transaction alive, total_client_streams_count should not be 0 and this block should not be ran.

Is there any way to reproduce this crash?

sudheerv · 2019-08-27T00:56:05Z

I may not understand the condition correctly, but it sounds odd. Because if there is still a transaction alive, total_client_streams_count should not be 0 and this block should not be ran.

Is there any way to reproduce this crash?

The problem is caused when handling a FIN. When handling a resultant HTTP2_SESSION_EVENT_FINI event, cleanup_streams() and release_stream() are called in that order (the first trying to delete all streams (in the process calling stream::initiating_close() for each stream which basically schedules a VC event to close the TXN) and the second then calls ua_session->destroy()). However, the VC events for the TXN may not actually be processed immediately because a plugin (independently) is blocking the TXN (i.e has not called TxnReenable, for e.g to perform a sideways i/o). When the plugin eventually calls TxnReenable, the pending VC events try to get processed ending up calling ua_session->destroy() (via the terminateIfposibble() etc - see the above traceback).

maskit · 2019-08-27T05:13:26Z

The current change makes more sense than the first one. But it's still odd that destroy() is called twice.

Can you print _history?

sudheerv · 2019-08-27T13:46:17Z

The current change makes more sense than the first one. But it's still odd that destroy() is called twice.

Can you print _history?

We are running an older version unfortunately, and don't have _history merged yet. We are working on upgrading to ATS 9 soon though.

maskit · 2019-08-28T15:15:43Z

I’m ok with adding this guard, but I still don’t understand how it is called twice.

Because destroy() should not be called twice, I don’t want to let it return silently without any reports. How about adding ink_assert(!in_destroy) ?

It would be really helpful if you could share an example plugin that causes the second call.

sudheerv · 2019-08-28T16:11:41Z

I’m ok with adding this guard, but I still don’t understand how it is called twice.

Because destroy() should not be called twice, I don’t want to let it return silently without any reports. How about adding ink_assert(!in_destroy) ?

It would be really helpful if you could share an example plugin that causes the second call.

Yeah, I can add an ink_assert or may be a schedule_zombie_event().

The example plugin is basically doing a sideways call using the FetchSM construct (i.e another internal/dummy Txn) without calling TSHttpTxnReenable() at Pre_Remap hook (thus stalling the Txn). The plugin finally calls TSHttpTxnReenable() after the completion of the sideways call.

maskit

Not a change request, but I want to reproduce the second call and dig into it before merging this.
Also I'm not familiar with zombie events, so need to understand how it works.

maskit · 2019-08-30T07:18:09Z

Hmm, I tried a few things to reproduce the second call on master and 8.0.x, but I couldn't.

Reenable after inactive timeout
Reenable after triggering rate limit

The plugin I wrote is like this:

TSHttpTxn transactions[128] = {0};                                               
int x = 0;                                                                       
                                                                                 
static int                                                                       
event_handler(TSCont contp, TSEvent event ATS_UNUSED, void *edata)               
{                                                                                
  TSHttpSsn ssnp;                                                                
  TSHttpTxn txnp;                                                                
                                                                                 
  switch (event) {                                                               
  case TS_EVENT_HTTP_SSN_START:                                                  
    ssnp = (TSHttpSsn) edata;                                                    
    TSHttpSsnHookAdd(ssnp, TS_HTTP_TXN_START_HOOK, contp);                       
    TSHttpSsnReenable(ssnp, TS_EVENT_HTTP_CONTINUE);                             
    break;                                                                       
  case TS_EVENT_HTTP_TXN_START:                                                  
    txnp = (TSHttpTxn) edata;                                                    
    TSHttpTxnHookAdd(txnp, TS_HTTP_PRE_REMAP_HOOK, contp);                       
    TSHttpTxnReenable(txnp, TS_EVENT_HTTP_CONTINUE);                             
    break;                                                                       
  case TS_EVENT_HTTP_PRE_REMAP:                                                  
    txnp = (TSHttpTxn) edata;                                                    
    TSDebug("maskit", "-------------------------------------------------------- PRE_REMAP");
    transactions[x] = txnp;                                                      
    ++x;                                                                         
    /* TSContScheduleOnPool(contp, 150000, TS_THREAD_POOL_TASK); */              
    TSContSchedule(contp, 150000, TS_THREAD_POOL_TASK);                          
    break;                                                                       
  default:                                                                       
    txnp = (TSHttpTxn) edata;                                                    
    TSDebug("maskit", "-------------------------------------------------------- DEFAULT");
    for (int i = 0; i < 128; ++i) {                                              
      if (transactions[i] != 0) {                                                
        TSHttpTxnReenable(transactions[i], TS_EVENT_HTTP_CONTINUE);              
        transactions[i] = 0;                                                     
        break;                                                                   
      }                                                                          
    }                                                                            
    break;                                                                       
  }                                                                              
  return 0;                                                                      
}

Couldn't reproduce the issue.

maskit · 2019-09-02T01:10:43Z

I'm -0 on this. The change seems harmless, but it would hide the real issue (the cause of double destroy).

Also, zombie_event seems to be used for a different purpose. It was introduced to detect a kind of leak, but not double free, IIUC. It would confuse us because we would have no idea which one caused the event. Any comments? @shinrich

shinrich · 2019-09-03T14:14:34Z

I am also lukewarm on the change. The zombie event, should track that the object we aren't destroying does get cleaned up within the zombie deadline, so I think the use here is ok.

Since the error was seen on an older version can cannot be reproduced on master or 8, I'd rather hold off on this change in case it is unnecessary due to some of the other numerous changes in the HTTP/2 area.

sudheerv · 2019-09-26T17:00:49Z

@maskit @shinrich

This happened again even with the additional guard. Found 2 different kinds of trace backs (on two different boxes) both crashing in Http2ConnectionState::destroy()

{code}
(gdb) bt
#0  0x00002ae162a280a0 in ?? ()
#1  0x0000000000616064 in destroy (this=0x2ae124cd5200) at Http2ConnectionState.h:169
#2  Http2ClientSession::free (this=<optimized out>) at Http2ClientSession.cc:142
#3  0x0000000000613c2e in Http2ClientSession::main_event_handler (this=0x2ae124cd4f80, event=<optimized out>, edata=0x2ae1a0c72470) at Http2ClientSession.cc:382
#4  0x00000000007698a0 in handleEvent (data=0x2ae1a0c72470, event=105, this=0x2ae124cd4f80) at ../../iocore/eventsystem/I_Continuation.h:153
#5  read_signal_and_update (vc=0x2ae1a0c72350, event=105) at UnixNetVConnection.cc:148
#6  UnixNetVConnection::mainEvent (this=0x2ae1a0c72350, event=<optimized out>, e=<optimized out>) at UnixNetVConnection.cc:1203
#7  0x0000000000760b33 in handleEvent (data=0x2ae0d8fbdc20, event=1, this=0x2ae1a0c72350) at ../../iocore/eventsystem/I_Continuation.h:153
#8  InactivityCop::check_inactivity (this=0x2ae0d830db80, event=<optimized out>, e=0x2ae0d8fbdc20) at UnixNet.cc:102
#9  0x000000000078d5ad in handleEvent (data=0x2ae0d8fbdc20, event=2, this=<optimized out>) at I_Continuation.h:153
#10 EThread::process_event (this=0x2ae0daa08000, e=0x2ae0d8fbdc20, calling_code=2) at UnixEThread.cc:152
#11 0x000000000078e281 in EThread::execute (this=0x2ae0daa08000) at UnixEThread.cc:234
#12 0x000000000078d04a in spawn_thread_internal (a=0x2ae0d830f280) at Thread.cc:86
#13 0x00002ae0d5dc5dc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00002ae0d6e5f76d in clone () from /lib64/libc.so.6
{code}

{code}
(gdb) bt
#0  0x00002b6525eda6a0 in ?? ()
#1  0x0000000000616064 in destroy (this=0x2b64d3753f00) at Http2ConnectionState.h:169
#2  Http2ClientSession::free (this=<optimized out>) at Http2ClientSession.cc:142
#3  0x00000000004f3f11 in handle_api_return (event=<optimized out>, this=<optimized out>) at ProxyClientSession.cc:201
#4  ProxyClientSession::state_api_callout (this=0x2b64d3753c80, event=<optimized out>, data=<optimized out>) at ProxyClientSession.cc:153
#5  0x000000000061ae72 in Http2ConnectionState::release_stream (this=this@entry=0x2b64d3753f00, stream=stream@entry=0x2b651a9d4f90) at Http2ConnectionState.cc:1152
#6  0x00000000006075f9 in Http2Stream::destroy (this=0x2b651a9d4f90) at Http2Stream.cc:677
#7  0x0000000000607f35 in Http2Stream::terminate_if_possible (this=this@entry=0x2b651a9d4f90) at Http2Stream.cc:372
#8  0x0000000000608072 in Http2Stream::transaction_done (this=0x2b651a9d4f90) at Http2Stream.cc:361
#9  0x00000000005bb4be in HttpSM::kill_this (this=this@entry=0x2b64de8b06c0) at HttpSM.cc:7057
#10 0x00000000005bc3bd in HttpSM::state_api_callback (this=this@entry=0x2b64de8b06c0, event=event@entry=60000, data=data@entry=0x0) at HttpSM.cc:1519
#11 0x00000000004d9379 in TSHttpTxnReenable (txnp=0x2b64de8b06c0, event=TS_EVENT_HTTP_CONTINUE) at InkAPI.cc:5711
#12 0x00002b64a86d6943 in MemberAssignmentRequestContext::process (this=0x2b64da0bf500, url="stickyrouting/stickyrouting/memberBucketAssignments?action=fetch", 
    result=AtsPluginUtils::FetchResultProcessor::SUCCESS, status=TS_HTTP_STATUS_OK, 
    body=0x2b64908a0fb9 "{\"value\":{\"lastIssued\":1569455728,\"expires\":1569529305,\"userGeneration\":105,\"coloChangedTime\":1569442905,\"currentBucket\":\"TB83\",\"bucketGeneration\":2000}}", bodyLen=153) at MemberAssignmentRequestContext.cc:89
#13 0x00002b649927809b in handleFetchEvents (cont=0x2b65f70b09a0, event=10000, edata=0x2b6488420800) at Utils.cc:172
#14 0x00000000004c4323 in INKContInternal::handle_event (this=0x2b65f70b09a0, event=10000, edata=0x2b6488420800) at InkAPI.cc:1012
#15 0x00000000004af96f in handleEvent (data=0x2b6488420800, event=10000, this=<optimized out>) at ../iocore/eventsystem/I_Continuation.h:153
#16 FetchSM::InvokePlugin (this=0x2b6488420800, event=10000, data=0x2b6488420800) at FetchSM.cc:104
#17 0x00000000004b1445 in FetchSM::process_fetch_read (this=this@entry=0x2b6488420800, event=event@entry=104) at FetchSM.cc:479
#18 0x00000000004b1743 in FetchSM::fetch_handler (this=0x2b6488420800, event=104, edata=0x2b6490862548) at FetchSM.cc:525
#19 0x00000000004f0737 in handleEvent (data=0x2b6490862548, event=104, this=<optimized out>) at ../iocore/eventsystem/I_Continuation.h:153
#20 PluginVC::process_read_side (this=this@entry=0x2b6490862438, other_side_call=other_side_call@entry=false) at PluginVC.cc:627
#21 0x00000000004f109a in PluginVC::main_handler (this=0x2b6490862438, event=<optimized out>, data=0x2b6520dd6ba0) at PluginVC.cc:221
#22 0x000000000078d5ad in handleEvent (data=0x2b6520dd6ba0, event=1, this=<optimized out>) at I_Continuation.h:153
#23 EThread::process_event (this=0x2b647d406000, e=0x2b6520dd6ba0, calling_code=1) at UnixEThread.cc:152
#24 0x000000000078e26b in EThread::execute (this=0x2b647d406000) at UnixEThread.cc:206
#25 0x000000000078d04a in spawn_thread_internal (a=0x2b647ab0f220) at Thread.cc:86
#26 0x00002b6478838dc5 in start_thread () from /lib64/libpthread.so.0
#27 0x00002b64798d276d in clone () from /lib64/libc.so.6
{code}

It seems like there's two code paths that can call Http2ClientSession::free() and they could cause a use-after-free symptom.

The crash itself seems to happen because Http2ConnectionState::shutdown_cont_event is garbage. I'm going to try and set shutdown_cont_event = null after calling cancel() inside Http2ConnectionState::destroy() to make sure it's not holding onto garbage reference.

Let me know what you think.

Some additional data. shutdown_cont_event seems to show a different thread to that of the thread in stack and it's mutex is either null or doesn't seem to be from the current thread in stack.

 (gdb) p this->shutdown_cont_event
$6 = (Event *) 0x2ae1d0e8fa40
(gdb) p *this->shutdown_cont_event
$7 = {<Action> = {_vptr.Action = 0x2ae128f1fd60, continuation = 0x2ae206cb30c0, mutex = {m_ptr = 0x0}, cancelled = 1}, ethread = 0x2ae0d8d07000, in_the_prot_queue = 1, 
  in_the_priority_queue = 0, immediate = 1, globally_allocated = 1, in_heap = 0, callback_event = 100, timeout_at = 0, period = 0, cookie = 0x2ae206cb33f0, link = {<SLink<Event>> = {
      next = 0x0}, prev = 0x0}}
(gdb) p *(ProxyMutex*)0x2ae1693e7fa0
$8 = {<RefCountObj> = {<ForceVFPTToTop> = {_vptr.ForceVFPTToTop = 0x791430 <vtable for ProxyMutex+16>}, m_refcount = 8}, the_mutex = {__data = {__lock = 1, __count = 0, __owner = 25056, 
      __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = "\001\000\000\000\000\000\000\000\340a\000\000\001", '\000' <repeats 26 times>, __align = 1}, thread_holding = 0x2ae0daa08000, nthread_holding = 3}

(gdb) p this->shutdown_cont_event
$6 = (Event *) 0x2b65a05c9c60
(gdb) p *this->shutdown_cont_event
$7 = {<Action> = {_vptr.Action = 0x7e8310 <vtable for Event+16>, continuation = 0x2b64e313c300, mutex = {m_ptr = 0x2b6563fe61c0}, cancelled = 0}, ethread = 0x2b647e608000, 
  in_the_prot_queue = 0, in_the_priority_queue = 1, immediate = 0, globally_allocated = 1, in_heap = 9, callback_event = 2256, timeout_at = 1569455730555762263, period = 0, cookie = 0x0, 
  link = {<SLink<Event>> = {next = 0x2b65a05c9060}, prev = 0x2b65102fef80}}
(gdb) p *(ProxyMutex*)0x2b64f05a3730
$8 = {<RefCountObj> = {<ForceVFPTToTop> = {_vptr.ForceVFPTToTop = 0x791430 <vtable for ProxyMutex+16>}, m_refcount = 16}, the_mutex = {__data = {__lock = 1, __count = 0, __owner = 9870, 
      __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = "\001\000\000\000\000\000\000\000\216&\000\000\001", '\000' <repeats 26 times>, __align = 1}, thread_holding = 0x2b647d406000, nthread_holding = 6}
(gdb) p *(ProxyMutex*)0x2b6563fe61c0
$9 = {<RefCountObj> = {<ForceVFPTToTop> = {_vptr.ForceVFPTToTop = 0x791430 <vtable for ProxyMutex+16>}, m_refcount = 18}, the_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, 
      __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, thread_holding = 0x0, 
  nthread_holding = 0}

I'm wondering if it's because HTTP2_SESSION_EVENT_SHUTDOWN_INIT is going through the event loop and therefore can cause a race condition in handling shutdown_cont_event

masaori335 · 2019-09-26T23:17:15Z

The crash itself seems to happen because Http2ConnectionState::shutdown_cont_event is garbage. I'm going to try and set shutdown_cont_event = null after calling cancel() inside Http2ConnectionState::destroy() to make sure it's not holding onto garbage reference.

+1. it should be set nullptr after canceled.

maskit · 2019-09-27T00:57:39Z

I'm wondering if it's because HTTP2_SESSION_EVENT_SHUTDOWN_INIT is going through the event loop and therefore can cause a race condition in handling shutdown_cont_event

HTTP2_SESSION_EVENT_SHUTDOWN_INIT is not going through the event loop. It's processed directly via send_connection_event.

On the first stack trace, if destroy is already called and it's the second call, triggering VC_EVENT_INACTIVITY_TIMEOUT (105) against the session doesn't make sense. I think we need to make sure that a session don't get INACTIVITY_TIMEOUT after destroying it.

sudheerv · 2019-09-27T17:10:07Z

I'm wondering if it's because HTTP2_SESSION_EVENT_SHUTDOWN_INIT is going through the event loop and therefore can cause a race condition in handling shutdown_cont_event

HTTP2_SESSION_EVENT_SHUTDOWN_INIT is not going through the event loop. It's processed directly via send_connection_event.

Er, I meant the handling that follows HTTP2_SESSION_EVENT_SHUTDOWN_INIT (not that event itself).

See below.

https://github.com/apache/trafficserver/blob/master/proxy/http2/Http2ConnectionState.cc#L1056

On the first stack trace, if destroy is already called and it's the second call, triggering VC_EVENT_INACTIVITY_TIMEOUT (105) against the session doesn't make sense. I think we need to make sure that a session don't get INACTIVITY_TIMEOUT after destroying it.

I suspect the problem is because of the scheduling of a timer to process the graceful shutdown (see above). I'm going to try and run an ASAN build to see if we can confirm this.

double delete (similar to Http2ClientSession) Add zombie_event when destroy() is called more than once.

As it seems like there could be a race condition because the handling of HTTP2_SESSION_EVENT_SHUTDOWN_INIT is going through the event loop again and therefore may come in after the Session and its associated contexts are already cleaned up thus causing an use-after-free possibility

sudheerv · 2019-10-02T16:41:33Z

I'm wondering if it's because HTTP2_SESSION_EVENT_SHUTDOWN_INIT is going through the event loop and therefore can cause a race condition in handling shutdown_cont_event

HTTP2_SESSION_EVENT_SHUTDOWN_INIT is not going through the event loop. It's processed directly via send_connection_event.

Er, I meant the handling that follows HTTP2_SESSION_EVENT_SHUTDOWN_INIT (not that event itself).

See below.

https://github.com/apache/trafficserver/blob/master/proxy/http2/Http2ConnectionState.cc#L1056

On the first stack trace, if destroy is already called and it's the second call, triggering VC_EVENT_INACTIVITY_TIMEOUT (105) against the session doesn't make sense. I think we need to make sure that a session don't get INACTIVITY_TIMEOUT after destroying it.

I suspect the problem is because of the scheduling of a timer to process the graceful shutdown (see above). I'm going to try and run an ASAN build to see if we can confirm this.

Following up on this -

I ran an ASAN build on one of our prod boxes for hours and couldn't catch this issue. But, based on the core dump analysis and the code that schedules a delay back to the event loop when handling a shutdown event, I'm fairly positive that this can cause a race condition with potential use-after-free depending on what plugins you may have.

IMO, the fix to reset shutdown_cont_event to nullptr seems reasonable to prevent the use-after-free and as a protection.

maskit · 2019-10-03T01:10:33Z

I'm fine with setting nullptr to shutdown_cont_event. It wouldn't break anything. However, I still think the root cause is destroy is called multiple times somehow. Your instance may work fine, but we'll lose a hint of the misbehavior (double destroy).

sudheerv · 2019-10-03T09:18:09Z

I'm fine with setting nullptr to shutdown_cont_event. It wouldn't break anything. However, I still think the root cause is destroy is called multiple times somehow. Your instance may work fine, but we'll lose a hint of the misbehavior (double destroy).

Like I just explained and showed the stack traces, destroy() is indeed being called twice (one via inactivity timeout and the second via a unblocked stalled transaction) for the same session. It seems to be because of the logic to schedule an event with a delay when trying to gracefully close the stream. I don’t think you can prevent the two code paths from triggering in parallel because of the delay window, but may be shutdown_initiated state can be used to block the second cleanup sequence. Either way, it’s a defensive fix.

zwoop · 2019-10-17T18:19:08Z

@maskit @sudheerv What do we do with this ?

maskit · 2019-10-17T18:43:02Z

The PR subject should be rephrased at least because destroy() still can be called multiple times. Then I'm ok with landing this.

sudheerv · 2019-11-12T00:33:25Z

@zwoop We are running this patch in our prod and haven't seen the same issues since. I can change the title of the PR to say "Protect against destroy() being called more than once per Http2Session".

zwoop · 2020-01-06T21:41:41Z

Yeh, sounds like we should merge this.

zwoop · 2020-01-06T21:42:09Z

Also, is this an 8.0.x candidate ?

zwoop · 2020-01-06T22:35:16Z

Cherry-picked to v9.0.x branch.

sudheerv assigned maskit and masaori335 Aug 26, 2019

sudheerv added the HTTP/2 label Aug 26, 2019

sudheerv added this to the 10.0.0 milestone Aug 26, 2019

bryancall assigned sudheerv and unassigned maskit and masaori335 Aug 26, 2019

bryancall requested review from masaori335 and maskit August 26, 2019 17:49

sudheerv force-pushed the h2_use_after_free branch from d59fbf6 to a8b2269 Compare August 27, 2019 03:23

sudheerv changed the title ~~Make sure ua_session->destroy is called only once on an active session~~ Prevent Http2ConnectionState::destroy() being called more than once per session. Aug 27, 2019

apache deleted a comment from bryancall Aug 27, 2019

maskit previously requested changes Aug 29, 2019

View reviewed changes

sudheerv added 2 commits September 27, 2019 10:15

Add in_destroy to Http2ConnectionState to prevent

842c5ce

double delete (similar to Http2ClientSession) Add zombie_event when destroy() is called more than once.

sudheerv force-pushed the h2_use_after_free branch from 5b0e7cf to 8884fa8 Compare September 27, 2019 17:18

sudheerv changed the title ~~Prevent Http2ConnectionState::destroy() being called more than once per session.~~ Add protection against use after free due to Http2ConnectionState::destroy() being called more than once per session. Nov 12, 2019

calavera requested a review from maskit December 10, 2019 20:18

maskit approved these changes Dec 11, 2019

View reviewed changes

sudheerv merged commit b0685db into apache:master Jan 6, 2020

zwoop modified the milestones: 10.0.0, 9.0.0 Jan 6, 2020

masaori335 mentioned this pull request Mar 27, 2020

HTTP/2 backports for 8.1.x (part 1) #6581

Merged

zwoop modified the milestones: 9.0.0, 8.1.0 Mar 30, 2020

Add protection against use after free due to Http2ConnectionState::destroy() being called more than once per session. #5878

Add protection against use after free due to Http2ConnectionState::destroy() being called more than once per session. #5878

Uh oh!

Conversation

sudheerv commented Aug 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shinrich commented Aug 26, 2019

Uh oh!

sudheerv commented Aug 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sudheerv commented Aug 26, 2019

Uh oh!

maskit commented Aug 27, 2019

Uh oh!

sudheerv commented Aug 27, 2019

Uh oh!

maskit commented Aug 27, 2019

Uh oh!

sudheerv commented Aug 27, 2019

Uh oh!

maskit commented Aug 28, 2019

Uh oh!

sudheerv commented Aug 28, 2019

Uh oh!

maskit left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maskit commented Aug 30, 2019

Uh oh!

maskit commented Sep 2, 2019

Uh oh!

shinrich commented Sep 3, 2019

Uh oh!

sudheerv commented Sep 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

masaori335 commented Sep 26, 2019

Uh oh!

maskit commented Sep 27, 2019

Uh oh!

sudheerv commented Sep 27, 2019

Uh oh!

sudheerv commented Oct 2, 2019

Uh oh!

maskit commented Oct 3, 2019

Uh oh!

sudheerv commented Oct 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zwoop commented Oct 17, 2019

Uh oh!

maskit commented Oct 17, 2019

Uh oh!

sudheerv commented Nov 12, 2019

Uh oh!

zwoop commented Jan 6, 2020

Uh oh!

zwoop commented Jan 6, 2020

Uh oh!

zwoop commented Jan 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sudheerv commented Aug 26, 2019 •

edited

Loading

sudheerv commented Aug 26, 2019 •

edited

Loading

maskit left a comment •

edited

Loading

sudheerv commented Sep 26, 2019 •

edited

Loading

sudheerv commented Oct 3, 2019 •

edited

Loading