-
Notifications
You must be signed in to change notification settings - Fork 844
Add protection against use after free due to Http2ConnectionState::destroy() being called more than once per session. #5878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Interesting. Do you see that as a crash case? |
Yeah, it does cause a crash. Here's the traceback. I'm running the fix in prod to ensure there are no leaks or side-effects. It's been okay so far (about 8 hrs), will let it run for more than 24hours to be absolutely sure. |
|
@shinrich Also, I think the recent rate limiting code that's been added may have made this problem worse as now there are more code paths that can call ua_session->destroy (outside of the normal TXN cleanup) |
|
I may not understand the condition correctly, but it sounds odd. Because if there is still a transaction alive, Is there any way to reproduce this crash? |
The problem is caused when handling a FIN. When handling a resultant |
d59fbf6 to
a8b2269
Compare
|
The current change makes more sense than the first one. But it's still odd that destroy() is called twice. Can you print |
We are running an older version unfortunately, and don't have |
|
I’m ok with adding this guard, but I still don’t understand how it is called twice. Because destroy() should not be called twice, I don’t want to let it return silently without any reports. How about adding ink_assert(!in_destroy) ? It would be really helpful if you could share an example plugin that causes the second call. |
Yeah, I can add an ink_assert or may be a schedule_zombie_event(). The example plugin is basically doing a sideways call using the FetchSM construct (i.e another internal/dummy Txn) without calling TSHttpTxnReenable() at Pre_Remap hook (thus stalling the Txn). The plugin finally calls TSHttpTxnReenable() after the completion of the sideways call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a change request, but I want to reproduce the second call and dig into it before merging this.
Also I'm not familiar with zombie events, so need to understand how it works.
|
Hmm, I tried a few things to reproduce the second call on master and 8.0.x, but I couldn't.
The plugin I wrote is like this: TSHttpTxn transactions[128] = {0};
int x = 0;
static int
event_handler(TSCont contp, TSEvent event ATS_UNUSED, void *edata)
{
TSHttpSsn ssnp;
TSHttpTxn txnp;
switch (event) {
case TS_EVENT_HTTP_SSN_START:
ssnp = (TSHttpSsn) edata;
TSHttpSsnHookAdd(ssnp, TS_HTTP_TXN_START_HOOK, contp);
TSHttpSsnReenable(ssnp, TS_EVENT_HTTP_CONTINUE);
break;
case TS_EVENT_HTTP_TXN_START:
txnp = (TSHttpTxn) edata;
TSHttpTxnHookAdd(txnp, TS_HTTP_PRE_REMAP_HOOK, contp);
TSHttpTxnReenable(txnp, TS_EVENT_HTTP_CONTINUE);
break;
case TS_EVENT_HTTP_PRE_REMAP:
txnp = (TSHttpTxn) edata;
TSDebug("maskit", "-------------------------------------------------------- PRE_REMAP");
transactions[x] = txnp;
++x;
/* TSContScheduleOnPool(contp, 150000, TS_THREAD_POOL_TASK); */
TSContSchedule(contp, 150000, TS_THREAD_POOL_TASK);
break;
default:
txnp = (TSHttpTxn) edata;
TSDebug("maskit", "-------------------------------------------------------- DEFAULT");
for (int i = 0; i < 128; ++i) {
if (transactions[i] != 0) {
TSHttpTxnReenable(transactions[i], TS_EVENT_HTTP_CONTINUE);
transactions[i] = 0;
break;
}
}
break;
}
return 0;
} |
|
I'm -0 on this. The change seems harmless, but it would hide the real issue (the cause of double destroy). Also, zombie_event seems to be used for a different purpose. It was introduced to detect a kind of leak, but not double free, IIUC. It would confuse us because we would have no idea which one caused the event. Any comments? @shinrich |
|
I am also lukewarm on the change. The zombie event, should track that the object we aren't destroying does get cleaned up within the zombie deadline, so I think the use here is ok. Since the error was seen on an older version can cannot be reproduced on master or 8, I'd rather hold off on this change in case it is unnecessary due to some of the other numerous changes in the HTTP/2 area. |
|
This happened again even with the additional guard. Found 2 different kinds of trace backs (on two different boxes) both crashing in It seems like there's two code paths that can call The crash itself seems to happen because Let me know what you think. Some additional data. I'm wondering if it's because |
+1. it should be set nullptr after canceled. |
HTTP2_SESSION_EVENT_SHUTDOWN_INIT is not going through the event loop. It's processed directly via On the first stack trace, if |
Er, I meant the handling that follows HTTP2_SESSION_EVENT_SHUTDOWN_INIT (not that event itself). See below.
I suspect the problem is because of the scheduling of a timer to process the graceful shutdown (see above). I'm going to try and run an ASAN build to see if we can confirm this. |
double delete (similar to Http2ClientSession) Add zombie_event when destroy() is called more than once.
As it seems like there could be a race condition because the handling of HTTP2_SESSION_EVENT_SHUTDOWN_INIT is going through the event loop again and therefore may come in after the Session and its associated contexts are already cleaned up thus causing an use-after-free possibility
5b0e7cf to
8884fa8
Compare
Following up on this - I ran an ASAN build on one of our prod boxes for hours and couldn't catch this issue. But, based on the core dump analysis and the code that schedules a delay back to the event loop when handling a shutdown event, I'm fairly positive that this can cause a race condition with potential use-after-free depending on what plugins you may have. IMO, the fix to reset |
|
I'm fine with setting nullptr to |
Like I just explained and showed the stack traces, destroy() is indeed being called twice (one via inactivity timeout and the second via a unblocked stalled transaction) for the same session. It seems to be because of the logic to schedule an event with a delay when trying to gracefully close the stream. I don’t think you can prevent the two code paths from triggering in parallel because of the delay window, but may be shutdown_initiated state can be used to block the second cleanup sequence. Either way, it’s a defensive fix. |
|
The PR subject should be rephrased at least because destroy() still can be called multiple times. Then I'm ok with landing this. |
|
@zwoop We are running this patch in our prod and haven't seen the same issues since. I can change the title of the PR to say "Protect against destroy() being called more than once per Http2Session". |
|
Yeh, sounds like we should merge this. |
|
Also, is this an 8.0.x candidate ? |
|
Cherry-picked to v9.0.x branch. |
Replacing the fix with a more defensive fix to protect against double delete when Http2ConnectionState::destroy() is called more than once (similar to the technique used in Http2ClientSession::destroy() via a
in_destroystate). (as it looks like preventingua_session->destroy()from being called multiple times will result in leaking the io buffers). Also resetting the shutdown_cont_event