-
Notifications
You must be signed in to change notification settings - Fork 847
Ensure that continuation lock is held before calling handler. #4019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
67d656a to
ed5c403
Compare
|
This looks pretty scary and risky change. I understand that continuations's |
|
Here is an example that is controlling the behavior (MUTEX_TRY_LOCK vs SCOPED_MUTEX_LOCK). trafficserver/proxy/http2/Http2Stream.cc Lines 104 to 127 in 52738fb
|
|
I agree that it is a fundamental change, but calling handlers on unlocked HttpSM's is also scarey. In the example you gave, the mutex is already held, so the change in Continuation::handleEvent will not affect that particular case. I'll review the code for other cases, but if the assertion is that handleEvent can only be called directly if the mutex is held, then the changes in the Continuation::handlleEvent should be no-ops |
ed5c403 to
e1f8bbd
Compare
|
Thought on this some more. The case, I had a problem was when the event processing loop was calling handleEvent. There were cases were the continuation was being called without lock. I took @maskit's suggestion and make a new method for that case lockAndHandleEvent. So this change should be less invasive, only addressing the event processing handleEvent case. |
|
It seems much better than the original change but I'm still not so sure. My point is event processing order. lockAndHandleEvent sounds like handleEvent with a built-in lock acquiring but actually it can just schedule the event and return. It will avoid crashes but it will also allow ATS to keep running strangely because of unexpected event processing order. This is why I suggested the name. If we just want to ensure that a caller holds a lock for the continuation, we should probably use SCOPED_MUTEX_LOCK instead in lockAndHandleEvent. If we also want a function that TRY to acquire a lock, that should be named like handleOrScheduleEvent I think. |
|
If in fact other threads are doing something on the continuation, doing a blocking lock seems even more dangerous. The Try and reschedule seems like a safer approach, particularly if we are concentrating on the event loop processing case. |
|
Thinking on that some more and chatting with Alan, I think my origin logic was correct. In the event process logic the event mutex is locked before calling the event handler and the event lock and the continuation lock are supposed to be the same. I think the issue I was seeing was a handleEvent called directly from the netvc (readSignalAndUpdate). The serverVC and HttpSM in the case of session reuse will not have the same mutex. I am in the midst of something right now, but I will put back my original version. |
e1f8bbd to
f71ea1e
Compare
|
I pushed the origin back. If the caller is holding the continuation lock then there is no change in functionality. If the caller is not holding the continuation lock (and the continuation has a mutex), then the handleEvent may reschedule and return immediately. This is a change in behavior , true. But in the original behavior the event handler is being called on the continuation without a lock and with some other thread holding the lock. This original behavior seems even worse than a potential race condition. |
True.
On the case you were seeing, that is probably true. But again, handleEvent have been ensuring that an event is completely processed on the function call even if it was accidentally called without a lock. It will not be postponed. So lines following handleEvent can assume that an event handler have already processed the event. Places that require this behavior must do a blocking lock before calling handleEvent, however, similarly, places that don't require this behavior must do a try lock before calling handleEvent. This is the rule, right? Why don't you put a try-lock outside the handleEvent that causes the issue then? Guaranteeing holding a lock by doing it inside handleEvent sounds like a good idea, but if we do this, I think we need to provide both the two ways (block or postpone) so that we can choose appropriate one based on what we need at each places. |
|
If you have a mutex associated with the continuation, there is no case where you should be calling the handleEvent without the lock. I can hunt down the cases that aren't grabbing the lock and do a try lock there, but in that case I would want to put a ink_release cert in handleEvent. If it very risky and vulnerable to race conditions to be calling the event handler on a continuation without holding its lock. |
|
It seems to be we have agreement that calling @shinrich's patch proposes that @maskit points out that this patch will change Hopefully I have understood the discussion correctly :) Given that we agree the current code is broken, then—writing from the perspective of one who has not exhaustively search the impacted code—I am inclined to agree with @maskit here that changing methods from sync to async seems to break a pretty fundamental and straightforward expectation the callers have had up to now on If we are going to change
Alternatives so far:
|
|
@d2r Thank you for clarifying my opinion. That’s exactly what I wanted to say.
Hmm, yeah, it’s acceptable, if all cases can handle the new async behavior. But if we do so, places that require sync behavior try to acquire a lock twice (outside handleEvent and inside handleEvent). It’s a bit redundant. I’m on vacation. Don’t expect replies from me next 2 weeks. |
|
I think I'd go with option 3 with the name "dispatchEvent" which does the try lock either calls directly with the lock or dispatches an event if not. I think putting in the |
|
Also, acquiring a lock twice is very common in the core - that's precisely why the locks are recursive locks. The second acquisition is very fast because the code checks if the lock is already held by the thread and if so just bumps a counter. |
f71ea1e to
5545d59
Compare
|
I am good with the consensus. I've updated the PR again to create a dispatchEvent which attempts to get the lock and reschedules if it cannot. And it adds an ink_release_assert if there is a lock and it is not held by the current thread on entry to handleEvent. |
23596d9 to
d043020
Compare
d043020 to
0e5d006
Compare
|
Updated calls from handleEvent to dispatchEvent in many places in HttpSM that I think were vulnerable to the original issue plus cases of the tests that tweaked the ink_release_assert |
| EThread *t = this_ethread(); | ||
| MUTEX_TRY_LOCK(lock, this->mutex, t); | ||
| if (!lock.is_locked()) { | ||
| t->schedule_imm(this, event, data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An event is created by schedule_imm, is there any mechanism to guarantee the Continuation will not be destroyed before the event call back. @shinrich
|
I'm going to remove the 7.1.5 Project on this, since going forward, we will need to make a second PR against the 7.1.x for proposed back ports. |
|
This breaks release builds on macOS, fixed in #4107. |
|
Cherry picked to 8.1.0 |
Noticed this while debugging a plugin using the ASYNC job support. My earlier assumption was that all the VC and HttpSM continuations would have same mutex as the NH handler. So there would be no need to grab the mutex before calling the handler.
For the client vc and the HttpSM, the mutexes are the same and correspond to the nh lock. However, the server vc if it is reused may have a different mutex than the HttpSM. So if the event is being processed from the server vc, the HttpSM will not be locked. So there are cases when HttpSM is being called but not locked. Most activity on the HttpSM will be from the same thread, so this shouldn't be too bad, but there are cases when other worker threads, etc work on the HttpSM.
During my engine development there were some very odd timings which exposed some crashes due to unlocked HttpSM.