Skip to content

Conversation

@shinrich
Copy link
Member

Noticed this while debugging a plugin using the ASYNC job support. My earlier assumption was that all the VC and HttpSM continuations would have same mutex as the NH handler. So there would be no need to grab the mutex before calling the handler.

For the client vc and the HttpSM, the mutexes are the same and correspond to the nh lock. However, the server vc if it is reused may have a different mutex than the HttpSM. So if the event is being processed from the server vc, the HttpSM will not be locked. So there are cases when HttpSM is being called but not locked. Most activity on the HttpSM will be from the same thread, so this shouldn't be too bad, but there are cases when other worker threads, etc work on the HttpSM.

During my engine development there were some very odd timings which exposed some crashes due to unlocked HttpSM.

@shinrich shinrich added this to the 9.0.0 milestone Jul 26, 2018
@shinrich shinrich self-assigned this Jul 26, 2018
@shinrich shinrich force-pushed the ensure-lock-on-handle-event branch from 67d656a to ed5c403 Compare July 26, 2018 22:28
@maskit
Copy link
Member

maskit commented Jul 27, 2018

This looks pretty scary and risky change. handleEvent have been ensuring that the event is processed after returning from handleEvent. I'm not sure whether it is by design but this changes the behavior.

I understand that continuations's handleEvent shouldn't be called without the lock and this change ensures it, however I'd suggest that adding a new function (e.g. handleOrScheduleEvent) and use it instead of handleEvent unless the event really need to be processed on the function call.

@maskit
Copy link
Member

maskit commented Jul 27, 2018

Here is an example that is controlling the behavior (MUTEX_TRY_LOCK vs SCOPED_MUTEX_LOCK).

case VC_EVENT_READ_READY:
inactive_timeout_at = Thread::get_hrtime() + inactive_timeout;
if (e->cookie == &read_vio) {
if (read_vio.mutex) {
MUTEX_TRY_LOCK(lock, read_vio.mutex, this_ethread());
if (lock.is_locked() && read_vio.cont && this->current_reader) {
read_vio.cont->handleEvent(event, &read_vio);
} else {
this_ethread()->schedule_imm(read_vio.cont, event, &read_vio);
}
}
} else {
this->update_read_request(INT64_MAX, true);
}
break;
case VC_EVENT_EOS:
if (e->cookie == &read_vio) {
SCOPED_MUTEX_LOCK(lock, read_vio.mutex, this_ethread());
read_vio.cont->handleEvent(VC_EVENT_EOS, &read_vio);
} else if (e->cookie == &write_vio) {
SCOPED_MUTEX_LOCK(lock, write_vio.mutex, this_ethread());
write_vio.cont->handleEvent(VC_EVENT_EOS, &write_vio);
}
break;

@shinrich
Copy link
Member Author

I agree that it is a fundamental change, but calling handlers on unlocked HttpSM's is also scarey. In the example you gave, the mutex is already held, so the change in Continuation::handleEvent will not affect that particular case.

I'll review the code for other cases, but if the assertion is that handleEvent can only be called directly if the mutex is held, then the changes in the Continuation::handlleEvent should be no-ops

@shinrich shinrich force-pushed the ensure-lock-on-handle-event branch from ed5c403 to e1f8bbd Compare July 27, 2018 15:25
@shinrich
Copy link
Member Author

Thought on this some more. The case, I had a problem was when the event processing loop was calling handleEvent. There were cases were the continuation was being called without lock. I took @maskit's suggestion and make a new method for that case lockAndHandleEvent.

So this change should be less invasive, only addressing the event processing handleEvent case.

@maskit
Copy link
Member

maskit commented Jul 27, 2018

It seems much better than the original change but I'm still not so sure.

My point is event processing order. lockAndHandleEvent sounds like handleEvent with a built-in lock acquiring but actually it can just schedule the event and return. It will avoid crashes but it will also allow ATS to keep running strangely because of unexpected event processing order. This is why I suggested the name.

If we just want to ensure that a caller holds a lock for the continuation, we should probably use SCOPED_MUTEX_LOCK instead in lockAndHandleEvent. If we also want a function that TRY to acquire a lock, that should be named like handleOrScheduleEvent I think.

@shinrich
Copy link
Member Author

If in fact other threads are doing something on the continuation, doing a blocking lock seems even more dangerous. The Try and reschedule seems like a safer approach, particularly if we are concentrating on the event loop processing case.

@shinrich
Copy link
Member Author

Thinking on that some more and chatting with Alan, I think my origin logic was correct. In the event process logic the event mutex is locked before calling the event handler and the event lock and the continuation lock are supposed to be the same.

I think the issue I was seeing was a handleEvent called directly from the netvc (readSignalAndUpdate). The serverVC and HttpSM in the case of session reuse will not have the same mutex. I am in the midst of something right now, but I will put back my original version.

@shinrich shinrich force-pushed the ensure-lock-on-handle-event branch from e1f8bbd to f71ea1e Compare July 27, 2018 21:20
@shinrich
Copy link
Member Author

I pushed the origin back. If the caller is holding the continuation lock then there is no change in functionality.

If the caller is not holding the continuation lock (and the continuation has a mutex), then the handleEvent may reschedule and return immediately. This is a change in behavior , true. But in the original behavior the event handler is being called on the continuation without a lock and with some other thread holding the lock. This original behavior seems even worse than a potential race condition.

@maskit
Copy link
Member

maskit commented Jul 28, 2018

If the caller is holding the continuation lock then there is no change in functionality.

True.

But in the original behavior the event handler is being called on the continuation without a lock and with some other thread holding the lock. This original behavior seems even worse than a potential race condition.

On the case you were seeing, that is probably true.

But again, handleEvent have been ensuring that an event is completely processed on the function call even if it was accidentally called without a lock. It will not be postponed.

So lines following handleEvent can assume that an event handler have already processed the event. Places that require this behavior must do a blocking lock before calling handleEvent, however, similarly, places that don't require this behavior must do a try lock before calling handleEvent. This is the rule, right?

Why don't you put a try-lock outside the handleEvent that causes the issue then?

Guaranteeing holding a lock by doing it inside handleEvent sounds like a good idea, but if we do this, I think we need to provide both the two ways (block or postpone) so that we can choose appropriate one based on what we need at each places.

@shinrich
Copy link
Member Author

If you have a mutex associated with the continuation, there is no case where you should be calling the handleEvent without the lock. I can hunt down the cases that aren't grabbing the lock and do a try lock there, but in that case I would want to put a ink_release cert in handleEvent. If it very risky and vulnerable to race conditions to be calling the event handler on a continuation without holding its lock.

@d2r
Copy link
Contributor

d2r commented Jul 30, 2018

It seems to be we have agreement that calling handleEvent without locking the continuation's mutex is always incorrect, but there is disagreement in how we should eliminate the bug.

@shinrich's patch proposes that handleEvent itself should ensure we've locked the correct mutex before calling the event handler since, as you all noted, it is never correct to call the handler otherwise. Therefore this prevents the bug in all cases. If the lock cannot be immediately obtained, handleEvent reschedules the event and returns to the caller. This is because doing a blocking lock is more dangerous than not patching the code, and so this is worth changing the behavior this way.

@maskit points out that this patch will change handleEvent from synchronous method to an asynchronous method, and so this is very dangerous since there likely exists code that does not expect handleEvent to return asynchronously. @maskit suggests creating a second method rather than changing handleEvent in this way, or else making it the caller's responsibility to lock.

Hopefully I have understood the discussion correctly :)

Given that we agree the current code is broken, then—writing from the perspective of one who has not exhaustively search the impacted code—I am inclined to agree with @maskit here that changing methods from sync to async seems to break a pretty fundamental and straightforward expectation the callers have had up to now on handleEvent.

If we are going to change handleEvent to be async, then this patch must also exhaustively cover changes to any caller to handleEvent such that it handles the new async behavior. It is not obvious to me without doing this work completely that we would be safe from this new risk.

  • If we can show that apart from such additions to this patch, there exist no scenarios in which there is a such a risk, would that be acceptable, @maskit ? This would make everything consistent in how it calls handleEvent and what it expects, which I think is the central concern with the patch.

Alternatives so far:

  1. Make handleEvent async, and have it try to get the lock itself.
  2. Keep handleEvent synchronous, and add an ink_release_assert if the lock is not held
  3. Keep handleEvent synchronous, add an ink_release_assert if the lock is not held, and add something like handleEventAsync for the new logic.

@maskit
Copy link
Member

maskit commented Jul 31, 2018

@d2r Thank you for clarifying my opinion. That’s exactly what I wanted to say.

If we can show that apart from such additions to this patch, there exist no scenarios in which there is a such a risk, would that be acceptable, @maskit ?

Hmm, yeah, it’s acceptable, if all cases can handle the new async behavior. But if we do so, places that require sync behavior try to acquire a lock twice (outside handleEvent and inside handleEvent). It’s a bit redundant.

I’m on vacation. Don’t expect replies from me next 2 weeks.

@SolidWallOfCode
Copy link
Member

I think I'd go with option 3 with the name "dispatchEvent" which does the try lock either calls directly with the lock or dispatches an event if not. I think putting in the ink_release_assert to verify the lock in handleEvent is critical as well - we want to detect that breaking of the contract as early as possible.

@SolidWallOfCode
Copy link
Member

Also, acquiring a lock twice is very common in the core - that's precisely why the locks are recursive locks. The second acquisition is very fast because the code checks if the lock is already held by the thread and if so just bumps a counter.

@shinrich shinrich force-pushed the ensure-lock-on-handle-event branch from f71ea1e to 5545d59 Compare July 31, 2018 20:30
@shinrich
Copy link
Member Author

I am good with the consensus. I've updated the PR again to create a dispatchEvent which attempts to get the lock and reschedules if it cannot. And it adds an ink_release_assert if there is a lock and it is not held by the current thread on entry to handleEvent.

@shinrich shinrich force-pushed the ensure-lock-on-handle-event branch 2 times, most recently from 23596d9 to d043020 Compare July 31, 2018 21:47
@shinrich shinrich force-pushed the ensure-lock-on-handle-event branch from d043020 to 0e5d006 Compare August 1, 2018 14:59
@shinrich
Copy link
Member Author

shinrich commented Aug 1, 2018

Updated calls from handleEvent to dispatchEvent in many places in HttpSM that I think were vulnerable to the original issue plus cases of the tests that tweaked the ink_release_assert

@shinrich shinrich merged commit 15fe1ae into apache:master Aug 1, 2018
EThread *t = this_ethread();
MUTEX_TRY_LOCK(lock, this->mutex, t);
if (!lock.is_locked()) {
t->schedule_imm(this, event, data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An event is created by schedule_imm, is there any mechanism to guarantee the Continuation will not be destroyed before the event call back. @shinrich

@zwoop
Copy link
Contributor

zwoop commented Aug 29, 2018

I'm going to remove the 7.1.5 Project on this, since going forward, we will need to make a second PR against the 7.1.x for proposed back ports.

@zwoop
Copy link
Contributor

zwoop commented Aug 30, 2018

This breaks release builds on macOS, fixed in #4107.

@bryancall bryancall modified the milestones: 9.0.0, 8.1.0 Mar 29, 2019
@bryancall
Copy link
Contributor

Cherry picked to 8.1.0

@zwoop zwoop modified the milestones: 8.1.0, 8.1.0-nogo Mar 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants