Fix assert due to unheld nh->mutex #5950
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In our initial 9.0.x testing, we got a number of cores with the following stack.
The immediate failure is that the MUTEX_TRY_LOCK on nh->mutex failed in UnixNetVConnection::add_to_active_queue. Earlier we had issues with other threads trying to add to the active/keep-alive queues eventually causing corrupted queues.
In this case (of the two cores I looked at), the UnixNetVConnection mutex and the EThread mutex are the same. And at the time of the core, the mutex is not being held, but presumably it was moments before. I assume another thread was making a very transient grab for the nh->mutex.
In this path, the event HTTP2_SESSION_EVENT_REENABLE is being sent every 128 frames presumably to break up clients dominating the thread with very large data. This is new logic compared to our 7.1.x build.
This thread to thread signaling does not grab the nh->mutex as a more standard network event driven process.
This PR checks to make sure that the request is being made from the correct thread and performing a blocking lock in that case.
Diving back into my bug archives I see that one of our users was complaining about a similar stack from Http1 land on our 7.x build. So placing the fix in ProxySession should address both issues.