Fix assert due to unheld nh->mutex #5950

shinrich · 2019-09-20T15:08:39Z

In our initial 9.0.x testing, we got a number of cores with the following stack.

(gdb) bt
#0  0x00002b957717c207 in raise () from /lib64/libc.so.6
#1  0x00002b957717d8f8 in abort () from /lib64/libc.so.6
#2  0x00002b95746e662b in ink_abort (message_format=message_format@entry=0x2b957474fe47 "%s:%d: failed assertion `%s`") at ink_error.cc:99
#3  0x00002b95746e3de5 in _ink_assert (
    expression=expression@entry=0x795e88 "!\"BUG: It must have acquired the NetHandler's lock before doing anything on active_queue.\"", 
    file=file@entry=0x795be0 "UnixNetVConnection.cc", line=line@entry=1471) at ink_assert.cc:37
#4  0x00000000006f4c49 in UnixNetVConnection::add_to_active_queue() () at UnixNetVConnection.cc:1471
#5  0x00000000005bda7c in Http2ConnectionState::create_stream (this=this@entry=0x2b99977a9f48, new_id=new_id@entry=251, error=...) at Http2ClientSession.h:191
#6  0x00000000005bea41 in rcv_headers_frame(Http2ConnectionState&, Http2Frame const&) () at Http2ConnectionState.cc:235
#7  0x00000000005c0cd9 in Http2ConnectionState::main_event_handler(int, void*) () at Http2ClientSession.h:95
#8  0x00000000005b5e81 in handleEvent (data=0x2b957fb033e0, event=2253, this=0x2b99977a9f48)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver9/iocore/eventsystem/I_Continuation.h:190
#9  handleEvent (data=0x2b957fb033e0, event=2253, this=0x2b99977a9f48)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver9/iocore/eventsystem/I_Continuation.h:186
#10 send_connection_event(Continuation*, int, void*) () at Http2ClientSession.cc:65
#11 0x00000000005b6a75 in Http2ClientSession::do_complete_frame_read() () at Http2ClientSession.cc:556
#12 0x00000000005b6edd in Http2ClientSession::state_process_frame_read(int, VIO*, bool) () at Http2ClientSession.cc:610
#13 0x00000000005b7671 in Http2ClientSession::state_start_frame_read (this=0x2b99977a9c00, event=100, edata=0x2b987ea886d0) at Http2ClientSession.cc:477
#14 0x00000000005b5fe7 in Http2ClientSession::main_event_handler (this=0x2b99977a9c00, event=2257, edata=0x2b9582311b40) at Http2ClientSession.cc:363
#15 0x0000000000736bfb in handleEvent (data=0x2b9582311b40, event=2257, this=0x2b99977a9c00) at I_Continuation.h:190
#16 handleEvent (data=0x2b9582311b40, event=2257, this=0x2b99977a9c00) at I_Continuation.h:186
#17 EThread::process_event(Event*, int) () at UnixEThread.cc:136
#18 0x0000000000737abc in EThread::execute_regular (this=this@entry=0x2b957b11a900) at UnixEThread.cc:249
#19 0x0000000000737eaa in execute (this=0x2b957b11a900) at UnixEThread.cc:338
#20 EThread::execute (this=0x2b957b11a900) at UnixEThread.cc:316
#21 0x0000000000736309 in spawn_thread_internal (a=0x2b9579193f80) at Thread.cc:92
#22 0x00002b957650edd5 in start_thread () from /lib64/libpthread.so.0
#23 0x00002b9577243ead in clone () from /lib64/libc.so.6

The immediate failure is that the MUTEX_TRY_LOCK on nh->mutex failed in UnixNetVConnection::add_to_active_queue. Earlier we had issues with other threads trying to add to the active/keep-alive queues eventually causing corrupted queues.

In this case (of the two cores I looked at), the UnixNetVConnection mutex and the EThread mutex are the same. And at the time of the core, the mutex is not being held, but presumably it was moments before. I assume another thread was making a very transient grab for the nh->mutex.

In this path, the event HTTP2_SESSION_EVENT_REENABLE is being sent every 128 frames presumably to break up clients dominating the thread with very large data. This is new logic compared to our 7.1.x build.

This thread to thread signaling does not grab the nh->mutex as a more standard network event driven process.

This PR checks to make sure that the request is being made from the correct thread and performing a blocking lock in that case.

Diving back into my bug archives I see that one of our users was complaining about a similar stack from Http1 land on our 7.x build. So placing the fix in ProxySession should address both issues.

bryancall · 2019-09-20T16:29:59Z

Fixes issue #5943

bryancall · 2019-09-20T16:40:52Z

Couldn't we just change the MUTEX_TRY_LOCK to a SCOPED_MUTEX_LOCK in UnixNetVConnection::add_to_active_queue()? I don't like that we are now double locking and using dynamic casting.

bryancall · 2019-09-20T16:53:15Z

Thinking about it more while eating a snack, why are we updating the NetHandler from two different threads? That seems like a bigger issue.

shinrich · 2019-09-20T18:10:42Z

Yes, updating from a different thread is concerning. Thus, the addition of the ink_release_assert to start with. Presumably another continuation/action picked up the nh->mutex and got scheduled on another thread. In this particular crash case, the lock had been dropped by the time the core dumped, so I don't know what was behaving badly.

bryancall

I am -1 on this change. It doesn't fix the issue and has the similar behavior as the existing try lock with more complexity.

bryancall · 2019-09-24T16:01:22Z

Talked to @shinrich about this PR and #5954 is a better way to track down which thread is trying to get a lock on the NetHandler.

Fix assert due to unheld nh->mutex

7ba7061

shinrich added HTTP Network labels Sep 20, 2019

shinrich added this to the 10.0.0 milestone Sep 20, 2019

shinrich self-assigned this Sep 20, 2019

bryancall requested changes Sep 23, 2019

View reviewed changes

bryancall closed this Sep 24, 2019

bryancall removed this from the 10.0.0 milestone Sep 24, 2019

shinrich mentioned this pull request Feb 5, 2020

Avoid cross-thread mutex conflicts #6396

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix assert due to unheld nh->mutex #5950

Fix assert due to unheld nh->mutex #5950

Uh oh!

shinrich commented Sep 20, 2019

Uh oh!

bryancall commented Sep 20, 2019

Uh oh!

bryancall commented Sep 20, 2019 •

edited

Loading

Uh oh!

bryancall commented Sep 20, 2019 •

edited

Loading

Uh oh!

shinrich commented Sep 20, 2019

Uh oh!

bryancall left a comment

Uh oh!

bryancall commented Sep 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix assert due to unheld nh->mutex #5950

Fix assert due to unheld nh->mutex #5950

Uh oh!

Conversation

shinrich commented Sep 20, 2019

Uh oh!

bryancall commented Sep 20, 2019

Uh oh!

bryancall commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bryancall commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shinrich commented Sep 20, 2019

Uh oh!

bryancall left a comment

Choose a reason for hiding this comment

Uh oh!

bryancall commented Sep 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bryancall commented Sep 20, 2019 •

edited

Loading

bryancall commented Sep 20, 2019 •

edited

Loading