Skip to content

Conversation

@shinrich
Copy link
Member

In our initial 9.0.x testing, we got a number of cores with the following stack.

(gdb) bt
#0  0x00002b957717c207 in raise () from /lib64/libc.so.6
#1  0x00002b957717d8f8 in abort () from /lib64/libc.so.6
#2  0x00002b95746e662b in ink_abort (message_format=message_format@entry=0x2b957474fe47 "%s:%d: failed assertion `%s`") at ink_error.cc:99
#3  0x00002b95746e3de5 in _ink_assert (
    expression=expression@entry=0x795e88 "!\"BUG: It must have acquired the NetHandler's lock before doing anything on active_queue.\"", 
    file=file@entry=0x795be0 "UnixNetVConnection.cc", line=line@entry=1471) at ink_assert.cc:37
#4  0x00000000006f4c49 in UnixNetVConnection::add_to_active_queue() () at UnixNetVConnection.cc:1471
#5  0x00000000005bda7c in Http2ConnectionState::create_stream (this=this@entry=0x2b99977a9f48, new_id=new_id@entry=251, error=...) at Http2ClientSession.h:191
#6  0x00000000005bea41 in rcv_headers_frame(Http2ConnectionState&, Http2Frame const&) () at Http2ConnectionState.cc:235
#7  0x00000000005c0cd9 in Http2ConnectionState::main_event_handler(int, void*) () at Http2ClientSession.h:95
#8  0x00000000005b5e81 in handleEvent (data=0x2b957fb033e0, event=2253, this=0x2b99977a9f48)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver9/iocore/eventsystem/I_Continuation.h:190
#9  handleEvent (data=0x2b957fb033e0, event=2253, this=0x2b99977a9f48)
    at /sd/workspace/src/git.ouroath.com/Edge/build/_build/build_release_posix-x86_64_gcc_8/trafficserver9/iocore/eventsystem/I_Continuation.h:186
#10 send_connection_event(Continuation*, int, void*) () at Http2ClientSession.cc:65
#11 0x00000000005b6a75 in Http2ClientSession::do_complete_frame_read() () at Http2ClientSession.cc:556
#12 0x00000000005b6edd in Http2ClientSession::state_process_frame_read(int, VIO*, bool) () at Http2ClientSession.cc:610
#13 0x00000000005b7671 in Http2ClientSession::state_start_frame_read (this=0x2b99977a9c00, event=100, edata=0x2b987ea886d0) at Http2ClientSession.cc:477
#14 0x00000000005b5fe7 in Http2ClientSession::main_event_handler (this=0x2b99977a9c00, event=2257, edata=0x2b9582311b40) at Http2ClientSession.cc:363
#15 0x0000000000736bfb in handleEvent (data=0x2b9582311b40, event=2257, this=0x2b99977a9c00) at I_Continuation.h:190
#16 handleEvent (data=0x2b9582311b40, event=2257, this=0x2b99977a9c00) at I_Continuation.h:186
#17 EThread::process_event(Event*, int) () at UnixEThread.cc:136
#18 0x0000000000737abc in EThread::execute_regular (this=this@entry=0x2b957b11a900) at UnixEThread.cc:249
#19 0x0000000000737eaa in execute (this=0x2b957b11a900) at UnixEThread.cc:338
#20 EThread::execute (this=0x2b957b11a900) at UnixEThread.cc:316
#21 0x0000000000736309 in spawn_thread_internal (a=0x2b9579193f80) at Thread.cc:92
#22 0x00002b957650edd5 in start_thread () from /lib64/libpthread.so.0
#23 0x00002b9577243ead in clone () from /lib64/libc.so.6

The immediate failure is that the MUTEX_TRY_LOCK on nh->mutex failed in UnixNetVConnection::add_to_active_queue. Earlier we had issues with other threads trying to add to the active/keep-alive queues eventually causing corrupted queues.

In this case (of the two cores I looked at), the UnixNetVConnection mutex and the EThread mutex are the same. And at the time of the core, the mutex is not being held, but presumably it was moments before. I assume another thread was making a very transient grab for the nh->mutex.

In this path, the event HTTP2_SESSION_EVENT_REENABLE is being sent every 128 frames presumably to break up clients dominating the thread with very large data. This is new logic compared to our 7.1.x build.

This thread to thread signaling does not grab the nh->mutex as a more standard network event driven process.

This PR checks to make sure that the request is being made from the correct thread and performing a blocking lock in that case.

Diving back into my bug archives I see that one of our users was complaining about a similar stack from Http1 land on our 7.x build. So placing the fix in ProxySession should address both issues.

@shinrich shinrich added this to the 10.0.0 milestone Sep 20, 2019
@shinrich shinrich self-assigned this Sep 20, 2019
@bryancall
Copy link
Contributor

Fixes issue #5943

@bryancall
Copy link
Contributor

bryancall commented Sep 20, 2019

Couldn't we just change the MUTEX_TRY_LOCK to a SCOPED_MUTEX_LOCK in UnixNetVConnection::add_to_active_queue()? I don't like that we are now double locking and using dynamic casting.

@bryancall
Copy link
Contributor

bryancall commented Sep 20, 2019

Thinking about it more while eating a snack, why are we updating the NetHandler from two different threads? That seems like a bigger issue.

@shinrich
Copy link
Member Author

Yes, updating from a different thread is concerning. Thus, the addition of the ink_release_assert to start with. Presumably another continuation/action picked up the nh->mutex and got scheduled on another thread. In this particular crash case, the lock had been dropped by the time the core dumped, so I don't know what was behaving badly.

Copy link
Contributor

@bryancall bryancall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am -1 on this change. It doesn't fix the issue and has the similar behavior as the existing try lock with more complexity.

@bryancall
Copy link
Contributor

Talked to @shinrich about this PR and #5954 is a better way to track down which thread is trying to get a lock on the NetHandler.

@bryancall bryancall closed this Sep 24, 2019
@bryancall bryancall removed this from the 10.0.0 milestone Sep 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants