Skip to content

Conversation

@shinrich
Copy link
Member

@shinrich shinrich commented Nov 7, 2017

Removing the code to remove the stream from the ConnectionState stream list in Http2ConnectionState::release_stream. Most of the time Http2ConnectionState::delete_stream is called before Http2ConnectionState::release_stream. But if it is called the other way around, delete_stream will exit right away and the logic to remove the stream from the dependence tree is not executed. That leaves references to deleted streams.

@shinrich shinrich added this to the 8.0.0 milestone Nov 7, 2017
@shinrich shinrich self-assigned this Nov 7, 2017
@shinrich shinrich requested a review from maskit November 7, 2017 00:14
@maskit
Copy link
Member

maskit commented Nov 7, 2017

This will resolve the issue, and removing the line from release_stream makes sense since delete_stream has to be called before release_stream. However, if I understand correctly, we should land both this fix and my fix(#2774), right? Because this fix still depends on the safety net (calling delete_stream in Http2Stream::destroy).

@shinrich
Copy link
Member Author

shinrich commented Nov 7, 2017

This fix made things a lot better, but not perfect. One crash overnight instead of a crash every 10 minutes. The core shows a "bad" node in the dependency try. So we probably need a combination of the two fixes. I just compiled a binary with this fix and your fixes. I simplified that a bit to just always call delete_stream and then release_stream in Http2Stream::destroy(). I don't see the value of calling delete_stream, release_stream, delete_stream. The second delete_stream will always be an immediate return.

@zwoop
Copy link
Contributor

zwoop commented Nov 7, 2017

We want this for 7.1.2 I assume? I cherry-picked #2774 to 7.1.x already.

@maskit
Copy link
Member

maskit commented Nov 7, 2017

@shinrich I’m fine with this but should we wait for the combinated and simplified update?

@shinrich shinrich force-pushed the h2-priority-crash-fix branch from 218f990 to 534338a Compare November 8, 2017 16:22
@shinrich
Copy link
Member Author

shinrich commented Nov 8, 2017

Pushed another set of changes. I spent yesterday iterating over a variety of changes, but each fix would continue to crash with what appeared to be stale Nodes in one of the Http2DependencyTree queues. On @SolidWallOfCode's advice, I added "in" methods to the PriorityQueue and Http2DependencyTree and ran last night with asserts that in was false at the end of Http2DependencyTree::remove. It failed two times last night, both on the same stack, see below. I added another lock within Http2ConnectionState::delete_stream. That code has been running on my prod box for a couple hours. Will let it keep cooking today as prime time east coast comes on. But so far so good.

The changes at this point include

  • Remove the remove from the Http2ConnectionState stream_list in release_stream. This fixed the every 10 minute crash.
  • Integration of @maskit's fix by pulling a call to delete_stream into the Http2Stream::destroy_method. This gives us our "safety" net. The delete_stream should be called before the release_stream to avoid race conditions with Http2ClientSession going into keep-alive mode too soon.
  • null out the Http2Stream reference to priority_node.
  • Addition of the test "in" methods. I have the asserts commented out because I have performance concerns about them build run all of the time. As I understand things, the argument to the ink_asserts is still evaluated even if the assert is turned off. But the methods might be useful for future debugging.

My test version also had event tracking on the XMIT and FINI events, but that didn't seem to help much, so I didn't push forward those changes.

#0  0x00002ac493258495 in raise () from /lib64/libc.so.6
#1  0x00002ac493259bfd in abort () from /lib64/libc.so.6
#2  0x00002ac490da00cc in ink_abort (message_format=0x2ac490db603c "%s:%d: failed assertion `%s`")
    at ../../../../trafficserver/lib/ts/ink_error.cc:99
#3  0x00002ac490d9d696 in _ink_assert (expression=0x807be9 "!this->in(nullptr, node)", 
    file=0x807bc5 "Http2DependencyTree.h", line=246) at ../../../../trafficserver/lib/ts/ink_assert.cc:37
#4  0x0000000000670355 in Http2DependencyTree<Http2Stream*>::remove (this=0x2ab08b366c90, node=0x2ac509b7b000)
    at Http2DependencyTree.h:246
#5  0x000000000066c2e5 in Http2ConnectionState::delete_stream (this=0x2ac545751a50, stream=0x2ab25f053d20)
    at Http2ConnectionState.cc:1137
#6  0x00000000006696fd in rcv_rst_stream_frame (cstate=..., frame=...) at Http2ConnectionState.cc:482
#7  0x000000000066b4fe in Http2ConnectionState::main_event_handler (this=0x2ac545751a50, event=2253, 
    edata=0x2ac5002a28f0) at Http2ConnectionState.cc:914
#8  0x00000000005246cc in Continuation::handleEvent (this=0x2ac545751a50, event=2253, data=0x2ac5002a28f0)
    at /home/shinrich/yats_build/trafficserver/iocore/eventsystem/I_Continuation.h:153
#9  0x0000000000663e1b in send_connection_event (cont=0x2ac545751a50, event=2253, edata=0x2ac5002a28f0)
    at Http2ClientSession.cc:58
#10 0x000000000066617b in Http2ClientSession::do_complete_frame_read (this=0x2ac545751800)
    at Http2ClientSession.cc:485
#11 0x0000000000666373 in Http2ClientSession::state_process_frame_read (this=0x2ac545751800, event=100, 
    vio=0x2aad1e865c80, inside_frame=false) at Http2ClientSession.cc:522
#12 0x000000000066592c in Http2ClientSession::state_start_frame_read (this=0x2ac545751800, event=100, 
    edata=0x2aad1e865c80) at Http2ClientSession.cc:415
#13 0x00000000006650db in Http2ClientSession::main_event_handler (this=0x2ac545751800, event=100, 
    edata=0x2aad1e865c80) at Http2ClientSession.cc:317
#14 0x00000000005246cc in Continuation::handleEvent (this=0x2ac545751800, event=100, data=0x2aad1e865c80)
    at /home/shinrich/yats_build/trafficserver/iocore/eventsystem/I_Continuation.h:153
#15 0x000000000079ec2c in read_signal_and_update (event=100, vc=0x2aad1e865b50) at UnixNetVConnection.cc:144
#16 0x00000000007a1bc6 in UnixNetVConnection::readSignalAndUpdate (this=0x2aad1e865b50, event=100)
    at UnixNetVConnection.cc:1092
#17 0x0000000000782c6c in SSLNetVConnection::net_read_io (this=0x2aad1e865b50, nh=0x2ac495711c00, lthread=
    0x2ac49570e010) at SSLNetVConnection.cc:596

@shinrich shinrich force-pushed the h2-priority-crash-fix branch 2 times, most recently from f990139 to e25837b Compare November 8, 2017 16:58
@shinrich shinrich force-pushed the h2-priority-crash-fix branch from e25837b to 850feec Compare November 8, 2017 17:28
@maskit maskit requested a review from masaori335 November 9, 2017 00:44
@maskit
Copy link
Member

maskit commented Nov 9, 2017

@masaori335 It's priority tree stuff :)

ink_assert(client_streams_out_count > 0);
--client_streams_out_count;
}
stream_list.remove(stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now release_stream() doesn't release stream ;) We need some cleanup here like a) move counter updates outside of this function, b) remove argument, and c) rename this function.
But fixing the crash is urgency, so I'm fine with this changes.

Copy link
Contributor

@masaori335 masaori335 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tree<T>::in() looks useful for debugging, and other changes looks reasonable. Let's land this if @shinrich's test on production goes well.

@shinrich
Copy link
Member Author

shinrich commented Nov 9, 2017

A couple crashes overnight after I removed my outstanding event tracking. Adding them back in my test build.

@shinrich
Copy link
Member Author

Fighting possibly unrelated issues on my test machine today, so I am not confident on my enhancements to this PR. I think what is here is definitely much better than what we started with. I think there are still crashes, but once or twice a day instead of once every 10 minutes. I will continue to work on this tomorrow. But we may want to take this and get it moved over to 7.1.x if we are getting ready to make drop there.

@zwoop
Copy link
Contributor

zwoop commented Nov 11, 2017

@maskit We ok with landing this as-is, and cherry-pick to 7.1.x ?

@maskit maskit merged commit 4e5f024 into apache:master Nov 11, 2017
@maskit
Copy link
Member

maskit commented Nov 11, 2017

@zwoop Yeh, this should be cherry-picked.

@zwoop zwoop modified the milestones: 8.0.0, 7.1.2 Nov 13, 2017
@zwoop
Copy link
Contributor

zwoop commented Nov 13, 2017

Cherry-picked to 7.1.2

@shinrich
Copy link
Member Author

Looks like PR #2339 (already cherry-picked to 7.1.2) solves the rest of the issue I was seeing. Without this fix, some priority nodes would get stuck in the grandparent of the tree, and not get cleaned up correctly. So with PR #2339 should all be good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants