Correctly track H2 post vio bytes #6401

shinrich · 2020-02-06T15:54:41Z

This should address issue #6313. Or at least it seems to be addressing the crash we were seeing after introducing the commit with the H2 post processing optimization. The key thing this commit addresses is updating the vio ndone on the read_vio in the Http2Stream. We would see WRITE_COMPLETE on the tunnel server side for the post body, but would not see the READ_COMPLETE for the user agent post side. In this post case, the client request is specifying a content-length. Eventually the inactivity timeout would trigger on the user agent and the read_vio on the server side would be left open.

This PR updates the ndone on the H2 vio and it does some other cleanup to ensure that sessions not in the tunnel are marked as such so they can be correctly cleaned up before the SM is deleted.

I ran a precursor version of this patch on a production box overnight without crash. This PR has most of the asserts I added removed. I'm running this patch now.

masaori335 · 2020-02-06T23:16:04Z

Great! Thanks for finding the root cause. I'll take a look carefully.
Can the slow_post test reproduce the situation?

shinrich · 2020-02-06T23:21:58Z

I don't think so. I fixed slow post because it failed due to 8080 already being in use (transient error).

proxy/http2/Http2ConnectionState.cc

masaori335 · 2020-02-06T23:59:05Z

proxy/http2/Http2ConnectionState.cc

  myreader->writer()->dealloc_reader(myreader);

-  if (frame.header().flags & HTTP2_FLAGS_DATA_END_STREAM) {
+  if (frame.header().flags & HTTP2_FLAGS_DATA_END_STREAM || stream->read_vio_done()) {


This looks what we need 👍 ( Checking read_vio.ntodo() and signal VC_EVENT_READ_COMPLETE event )

What does this VC_EVENT_READ_COMPLETE mean exactly? I'm a bit worried about trailing header, we don't support trailing header now though.

What we really need to do might be scheduling READ_COMPLETE on END_STREAM carried by second HEADERS frame. This may explain a case we somehow miss READ_COMPLETE.

https://tools.ietf.org/html/rfc7540#section-8.1.3

Trailing header fields are sent as a header block after both the request or response header block and all the DATA frames have been sent. The HEADERS frame starting the trailers header block has the END_STREAM flag set. The following example includes both a 100 (Continue) status code, which is sent in response to a request containing a "100-continue" token in the Expect header field, and trailing header fields: HTTP/1.1 100 Continue HEADERS Extension-Field: bar ==> - END_STREAM + END_HEADERS :status = 100 extension-field = bar HTTP/1.1 200 OK HEADERS Content-Type: image/jpeg ==> - END_STREAM Transfer-Encoding: chunked + END_HEADERS Trailer: Foo :status = 200 content-length = 123 123 content-type = image/jpeg {binary data} trailer = Foo 0 Foo: bar DATA - END_STREAM {binary data} HEADERS + END_STREAM + END_HEADERS foo = bar

Good point. ATS just set Http2Stream::recv_end_stream flag and ignore trailing header.

I could reproduce the crash (what we saw on the docs) with a trailing header & setting transaction_active_timeout_out.

Process 16485 stopped * thread #2, name = '[ET_NET 0]', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT) frame #0: 0x0000000100010829 traffic_server`Continuation::handleEvent(this=0x000000000ad7a3f0, event=106, data=0x000000000ad9be00) at I_Continuation.h:189:5 186 handleEvent(int event = CONTINUATION_EVENT_NONE, void *data = nullptr) 187 { 188 // If there is a lock, we must be holding it on entry -> 189 ink_release_assert(!mutex || mutex->thread_holding == this_ethread()); 190 return (this->*handler)(event, data); 191 } 192 (lldb) bt * thread #2, name = '[ET_NET 0]', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT) * frame #0: 0x0000000100010829 traffic_server`Continuation::handleEvent(this=0x000000000ad7a3f0, event=106, data=0x000000000ad9be00) at I_Continuation.h:189:5 frame #1: 0x00000001004e14af traffic_server`read_signal_and_update(event=106, vc=0x000000000ad9bc20) at UnixNetVConnection.cc:83:24 frame #2: 0x00000001004e236f traffic_server`UnixNetVConnection::mainEvent(this=0x000000000ad9bc20, event=106, e=0x00000000010263a0) at UnixNetVConnection.cc:1153:9 frame #3: 0x00000001000108d1 traffic_server`Continuation::handleEvent(this=0x000000000ad9bc20, event=106, data=0x00000000010263a0) at I_Continuation.h:190:12 frame #4: 0x000000010049684e traffic_server`UnixNetVConnection::callback(this=0x000000000ad9bc20, event=106, data=0x00000000010263a0) at P_UnixNetVConnection.h:244:18 frame #5: 0x00000001004d1c46 traffic_server`InactivityCop::check_inactivity(this=0x0000000006600060, event=2, e=0x00000000010263a0) at UnixNet.cc:85:13 frame #6: 0x00000001000108d1 traffic_server`Continuation::handleEvent(this=0x0000000006600060, event=2, data=0x00000000010263a0) at I_Continuation.h:190:12 frame #7: 0x0000000100519c8c traffic_server`EThread::process_event(this=0x0000000005000000, e=0x00000000010263a0, calling_code=2) at UnixEThread.cc:126:22 frame #8: 0x000000010051a5bd traffic_server`EThread::execute_regular(this=0x0000000005000000) at UnixEThread.cc:239:11 frame #9: 0x000000010051abc8 traffic_server`EThread::execute(this=0x0000000005000000) at UnixEThread.cc:330:11 frame #10: 0x0000000100518d62 traffic_server`spawn_thread_internal(a=0x0000000003f160c0) at Thread.cc:92:12 frame #11: 0x00007fff6d482e65 libsystem_pthread.dylib`_pthread_start + 148 frame #12: 0x00007fff6d47e83b libsystem_pthread.dylib`thread_start + 15 (lldb) p this (Continuation *) $0 = 0x000000000ad7a3f0 (lldb) p *this (Continuation) $1 = { handler = de ad be ef de ad be ef de ad be ef de ad be ef handler_name = 0xefbeaddeefbeadde "" mutex = { m_ptr = 0xefbeaddeefbeadde } link = { SLink<Continuation> = { next = 0xefbeaddeefbeadde } prev = 0xefbeaddeefbeadde } control_flags = (raw_flags = 4022250974) thread_affinity = 0xefbeaddeefbeadde }

The crash is gone when I revert c55001b.
@shinrich Could you check your case is bad handling of trailing header too?

maskit · 2020-02-07T00:41:16Z

proxy/http2/Http2ConnectionState.cc


-  if (frame.header().flags & HTTP2_FLAGS_DATA_END_STREAM) {
+  if (frame.header().flags & HTTP2_FLAGS_DATA_END_STREAM || stream->read_vio_done()) {
    // TODO: set total written size to read_vio.nbytes


Is this still a todo?

Eventually, yes, still to do. As it stands with the origin needing to be chunked if nbytes is MAXINT on the H2 producer, the HttpTunnel logic updates the consumer write vio nbytes to the appropriate chunked size and as long as the EOS is set on the last data frame it is all ok.

masaori335 · 2020-02-07T07:43:54Z

It looks like we have 3 issues. Separated PRs for each is preferred.
1). Http2ClientSession/ConnectionState doesn't signal VC_EVENT_READ_COMPLETE correctly.
2). HttpTunnel doesn't close a server session correctly when a producer got VC_EVENT_INACTIVITY_TIMEOUT in some cases.
3). Http2Stream::read_vio.ndone is not increased correctly.

shinrich · 2020-02-07T14:13:46Z

I'll check for the trailing header in our environment. That would make sense for why content-length would be met but no EOS on the data frame. We can pull apart the PRs, but 1 won't really work unless 3 is also performed. I think a PR for 1/3 and another for 2 makes more sense.

shinrich · 2020-02-07T14:20:06Z

So what happens with a trailing header an no content-length on the post body? I assume in that case the user agent must set a content-length. The trailing headers become an issue with GRPC, so it is something we will have to work through with the H2 origin work at least.

shinrich · 2020-02-07T22:05:05Z

I split this PR into #6406, #6407, and #6040
I also spent much of today adding warning messages and analyzing logs trying to identify our traffic that was causing the bad behavior. We are not seeing any H2 trail frames from our clients, so trail frames is not exercising the problem in our case. We see quite a few clients closing the data stream with an empty data frame with the EOS bit set, but that frame is always sent very quickly after the last data frame. We do see some clients that just never send any EOS bit. Ultimately ATS closes down the stream while the stream is in the HTTP2_STREAM_STATE_HALF_CLOSED_LOCAL state.
Unfortunately looking through the logs, I cannot identify one culprit. Not even mobile vs desktop or ios vs android. In any case, ATS needs to protect itself in the presence of an ill behaved client. Plus sending trailers is valid and correctly tracking the completion of the read_vio independent of the EOS bits is necessary in that case.

masaori335 · 2020-02-10T02:27:59Z

Thanks for identifying the traffic which triggering the crash.
It looks like there're 2 similar cases that uncovered HttpTunnel's issue.

a). Trailing Header
b). Bad clients never send EOS

shinrich · 2020-02-11T22:44:01Z

Closing this PR in deference to the split PR's

shinrich added HTTP/2 HTTP Crash labels Feb 6, 2020

shinrich added this to the 10.0.0 milestone Feb 6, 2020

shinrich requested review from masaori335 and zwoop February 6, 2020 15:54

shinrich self-assigned this Feb 6, 2020

shinrich force-pushed the fix-stale-sm-inactive-timeout branch 3 times, most recently from b65adb6 to c0fee68 Compare February 6, 2020 22:13

shinrich added 2 commits February 6, 2020 22:13

Correctly track H2 post vio bytes

c0fee68

Change slow_post test to auto-select port

218cd57

masaori335 reviewed Feb 6, 2020

View reviewed changes

proxy/http2/Http2ConnectionState.cc Show resolved Hide resolved

masaori335 reviewed Feb 6, 2020

View reviewed changes

maskit reviewed Feb 7, 2020

View reviewed changes

masaori335 mentioned this pull request Feb 7, 2020

Signal VC_EVENT_READ_COMPLETE when ATS received END_STREAM flag #6404

Merged

1 task

This was referenced Feb 7, 2020

Correctly track H2 post vio bytes #6407

Closed

Improve logic to cleanup server-session exactly once #6408

Closed

shinrich mentioned this pull request Feb 11, 2020

Fix crash when H2 client does not set End-of-data bit #6413

Merged

shinrich closed this Feb 11, 2020

zwoop removed this from the 10.0.0 milestone Feb 11, 2020

Correctly track H2 post vio bytes #6401

Correctly track H2 post vio bytes #6401

Uh oh!

Conversation

shinrich commented Feb 6, 2020

Uh oh!

masaori335 commented Feb 6, 2020

Uh oh!

shinrich commented Feb 6, 2020

Uh oh!

Uh oh!

masaori335 Feb 6, 2020

Choose a reason for hiding this comment

Uh oh!

maskit Feb 7, 2020

Choose a reason for hiding this comment

Uh oh!

masaori335 Feb 7, 2020

Choose a reason for hiding this comment

Uh oh!

masaori335 Feb 7, 2020

Choose a reason for hiding this comment

Uh oh!

masaori335 Feb 7, 2020

Choose a reason for hiding this comment

Uh oh!

maskit Feb 7, 2020

Choose a reason for hiding this comment

Uh oh!

shinrich Feb 7, 2020

Choose a reason for hiding this comment

Uh oh!

masaori335 commented Feb 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shinrich commented Feb 7, 2020

Uh oh!

shinrich commented Feb 7, 2020

Uh oh!

shinrich commented Feb 7, 2020

Uh oh!

masaori335 commented Feb 10, 2020

Uh oh!

shinrich commented Feb 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

masaori335 commented Feb 7, 2020 •

edited

Loading