Skip to content

Conversation

@shinrich
Copy link
Member

@shinrich shinrich commented Feb 6, 2020

This should address issue #6313. Or at least it seems to be addressing the crash we were seeing after introducing the commit with the H2 post processing optimization. The key thing this commit addresses is updating the vio ndone on the read_vio in the Http2Stream. We would see WRITE_COMPLETE on the tunnel server side for the post body, but would not see the READ_COMPLETE for the user agent post side. In this post case, the client request is specifying a content-length. Eventually the inactivity timeout would trigger on the user agent and the read_vio on the server side would be left open.

This PR updates the ndone on the H2 vio and it does some other cleanup to ensure that sessions not in the tunnel are marked as such so they can be correctly cleaned up before the SM is deleted.

I ran a precursor version of this patch on a production box overnight without crash. This PR has most of the asserts I added removed. I'm running this patch now.

@shinrich shinrich added this to the 10.0.0 milestone Feb 6, 2020
@shinrich shinrich requested review from masaori335 and zwoop February 6, 2020 15:54
@shinrich shinrich self-assigned this Feb 6, 2020
@shinrich shinrich force-pushed the fix-stale-sm-inactive-timeout branch 3 times, most recently from b65adb6 to c0fee68 Compare February 6, 2020 22:13
@masaori335
Copy link
Contributor

Great! Thanks for finding the root cause. I'll take a look carefully.
Can the slow_post test reproduce the situation?

@shinrich
Copy link
Member Author

shinrich commented Feb 6, 2020

I don't think so. I fixed slow post because it failed due to 8080 already being in use (transient error).

myreader->writer()->dealloc_reader(myreader);

if (frame.header().flags & HTTP2_FLAGS_DATA_END_STREAM) {
if (frame.header().flags & HTTP2_FLAGS_DATA_END_STREAM || stream->read_vio_done()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks what we need 👍 ( Checking read_vio.ntodo() and signal VC_EVENT_READ_COMPLETE event )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this VC_EVENT_READ_COMPLETE mean exactly? I'm a bit worried about trailing header, we don't support trailing header now though.

What we really need to do might be scheduling READ_COMPLETE on END_STREAM carried by second HEADERS frame. This may explain a case we somehow miss READ_COMPLETE.

https://tools.ietf.org/html/rfc7540#section-8.1.3

   Trailing header fields are sent as a header block after both the
   request or response header block and all the DATA frames have been
   sent.  The HEADERS frame starting the trailers header block has the
   END_STREAM flag set.

   The following example includes both a 100 (Continue) status code,
   which is sent in response to a request containing a "100-continue"
   token in the Expect header field, and trailing header fields:

     HTTP/1.1 100 Continue            HEADERS
     Extension-Field: bar       ==>     - END_STREAM
                                        + END_HEADERS
                                          :status = 100
                                          extension-field = bar

     HTTP/1.1 200 OK                  HEADERS
     Content-Type: image/jpeg   ==>     - END_STREAM
     Transfer-Encoding: chunked         + END_HEADERS
     Trailer: Foo                         :status = 200
                                          content-length = 123
     123                                  content-type = image/jpeg
     {binary data}                        trailer = Foo
     0
     Foo: bar                         DATA
                                        - END_STREAM
                                      {binary data}

                                      HEADERS
                                        + END_STREAM
                                        + END_HEADERS
                                          foo = bar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. ATS just set Http2Stream::recv_end_stream flag and ignore trailing header.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could reproduce the crash (what we saw on the docs) with a trailing header & setting transaction_active_timeout_out.

Process 16485 stopped
* thread #2, name = '[ET_NET 0]', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x0000000100010829 traffic_server`Continuation::handleEvent(this=0x000000000ad7a3f0, event=106, data=0x000000000ad9be00) at I_Continuation.h:189:5
   186    handleEvent(int event = CONTINUATION_EVENT_NONE, void *data = nullptr)
   187    {
   188      // If there is a lock, we must be holding it on entry
-> 189      ink_release_assert(!mutex || mutex->thread_holding == this_ethread());
   190      return (this->*handler)(event, data);
   191    }
   192
(lldb) bt
* thread #2, name = '[ET_NET 0]', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x0000000100010829 traffic_server`Continuation::handleEvent(this=0x000000000ad7a3f0, event=106, data=0x000000000ad9be00) at I_Continuation.h:189:5
    frame #1: 0x00000001004e14af traffic_server`read_signal_and_update(event=106, vc=0x000000000ad9bc20) at UnixNetVConnection.cc:83:24
    frame #2: 0x00000001004e236f traffic_server`UnixNetVConnection::mainEvent(this=0x000000000ad9bc20, event=106, e=0x00000000010263a0) at UnixNetVConnection.cc:1153:9
    frame #3: 0x00000001000108d1 traffic_server`Continuation::handleEvent(this=0x000000000ad9bc20, event=106, data=0x00000000010263a0) at I_Continuation.h:190:12
    frame #4: 0x000000010049684e traffic_server`UnixNetVConnection::callback(this=0x000000000ad9bc20, event=106, data=0x00000000010263a0) at P_UnixNetVConnection.h:244:18
    frame #5: 0x00000001004d1c46 traffic_server`InactivityCop::check_inactivity(this=0x0000000006600060, event=2, e=0x00000000010263a0) at UnixNet.cc:85:13
    frame #6: 0x00000001000108d1 traffic_server`Continuation::handleEvent(this=0x0000000006600060, event=2, data=0x00000000010263a0) at I_Continuation.h:190:12
    frame #7: 0x0000000100519c8c traffic_server`EThread::process_event(this=0x0000000005000000, e=0x00000000010263a0, calling_code=2) at UnixEThread.cc:126:22
    frame #8: 0x000000010051a5bd traffic_server`EThread::execute_regular(this=0x0000000005000000) at UnixEThread.cc:239:11
    frame #9: 0x000000010051abc8 traffic_server`EThread::execute(this=0x0000000005000000) at UnixEThread.cc:330:11
    frame #10: 0x0000000100518d62 traffic_server`spawn_thread_internal(a=0x0000000003f160c0) at Thread.cc:92:12
    frame #11: 0x00007fff6d482e65 libsystem_pthread.dylib`_pthread_start + 148
    frame #12: 0x00007fff6d47e83b libsystem_pthread.dylib`thread_start + 15
(lldb) p this
(Continuation *) $0 = 0x000000000ad7a3f0
(lldb) p *this
(Continuation) $1 = {
  handler = de ad be ef de ad be ef de ad be ef de ad be ef
  handler_name = 0xefbeaddeefbeadde ""
  mutex = {
    m_ptr = 0xefbeaddeefbeadde
  }
  link = {
    SLink<Continuation> = {
      next = 0xefbeaddeefbeadde
    }
    prev = 0xefbeaddeefbeadde
  }
  control_flags = (raw_flags = 4022250974)
  thread_affinity = 0xefbeaddeefbeadde
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crash is gone when I revert c55001b.
@shinrich Could you check your case is bad handling of trailing header too?


if (frame.header().flags & HTTP2_FLAGS_DATA_END_STREAM) {
if (frame.header().flags & HTTP2_FLAGS_DATA_END_STREAM || stream->read_vio_done()) {
// TODO: set total written size to read_vio.nbytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still a todo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, yes, still to do. As it stands with the origin needing to be chunked if nbytes is MAXINT on the H2 producer, the HttpTunnel logic updates the consumer write vio nbytes to the appropriate chunked size and as long as the EOS is set on the last data frame it is all ok.

@masaori335
Copy link
Contributor

masaori335 commented Feb 7, 2020

It looks like we have 3 issues. Separated PRs for each is preferred.
1). Http2ClientSession/ConnectionState doesn't signal VC_EVENT_READ_COMPLETE correctly.
2). HttpTunnel doesn't close a server session correctly when a producer got VC_EVENT_INACTIVITY_TIMEOUT in some cases.
3). Http2Stream::read_vio.ndone is not increased correctly.

@shinrich
Copy link
Member Author

shinrich commented Feb 7, 2020

I'll check for the trailing header in our environment. That would make sense for why content-length would be met but no EOS on the data frame. We can pull apart the PRs, but 1 won't really work unless 3 is also performed. I think a PR for 1/3 and another for 2 makes more sense.

@shinrich
Copy link
Member Author

shinrich commented Feb 7, 2020

So what happens with a trailing header an no content-length on the post body? I assume in that case the user agent must set a content-length. The trailing headers become an issue with GRPC, so it is something we will have to work through with the H2 origin work at least.

@shinrich
Copy link
Member Author

shinrich commented Feb 7, 2020

I split this PR into #6406, #6407, and #6040
I also spent much of today adding warning messages and analyzing logs trying to identify our traffic that was causing the bad behavior. We are not seeing any H2 trail frames from our clients, so trail frames is not exercising the problem in our case. We see quite a few clients closing the data stream with an empty data frame with the EOS bit set, but that frame is always sent very quickly after the last data frame. We do see some clients that just never send any EOS bit. Ultimately ATS closes down the stream while the stream is in the HTTP2_STREAM_STATE_HALF_CLOSED_LOCAL state.
Unfortunately looking through the logs, I cannot identify one culprit. Not even mobile vs desktop or ios vs android. In any case, ATS needs to protect itself in the presence of an ill behaved client. Plus sending trailers is valid and correctly tracking the completion of the read_vio independent of the EOS bits is necessary in that case.

@masaori335
Copy link
Contributor

Thanks for identifying the traffic which triggering the crash.
It looks like there're 2 similar cases that uncovered HttpTunnel's issue.

a). Trailing Header
b). Bad clients never send EOS

@shinrich
Copy link
Member Author

Closing this PR in deference to the split PR's

@shinrich shinrich closed this Feb 11, 2020
@zwoop zwoop removed this from the 10.0.0 milestone Feb 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants