-
Notifications
You must be signed in to change notification settings - Fork 844
Correctly track H2 post vio bytes #6401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b65adb6 to
c0fee68
Compare
|
Great! Thanks for finding the root cause. I'll take a look carefully. |
|
I don't think so. I fixed slow post because it failed due to 8080 already being in use (transient error). |
| myreader->writer()->dealloc_reader(myreader); | ||
|
|
||
| if (frame.header().flags & HTTP2_FLAGS_DATA_END_STREAM) { | ||
| if (frame.header().flags & HTTP2_FLAGS_DATA_END_STREAM || stream->read_vio_done()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks what we need 👍 ( Checking read_vio.ntodo() and signal VC_EVENT_READ_COMPLETE event )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this VC_EVENT_READ_COMPLETE mean exactly? I'm a bit worried about trailing header, we don't support trailing header now though.
What we really need to do might be scheduling READ_COMPLETE on END_STREAM carried by second HEADERS frame. This may explain a case we somehow miss READ_COMPLETE.
https://tools.ietf.org/html/rfc7540#section-8.1.3
Trailing header fields are sent as a header block after both the
request or response header block and all the DATA frames have been
sent. The HEADERS frame starting the trailers header block has the
END_STREAM flag set.
The following example includes both a 100 (Continue) status code,
which is sent in response to a request containing a "100-continue"
token in the Expect header field, and trailing header fields:
HTTP/1.1 100 Continue HEADERS
Extension-Field: bar ==> - END_STREAM
+ END_HEADERS
:status = 100
extension-field = bar
HTTP/1.1 200 OK HEADERS
Content-Type: image/jpeg ==> - END_STREAM
Transfer-Encoding: chunked + END_HEADERS
Trailer: Foo :status = 200
content-length = 123
123 content-type = image/jpeg
{binary data} trailer = Foo
0
Foo: bar DATA
- END_STREAM
{binary data}
HEADERS
+ END_STREAM
+ END_HEADERS
foo = bar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. ATS just set Http2Stream::recv_end_stream flag and ignore trailing header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could reproduce the crash (what we saw on the docs) with a trailing header & setting transaction_active_timeout_out.
Process 16485 stopped
* thread #2, name = '[ET_NET 0]', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x0000000100010829 traffic_server`Continuation::handleEvent(this=0x000000000ad7a3f0, event=106, data=0x000000000ad9be00) at I_Continuation.h:189:5
186 handleEvent(int event = CONTINUATION_EVENT_NONE, void *data = nullptr)
187 {
188 // If there is a lock, we must be holding it on entry
-> 189 ink_release_assert(!mutex || mutex->thread_holding == this_ethread());
190 return (this->*handler)(event, data);
191 }
192
(lldb) bt
* thread #2, name = '[ET_NET 0]', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
* frame #0: 0x0000000100010829 traffic_server`Continuation::handleEvent(this=0x000000000ad7a3f0, event=106, data=0x000000000ad9be00) at I_Continuation.h:189:5
frame #1: 0x00000001004e14af traffic_server`read_signal_and_update(event=106, vc=0x000000000ad9bc20) at UnixNetVConnection.cc:83:24
frame #2: 0x00000001004e236f traffic_server`UnixNetVConnection::mainEvent(this=0x000000000ad9bc20, event=106, e=0x00000000010263a0) at UnixNetVConnection.cc:1153:9
frame #3: 0x00000001000108d1 traffic_server`Continuation::handleEvent(this=0x000000000ad9bc20, event=106, data=0x00000000010263a0) at I_Continuation.h:190:12
frame #4: 0x000000010049684e traffic_server`UnixNetVConnection::callback(this=0x000000000ad9bc20, event=106, data=0x00000000010263a0) at P_UnixNetVConnection.h:244:18
frame #5: 0x00000001004d1c46 traffic_server`InactivityCop::check_inactivity(this=0x0000000006600060, event=2, e=0x00000000010263a0) at UnixNet.cc:85:13
frame #6: 0x00000001000108d1 traffic_server`Continuation::handleEvent(this=0x0000000006600060, event=2, data=0x00000000010263a0) at I_Continuation.h:190:12
frame #7: 0x0000000100519c8c traffic_server`EThread::process_event(this=0x0000000005000000, e=0x00000000010263a0, calling_code=2) at UnixEThread.cc:126:22
frame #8: 0x000000010051a5bd traffic_server`EThread::execute_regular(this=0x0000000005000000) at UnixEThread.cc:239:11
frame #9: 0x000000010051abc8 traffic_server`EThread::execute(this=0x0000000005000000) at UnixEThread.cc:330:11
frame #10: 0x0000000100518d62 traffic_server`spawn_thread_internal(a=0x0000000003f160c0) at Thread.cc:92:12
frame #11: 0x00007fff6d482e65 libsystem_pthread.dylib`_pthread_start + 148
frame #12: 0x00007fff6d47e83b libsystem_pthread.dylib`thread_start + 15
(lldb) p this
(Continuation *) $0 = 0x000000000ad7a3f0
(lldb) p *this
(Continuation) $1 = {
handler = de ad be ef de ad be ef de ad be ef de ad be ef
handler_name = 0xefbeaddeefbeadde ""
mutex = {
m_ptr = 0xefbeaddeefbeadde
}
link = {
SLink<Continuation> = {
next = 0xefbeaddeefbeadde
}
prev = 0xefbeaddeefbeadde
}
control_flags = (raw_flags = 4022250974)
thread_affinity = 0xefbeaddeefbeadde
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| if (frame.header().flags & HTTP2_FLAGS_DATA_END_STREAM) { | ||
| if (frame.header().flags & HTTP2_FLAGS_DATA_END_STREAM || stream->read_vio_done()) { | ||
| // TODO: set total written size to read_vio.nbytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still a todo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, yes, still to do. As it stands with the origin needing to be chunked if nbytes is MAXINT on the H2 producer, the HttpTunnel logic updates the consumer write vio nbytes to the appropriate chunked size and as long as the EOS is set on the last data frame it is all ok.
|
It looks like we have 3 issues. Separated PRs for each is preferred. |
|
I'll check for the trailing header in our environment. That would make sense for why content-length would be met but no EOS on the data frame. We can pull apart the PRs, but 1 won't really work unless 3 is also performed. I think a PR for 1/3 and another for 2 makes more sense. |
|
So what happens with a trailing header an no content-length on the post body? I assume in that case the user agent must set a content-length. The trailing headers become an issue with GRPC, so it is something we will have to work through with the H2 origin work at least. |
|
I split this PR into #6406, #6407, and #6040 |
|
Thanks for identifying the traffic which triggering the crash. a). Trailing Header |
|
Closing this PR in deference to the split PR's |
This should address issue #6313. Or at least it seems to be addressing the crash we were seeing after introducing the commit with the H2 post processing optimization. The key thing this commit addresses is updating the vio ndone on the read_vio in the Http2Stream. We would see WRITE_COMPLETE on the tunnel server side for the post body, but would not see the READ_COMPLETE for the user agent post side. In this post case, the client request is specifying a content-length. Eventually the inactivity timeout would trigger on the user agent and the read_vio on the server side would be left open.
This PR updates the ndone on the H2 vio and it does some other cleanup to ensure that sessions not in the tunnel are marked as such so they can be correctly cleaned up before the SM is deleted.
I ran a precursor version of this patch on a production box overnight without crash. This PR has most of the asserts I added removed. I'm running this patch now.