Skip to content

TCP [RST] intermittently ignored #25314

@samoconnor

Description

@samoconnor

Most of the time when TCPSocket receives a [RST] packet, libuv calls uv_readcb() and UVError, ECONNRESET is thrown.

I have a test case where hundreds of pipelined HTTP PUT Requests are sent to AWS S3. Typically the Requests get ahead of the Responses (e.g. when Request No. 70 is being sent, we may only be up to reading Response No. 10).
At some point the S3 server hits an internal limit on the number of Requests per connection (about 100) and stops sending Response data (e.g. we might send Request No. 120 and then while we're reading Response No. 30 data stops arriving. Sometimes the server sends [RST] right away and a UVError, ECONNRESET is thrown as expected. Note: the S3 doc suggests not to send more than 90 requests per connection. I'm sending more than that as a way to test corner case behaviour in HTTP.jl.

However, monitoring with wireshark shows that sometimes the [RST] is not sent for a few minutes. It seems that in this case libuv does not notice the [RST], and uv_readcb is not called. The result is that the eof() call that the reader is waiting for blocks forever. I have a seperate task that periodically prints connection debug info. This shows that the LibuvStream.state remains StatusActive.

I have tried putting lots of printfs in libuv. What I see is that the uv__stream_io function is not called at all in the case where the [RST] is missed. Maybe there is a race-condition inside libuv where the [RST] is missed if kevent is not active when it arrives? Maybe for some reason libuv forgets to submit the socket to kevent, or does not indicate interest in the correct event type? (I'm not familiar with kqueue).

I have tried modifying wait_readnb so that it wakes up and does uv_read_start again every so often while waiting. This makes no difference.

As a practical solution for HTTP.jl I've implemented a Retry Layer that uses a seperate task to close stuck connections. Calling close results in the blocked eof() task waking up, discovering the connection is gone, and retring the Request.

Version 0.7.0-DEV.3090 (2017-12-18 19:26 UTC)
Commit 5abe9b1382* (10 days old master)
x86_64-apple-darwin14.5.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions