-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"read: connection reset by peer" after ~6 minutes during large "zapi postpath" from S3 #2005
Labels
bug
Something isn't working
Comments
mattnibs
added a commit
that referenced
this issue
Feb 10, 2021
If a "connection error" reset is encountered while reading a s3 object attempt to restart the connection and resume read at the current offset. This solves a bug found when trying to ingest several s3 hosted log files: several files will stop ingesting with the error "connection reset by peer". There seems to be a curious behavior of the s3 service that happens when a single session maintains numerous long-running download connections to various objects in a bucket- the service appears to reset connections at random. See: aws/aws-sdk-go#1242 Closes #2005
mattnibs
added a commit
that referenced
this issue
Feb 10, 2021
If a "connection error" reset is encountered while reading a s3 object attempt to restart the connection and resume read at the current offset. This solves a bug found when trying to ingest several s3 hosted log files: several files will stop ingesting with the error "connection reset by peer". There seems to be a curious behavior of the s3 service that happens when a single session maintains numerous long-running download connections to various objects in a bucket- the service appears to reset connections at random. See: aws/aws-sdk-go#1242 Closes #2005
mattnibs
added a commit
that referenced
this issue
Feb 10, 2021
If a "connection error" reset is encountered while reading a s3 object attempt to restart the connection and resume read at the current offset. This solves a bug found when trying to ingest several s3 hosted log files: several files will stop ingesting with the error "connection reset by peer". There seems to be a curious behavior of the s3 service that happens when a single session maintains numerous long-running download connections to various objects in a bucket- the service appears to reset connections at random. See: aws/aws-sdk-go#1242 Closes #2005
mattnibs
added a commit
that referenced
this issue
Feb 11, 2021
If a "connection error" reset is encountered while reading a s3 object attempt to restart the connection and resume read at the current offset. This solves a bug found when trying to ingest several s3 hosted log files: several files will stop ingesting with the error "connection reset by peer". There seems to be a curious behavior of the s3 service that happens when a single session maintains numerous long-running download connections to various objects in a bucket- the service appears to reset connections at random. See: aws/aws-sdk-go#1242 Closes #2005
Verified with Now I can complete 100% of the import successfully and it's never interrupted despite running for 13+ minutes.
Thanks @mattnibs! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Repro is with cluster at
zq
commit4eebb1d
and usingzapi
commit4de0485d
as the client.I've repeated this several times and have found the
connection reset by peer
messages consistently show up just short of 6 minutes from when I started thezapi postpath
.Since the failure appears to abort the download of some log files from S3, this results in an overall incomplete import. The actual count of events in the data set is
96217189
, but after the failure above the count shows:I can repro this every time I attempt it. Many of the same log files fail each time, though they don't fail in the same spot every time. Here's the output from another repro:
I did some web searches hoping to find magic numbers that might point to timeouts we're exceeding, but didn't find much. It might be a red herring, but I did find aws/aws-sdk-go#1763 that happens to involve the the AWS SDK for Go, "high concurrency downloads", and failures after "5~6 minutes", but alas, it was unresolved. Might be unrelated. 🤷♂️
I also couldn't seem to see any detailed logging during import. I deduced by watching
top
that thezqd
with personalityroot
seemed to be doing all the heavy lifting, but doing akubectl logs -f
on that pod showed just these two lines that correlated with the start/completion of thezapi postpath
:The text was updated successfully, but these errors were encountered: