Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"read: connection reset by peer" after ~6 minutes during large "zapi postpath" from S3 #2005

Closed
philrz opened this issue Jan 23, 2021 · 1 comment · Fixed by #2106
Closed
Assignees
Labels
bug Something isn't working

Comments

@philrz
Copy link
Contributor

philrz commented Jan 23, 2021

Repro is with cluster at zq commit 4eebb1d and using zapi commit 4de0485d as the client.

$ export ZQD_DATA_URI="s3://zq-ci-test/21-01-22.rug2"
$ export SPACE="year1"
$ zapi new -k archivestore -d $ZQD_DATA_URI $SPACE
$ zapi -s $SPACE postpath $(for file in $(aws s3 ls --recursive s3://brim-sampledata/wrccdc-year1/zeek-logs/ | grep "\.log" | awk '{ print $4 }'); do   echo s3://brim-sampledata/$file; done)
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/smtp.log: line 8934: read tcp 192.168.60.195:48976->52.219.100.128:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/ssh.log: line 151931: read tcp 192.168.60.195:37078->52.219.106.138:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/dhcp.log: line 9609: read tcp 192.168.60.195:48612->52.219.100.128:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/weird.log: line 1390916: read tcp 192.168.60.195:37184->52.219.106.138:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/http.log: line 807132: read tcp 192.168.60.195:36778->52.219.106.138:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/notice.log: line 35328: read tcp 192.168.60.195:48774->52.219.100.128:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/files.log: line 1438995: read tcp 192.168.60.195:36736->52.219.106.138:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/x509.log: line 471145: read tcp 192.168.60.195:49142->52.219.100.128:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/ssl.log: line 919998: read tcp 192.168.60.195:37098->52.219.106.138:443: read: connection reset by peer
 90.9% 11.68GB/12.85GB
posted 11.68GB in 11m57.900284245s

I've repeated this several times and have found the connection reset by peer messages consistently show up just short of 6 minutes from when I started the zapi postpath.

Since the failure appears to abort the download of some log files from S3, this results in an overall incomplete import. The actual count of events in the data set is 96217189, but after the failure above the count shows:

$ time zapi -s $SPACE get -workers 1 -t 'count()'
#0:record[count:uint64]
0:[91771072;]

I can repro this every time I attempt it. Many of the same log files fail each time, though they don't fail in the same spot every time. Here's the output from another repro:

$ zapi -s $SPACE postpath $(for file in $(aws s3 ls --recursive s3://brim-sampledata/wrccdc-year1/zeek-logs/ | grep "\.log" | awk '{ print $4 }'); do   echo s3://brim-sampledata/$file; done)
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/snmp.log: line 15873: read tcp 192.168.60.195:50688->52.219.80.176:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/smtp.log: line 9420: read tcp 192.168.60.195:39930->52.219.84.64:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/ssh.log: line 152276: read tcp 192.168.60.195:39970->52.219.84.64:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/dhcp.log: line 12081: read tcp 192.168.60.195:39568->52.219.84.64:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/tunnel.log: line 10361: read tcp 192.168.60.195:50812->52.219.80.176:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/weird.log: line 1390916: read tcp 192.168.60.195:40094->52.219.84.64:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/http.log: line 807819: read tcp 192.168.60.195:39672->52.219.84.64:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/notice.log: line 34163: read tcp 192.168.60.195:39730->52.219.84.64:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/files.log: line 1438263: read tcp 192.168.60.195:50370->52.219.80.176:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/x509.log: line 470916: read tcp 192.168.60.195:40114->52.219.84.64:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/ssl.log: line 920216: read tcp 192.168.60.195:39992->52.219.84.64:443: read: connection reset by peer
 90.9% 11.68GB/12.85GB
posted 11.68GB in 12m3.467001001s

$ time zapi -s $SPACE get -workers 1 -t 'count()'
#0:record[count:uint64]
0:[91767866;]

real	2m5.037s
user	0m0.031s
sys	0m0.032s

I did some web searches hoping to find magic numbers that might point to timeouts we're exceeding, but didn't find much. It might be a red herring, but I did find aws/aws-sdk-go#1763 that happens to involve the the AWS SDK for Go, "high concurrency downloads", and failures after "5~6 minutes", but alas, it was unresolved. Might be unrelated. 🤷‍♂️

I also couldn't seem to see any detailed logging during import. I deduced by watching top that the zqd with personality root seemed to be doing all the heavy lifting, but doing a kubectl logs -f on that pod showed just these two lines that correlated with the start/completion of the zapi postpath:

{"level":"info","ts":1611362757.1915123,"logger":"http.access","msg":"Request completed","request_id":"1nRpSajzZhaMu6jFNC0pBxQ96uZ","host":"localhost:9867","method":"GET","proto":"HTTP/1.1","remote_addr":"127.0.0.1:53124","request_content_length":0,"url":"/space","elapsed":0.000102775,"response_content_length":131,"status_code":200}
{"level":"info","ts":1611363475.1030416,"logger":"http.access","msg":"Request completed","request_id":"1nRpSaChR2rsEJ8j8z0oc13RWCq","host":"localhost:9867","method":"POST","proto":"HTTP/1.1","remote_addr":"127.0.0.1:53124","request_content_length":1733,"url":"/space/sp_1nRpMe6lzAJosPqRsa4BCNytq1F/log/paths","elapsed":717.833245198,"response_content_length":61244,"status_code":202}
@philrz philrz added the bug Something isn't working label Jan 23, 2021
@philrz philrz added this to the Brim v0.25.0 milestone Jan 28, 2021
mattnibs added a commit that referenced this issue Feb 10, 2021
If a "connection error" reset is encountered while reading a s3 object attempt
to restart the connection and resume read at the current offset.

This solves a bug found when trying to ingest several s3 hosted log files:
several files will stop ingesting with the error "connection reset by peer".
There seems to be a curious behavior of the s3 service that happens when
a single session maintains numerous long-running download connections to
various objects in a bucket- the service appears to reset connections at
random.

See: aws/aws-sdk-go#1242

Closes #2005
mattnibs added a commit that referenced this issue Feb 10, 2021
If a "connection error" reset is encountered while reading a s3 object attempt
to restart the connection and resume read at the current offset.

This solves a bug found when trying to ingest several s3 hosted log files:
several files will stop ingesting with the error "connection reset by peer".
There seems to be a curious behavior of the s3 service that happens when
a single session maintains numerous long-running download connections to
various objects in a bucket- the service appears to reset connections at
random.

See: aws/aws-sdk-go#1242

Closes #2005
mattnibs added a commit that referenced this issue Feb 10, 2021
If a "connection error" reset is encountered while reading a s3 object attempt
to restart the connection and resume read at the current offset.

This solves a bug found when trying to ingest several s3 hosted log files:
several files will stop ingesting with the error "connection reset by peer".
There seems to be a curious behavior of the s3 service that happens when
a single session maintains numerous long-running download connections to
various objects in a bucket- the service appears to reset connections at
random.

See: aws/aws-sdk-go#1242

Closes #2005
mattnibs added a commit that referenced this issue Feb 11, 2021
If a "connection error" reset is encountered while reading a s3 object attempt
to restart the connection and resume read at the current offset.

This solves a bug found when trying to ingest several s3 hosted log files:
several files will stop ingesting with the error "connection reset by peer".
There seems to be a curious behavior of the s3 service that happens when
a single session maintains numerous long-running download connections to
various objects in a bucket- the service appears to reset connections at
random.

See: aws/aws-sdk-go#1242

Closes #2005
@philrz
Copy link
Contributor Author

philrz commented Feb 11, 2021

Verified with zq commit bdbef8b2 running in the cluster and my local zapi at the same commit.

Now I can complete 100% of the import successfully and it's never interrupted despite running for 13+ minutes.

$ time zapi -s $SPACE postpath $(for file in $(aws s3 ls --recursive s3://brim-sampledata/wrccdc-year1/zeek-logs/ | grep "\.log" | awk '{ print $4 }'); do   echo s3://brim-sampledata/$file; done)
100.0% 12.85GB/12.85GB
posted 12.85GB in 13m3.136190624s

real	13m4.359s
user	0m0.742s
sys	0m0.358s

Thanks @mattnibs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants