"read: connection reset by peer" after ~6 minutes during large "zapi postpath" from S3 #2005

philrz · 2021-01-23T01:14:04Z

Repro is with cluster at zq commit 4eebb1d and using zapi commit 4de0485d as the client.

$ export ZQD_DATA_URI="s3://zq-ci-test/21-01-22.rug2"
$ export SPACE="year1"
$ zapi new -k archivestore -d $ZQD_DATA_URI $SPACE
$ zapi -s $SPACE postpath $(for file in $(aws s3 ls --recursive s3://brim-sampledata/wrccdc-year1/zeek-logs/ | grep "\.log" | awk '{ print $4 }'); do   echo s3://brim-sampledata/$file; done)
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/smtp.log: line 8934: read tcp 192.168.60.195:48976->52.219.100.128:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/ssh.log: line 151931: read tcp 192.168.60.195:37078->52.219.106.138:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/dhcp.log: line 9609: read tcp 192.168.60.195:48612->52.219.100.128:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/weird.log: line 1390916: read tcp 192.168.60.195:37184->52.219.106.138:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/http.log: line 807132: read tcp 192.168.60.195:36778->52.219.106.138:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/notice.log: line 35328: read tcp 192.168.60.195:48774->52.219.100.128:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/files.log: line 1438995: read tcp 192.168.60.195:36736->52.219.106.138:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/x509.log: line 471145: read tcp 192.168.60.195:49142->52.219.100.128:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/ssl.log: line 919998: read tcp 192.168.60.195:37098->52.219.106.138:443: read: connection reset by peer
 90.9% 11.68GB/12.85GB
posted 11.68GB in 11m57.900284245s

I've repeated this several times and have found the connection reset by peer messages consistently show up just short of 6 minutes from when I started the zapi postpath.

Since the failure appears to abort the download of some log files from S3, this results in an overall incomplete import. The actual count of events in the data set is 96217189, but after the failure above the count shows:

$ time zapi -s $SPACE get -workers 1 -t 'count()'
#0:record[count:uint64]
0:[91771072;]

I can repro this every time I attempt it. Many of the same log files fail each time, though they don't fail in the same spot every time. Here's the output from another repro:

$ zapi -s $SPACE postpath $(for file in $(aws s3 ls --recursive s3://brim-sampledata/wrccdc-year1/zeek-logs/ | grep "\.log" | awk '{ print $4 }'); do   echo s3://brim-sampledata/$file; done)
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/snmp.log: line 15873: read tcp 192.168.60.195:50688->52.219.80.176:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/smtp.log: line 9420: read tcp 192.168.60.195:39930->52.219.84.64:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/ssh.log: line 152276: read tcp 192.168.60.195:39970->52.219.84.64:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/dhcp.log: line 12081: read tcp 192.168.60.195:39568->52.219.84.64:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/tunnel.log: line 10361: read tcp 192.168.60.195:50812->52.219.80.176:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/weird.log: line 1390916: read tcp 192.168.60.195:40094->52.219.84.64:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/http.log: line 807819: read tcp 192.168.60.195:39672->52.219.84.64:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/notice.log: line 34163: read tcp 192.168.60.195:39730->52.219.84.64:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/files.log: line 1438263: read tcp 192.168.60.195:50370->52.219.80.176:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/x509.log: line 470916: read tcp 192.168.60.195:40114->52.219.84.64:443: read: connection reset by peer
warning: s3://brim-sampledata/wrccdc-year1/zeek-logs/ssl.log: line 920216: read tcp 192.168.60.195:39992->52.219.84.64:443: read: connection reset by peer
 90.9% 11.68GB/12.85GB
posted 11.68GB in 12m3.467001001s

$ time zapi -s $SPACE get -workers 1 -t 'count()'
#0:record[count:uint64]
0:[91767866;]

real	2m5.037s
user	0m0.031s
sys	0m0.032s

I did some web searches hoping to find magic numbers that might point to timeouts we're exceeding, but didn't find much. It might be a red herring, but I did find aws/aws-sdk-go#1763 that happens to involve the the AWS SDK for Go, "high concurrency downloads", and failures after "5~6 minutes", but alas, it was unresolved. Might be unrelated. 🤷‍♂️

I also couldn't seem to see any detailed logging during import. I deduced by watching top that the zqd with personality root seemed to be doing all the heavy lifting, but doing a kubectl logs -f on that pod showed just these two lines that correlated with the start/completion of the zapi postpath:

{"level":"info","ts":1611362757.1915123,"logger":"http.access","msg":"Request completed","request_id":"1nRpSajzZhaMu6jFNC0pBxQ96uZ","host":"localhost:9867","method":"GET","proto":"HTTP/1.1","remote_addr":"127.0.0.1:53124","request_content_length":0,"url":"/space","elapsed":0.000102775,"response_content_length":131,"status_code":200}
{"level":"info","ts":1611363475.1030416,"logger":"http.access","msg":"Request completed","request_id":"1nRpSaChR2rsEJ8j8z0oc13RWCq","host":"localhost:9867","method":"POST","proto":"HTTP/1.1","remote_addr":"127.0.0.1:53124","request_content_length":1733,"url":"/space/sp_1nRpMe6lzAJosPqRsa4BCNytq1F/log/paths","elapsed":717.833245198,"response_content_length":61244,"status_code":202}

The text was updated successfully, but these errors were encountered:

If a "connection error" reset is encountered while reading a s3 object attempt to restart the connection and resume read at the current offset. This solves a bug found when trying to ingest several s3 hosted log files: several files will stop ingesting with the error "connection reset by peer". There seems to be a curious behavior of the s3 service that happens when a single session maintains numerous long-running download connections to various objects in a bucket- the service appears to reset connections at random. See: aws/aws-sdk-go#1242 Closes #2005

philrz · 2021-02-11T21:30:29Z

Verified with zq commit bdbef8b2 running in the cluster and my local zapi at the same commit.

Now I can complete 100% of the import successfully and it's never interrupted despite running for 13+ minutes.

$ time zapi -s $SPACE postpath $(for file in $(aws s3 ls --recursive s3://brim-sampledata/wrccdc-year1/zeek-logs/ | grep "\.log" | awk '{ print $4 }'); do   echo s3://brim-sampledata/$file; done)
100.0% 12.85GB/12.85GB
posted 12.85GB in 13m3.136190624s

real	13m4.359s
user	0m0.742s
sys	0m0.358s

Thanks @mattnibs!

philrz added the bug Something isn't working label Jan 23, 2021

philrz added this to the Brim v0.25.0 milestone Jan 28, 2021

alfred-landrum assigned mattnibs Feb 1, 2021

mattnibs mentioned this issue Feb 10, 2021

s3io.Reader: reissue request on ECONNRESET #2106

Merged

mattnibs closed this as completed in #2106 Feb 11, 2021

github-actions bot mentioned this issue Feb 11, 2021

zq update through "s3io.Reader: reissue request on ECONNRESET" by mattnibs brimdata/zui#1454

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"read: connection reset by peer" after ~6 minutes during large "zapi postpath" from S3 #2005

"read: connection reset by peer" after ~6 minutes during large "zapi postpath" from S3 #2005

philrz commented Jan 23, 2021

philrz commented Feb 11, 2021

"read: connection reset by peer" after ~6 minutes during large "zapi postpath" from S3 #2005

"read: connection reset by peer" after ~6 minutes during large "zapi postpath" from S3 #2005

Comments

philrz commented Jan 23, 2021

philrz commented Feb 11, 2021