Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[service/s3] high concurrency downloads causes some to fail after a long time #1763

Closed
yizha opened this issue Jan 29, 2018 · 16 comments
Closed
Labels
guidance Question that needs advice or information. service-api This issue is due to a problem in a service API, not the SDK implementation.

Comments

@yizha
Copy link

yizha commented Jan 29, 2018

First I am not sure if this is actually an issue but it seems, at least to me, that the it doesn't run well when there is high concurrency downloads going on and no timeout is set. The exception "connection reset by peer" is thrown at last after a long time (on my box, it took 5~6 minutes for those to fail while the succeeded ones took less than one minute) so looks like the sdk has difficulty on reading the response fast enough in this particular case. I know we can timeout (cancel) the request if it takes too long, by either configuring the sdk with a custom timeout-enabled http client or calling s3.GetObjectWithContext(...) with a timeout Context. But I think 100 concurrent downloads really shouldn't cause such an issue (in fact I can reproduce it with lower concurrency but 100 guarantees to reproduce it on my box). Also looks like the smaller the network bandwidth is the easier to reproduce the issue.

Version of AWS SDK for Go?

[yizha@yizha-hp-z420 aws-sdk-go]$ echo $GOPATH
/home/yizha/work/gows:/home/yizha/work/agency/gows
[yizha@yizha-hp-z420 aws-sdk-go]$ pwd
/home/yizha/work/gows/src/github.com/aws/aws-sdk-go
[yizha@yizha-hp-z420 aws-sdk-go]$ git log HEAD
commit ee65a53 (HEAD -> master, origin/master, origin/HEAD)
Author: xibz impactbchang@gmail.com
Date: Fri Jan 19 13:54:42 2018 -0800

Only set header tag len if set (#1743)

Version of Go (go version)?

[yizha@yizha-hp-z420 aws-sdk-go]$ go version
go version go1.9.2 linux/amd64

What issue did you see?

[yizha@yizha-hp-z420 tmp]$ C_CNT=100 go run test.go
concurrency: 100

[1] start downloading ...
[8] start downloading ...
[5] start downloading ...
[99] start downloading ...
[9] start downloading ...
[7] start downloading ...
[0] start downloading ...
[10] start downloading ...
[3] start downloading ...
[4] start downloading ...
[11] start downloading ...
[56] start downloading ...
[14] start downloading ...
[57] start downloading ...
[34] start downloading ...
[16] start downloading ...
[35] start downloading ...
[17] start downloading ...
[36] start downloading ...
[18] start downloading ...
[37] start downloading ...
[58] start downloading ...
[38] start downloading ...
[19] start downloading ...
[59] start downloading ...
[20] start downloading ...
[39] start downloading ...
[60] start downloading ...
[40] start downloading ...
[21] start downloading ...
[61] start downloading ...
[41] start downloading ...
[22] start downloading ...
[62] start downloading ...
[42] start downloading ...
[23] start downloading ...
[63] start downloading ...
[43] start downloading ...
[24] start downloading ...
[64] start downloading ...
[44] start downloading ...
[25] start downloading ...
[65] start downloading ...
[26] start downloading ...
[66] start downloading ...
[45] start downloading ...
[27] start downloading ...
[67] start downloading ...
[68] start downloading ...
[28] start downloading ...
[46] start downloading ...
[30] start downloading ...
[71] start downloading ...
[29] start downloading ...
[47] start downloading ...
[69] start downloading ...
[70] start downloading ...
[49] start downloading ...
[50] start downloading ...
[51] start downloading ...
[31] start downloading ...
[72] start downloading ...
[32] start downloading ...
[12] start downloading ...
[48] start downloading ...
[73] start downloading ...
[53] start downloading ...
[87] start downloading ...
[52] start downloading ...
[54] start downloading ...
[90] start downloading ...
[74] start downloading ...
[88] start downloading ...
[75] start downloading ...
[55] start downloading ...
[77] start downloading ...
[78] start downloading ...
[76] start downloading ...
[79] start downloading ...
[33] start downloading ...
[15] start downloading ...
[80] start downloading ...
[81] start downloading ...
[82] start downloading ...
[85] start downloading ...
[84] start downloading ...
[96] start downloading ...
[97] start downloading ...
[86] start downloading ...
[98] start downloading ...
[94] start downloading ...
[2] start downloading ...
[91] start downloading ...
[92] start downloading ...
[83] start downloading ...
[6] start downloading ...
[13] start downloading ...
[95] start downloading ...
[93] start downloading ...
[89] start downloading ...
[74] finished download job (read 6198809 bytes) in 43.569521561s
[53] finished download job (read 6198809 bytes) in 43.95594173s
[4] finished download job (read 6198809 bytes) in 47.707810419s
[95] finished download job (read 6198809 bytes) in 48.647493668s
[54] finished download job (read 6198809 bytes) in 48.701726836s
[46] finished download job (read 6198809 bytes) in 49.470526305s
[56] finished download job (read 6198809 bytes) in 49.987636599s
[83] finished download job (read 6198809 bytes) in 50.050643396s
[13] finished download job (read 6198809 bytes) in 51.928045656s
[22] finished download job (read 6198809 bytes) in 53.173156348s
[20] finished download job (read 6198809 bytes) in 53.216101938s
[55] finished download job (read 6198809 bytes) in 53.600310385s
[61] finished download job (read 6198809 bytes) in 53.609563651s
[79] finished download job (read 6198809 bytes) in 53.717528561s
[94] finished download job (read 6198809 bytes) in 53.954241976s
[84] finished download job (read 6198809 bytes) in 54.158461838s
[51] finished download job (read 6198809 bytes) in 54.389245387s
[43] finished download job (read 6198809 bytes) in 54.611049834s
[21] finished download job (read 6198809 bytes) in 54.641270409s
[8] finished download job (read 6198809 bytes) in 55.025735593s
[39] finished download job (read 6198809 bytes) in 55.026327339s
[86] finished download job (read 6198809 bytes) in 55.161876688s
[90] finished download job (read 6198809 bytes) in 55.244310288s
[85] finished download job (read 6198809 bytes) in 55.396435164s
[18] finished download job (read 6198809 bytes) in 55.446654478s
[31] finished download job (read 6198809 bytes) in 55.665957191s
[82] finished download job (read 6198809 bytes) in 55.790000091s
[48] finished download job (read 6198809 bytes) in 55.834006768s
[57] finished download job (read 6198809 bytes) in 55.849872747s
[70] finished download job (read 6198809 bytes) in 56.149388773s
[26] finished download job (read 6198809 bytes) in 56.221490182s
[16] finished download job (read 6198809 bytes) in 56.272423036s
[67] finished download job (read 6198809 bytes) in 56.44158009s
[96] finished download job (read 6198809 bytes) in 56.456063295s
[60] finished download job (read 6198809 bytes) in 56.521619324s
[50] finished download job (read 6198809 bytes) in 56.771706047s
[11] finished download job (read 6198809 bytes) in 56.821399992s
[52] finished download job (read 6198809 bytes) in 56.865662218s
[66] finished download job (read 6198809 bytes) in 56.97720656s
[6] finished download job (read 6198809 bytes) in 56.993605511s
[78] finished download job (read 6198809 bytes) in 57.044606204s
[30] finished download job (read 6198809 bytes) in 57.167580858s
[81] finished download job (read 6198809 bytes) in 57.219458452s
[27] finished download job (read 6198809 bytes) in 57.234184741s
[80] finished download job (read 6198809 bytes) in 57.294372358s
[37] finished download job (read 6198809 bytes) in 57.434658629s
[23] finished download job (read 6198809 bytes) in 57.654305373s
[91] finished download job (read 6198809 bytes) in 57.924653005s
[3] finished download job (read 6198809 bytes) in 57.94704263s
[36] finished download job (read 6198809 bytes) in 57.978373205s
[45] finished download job (read 6198809 bytes) in 58.207883656s
[38] finished download job (read 6198809 bytes) in 58.244114709s
[69] finished download job (read 6198809 bytes) in 58.2754416s
[41] finished download job (read 6198809 bytes) in 58.293532975s
[28] finished download job (read 6198809 bytes) in 58.305003088s
[63] finished download job (read 6198809 bytes) in 58.315931935s
[32] finished download job (read 6198809 bytes) in 58.320526113s
[44] finished download job (read 6198809 bytes) in 58.463426612s
[7] finished download job (read 6198809 bytes) in 58.478705522s
[92] finished download job (read 6198809 bytes) in 58.620686694s
[10] finished download job (read 6198809 bytes) in 58.676208804s
[19] finished download job (read 6198809 bytes) in 58.67685684s
[49] finished download job (read 6198809 bytes) in 58.70566748s
[35] finished download job (read 6198809 bytes) in 58.847224198s
[99] finished download job (read 6198809 bytes) in 58.894416924s
[76] finished download job (read 6198809 bytes) in 58.968481343s
[98] finished download job (read 6198809 bytes) in 58.988596888s
[40] finished download job (read 6198809 bytes) in 59.035400909s
[58] finished download job (read 6198809 bytes) in 59.043526431s
[88] finished download job (read 6198809 bytes) in 59.122832185s
[87] finished download job (read 6198809 bytes) in 59.153337061s
[24] finished download job (read 6198809 bytes) in 59.182860012s
[5] finished download job (read 6198809 bytes) in 59.300469562s
[65] finished download job (read 6198809 bytes) in 59.318931113s
[68] finished download job (read 6198809 bytes) in 59.327315684s
[73] finished download job (read 6198809 bytes) in 59.417350732s
[75] finished download job (read 6198809 bytes) in 59.429167583s
[89] finished download job (read 6198809 bytes) in 59.441224719s
[33] finished download job (read 6198809 bytes) in 59.467230748s
[93] finished download job (read 6198809 bytes) in 59.47243084s
[62] finished download job (read 6198809 bytes) in 59.571973951s
[0] finished download job (read 6198809 bytes) in 59.624225804s
[2] finished download job (read 6198809 bytes) in 59.723292307s
[64] finished download job (read 6198809 bytes) in 59.759922201s
[47] finished download job (read 6198809 bytes) in 59.759636996s
[14] finished download job (read 6198809 bytes) in 59.994028361s
[59] finished download job (read 6198809 bytes) in 1m0.002918409s
[34] finished download job (read 6198809 bytes) in 1m0.070034962s
[97] finished download job (read 6198809 bytes) in 1m0.327620044s
[77] finished download job (read 6198809 bytes) in 1m0.366530372s
[15] finished download job (read 6198809 bytes) in 1m0.402741674s
[72] finished download job (read 6198809 bytes) in 1m0.412533226s
[29] finished download job (read 6198809 bytes) in 1m0.440142146s
[42] finished download job (read 6198809 bytes) in 1m0.510102497s
[1] finished download job (read 6198809 bytes) in 1m0.75496559s
[71] failed to read body, error: read tcp 10.90.7.237:33446->52.216.64.216:443: read: connection reset by peer
[71] finished download job (read 730633 bytes) in 5m45.648985404s
[9] failed to read body, error: read tcp 10.90.7.237:33386->52.216.64.216:443: read: connection reset by peer
[9] finished download job (read 1409545 bytes) in 5m55.520246483s
[12] failed to read body, error: read tcp 10.90.7.237:33566->52.216.64.216:443: read: connection reset by peer
[12] finished download job (read 1775113 bytes) in 6m1.201912432s
[17] failed to read body, error: read tcp 10.90.7.237:33420->52.216.64.216:443: read: connection reset by peer
[17] finished download job (read 748041 bytes) in 6m1.384548141s
[25] failed to read body, error: read tcp 10.90.7.237:33516->52.216.64.216:443: read: connection reset by peer
[25] finished download job (read 2140681 bytes) in 6m1.941659707s

Steps to reproduce

Run attached example with high concurrency (you will need to modify the access key id and secret key, also change the bucket/key). On my box I can reproduce it with concurrency set to 100 and the s3 object it downloads is about 6MB. You may be able to reproduce it with a different concurrency number, guess it depends on the network bandwidth and the size of the s3 object it downloads.

If you have have an runnable example, please include it.

test.go.txt

@xibz
Copy link
Contributor

xibz commented Jan 29, 2018

Hello @yizha, thank you for reaching out to us. Can you try using the s3manager.Downloader? That will retry connection resets. Please let us know if you have any issues with that.

@jasdel jasdel added the guidance Question that needs advice or information. label Jan 29, 2018
@yizha
Copy link
Author

yizha commented Jan 29, 2018

@xibz it is not a question/issue about whether the request will be retried or not, it is about the abnormal slow download speed when there are a lot of concurrent downloaders. We have already switched to use s3.GetObjectWithContext(...) with a timeout context to break the slow downloader, so it won't occupy the go routine for too long, I would assume switching to s3manager.Downloader won't help with the slow downloading in this case.

@yizha
Copy link
Author

yizha commented Jan 29, 2018

btw I can reproduce the issue with concurrency like 70 or even 60, not like 100 which guarantees to reproduce it on my box, I have to run it several times with 70 or 60 to reproduce it, but I never be able to reproduce it with concurrency 50 or lower, and it looks like an contention issue/bug in the sdk.

@xibz
Copy link
Contributor

xibz commented Jan 29, 2018

@yizha - How is the memory on that machine when you are running this? My first thought is since these are just firing off, you are loading everything into memory almost all at once.

@yizha
Copy link
Author

yizha commented Jan 30, 2018

@xibz - The s3 object I test with is about 6MB so for 100 concurrent downloads the memory for storing them is 600MB+. My test box has 8GB memory so I don't think memory has anything to do with my test. Anyway I ran the test again and monitored the go processes with top and checked system memory with free and the result seems reasonable,

Test script is run with

$ C_CNT=100 go run test.go

Output from 'top -p' while the test is running

  PID USER      PR  NI    VIRT    RES  %CPU %MEM     TIME+ S COMMAND                                              
10455 yizha     20   0  670.4m  12.3m   0.0  0.2   0:00.15 S go                                                   
10479 yizha     20   0 1712.0m 1.091g   0.0 14.1   0:15.84 S  `- test

Output from 'free -m'

[yizha@yizha-hp-z420 ~]$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7916        2385        1108          12        4422        5223
Swap:          4095           0        4095

But to make sure the issue is not related with memory usage I replaced 'ioutil.ReadAll(...)' with a for-loop which calls Read(...) with a []byte buffer of 8192 so now for each download it only needs 8KB memory. I ran it again, this time it only used ~28MB memory but the result was same that some downloads were extremely slow and ended with the "connection reset by peer" error.

Source code which uses for-loop to read whole response body with a 8KB []byte buffer

package main

import (
	"fmt"
	"io"
	"os"
	"strconv"
	"sync"
	"time"

	"github.com/aws/aws-sdk-go/aws"
	"github.com/aws/aws-sdk-go/aws/session"
	"github.com/aws/aws-sdk-go/service/s3"
)

func main() {

	config := &aws.Config{
		MaxRetries: aws.Int(0),
		Region:     aws.String("us-east-1"),
	}

	sess := session.Must(session.NewSession(config))
	s3 := s3.New(sess)

	concurrency, err := strconv.Atoi(os.Getenv("C_CNT"))
	if err != nil {
		panic(err.Error())
	}
	fmt.Printf("concurrency: %d\n", concurrency)

	fmt.Println()

	bucket := "tr-content-archive-us-qa"
	key := "0DE3/tag:reuters.com,2018:binary_RC1B9F653ED0-BASEIMAGE:191352371"
	wg := &sync.WaitGroup{}
	wg.Add(concurrency)
	for i := 0; i < concurrency; i++ {
		go download(i, s3, bucket, key, wg)
	}
	wg.Wait()
}

func readBody(r io.Reader) (int, error) {
	bytesRead := 0
	buf := make([]byte, 8192)
	for {
		n, err := r.Read(buf)
		if n > 0 {
			bytesRead += n
		}
		if err != nil {
			if err == io.EOF {
				return bytesRead, nil
			} else {
				return bytesRead, err
			}
		}
	}
}

func download(id int, client *s3.S3, bucket, key string, wg *sync.WaitGroup) {
	fmt.Printf("[%d] start downloading ...\n", id)

	defer wg.Done()

	t1 := time.Now()

	resp, err := client.GetObject(&s3.GetObjectInput{
		Bucket: aws.String(bucket),
		Key:    aws.String(key),
	})
	bytesRead := 0
	if err != nil {
		fmt.Printf("[%d] failed to get object, error: %v\n", id, err)
	} else {
		defer resp.Body.Close()
		bytesRead, err = readBody(resp.Body)
		if err != nil {
			fmt.Printf("[%d] failed to read body, error: %v\n", id, err)
		}
	}
	fmt.Printf("[%d] finished download job (read %d bytes) in %v\n", id, bytesRead, time.Now().Sub(t1))
}

Output from 'top -p'

  PID USER      PR  NI    VIRT    RES  %CPU %MEM     TIME+ S COMMAND                                              
10626 yizha     20   0  598.2m  12.1m   0.0  0.2   0:00.18 S go                                                   
10651 yizha     20   0  550.0m  28.2m   0.0  0.4   0:14.46 S  `- test

Output from 'free -m'

[yizha@yizha-hp-z420 tmp]$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7916        1293        2522          12        4100        6315
Swap:          4095           0        4095```

@xibz
Copy link
Contributor

xibz commented Jan 30, 2018

@yizha - Thank you for the detailed information. I suspect you may be getting throttled by the service. Can you enable logging by setting LogLevel to aws.LogDebug and paste the logs here? More particularly getting the RequestId of the failed request will be very helpful.

@xibz xibz added the service-api This issue is due to a problem in a service API, not the SDK implementation. label Jan 30, 2018
@yizha
Copy link
Author

yizha commented Jan 30, 2018

@xibz - I enabled debug log with

config.WithLogLevel(aws.LogDebug)

then reran test.go and attached the full output as a file test.log.

This time the result is really bad, the longest one took 18 minutes to finish and failed at last. And I don't think it is throttled because there is no retry (I set MaxTries to 0) and the responses were 200, and even for the failed ones they did read some bytes from response but didn't finish reading all. You can tell from the log which prints out how many bytes it reads from response body no matter it is a success download or not.

[71] failed to read body, error: read tcp 10.90.7.237:36854->52.216.98.35:443: read: connection reset by peer
[71] finished download job (read 1688073 bytes) in 18m0.034458929s

@xibz
Copy link
Contributor

xibz commented Jan 30, 2018

@yizha - I've reached out to the service team, as I believe this is an issue on their end. The error you are seeing suggests that they may be closing the connection. I will let you know once I have more information.

@xinst
Copy link
Contributor

xinst commented Apr 2, 2018

I think @yizha this is reached bandwidth limit, you can test with "Network Linker Conditioner" on MAC or some other tools like that to limit your network speed, and then try again

@huangxc2013
Copy link

Do we have a solution for this now? Been experiencing the same problem.

@diehlaws
Copy link
Contributor

diehlaws commented Oct 9, 2018

I have tried reproducing this using an S3 Bucket in us-east-1 on different platforms (local workstation, t2.micro in us-west-2, t2.micro in ap-southeast-2, third party VPS in NY) with different versions of the AWS SDK for Go (v1.12.67 and v1.15.51), and have not been able to elicit a connection reset from S3 under any of these circumstances. This does not appear to be an issue with the AWS SDK for Go, you can confirm this by running your code on an EC2 instance, or attempting the same actions using the AWS SDK for another language, and checking for the same connection resets from S3.

I suggest further troubleshooting this behavior from the service end with AWS Support by opening a new support case under the Simple Storage Service (S3) service if you haven't already, as they will be able to better assist with behaviors relating to the service rather than the SDK.

As an aside, in my experience issuing GET or PUT requests to S3 at a concurrency above 20 generally results in diminishing returns in terms of bandwidth - specially if you are performing single-part uploads/downloads as shown in your code rather than multi-part upload/downloads (for example using s3manager.Downloader as @xibz suggested).

@diehlaws diehlaws added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Oct 9, 2018
@diehlaws diehlaws removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label Oct 29, 2018
@suanziliu
Copy link

Hi, I also met this issue in my program(Each go routine will download a .gz file then read its content).

When we download 70*20M .gz files from s3(ap-northeast-1) by using s3manager.Downloader(file, &s3.GetObjectInput{}), it will stuck for 5 minutes for after dealing with 66 goroutines.

I got following stack trace by using runtime/pprof when it is stucked, details are attached.
goroutine_profile.txt

I know that aws ec2 will throttle requests if exceeds certain limits, is s3 still has such limites?

@sdlarsen
Copy link

sdlarsen commented Jul 9, 2019

Why was this closed with no resolution? I'm seeing similar slow download speeds using s3manager (with v1 and v2 of the golang api)

@diehlaws
Copy link
Contributor

diehlaws commented Jul 9, 2019

@sdlarsen this was closed due to inactivity because we were unable to reproduce the described behavior under various conditions.

This does not appear to be an issue with the AWS SDK for Go, you can confirm this by running your code on an EC2 instance, or attempting the same actions using the AWS SDK for another language, and checking for the same connection resets from S3.

If you're experiencing similar problems, please open a new issue with the information requested in our General Issue template so we can better assist you with this. In addition to this information, in your case testing download speeds from an EC2 instance would be helpful to rule out your local network connection as the culprit to the behavior you're seeing.

@suanziliu S3 does throttle requests once you've reached a certain request rate (and the SDK will automatically retry requests to which S3 respond with an HTTP 503 Slow Down error) , however the base request rate is quite high (at least 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix) and can be increased by following our best practices for optimizing S3 performance.

@haooooooqi
Copy link

Seeing the same issue

@PaulisMatrix
Copy link

PaulisMatrix commented Oct 29, 2024

Experiencing the same issue of delayed responses on high concurrency @diehlaws , @xibz.

My use case :
I have around 1k records. For each record, fetch relevant S3 file and do further processing.

In initial setup, I was spawning a separate go-routine to fetch the object using GetObject method for each such record. So total around 1k go-routines calling S3 at the same time.
Few of the requests started failing with context deadline exceeded with initial context timeout of 5secs. Turns out these reqs were taking around 10-11secs.
This is well within the 5.5k Get reqs/sec limit set by S3 so I'm wondering what caused it.

Sample error log for your reference.

{"errorMsg":"unable to download file \"historical/2024/10/16/x.eml\" from bucket \"x.org.123\", operation error S3: GetObject, https response error StatusCode: 0, RequestID: , HostID: , canceled, context deadline exceeded","filePath":"historical/2024/10/16/x.eml","bucketName":"x.org.123","createdAt":"2024-10-28T13:16:03.911441494Z"}

Considering this, I made the change to limit concurrency to process 1k records between only 50 go-routines. Took around 1-2secs as it should.

I took a look at the go sdk GetObject source and its just making http requests to S3 underneath. So don't think this is an sdk issue but worth pointing out as this is being experienced by many people as I can see above in thread.

Might be S3 issue, dunno. Please do share any workarounds, or any possible explanation for this behaviour. Checking the S3 download manager as well if that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
guidance Question that needs advice or information. service-api This issue is due to a problem in a service API, not the SDK implementation.
Projects
None yet
Development

No branches or pull requests

10 participants