-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[service/s3] high concurrency downloads causes some to fail after a long time #1763
Comments
Hello @yizha, thank you for reaching out to us. Can you try using the |
@xibz it is not a question/issue about whether the request will be retried or not, it is about the abnormal slow download speed when there are a lot of concurrent downloaders. We have already switched to use s3.GetObjectWithContext(...) with a timeout context to break the slow downloader, so it won't occupy the go routine for too long, I would assume switching to s3manager.Downloader won't help with the slow downloading in this case. |
btw I can reproduce the issue with concurrency like 70 or even 60, not like 100 which guarantees to reproduce it on my box, I have to run it several times with 70 or 60 to reproduce it, but I never be able to reproduce it with concurrency 50 or lower, and it looks like an contention issue/bug in the sdk. |
@yizha - How is the memory on that machine when you are running this? My first thought is since these are just firing off, you are loading everything into memory almost all at once. |
@xibz - The s3 object I test with is about 6MB so for 100 concurrent downloads the memory for storing them is 600MB+. My test box has 8GB memory so I don't think memory has anything to do with my test. Anyway I ran the test again and monitored the go processes with top and checked system memory with free and the result seems reasonable, Test script is run with $ C_CNT=100 go run test.go Output from 'top -p' while the test is running PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND
10455 yizha 20 0 670.4m 12.3m 0.0 0.2 0:00.15 S go
10479 yizha 20 0 1712.0m 1.091g 0.0 14.1 0:15.84 S `- test Output from 'free -m' [yizha@yizha-hp-z420 ~]$ free -m
total used free shared buff/cache available
Mem: 7916 2385 1108 12 4422 5223
Swap: 4095 0 4095 But to make sure the issue is not related with memory usage I replaced 'ioutil.ReadAll(...)' with a for-loop which calls Read(...) with a []byte buffer of 8192 so now for each download it only needs 8KB memory. I ran it again, this time it only used ~28MB memory but the result was same that some downloads were extremely slow and ended with the "connection reset by peer" error. Source code which uses for-loop to read whole response body with a 8KB []byte buffer package main
import (
"fmt"
"io"
"os"
"strconv"
"sync"
"time"
"github.com/aws/aws-sdk-go/aws"
"github.com/aws/aws-sdk-go/aws/session"
"github.com/aws/aws-sdk-go/service/s3"
)
func main() {
config := &aws.Config{
MaxRetries: aws.Int(0),
Region: aws.String("us-east-1"),
}
sess := session.Must(session.NewSession(config))
s3 := s3.New(sess)
concurrency, err := strconv.Atoi(os.Getenv("C_CNT"))
if err != nil {
panic(err.Error())
}
fmt.Printf("concurrency: %d\n", concurrency)
fmt.Println()
bucket := "tr-content-archive-us-qa"
key := "0DE3/tag:reuters.com,2018:binary_RC1B9F653ED0-BASEIMAGE:191352371"
wg := &sync.WaitGroup{}
wg.Add(concurrency)
for i := 0; i < concurrency; i++ {
go download(i, s3, bucket, key, wg)
}
wg.Wait()
}
func readBody(r io.Reader) (int, error) {
bytesRead := 0
buf := make([]byte, 8192)
for {
n, err := r.Read(buf)
if n > 0 {
bytesRead += n
}
if err != nil {
if err == io.EOF {
return bytesRead, nil
} else {
return bytesRead, err
}
}
}
}
func download(id int, client *s3.S3, bucket, key string, wg *sync.WaitGroup) {
fmt.Printf("[%d] start downloading ...\n", id)
defer wg.Done()
t1 := time.Now()
resp, err := client.GetObject(&s3.GetObjectInput{
Bucket: aws.String(bucket),
Key: aws.String(key),
})
bytesRead := 0
if err != nil {
fmt.Printf("[%d] failed to get object, error: %v\n", id, err)
} else {
defer resp.Body.Close()
bytesRead, err = readBody(resp.Body)
if err != nil {
fmt.Printf("[%d] failed to read body, error: %v\n", id, err)
}
}
fmt.Printf("[%d] finished download job (read %d bytes) in %v\n", id, bytesRead, time.Now().Sub(t1))
} Output from 'top -p' PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND
10626 yizha 20 0 598.2m 12.1m 0.0 0.2 0:00.18 S go
10651 yizha 20 0 550.0m 28.2m 0.0 0.4 0:14.46 S `- test Output from 'free -m' [yizha@yizha-hp-z420 tmp]$ free -m
total used free shared buff/cache available
Mem: 7916 1293 2522 12 4100 6315
Swap: 4095 0 4095``` |
@yizha - Thank you for the detailed information. I suspect you may be getting throttled by the service. Can you enable logging by setting |
@xibz - I enabled debug log with config.WithLogLevel(aws.LogDebug) then reran test.go and attached the full output as a file test.log. This time the result is really bad, the longest one took 18 minutes to finish and failed at last. And I don't think it is throttled because there is no retry (I set MaxTries to 0) and the responses were 200, and even for the failed ones they did read some bytes from response but didn't finish reading all. You can tell from the log which prints out how many bytes it reads from response body no matter it is a success download or not.
|
@yizha - I've reached out to the service team, as I believe this is an issue on their end. The error you are seeing suggests that they may be closing the connection. I will let you know once I have more information. |
I think @yizha this is reached bandwidth limit, you can test with "Network Linker Conditioner" on MAC or some other tools like that to limit your network speed, and then try again |
Do we have a solution for this now? Been experiencing the same problem. |
I have tried reproducing this using an S3 Bucket in us-east-1 on different platforms (local workstation, t2.micro in us-west-2, t2.micro in ap-southeast-2, third party VPS in NY) with different versions of the AWS SDK for Go (v1.12.67 and v1.15.51), and have not been able to elicit a connection reset from S3 under any of these circumstances. This does not appear to be an issue with the AWS SDK for Go, you can confirm this by running your code on an EC2 instance, or attempting the same actions using the AWS SDK for another language, and checking for the same connection resets from S3. I suggest further troubleshooting this behavior from the service end with AWS Support by opening a new support case under the As an aside, in my experience issuing |
Hi, I also met this issue in my program(Each go routine will download a .gz file then read its content). When we download 70*20M .gz files from s3(ap-northeast-1) by using I got following stack trace by using runtime/pprof when it is stucked, details are attached. I know that aws ec2 will throttle requests if exceeds certain limits, is s3 still has such limites? |
Why was this closed with no resolution? I'm seeing similar slow download speeds using s3manager (with v1 and v2 of the golang api) |
@sdlarsen this was closed due to inactivity because we were unable to reproduce the described behavior under various conditions.
If you're experiencing similar problems, please open a new issue with the information requested in our General Issue template so we can better assist you with this. In addition to this information, in your case testing download speeds from an EC2 instance would be helpful to rule out your local network connection as the culprit to the behavior you're seeing. @suanziliu S3 does throttle requests once you've reached a certain request rate (and the SDK will automatically retry requests to which S3 respond with an HTTP 503 Slow Down error) , however the base request rate is quite high (at least 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix) and can be increased by following our best practices for optimizing S3 performance. |
Seeing the same issue |
Experiencing the same issue of delayed responses on high concurrency @diehlaws , @xibz. My use case : In initial setup, I was spawning a separate go-routine to fetch the object using Sample error log for your reference.
Considering this, I made the change to limit concurrency to process 1k records between only 50 go-routines. Took around 1-2secs as it should. I took a look at the go sdk Might be S3 issue, dunno. Please do share any workarounds, or any possible explanation for this behaviour. Checking the S3 download manager as well if that helps. |
First I am not sure if this is actually an issue but it seems, at least to me, that the it doesn't run well when there is high concurrency downloads going on and no timeout is set. The exception "connection reset by peer" is thrown at last after a long time (on my box, it took 5~6 minutes for those to fail while the succeeded ones took less than one minute) so looks like the sdk has difficulty on reading the response fast enough in this particular case. I know we can timeout (cancel) the request if it takes too long, by either configuring the sdk with a custom timeout-enabled http client or calling s3.GetObjectWithContext(...) with a timeout Context. But I think 100 concurrent downloads really shouldn't cause such an issue (in fact I can reproduce it with lower concurrency but 100 guarantees to reproduce it on my box). Also looks like the smaller the network bandwidth is the easier to reproduce the issue.
Version of AWS SDK for Go?
[yizha@yizha-hp-z420 aws-sdk-go]$ echo $GOPATH
/home/yizha/work/gows:/home/yizha/work/agency/gows
[yizha@yizha-hp-z420 aws-sdk-go]$ pwd
/home/yizha/work/gows/src/github.com/aws/aws-sdk-go
[yizha@yizha-hp-z420 aws-sdk-go]$ git log HEAD
commit ee65a53 (HEAD -> master, origin/master, origin/HEAD)
Author: xibz impactbchang@gmail.com
Date: Fri Jan 19 13:54:42 2018 -0800
Version of Go (
go version
)?[yizha@yizha-hp-z420 aws-sdk-go]$ go version
go version go1.9.2 linux/amd64
What issue did you see?
[yizha@yizha-hp-z420 tmp]$ C_CNT=100 go run test.go
concurrency: 100
[1] start downloading ...
[8] start downloading ...
[5] start downloading ...
[99] start downloading ...
[9] start downloading ...
[7] start downloading ...
[0] start downloading ...
[10] start downloading ...
[3] start downloading ...
[4] start downloading ...
[11] start downloading ...
[56] start downloading ...
[14] start downloading ...
[57] start downloading ...
[34] start downloading ...
[16] start downloading ...
[35] start downloading ...
[17] start downloading ...
[36] start downloading ...
[18] start downloading ...
[37] start downloading ...
[58] start downloading ...
[38] start downloading ...
[19] start downloading ...
[59] start downloading ...
[20] start downloading ...
[39] start downloading ...
[60] start downloading ...
[40] start downloading ...
[21] start downloading ...
[61] start downloading ...
[41] start downloading ...
[22] start downloading ...
[62] start downloading ...
[42] start downloading ...
[23] start downloading ...
[63] start downloading ...
[43] start downloading ...
[24] start downloading ...
[64] start downloading ...
[44] start downloading ...
[25] start downloading ...
[65] start downloading ...
[26] start downloading ...
[66] start downloading ...
[45] start downloading ...
[27] start downloading ...
[67] start downloading ...
[68] start downloading ...
[28] start downloading ...
[46] start downloading ...
[30] start downloading ...
[71] start downloading ...
[29] start downloading ...
[47] start downloading ...
[69] start downloading ...
[70] start downloading ...
[49] start downloading ...
[50] start downloading ...
[51] start downloading ...
[31] start downloading ...
[72] start downloading ...
[32] start downloading ...
[12] start downloading ...
[48] start downloading ...
[73] start downloading ...
[53] start downloading ...
[87] start downloading ...
[52] start downloading ...
[54] start downloading ...
[90] start downloading ...
[74] start downloading ...
[88] start downloading ...
[75] start downloading ...
[55] start downloading ...
[77] start downloading ...
[78] start downloading ...
[76] start downloading ...
[79] start downloading ...
[33] start downloading ...
[15] start downloading ...
[80] start downloading ...
[81] start downloading ...
[82] start downloading ...
[85] start downloading ...
[84] start downloading ...
[96] start downloading ...
[97] start downloading ...
[86] start downloading ...
[98] start downloading ...
[94] start downloading ...
[2] start downloading ...
[91] start downloading ...
[92] start downloading ...
[83] start downloading ...
[6] start downloading ...
[13] start downloading ...
[95] start downloading ...
[93] start downloading ...
[89] start downloading ...
[74] finished download job (read 6198809 bytes) in 43.569521561s
[53] finished download job (read 6198809 bytes) in 43.95594173s
[4] finished download job (read 6198809 bytes) in 47.707810419s
[95] finished download job (read 6198809 bytes) in 48.647493668s
[54] finished download job (read 6198809 bytes) in 48.701726836s
[46] finished download job (read 6198809 bytes) in 49.470526305s
[56] finished download job (read 6198809 bytes) in 49.987636599s
[83] finished download job (read 6198809 bytes) in 50.050643396s
[13] finished download job (read 6198809 bytes) in 51.928045656s
[22] finished download job (read 6198809 bytes) in 53.173156348s
[20] finished download job (read 6198809 bytes) in 53.216101938s
[55] finished download job (read 6198809 bytes) in 53.600310385s
[61] finished download job (read 6198809 bytes) in 53.609563651s
[79] finished download job (read 6198809 bytes) in 53.717528561s
[94] finished download job (read 6198809 bytes) in 53.954241976s
[84] finished download job (read 6198809 bytes) in 54.158461838s
[51] finished download job (read 6198809 bytes) in 54.389245387s
[43] finished download job (read 6198809 bytes) in 54.611049834s
[21] finished download job (read 6198809 bytes) in 54.641270409s
[8] finished download job (read 6198809 bytes) in 55.025735593s
[39] finished download job (read 6198809 bytes) in 55.026327339s
[86] finished download job (read 6198809 bytes) in 55.161876688s
[90] finished download job (read 6198809 bytes) in 55.244310288s
[85] finished download job (read 6198809 bytes) in 55.396435164s
[18] finished download job (read 6198809 bytes) in 55.446654478s
[31] finished download job (read 6198809 bytes) in 55.665957191s
[82] finished download job (read 6198809 bytes) in 55.790000091s
[48] finished download job (read 6198809 bytes) in 55.834006768s
[57] finished download job (read 6198809 bytes) in 55.849872747s
[70] finished download job (read 6198809 bytes) in 56.149388773s
[26] finished download job (read 6198809 bytes) in 56.221490182s
[16] finished download job (read 6198809 bytes) in 56.272423036s
[67] finished download job (read 6198809 bytes) in 56.44158009s
[96] finished download job (read 6198809 bytes) in 56.456063295s
[60] finished download job (read 6198809 bytes) in 56.521619324s
[50] finished download job (read 6198809 bytes) in 56.771706047s
[11] finished download job (read 6198809 bytes) in 56.821399992s
[52] finished download job (read 6198809 bytes) in 56.865662218s
[66] finished download job (read 6198809 bytes) in 56.97720656s
[6] finished download job (read 6198809 bytes) in 56.993605511s
[78] finished download job (read 6198809 bytes) in 57.044606204s
[30] finished download job (read 6198809 bytes) in 57.167580858s
[81] finished download job (read 6198809 bytes) in 57.219458452s
[27] finished download job (read 6198809 bytes) in 57.234184741s
[80] finished download job (read 6198809 bytes) in 57.294372358s
[37] finished download job (read 6198809 bytes) in 57.434658629s
[23] finished download job (read 6198809 bytes) in 57.654305373s
[91] finished download job (read 6198809 bytes) in 57.924653005s
[3] finished download job (read 6198809 bytes) in 57.94704263s
[36] finished download job (read 6198809 bytes) in 57.978373205s
[45] finished download job (read 6198809 bytes) in 58.207883656s
[38] finished download job (read 6198809 bytes) in 58.244114709s
[69] finished download job (read 6198809 bytes) in 58.2754416s
[41] finished download job (read 6198809 bytes) in 58.293532975s
[28] finished download job (read 6198809 bytes) in 58.305003088s
[63] finished download job (read 6198809 bytes) in 58.315931935s
[32] finished download job (read 6198809 bytes) in 58.320526113s
[44] finished download job (read 6198809 bytes) in 58.463426612s
[7] finished download job (read 6198809 bytes) in 58.478705522s
[92] finished download job (read 6198809 bytes) in 58.620686694s
[10] finished download job (read 6198809 bytes) in 58.676208804s
[19] finished download job (read 6198809 bytes) in 58.67685684s
[49] finished download job (read 6198809 bytes) in 58.70566748s
[35] finished download job (read 6198809 bytes) in 58.847224198s
[99] finished download job (read 6198809 bytes) in 58.894416924s
[76] finished download job (read 6198809 bytes) in 58.968481343s
[98] finished download job (read 6198809 bytes) in 58.988596888s
[40] finished download job (read 6198809 bytes) in 59.035400909s
[58] finished download job (read 6198809 bytes) in 59.043526431s
[88] finished download job (read 6198809 bytes) in 59.122832185s
[87] finished download job (read 6198809 bytes) in 59.153337061s
[24] finished download job (read 6198809 bytes) in 59.182860012s
[5] finished download job (read 6198809 bytes) in 59.300469562s
[65] finished download job (read 6198809 bytes) in 59.318931113s
[68] finished download job (read 6198809 bytes) in 59.327315684s
[73] finished download job (read 6198809 bytes) in 59.417350732s
[75] finished download job (read 6198809 bytes) in 59.429167583s
[89] finished download job (read 6198809 bytes) in 59.441224719s
[33] finished download job (read 6198809 bytes) in 59.467230748s
[93] finished download job (read 6198809 bytes) in 59.47243084s
[62] finished download job (read 6198809 bytes) in 59.571973951s
[0] finished download job (read 6198809 bytes) in 59.624225804s
[2] finished download job (read 6198809 bytes) in 59.723292307s
[64] finished download job (read 6198809 bytes) in 59.759922201s
[47] finished download job (read 6198809 bytes) in 59.759636996s
[14] finished download job (read 6198809 bytes) in 59.994028361s
[59] finished download job (read 6198809 bytes) in 1m0.002918409s
[34] finished download job (read 6198809 bytes) in 1m0.070034962s
[97] finished download job (read 6198809 bytes) in 1m0.327620044s
[77] finished download job (read 6198809 bytes) in 1m0.366530372s
[15] finished download job (read 6198809 bytes) in 1m0.402741674s
[72] finished download job (read 6198809 bytes) in 1m0.412533226s
[29] finished download job (read 6198809 bytes) in 1m0.440142146s
[42] finished download job (read 6198809 bytes) in 1m0.510102497s
[1] finished download job (read 6198809 bytes) in 1m0.75496559s
[71] failed to read body, error: read tcp 10.90.7.237:33446->52.216.64.216:443: read: connection reset by peer
[71] finished download job (read 730633 bytes) in 5m45.648985404s
[9] failed to read body, error: read tcp 10.90.7.237:33386->52.216.64.216:443: read: connection reset by peer
[9] finished download job (read 1409545 bytes) in 5m55.520246483s
[12] failed to read body, error: read tcp 10.90.7.237:33566->52.216.64.216:443: read: connection reset by peer
[12] finished download job (read 1775113 bytes) in 6m1.201912432s
[17] failed to read body, error: read tcp 10.90.7.237:33420->52.216.64.216:443: read: connection reset by peer
[17] finished download job (read 748041 bytes) in 6m1.384548141s
[25] failed to read body, error: read tcp 10.90.7.237:33516->52.216.64.216:443: read: connection reset by peer
[25] finished download job (read 2140681 bytes) in 6m1.941659707s
Steps to reproduce
Run attached example with high concurrency (you will need to modify the access key id and secret key, also change the bucket/key). On my box I can reproduce it with concurrency set to 100 and the s3 object it downloads is about 6MB. You may be able to reproduce it with a different concurrency number, guess it depends on the network bandwidth and the size of the s3 object it downloads.
If you have have an runnable example, please include it.
test.go.txt
The text was updated successfully, but these errors were encountered: