Skip to content
This repository has been archived by the owner on Dec 9, 2022. It is now read-only.

Apply retries around the whole download operation #44

Merged
merged 26 commits into from
Jan 27, 2020
Merged

Conversation

TomBoam
Copy link
Contributor

@TomBoam TomBoam commented Jan 22, 2020

Ticket: https://deliveroo.atlassian.net/browse/NSA-500

What
Change the retry logic in paddle data get so it covers additional transient failure cases that are worth retrying.

Why
NSA have seen some failures in our pipelines because copying the data from S3 into a local file failed for a transient reason. Before this PR the cli had retries around asking s3 for the object (s3.GetObject(input *s3.GetObjectInput) (*s3.GetObjectOutput, error)) but not around actually pulling the bytes across the network and copying them into the file locally (io.Copy(foo, s3.GetObjectOutput.Body)).

This led to failures in our pipelines because we failed when copying the object into a local file which weren't retried.

2020/01/07 12:03:13 [rider-planning-duke-backbone-master/paddle]: /data/input/slots_surge_and_order_volume.csv -> 1309073807 bytes
2020/01/07 12:08:22 [rider-planning-duke-backbone-master/paddle]: copying file /data/input/driver_hourly_schedule_in_app.csv: read tcp 100.117.27.117:48523->52.218.24.160:443: read: connection reset by peer
2020/01/07 12:08:25 [paddle] Container removed: paddle

This PR moves the retries to be around both connecting to s3 and copying the data locally.

How
Firstly I refactored how the class fetches files and the special case of fetching the HEAD file. Previously files were fetched as you would expect but the HEAD file got the contents and copied them directly into a string. The retry was in getting the contents so if I moved the retry further up the stack for fetching files the readHEAD method would no longer benefit from the retries. So I changed readHEAD to copy the object into a temporary file and then read the contents. This unified the code paths and made it easier to refactor.

This then gives a method called copyS3ObjectToFile which is used by both code paths and is easier to test in isolation. I moved the retry logic into this method so both connecting to s3 and copying the bytes into the file are now inside the retry loop. We have to reset the file if we fail to copy because some bytes may already be written to the file.

Testing
Previously there were no tests for reading files from S3. I've now added tests for the happy case (read successfully first time), failures at each step and success after a number of retries. I made the time between retries configurable so these tests don't take too long.

The tests work by pulling out a structural type S3Getter which encapsulates the GetObject(input *s3.GetObjectInput) (*s3.GetObjectOutput, error) method we need from the full aws client.

I have not yet tested this on a pipeline but will do this once I've got feedback on this PR.

Where to start
Only covers 2 classes. I'm not familiar with Go so the individual commits are very small, not sure if it is useful to follow them.

Open Questions
This is my first time writing Go so I may have made some dumb mistakes, missed some obvious library code that could have helped or written code in a style that isn't idiomatic.
This class also lists the keys with a given prefix - should I add retries around this operation too? We haven't seen it fail yet but it could.

Urgency
Not urgent

Copy link

@aleon aleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I am still very much a n00b to Go so very much worth more 👀

cli/data/get.go Outdated Show resolved Hide resolved
@TomBoam
Copy link
Contributor Author

TomBoam commented Jan 27, 2020

I have completed the testing on staging of the new get cli. The happy case works fine. It is difficult to simulate S3 failing in these tests but I believe the unit tests cover these cases.

Copy link
Collaborator

@me me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@TomBoam TomBoam merged commit 4691b5a into master Jan 27, 2020
@TomBoam TomBoam deleted the tb/NSA-500 branch January 27, 2020 15:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants