Skip to content
This repository has been archived by the owner on Sep 23, 2024. It is now read-only.

Streaming S3 upload #20

Open
aroder opened this issue Jul 10, 2020 · 2 comments
Open

Streaming S3 upload #20

aroder opened this issue Jul 10, 2020 · 2 comments

Comments

@aroder
Copy link
Contributor

aroder commented Jul 10, 2020

The S3 upload should stream. Currently, even for large source streams, the code creates a temp file and attempts to upload it all at once.

The code at https://github.com/aroder/pipelinewise-target-s3-csv/tree/stream-s3-upload is functional for streaming. I cannot do a PR yet, because there are some features like compression that I need to see how to support.

@koszti when you can make time to review this branch and provide feedback, it would be appreciated.

@koszti
Copy link
Collaborator

koszti commented Jul 15, 2020

@aroder, thanks for this. I couldn't try and test it extensively with real files but I think the logic is basically just fine. And indeed, streaming to s3 is much better option than creating temp file on local disk.

Have you found anything to solve compression? Is that something that we can do with multipart uploads on the fly?

Here are my comments so far:

  • The singer.parse_message method called already a few lines above.

  • You emit the state here right after we received a new STATE message. I think we should keep the previous logic and emit the last received STATE only at the very end, when the entire file uploaded to s3. If we received a STATE message in the middle of the stream then it doesn't guarantee that the actual batch of 100k records have been successfully uploaded to S3.

  • What's the purpose of IterStream class? I can't find where this class used in the code

@aroder
Copy link
Contributor Author

aroder commented Jul 24, 2020 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants