Streaming S3 upload #20

aroder · 2020-07-10T08:13:34Z

The S3 upload should stream. Currently, even for large source streams, the code creates a temp file and attempts to upload it all at once.

The code at https://github.com/aroder/pipelinewise-target-s3-csv/tree/stream-s3-upload is functional for streaming. I cannot do a PR yet, because there are some features like compression that I need to see how to support.

@koszti when you can make time to review this branch and provide feedback, it would be appreciated.

koszti · 2020-07-15T16:37:12Z

@aroder, thanks for this. I couldn't try and test it extensively with real files but I think the logic is basically just fine. And indeed, streaming to s3 is much better option than creating temp file on local disk.

Have you found anything to solve compression? Is that something that we can do with multipart uploads on the fly?

Here are my comments so far:

The singer.parse_message method called already a few lines above.
You emit the state here right after we received a new STATE message. I think we should keep the previous logic and emit the last received STATE only at the very end, when the entire file uploaded to s3. If we received a STATE message in the middle of the stream then it doesn't guarantee that the actual batch of 100k records have been successfully uploaded to S3.
What's the purpose of IterStream class? I can't find where this class used in the code

aroder · 2020-07-24T22:30:34Z

Thanks, Peter. I don't have an answer yet for the compression on multipart uploads. Perhaps we keep the old method around if the user wants to do compression? I'll keep looking. I'll address the other comments. I believe IterStream was an attempt I abandoned and should be removed.

…

-- Adam Roderick 720-940-3512 aroder@gmail.com

On Wed, Jul 15, 2020 at 10:37 AM Peter Kosztolanyi ***@***.***> wrote: @aroder <https://github.com/aroder>, thanks for this. I couldn't try and test it extensively with real files but I think the logic is basically just fine. And indeed, streaming to s3 is much better approach than creating temp file on local disk. Have you find anything to solve compressions? Is that something that we can do multipart uploads on the fly? Here are my comments so far: - The singer.parse_message <https://github.com/aroder/pipelinewise-target-s3-csv/blob/e7ff1fe7c477748aba3919fda2b02ab24276542b/target_s3_csv/__init__.py#L169> method called already a few lines above. - You emit the state here <https://github.com/aroder/pipelinewise-target-s3-csv/blob/e7ff1fe7c477748aba3919fda2b02ab24276542b/target_s3_csv/__init__.py#L200> right after we received a new STATE message. I think we should keep the previous logic and emit the last received STATE only at the very end, when the entire file uploaded to s3. If we received a STATE message in the middle of the stream then it doesn't guarantee that the actual batch of 100k records have been successfully uploaded into S3. - What's the purpose of IterStream clas <https://github.com/aroder/pipelinewise-target-s3-csv/blob/e7ff1fe7c477748aba3919fda2b02ab24276542b/target_s3_csv/IterStream.py>? I can't find where this class used in the code — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAFYYQIHZYTX3YGYN67MBTR3XLMVANCNFSM4OWLMONA> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming S3 upload #20

Streaming S3 upload #20

aroder commented Jul 10, 2020

koszti commented Jul 15, 2020 •

edited

Loading

aroder commented Jul 24, 2020 via email

Streaming S3 upload #20

Streaming S3 upload #20

Comments

aroder commented Jul 10, 2020

koszti commented Jul 15, 2020 • edited Loading

aroder commented Jul 24, 2020 via email

koszti commented Jul 15, 2020 •

edited

Loading