-
Notifications
You must be signed in to change notification settings - Fork 43
Streaming S3 upload #20
Comments
@aroder, thanks for this. I couldn't try and test it extensively with real files but I think the logic is basically just fine. And indeed, streaming to s3 is much better option than creating temp file on local disk. Have you found anything to solve compression? Is that something that we can do with multipart uploads on the fly? Here are my comments so far:
|
Thanks, Peter. I don't have an answer yet for the compression on multipart
uploads. Perhaps we keep the old method around if the user wants to do
compression? I'll keep looking.
I'll address the other comments. I believe IterStream was an attempt I
abandoned and should be removed.
…--
Adam Roderick
720-940-3512
aroder@gmail.com
On Wed, Jul 15, 2020 at 10:37 AM Peter Kosztolanyi ***@***.***> wrote:
@aroder <https://github.com/aroder>, thanks for this. I couldn't try and
test it extensively with real files but I think the logic is basically just
fine. And indeed, streaming to s3 is much better approach than creating
temp file on local disk.
Have you find anything to solve compressions? Is that something that we
can do multipart uploads on the fly?
Here are my comments so far:
-
The singer.parse_message
<https://github.com/aroder/pipelinewise-target-s3-csv/blob/e7ff1fe7c477748aba3919fda2b02ab24276542b/target_s3_csv/__init__.py#L169>
method called already a few lines above.
-
You emit the state here
<https://github.com/aroder/pipelinewise-target-s3-csv/blob/e7ff1fe7c477748aba3919fda2b02ab24276542b/target_s3_csv/__init__.py#L200>
right after we received a new STATE message. I think we should keep
the previous logic and emit the last received STATE only at the very
end, when the entire file uploaded to s3. If we received a STATE
message in the middle of the stream then it doesn't guarantee that the
actual batch of 100k records have been successfully uploaded into S3.
-
What's the purpose of IterStream clas
<https://github.com/aroder/pipelinewise-target-s3-csv/blob/e7ff1fe7c477748aba3919fda2b02ab24276542b/target_s3_csv/IterStream.py>?
I can't find where this class used in the code
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAFYYQIHZYTX3YGYN67MBTR3XLMVANCNFSM4OWLMONA>
.
|
The S3 upload should stream. Currently, even for large source streams, the code creates a temp file and attempts to upload it all at once.
The code at https://github.com/aroder/pipelinewise-target-s3-csv/tree/stream-s3-upload is functional for streaming. I cannot do a PR yet, because there are some features like compression that I need to see how to support.
@koszti when you can make time to review this branch and provide feedback, it would be appreciated.
The text was updated successfully, but these errors were encountered: