Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experiment] Make S3Handler.s3_read return a stream rather than bytes #800

Open
ejguan opened this issue Sep 29, 2022 · 2 comments
Open

Comments

@ejguan
Copy link
Contributor

ejguan commented Sep 29, 2022

Originally I was expecting the returned stream from S3handler is non-seekable stream. But, it turns out that the whole archive/files will be dumped into memory based on the implementation (I might be wrong about it then I need someone to validate it)

In order to make it streaming, we need to have a way to pybind C++ stream IO to python, which is non-trivial. See a code example: https://github.com/CadQuery/OCP/blob/master/pystreambuf.h

Potentially this change would accelerate data preprocessing. But, it needs to be extensively benchmarked.

@ejguan
Copy link
Contributor Author

ejguan commented Sep 29, 2022

And, there is a use case that might affect the performance on S3FileLoader.
If I do tarfile.open(fileobj=s3_stream_returned_from_s3fileloader, mode=m, bufsize=20000000240), the speed with mode r: is way faster than the mode r|

@ejguan
Copy link
Contributor Author

ejguan commented Sep 29, 2022

cc: @ydaiming for confirmation about the files are dumped into memory rather than streaming from S3Handler.s3_read. And, do you want to see if there will be benefit to revamp it to an iostream?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant