-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Experiment] Make S3Handler.s3_read return a stream rather than bytes #800
Comments
And, there is a use case that might affect the performance on S3FileLoader. |
cc: @ydaiming for confirmation about the files are dumped into memory rather than streaming from |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Originally I was expecting the returned stream from
S3handler
is non-seekable stream. But, it turns out that the whole archive/files will be dumped into memory based on the implementation (I might be wrong about it then I need someone to validate it)data/torchdata/csrc/pybind/pybind.cpp
Line 28 in a435c7f
data/torchdata/datapipes/iter/load/s3io.py
Line 135 in a435c7f
And, that is the reason that the performance seems on parity with or without
BytesIO
in this issue.In order to make it streaming, we need to have a way to pybind C++ stream IO to python, which is non-trivial. See a code example: https://github.com/CadQuery/OCP/blob/master/pystreambuf.h
Potentially this change would accelerate data preprocessing. But, it needs to be extensively benchmarked.
The text was updated successfully, but these errors were encountered: