[Experiment] Make S3Handler.s3_read return a stream rather than bytes #800

ejguan · 2022-09-29T18:22:35Z

Originally I was expecting the returned stream from S3handler is non-seekable stream. But, it turns out that the whole archive/files will be dumped into memory based on the implementation (I might be wrong about it then I need someone to validate it)

C++ side:

data/torchdata/csrc/pybind/pybind.cpp

Line 28 in a435c7f

return py::bytes(result);
Python side:

data/torchdata/datapipes/iter/load/s3io.py

Line 135 in a435c7f

yield url, StreamWrapper(BytesIO(self.handler.s3_read(url)))

And, that is the reason that the performance seems on parity with or without BytesIO in this issue.

In order to make it streaming, we need to have a way to pybind C++ stream IO to python, which is non-trivial. See a code example: https://github.com/CadQuery/OCP/blob/master/pystreambuf.h

Potentially this change would accelerate data preprocessing. But, it needs to be extensively benchmarked.

The text was updated successfully, but these errors were encountered:

ejguan · 2022-09-29T18:37:44Z

And, there is a use case that might affect the performance on S3FileLoader.
If I do tarfile.open(fileobj=s3_stream_returned_from_s3fileloader, mode=m, bufsize=20000000240), the speed with mode r: is way faster than the mode r|

ejguan · 2022-09-29T18:38:45Z

cc: @ydaiming for confirmation about the files are dumped into memory rather than streaming from S3Handler.s3_read. And, do you want to see if there will be benefit to revamp it to an iostream?

ejguan mentioned this issue Sep 30, 2022

Does torchdata already work with GCP and Azure blob storage #794

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experiment] Make S3Handler.s3_read return a stream rather than bytes #800

[Experiment] Make S3Handler.s3_read return a stream rather than bytes #800

ejguan commented Sep 29, 2022

ejguan commented Sep 29, 2022

ejguan commented Sep 29, 2022

[Experiment] Make S3Handler.s3_read return a stream rather than bytes #800

[Experiment] Make S3Handler.s3_read return a stream rather than bytes #800

Comments

ejguan commented Sep 29, 2022

ejguan commented Sep 29, 2022

ejguan commented Sep 29, 2022