Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS S3 DataPipe doesn't terminate when used with .load_from_tar(mode="r|") #799

Closed
NivekT opened this issue Sep 28, 2022 · 2 comments
Closed
Labels
bug Something isn't working

Comments

@NivekT
Copy link
Contributor

NivekT commented Sep 28, 2022

🐛 Describe the bug

Note this issue only occurs when mode="r|", not when it is mode="r:".

Sample code:

dp = IterableWrapper(["s3://BUCKET/images0.tar"])
# dp = dp.load_files_by_s3(region="us-east-1").load_from_tar(mode="r:")  # This is fast and works as expected
dp = dp.load_files_by_s3(region="us-east-1").load_from_tar(mode="r|")  # This is slow and doesn't terminate, it repeatedly yields certain files
start = time.time()
for x in dp:
    print(x)
print(time.time() - start)

Not sure if mode="r|" is incompatible with S3 DataPipe/SDK or there is something off with our implementation.

Versions

main branch

cc: @ejguan

@NivekT NivekT added the bug Something isn't working label Sep 28, 2022
@ejguan
Copy link
Contributor

ejguan commented Sep 29, 2022

Here are what I have found regarding S3Handler:

Originally I was expecting the returned stream from S3handler is non-seekable stream. But, it turns out that the whole archive/files will be dumped into memory based on the implementation here:

In order to make it streaming, we need to have a way to pybind C++ stream IO to python, which is non-trivial. See a code reference: https://github.com/CadQuery/OCP/blob/master/pystreambuf.h

However, it doesn't give us an answer about why it's not working with mode r| properly. I did test those two modes on a local tar archive, and both modes are working properly.

Then, it definitely a bug regarding S3Loader.

Edit: Tested S3FileLoader, the performance is on-par for both modes.....

@NivekT
Copy link
Contributor Author

NivekT commented Sep 29, 2022

The main branch is fine. The root cause is here.

Sorry for the false alarm :/

One discovery is that S3 DataPipe downloads the whole archive into memory instead of chunks. We would prefer it to stream. I'll look into fsspec shortly to examine its behavior.

@NivekT NivekT closed this as completed Sep 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants