Skip to content

Read COCO dataset from ZIP file? #947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
koenvandesande opened this issue May 23, 2019 · 3 comments
Open

Read COCO dataset from ZIP file? #947

koenvandesande opened this issue May 23, 2019 · 3 comments

Comments

@koenvandesande
Copy link
Contributor

For large datasets on e.g. university clusters, where your data storage is an NFS mount, reading individual files can be slow. It also doesn't support reading ahead. In the cloud, you typically have SSD storage, but unzipping the dataset still takes time.

Would you be open to receiving a pull request that reads the COCO dataset from its zipped version? It adds around 10 lines in the COCO Detection class, and adds another Python file for reading ZIP files in a fork-safe manner (so it works with distributed training).

@fmassa
Copy link
Member

fmassa commented May 23, 2019

You mean that all the images are in a zip file?
And how would the structure of the reading be? Does it unzip it all locally, or read the zipped file without uncompressing it all?

In general, I don't see why this would be something specific to the COCO dataset. But finding a generic way of supporting this for all datasets is something that would be great to have.

@koenvandesande
Copy link
Contributor Author

Yes, all the images are in a zip file and they are read without unzipping. With the constraint (added by me) that the ZIP file shouldn't use compression (which is the case for COCO).
Note that ZIP files are suited for this because they have an index. For tar files, it isn't very efficient because you need to walk over the entire file first to build an index.
I'll first create something just for COCO, and then we can look at which other datasets are stored as ZIP files.

@koenvandesande
Copy link
Contributor Author

This could easily apply to the following datasets as well (because they are stored as ZIP files):

  • celeba
  • omniglot
  • phototour (though not really, because it does postprocessing on the files after extraction)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants