You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For large datasets on e.g. university clusters, where your data storage is an NFS mount, reading individual files can be slow. It also doesn't support reading ahead. In the cloud, you typically have SSD storage, but unzipping the dataset still takes time.
Would you be open to receiving a pull request that reads the COCO dataset from its zipped version? It adds around 10 lines in the COCO Detection class, and adds another Python file for reading ZIP files in a fork-safe manner (so it works with distributed training).
The text was updated successfully, but these errors were encountered:
You mean that all the images are in a zip file?
And how would the structure of the reading be? Does it unzip it all locally, or read the zipped file without uncompressing it all?
In general, I don't see why this would be something specific to the COCO dataset. But finding a generic way of supporting this for all datasets is something that would be great to have.
Yes, all the images are in a zip file and they are read without unzipping. With the constraint (added by me) that the ZIP file shouldn't use compression (which is the case for COCO).
Note that ZIP files are suited for this because they have an index. For tar files, it isn't very efficient because you need to walk over the entire file first to build an index.
I'll first create something just for COCO, and then we can look at which other datasets are stored as ZIP files.
For large datasets on e.g. university clusters, where your data storage is an NFS mount, reading individual files can be slow. It also doesn't support reading ahead. In the cloud, you typically have SSD storage, but unzipping the dataset still takes time.
Would you be open to receiving a pull request that reads the COCO dataset from its zipped version? It adds around 10 lines in the COCO Detection class, and adds another Python file for reading ZIP files in a fork-safe manner (so it works with distributed training).
The text was updated successfully, but these errors were encountered: