Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read COCO dataset from ZIP file #950

Closed
wants to merge 51 commits into from

Conversation

koenvandesande
Copy link
Contributor

Probably needs more discussion about what other datasets to apply it to, but this is an initial take for CocoCaptions and CocoDetection
Fixes #947

Where you'd normally have e.g. "train2014" as a folder, if you place "train2014.zip" next to that, it will transparently switch to the zipped version.
Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start, thanks a lot!

I have a few comments and I'd like us to think a bit more the API and how to make it easier to let other parts of the codebase, like ImageFolder to support zipped files instead of folders.

Also, we would need tests for this functionality, because it adds some non-trivial code.

torchvision/datasets/forksafeziplookup.py Outdated Show resolved Hide resolved
torchvision/datasets/forksafeziplookup.py Outdated Show resolved Hide resolved
torchvision/datasets/forksafeziplookup.py Outdated Show resolved Hide resolved
torchvision/datasets/forksafeziplookup.py Outdated Show resolved Hide resolved
torchvision/datasets/coco.py Outdated Show resolved Hide resolved
torchvision/datasets/forksafeziplookup.py Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented May 24, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@2611f5c). Click here to learn what that means.
The diff coverage is 73.18%.

Impacted file tree graph

@@           Coverage Diff            @@
##             master    #950   +/-   ##
========================================
  Coverage          ?   65.6%           
========================================
  Files             ?      81           
  Lines             ?    6411           
  Branches          ?     983           
========================================
  Hits              ?    4206           
  Misses            ?    1902           
  Partials          ?     303
Impacted Files Coverage Δ
torchvision/datasets/coco.py 29.26% <0%> (ø)
torchvision/datasets/__init__.py 100% <100%> (ø)
torchvision/datasets/omniglot.py 86% <100%> (ø)
torchvision/datasets/celeba.py 71.6% <100%> (ø)
torchvision/datasets/zippedfolder.py 61.29% <61.29%> (ø)
torchvision/datasets/vision.py 53.94% <75%> (ø)
torchvision/datasets/utils.py 83.58% <88.23%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2611f5c...f28b324. Read the comment docs.

@koenvandesande
Copy link
Contributor Author

koenvandesande commented May 31, 2019

With the addition of the ZippedImageFolder class, I'm finished in terms of features. Initially I tried to subclass DatasetFolder for ZippedImageFolder, but given the extent of the changes needed I made it into a separate class and .py file.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good, thanks!

I want to think a bit more through this though, as there are some things that I think could be improved. I'll have another look on Monday / Tuesday next week.

If you don't mind, I might send some patches on top of your branch?

torchvision/datasets/celeba.py Outdated Show resolved Hide resolved
@koenvandesande
Copy link
Contributor Author

Sure, please do provide patches on top of this.

@koenvandesande
Copy link
Contributor Author

Updated branch so that it merges cleanly with master again.

@fmassa
Copy link
Member

fmassa commented Jul 15, 2019

@koenvandesande thanks for updating the PR!

I am still unsure about how to nicely place this with the rest of torchvision datasets. In particular, the discussion in #1080 is very relevant.

As such, I'm holding on on merging this PR for the time being, but this is a nice addition that would be good to have in torchvision at some point.

ezyang added 3 commits August 8, 2019 12:31
* Rewrite torchvision packaging (pytorch#1209)

Following a similar line of inquiry to pytorch/audio#217

* Packaging fixes (pytorch#1214)

Add uploading support, make CUDA builds actually work.

* 0.4.0 parameters

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Actually upload wheels (please port to master)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Put macos binaries in the right place

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Propagate more environment variables.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Change the version number

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Go time

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
@koenvandesande
Copy link
Contributor Author

The discussion in #1080 seems to have quieted down (without consensus so far?).
This pull request could be split into two: (1) the ZippedImageFolder, an efficient alternative to ImageFolder (which needs to be used explicitly by the user), and (2) changes to COCO, CelebA and OmniGlot to be read efficiently from ZIP in a transparent way (no user changes).

Is there interest to merge either (1) or (2) only in the near future?

@fmassa
Copy link
Member

fmassa commented Aug 28, 2019

Hi @koenvandesande

Sorry for the delay in replying, I just got back from holidays.

I need to get back to the discussions in #1080 and pytorch/pytorch#24915 more generally.

There is value in this feature, but I will need a bit more time to think it through. I'll be reviewing this again on Tuesday, Sep 3rd

@ain-soph
Copy link
Contributor

ain-soph commented Dec 14, 2020

Hi, I think it'll be beneficial to have this Zipped loading style supported. It doesn't seem to be limited to COCO, but will also be sweet to have an argument zipped=True for the generic class ImageFolder, if it's proved to be true to gain acceleration on HDD. When it is enabled but zip file is not found (images exist), we expect it to put all images in a zip file without compression to accelerate in the future. But I'm highly suspecting whether it'll get any acceleration because __getitem__ still read one small file at a time, the only difference is that the file address in the disk is got from the ZIPheader rather than the filesystem.

Btw, do we have to add a new class ZippedFolder to support this feature? I thought it could be part of ImageFolder.

And another feature might be loading all images from the ZIP or folder to the memory at the initialization just like CIFAR10 does, so that we won't take time to load image one by one during traverse (__getitem__). Maybe another argument in_memory=True in ImageFolder? Currently I'm using my own custom Dataset class, but I think it's a generic thing and would be nice if torchvision could support it natively.

I guess it'll be really useful and save quite some time for academic researchers. Datasets are saved on HDD for the server, but the disk space is quite sufficient. GPUs and memories are strong but disks are the bottleneck. Most of the dataset formats are ImageFolder style and small-scale. I wish I could load them quickly to the memory directly for small scale and use ZIP file for large scale (Not quite sure will get acceleration from ZIP though).

@yassineAlouini
Copy link
Contributor

yassineAlouini commented May 20, 2022

Thanks @koenvandesande for the contribution and sorry for taking that long to get back at you.

There is a new dataset API being developed and old datasets are being ported as discussed here: #5336.

I am not 100% sure but I think that new features have been added to easily read zipped files/folders. @pmeier knows a lot more about this new API so I hope he will add details here.

Thus, I would propose waiting a bit until the new dataset API is finalized and merged and then seeing if the features you have contributed @koenvandesande are still useful. If they are, someone familiar with the new API design can help you add them as needed and you will get proper attribution of course.

Again, thanks for the contribution and sorry for the long wait.

@pmeier
Copy link
Collaborator

pmeier commented May 23, 2022

@yassineAlouini

I am not 100% sure but I think that new features have been added to easily read zipped files/folders. @pmeier knows a lot more about this new API so I hope he will add details here.

Yup. The prototype datasets will read from archives by default:

images = HttpResource(
f"{self._IMAGE_URL_BASE}/{self._split}{self._year}.zip",
sha256=self._IMAGES_CHECKSUMS[(self._year, self._split)],
)

dp = FileOpener(IterableWrapper((str(path),)), mode="rb")
archive_loader = self._guess_archive_loader(path)
if archive_loader:
dp = archive_loader(dp)
return dp

@koenvandesande It seems the main contribution here is the ZippedImageFolder, correct? The new API will no longer use the old ImageFolder, but rather uses primitives from torchdata to build the dataset:

root = pathlib.Path(root).expanduser().resolve()
categories = sorted(entry.name for entry in os.scandir(root) if entry.is_dir())
masks: Union[List[str], str] = [f"*.{ext}" for ext in valid_extensions] if valid_extensions is not None else ""
dp = FileLister(str(root), recursive=recursive, masks=masks)
dp: IterDataPipe = Filter(dp, functools.partial(_is_not_top_level_file, root=root))
dp = hint_sharding(dp)
dp = hint_shuffling(dp)
dp = FileOpener(dp, mode="rb")
return Mapper(dp, functools.partial(_prepare_sample, root=root, categories=categories)), categories

For now, we only support loading datasets in the image folder structure from extracted archives, but changing this to read from an archive shouldn't be too hard. My proposal to resolve this is to open a new issue tracking this feature and close this PR given that it is no longer compatible with the new API. Is that ok for you?

@pmeier pmeier closed this Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read COCO dataset from ZIP file?
7 participants