Read COCO dataset from ZIP file #950

koenvandesande · 2019-05-23T10:22:58Z

Probably needs more discussion about what other datasets to apply it to, but this is an initial take for CocoCaptions and CocoDetection
Fixes #947

Where you'd normally have e.g. "train2014" as a folder, if you place "train2014.zip" next to that, it will transparently switch to the zipped version.

fmassa

This is a good start, thanks a lot!

I have a few comments and I'd like us to think a bit more the API and how to make it easier to let other parts of the codebase, like ImageFolder to support zipped files instead of folders.

Also, we would need tests for this functionality, because it adds some non-trivial code.

torchvision/datasets/forksafeziplookup.py

torchvision/datasets/coco.py

torchvision/datasets/forksafeziplookup.py

…nstead, make it a new-styel class

codecov-io · 2019-05-24T08:54:07Z

Codecov Report

❗ No coverage uploaded for pull request base (master@2611f5c). Click here to learn what that means.
The diff coverage is 73.18%.

@@           Coverage Diff            @@
##             master    #950   +/-   ##
========================================
  Coverage          ?   65.6%           
========================================
  Files             ?      81           
  Lines             ?    6411           
  Branches          ?     983           
========================================
  Hits              ?    4206           
  Misses            ?    1902           
  Partials          ?     303

Impacted Files	Coverage Δ
torchvision/datasets/coco.py	`29.26% <0%> (ø)`
torchvision/datasets/__init__.py	`100% <100%> (ø)`
torchvision/datasets/omniglot.py	`86% <100%> (ø)`
torchvision/datasets/celeba.py	`71.6% <100%> (ø)`
torchvision/datasets/zippedfolder.py	`61.29% <61.29%> (ø)`
torchvision/datasets/vision.py	`53.94% <75%> (ø)`
torchvision/datasets/utils.py	`83.58% <88.23%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2611f5c...f28b324. Read the comment docs.

torchvision/datasets/utils.py

Needs just one special case argument for Coco, removed the "base_folder" for CelebA which really should have been part of root all along

…t tries both options instead

Oops, shouldn't have checked in that file

…nto read_zipped_data

…you'd normally pass to ImageFolder

koenvandesande · 2019-05-31T10:43:25Z

With the addition of the ZippedImageFolder class, I'm finished in terms of features. Initially I tried to subclass DatasetFolder for ZippedImageFolder, but given the extent of the changes needed I made it into a separate class and .py file.

fmassa

This is looking pretty good, thanks!

I want to think a bit more through this though, as there are some things that I think could be improved. I'll have another look on Monday / Tuesday next week.

If you don't mind, I might send some patches on top of your branch?

torchvision/datasets/celeba.py

koenvandesande · 2019-06-03T07:15:22Z

Sure, please do provide patches on top of this.

koenvandesande · 2019-07-13T17:18:10Z

Updated branch so that it merges cleanly with master again.

fmassa · 2019-07-15T12:47:19Z

@koenvandesande thanks for updating the PR!

I am still unsure about how to nicely place this with the rest of torchvision datasets. In particular, the discussion in #1080 is very relevant.

As such, I'm holding on on merging this PR for the time being, but this is a nice addition that would be good to have in torchvision at some point.

* Rewrite torchvision packaging (pytorch#1209) Following a similar line of inquiry to pytorch/audio#217 * Packaging fixes (pytorch#1214) Add uploading support, make CUDA builds actually work. * 0.4.0 parameters Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Actually upload wheels (please port to master) Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Put macos binaries in the right place Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Propagate more environment variables. Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Change the version number Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Go time Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

…h#1218) (pytorch#1219) Signed-off-by: Edward Z. Yang <ezyang@fb.com>

koenvandesande · 2019-08-13T12:47:19Z

The discussion in #1080 seems to have quieted down (without consensus so far?).
This pull request could be split into two: (1) the ZippedImageFolder, an efficient alternative to ImageFolder (which needs to be used explicitly by the user), and (2) changes to COCO, CelebA and OmniGlot to be read efficiently from ZIP in a transparent way (no user changes).

Is there interest to merge either (1) or (2) only in the near future?

fmassa · 2019-08-28T09:12:47Z

Hi @koenvandesande

Sorry for the delay in replying, I just got back from holidays.

I need to get back to the discussions in #1080 and pytorch/pytorch#24915 more generally.

There is value in this feature, but I will need a bit more time to think it through. I'll be reviewing this again on Tuesday, Sep 3rd

ain-soph · 2020-12-14T00:29:22Z

Hi, I think it'll be beneficial to have this Zipped loading style supported. It doesn't seem to be limited to COCO, but will also be sweet to have an argument zipped=True for the generic class ImageFolder, if it's proved to be true to gain acceleration on HDD. When it is enabled but zip file is not found (images exist), we expect it to put all images in a zip file without compression to accelerate in the future. But I'm highly suspecting whether it'll get any acceleration because __getitem__ still read one small file at a time, the only difference is that the file address in the disk is got from the ZIPheader rather than the filesystem.

Btw, do we have to add a new class ZippedFolder to support this feature? I thought it could be part of ImageFolder.

And another feature might be loading all images from the ZIP or folder to the memory at the initialization just like CIFAR10 does, so that we won't take time to load image one by one during traverse (__getitem__). Maybe another argument in_memory=True in ImageFolder? Currently I'm using my own custom Dataset class, but I think it's a generic thing and would be nice if torchvision could support it natively.

I guess it'll be really useful and save quite some time for academic researchers. Datasets are saved on HDD for the server, but the disk space is quite sufficient. GPUs and memories are strong but disks are the bottleneck. Most of the dataset formats are ImageFolder style and small-scale. I wish I could load them quickly to the memory directly for small scale and use ZIP file for large scale (Not quite sure will get acceleration from ZIP though).

yassineAlouini · 2022-05-20T14:40:14Z

Thanks @koenvandesande for the contribution and sorry for taking that long to get back at you.

There is a new dataset API being developed and old datasets are being ported as discussed here: #5336.

I am not 100% sure but I think that new features have been added to easily read zipped files/folders. @pmeier knows a lot more about this new API so I hope he will add details here.

Thus, I would propose waiting a bit until the new dataset API is finalized and merged and then seeing if the features you have contributed @koenvandesande are still useful. If they are, someone familiar with the new API design can help you add them as needed and you will get proper attribution of course.

Again, thanks for the contribution and sorry for the long wait.

pmeier · 2022-05-23T07:49:56Z

@yassineAlouini

I am not 100% sure but I think that new features have been added to easily read zipped files/folders. @pmeier knows a lot more about this new API so I hope he will add details here.

Yup. The prototype datasets will read from archives by default:

vision/torchvision/prototype/datasets/_builtin/coco.py

Lines 95 to 98 in b969cca

    
           images = HttpResource( 
        
               f"{self._IMAGE_URL_BASE}/{self._split}{self._year}.zip", 
        
               sha256=self._IMAGES_CHECKSUMS[(self._year, self._split)], 
        
           )

vision/torchvision/prototype/datasets/utils/_resource.py

Lines 64 to 70 in b969cca

    
           dp = FileOpener(IterableWrapper((str(path),)), mode="rb") 
        
           archive_loader = self._guess_archive_loader(path) 
        
           if archive_loader: 
        
               dp = archive_loader(dp) 
        
           return dp

@koenvandesande It seems the main contribution here is the ZippedImageFolder, correct? The new API will no longer use the old ImageFolder, but rather uses primitives from torchdata to build the dataset:

vision/torchvision/prototype/datasets/_folder.py

Lines 41 to 49 in b969cca

    
           root = pathlib.Path(root).expanduser().resolve() 
        
           categories = sorted(entry.name for entry in os.scandir(root) if entry.is_dir()) 
        
           masks: Union[List[str], str] = [f"*.{ext}" for ext in valid_extensions] if valid_extensions is not None else "" 
        
           dp = FileLister(str(root), recursive=recursive, masks=masks) 
        
           dp: IterDataPipe = Filter(dp, functools.partial(_is_not_top_level_file, root=root)) 
        
           dp = hint_sharding(dp) 
        
           dp = hint_shuffling(dp) 
        
           dp = FileOpener(dp, mode="rb") 
        
           return Mapper(dp, functools.partial(_prepare_sample, root=root, categories=categories)), categories

For now, we only support loading datasets in the image folder structure from extracted archives, but changing this to read from an archive shouldn't be too hard. My proposal to resolve this is to open a new issue tracking this feature and close this PR given that it is no longer compatible with the new API. Is that ok for you?

koenvandesande added 2 commits May 23, 2019 12:17

Read COCO dataset images from its zipfile when it is there

e7f6f66

Where you'd normally have e.g. "train2014" as a folder, if you place "train2014.zip" next to that, it will transparently switch to the zipped version.

Also do it for CocoCaptions

5aebeae

fmassa requested changes May 23, 2019

View reviewed changes

Move code into utils.py, remove the magic constants and import them i…

7803588

…nstead, make it a new-styel class

koenvandesande added 7 commits May 24, 2019 12:16

Add test for zip lookup class

3e5e03d

Fix for Python versions < 3.6

4e39618

Generalize to CelebA, move part of shared logic into VisionDataset

9a59666

Fix import

0bc30f2

flake8 fixes

a8b483a

Simplify implementation of ZipLookup by not keeping file descriptor open

26d51d0

Remove unused import

29d7df8

fmassa reviewed May 24, 2019

View reviewed changes

torchvision/datasets/utils.py Show resolved Hide resolved

koenvandesande and others added 17 commits May 24, 2019 14:18

Support reading images from ZIP for Omniglot dataset

46ceaf0

Add common get_path_or_fp function

3f10d56

Needs just one special case argument for Coco, removed the "base_folder" for CelebA which really should have been part of root all along

Forgot one spot

534d35d

Remove syntax unsupported by Python 2, replace argument with code tha…

26227de

…t tries both options instead

Delete _C.cp37-win_amd64.pyd

547b618

Oops, shouldn't have checked in that file

Fixes and extra unit tests

559a5cf

Fixes and extra unit tests

741e3bb

Merge branch 'read_zipped_data' of github.com:koenvandesande/vision i…

f037fe9

…nto read_zipped_data

Fix

c0d4dbf

Fix

481d45c

Need to rewrite Omniglot ZIP-file because it uses compression

32b2311

Fix flake8

255a6f9

Omniglot depends on pandas, and that is tested now in test_datasets

8adf9af

Fix

7753710

Add extra check

bfa7510

Refactor

afd2d04

Add test

2b7a044

Add ZippedImageFolder class which reads a zipped version of the data …

2ac26ce

…you'd normally pass to ImageFolder

fmassa reviewed May 31, 2019

View reviewed changes

torchvision/datasets/celeba.py Outdated Show resolved Hide resolved

koenvandesande added 5 commits July 12, 2019 16:41

Merge branch 'master' into read_zipped_data

629c851

Fix flake8

9396c35

Update test_zippedfolder.py

48894bf

Fix test

0fa8035

Fix omniglot

c05281a

ezyang added 3 commits August 8, 2019 12:31

Don't build nightlies on 0.4.0 branch.

66bc6f9

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Refactor version suffix so conda packages don't get suffixes. (pytorc…

a1ed206

…h#1218) (pytorch#1219) Signed-off-by: Edward Z. Yang <ezyang@fb.com>

koenvandesande and others added 8 commits August 30, 2019 16:29

Merge remote-tracking branch 'upstream/v0.4.0' into read_zipped_data

7a8b133

Merge remote-tracking branch 'upstream/master' into read_zipped_data

b8c2c5d

Merge branch 'master' into read_zipped_data

8df35fa

Merge branch 'master' into read_zipped_data

d68ce83

Merge branch 'master' into read_zipped_data

17de30d

Update config.yml

393cfd6

Remove EOL

f28b324

Merge branch 'master' into read_zipped_data

6247d96

ain-soph mentioned this pull request Dec 30, 2020

'make_dataset' as staticmethod of 'DatasetFolder' #3215

Merged

pmeier self-assigned this Apr 8, 2022

pmeier closed this Nov 7, 2022

Read COCO dataset from ZIP file #950

Read COCO dataset from ZIP file #950

Uh oh!

Conversation

koenvandesande commented May 23, 2019

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-io commented May 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

koenvandesande commented May 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

koenvandesande commented Jun 3, 2019

Uh oh!

koenvandesande commented Jul 13, 2019

Uh oh!

fmassa commented Jul 15, 2019

Uh oh!

koenvandesande commented Aug 13, 2019

Uh oh!

fmassa commented Aug 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ain-soph commented Dec 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yassineAlouini commented May 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmeier commented May 23, 2022

Uh oh!

Uh oh!

codecov-io commented May 24, 2019 •

edited

Loading

koenvandesande commented May 31, 2019 •

edited

Loading

fmassa commented Aug 28, 2019 •

edited

Loading

ain-soph commented Dec 14, 2020 •

edited

Loading

yassineAlouini commented May 20, 2022 •

edited

Loading