Implement ZipFolder #3510

ain-soph · 2021-03-05T06:52:40Z

Implement a ZipFolder class, which follows my previous PR #3215 .
The idea is very similar to the TarDataset issue on pytorch/pytorch.
It archives the ImageFolder to be a zip without any compression. The functions are almost the same as ImageFolder.

Advantage: it's better for long term use with one single archive file, and makes loading and transferring faster and more convenient by avoiding small files IO (when memory=True), especially on HDD disk.
When argument memory is set to be true, it'll read all bytes of the zip into memory at beginning. Otherwise, the default loading by zipfile would be lazy, leading to the same mechanism as ImageFolder.

Besides the basic utility, I also add a staticmethod initialize_from_folder that makes a folder (follows the ImageFolder requirements) to be the zip format.

ain-soph · 2021-03-05T07:18:29Z

Need Discuss:

Method initialize_from_folder might need a better name. (Candidates: init_from_folder, folder_to_zip)
It might not be appropriate to use io.BytesIO for type annotation.
Potential file structure of zip file (zip filename == [root_folder_name]_store.zip):
a. (current) [root_folder_name]/[target_class]/[img_file]
b. [target_class]/[img_file]
We need to check the compress type to be ZIP_STORED.

And unit test and docs need doing if any reviewer thinks this PR worth doing.

codecov · 2021-03-05T09:20:59Z

Codecov Report

Merging #3510 (c8f167a) into master (c991db8) will decrease coverage by 0.33%.
The diff coverage is 27.86%.

@@            Coverage Diff             @@
##           master    #3510      +/-   ##
==========================================
- Coverage   78.70%   78.37%   -0.34%     
==========================================
  Files         105      105              
  Lines        9735     9788      +53     
  Branches     1563     1575      +12     
==========================================
+ Hits         7662     7671       +9     
- Misses       1582     1626      +44     
  Partials      491      491

Impacted Files	Coverage Δ
torchvision/datasets/folder.py	`59.71% <27.86%> (-26.34%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c991db8...c8f167a. Read the comment docs.

fmassa · 2021-03-08T13:13:53Z

Hi,

Thanks for the PR!

While I think this adds useful functionality, I think it's better to handle this in a more holistic way that abstracts away if the file is zippped / tar / etc.

As you mentioned, there is an ongoing effort in pytorch/pytorch#49440 to restructure PyTorch datasets so that they can be more modular.
From this perspective, DatasetFolder could be implemented with something similar to:

DatasetFolder = GroupByPrefix(ListFiles())

from this perspective, having a ZipFolder would just mean doing

ZipFolder = GroupByPrefix(ZipListFiles())

and thus the ZipListFiles would be the elementary building block, which would allow to have all other datasets in torchvision to be also written to support zip / tar / etc (if they are implemented in a way compatible with the new abstractions).

As such, I'd be tempted to wait until the new dataset abstractions are ready before we go on modifying the datasets in torchvision, as I foresee that we will be making a number of changes in the near future.

Thoughts?

ain-soph · 2021-03-08T17:29:20Z

Hi,

Thanks for the PR!

While I think this adds useful functionality, I think it's better to handle this in a more holistic way that abstracts away if the file is zippped / tar / etc.

As you mentioned, there is an ongoing effort in pytorch/pytorch#49440 to restructure PyTorch datasets so that they can be more modular.
From this perspective, DatasetFolder could be implemented with something similar to:
DatasetFolder = GroupByPrefix(ListFiles())
from this perspective, having a ZipFolder would just mean doing
ZipFolder = GroupByPrefix(ZipListFiles())
and thus the ZipListFiles would be the elementary building block, which would allow to have all other datasets in torchvision to be also written to support zip / tar / etc (if they are implemented in a way compatible with the new abstractions).

As such, I'd be tempted to wait until the new dataset abstractions are ready before we go on modifying the datasets in torchvision, as I foresee that we will be making a number of changes in the near future.

Thoughts?

Yes. That’ll be even better if we can united them in one class. Hope to see it in 1.0.0.
I’ll close this PR.
But want to add the note that in-memory option and an initialization method from ImageFolder would be very nice.

fmassa · 2021-03-09T16:40:26Z

Hi,

I agree that having the option to hold everything in-memory would be helpful. Again, this would probably be a different combination of building blocks for loading everything ahead of time instead of lazily.

ain-soph · 2022-03-03T00:09:29Z

Hi, it seems the pipes have been stable for quite a while. But I haven't seen any codes to unite Zip and Tar with ImageFolder yet.

Is there any new progress on this? @fmassa

Implement ZipFolder

6ae5857

facebook-github-bot added the cla signed label Mar 5, 2021

ain-soph added 2 commits March 5, 2021 02:12

avoid python3.9 annotation style

b980068

minimize modification

711d0f6

ain-soph added 5 commits March 5, 2021 02:41

fix bugs

9594693

fix indent issues

c6f4ff2

fix annotation

1d66921

fix annotation

6aa08e4

fix bug

c8f167a

ain-soph mentioned this pull request Mar 7, 2021

[Feature Implement] ZipFolder (TarFolder) #3519

Closed

ain-soph closed this Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ZipFolder #3510

Implement ZipFolder #3510

ain-soph commented Mar 5, 2021 •

edited

Loading

ain-soph commented Mar 5, 2021 •

edited

Loading

codecov bot commented Mar 5, 2021 •

edited

Loading

fmassa commented Mar 8, 2021

ain-soph commented Mar 8, 2021 •

edited

Loading

fmassa commented Mar 9, 2021

ain-soph commented Mar 3, 2022

Implement ZipFolder #3510

Implement ZipFolder #3510

Conversation

ain-soph commented Mar 5, 2021 • edited Loading

ain-soph commented Mar 5, 2021 • edited Loading

codecov bot commented Mar 5, 2021 • edited Loading

Codecov Report

fmassa commented Mar 8, 2021

ain-soph commented Mar 8, 2021 • edited Loading

fmassa commented Mar 9, 2021

ain-soph commented Mar 3, 2022

ain-soph commented Mar 5, 2021 •

edited

Loading

ain-soph commented Mar 5, 2021 •

edited

Loading

codecov bot commented Mar 5, 2021 •

edited

Loading

ain-soph commented Mar 8, 2021 •

edited

Loading