Skip to content

Implement ZipFolder #3510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Implement ZipFolder #3510

wants to merge 8 commits into from

Conversation

ain-soph
Copy link
Contributor

@ain-soph ain-soph commented Mar 5, 2021

Implement a ZipFolder class, which follows my previous PR #3215 .
The idea is very similar to the TarDataset issue on pytorch/pytorch.
It archives the ImageFolder to be a zip without any compression. The functions are almost the same as ImageFolder.

Advantage: it's better for long term use with one single archive file, and makes loading and transferring faster and more convenient by avoiding small files IO (when memory=True), especially on HDD disk.
When argument memory is set to be true, it'll read all bytes of the zip into memory at beginning. Otherwise, the default loading by zipfile would be lazy, leading to the same mechanism as ImageFolder.

Besides the basic utility, I also add a staticmethod initialize_from_folder that makes a folder (follows the ImageFolder requirements) to be the zip format.

@ain-soph
Copy link
Contributor Author

ain-soph commented Mar 5, 2021

Need Discuss:

  1. Method initialize_from_folder might need a better name. (Candidates: init_from_folder, folder_to_zip)
  2. It might not be appropriate to use io.BytesIO for type annotation.
  3. Potential file structure of zip file (zip filename == [root_folder_name]_store.zip):
    a. (current) [root_folder_name]/[target_class]/[img_file]
    b. [target_class]/[img_file]
  4. We need to check the compress type to be ZIP_STORED.

And unit test and docs need doing if any reviewer thinks this PR worth doing.

@codecov
Copy link

codecov bot commented Mar 5, 2021

Codecov Report

Merging #3510 (c8f167a) into master (c991db8) will decrease coverage by 0.33%.
The diff coverage is 27.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3510      +/-   ##
==========================================
- Coverage   78.70%   78.37%   -0.34%     
==========================================
  Files         105      105              
  Lines        9735     9788      +53     
  Branches     1563     1575      +12     
==========================================
+ Hits         7662     7671       +9     
- Misses       1582     1626      +44     
  Partials      491      491              
Impacted Files Coverage Δ
torchvision/datasets/folder.py 59.71% <27.86%> (-26.34%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c991db8...c8f167a. Read the comment docs.

@fmassa
Copy link
Member

fmassa commented Mar 8, 2021

Hi,

Thanks for the PR!

While I think this adds useful functionality, I think it's better to handle this in a more holistic way that abstracts away if the file is zippped / tar / etc.

As you mentioned, there is an ongoing effort in pytorch/pytorch#49440 to restructure PyTorch datasets so that they can be more modular.
From this perspective, DatasetFolder could be implemented with something similar to:

DatasetFolder = GroupByPrefix(ListFiles())

from this perspective, having a ZipFolder would just mean doing

ZipFolder = GroupByPrefix(ZipListFiles())

and thus the ZipListFiles would be the elementary building block, which would allow to have all other datasets in torchvision to be also written to support zip / tar / etc (if they are implemented in a way compatible with the new abstractions).

As such, I'd be tempted to wait until the new dataset abstractions are ready before we go on modifying the datasets in torchvision, as I foresee that we will be making a number of changes in the near future.

Thoughts?

@ain-soph
Copy link
Contributor Author

ain-soph commented Mar 8, 2021

Hi,

Thanks for the PR!

While I think this adds useful functionality, I think it's better to handle this in a more holistic way that abstracts away if the file is zippped / tar / etc.

As you mentioned, there is an ongoing effort in pytorch/pytorch#49440 to restructure PyTorch datasets so that they can be more modular.
From this perspective, DatasetFolder could be implemented with something similar to:

DatasetFolder = GroupByPrefix(ListFiles())

from this perspective, having a ZipFolder would just mean doing

ZipFolder = GroupByPrefix(ZipListFiles())

and thus the ZipListFiles would be the elementary building block, which would allow to have all other datasets in torchvision to be also written to support zip / tar / etc (if they are implemented in a way compatible with the new abstractions).

As such, I'd be tempted to wait until the new dataset abstractions are ready before we go on modifying the datasets in torchvision, as I foresee that we will be making a number of changes in the near future.

Thoughts?

Yes. That’ll be even better if we can united them in one class. Hope to see it in 1.0.0.
I’ll close this PR.
But want to add the note that in-memory option and an initialization method from ImageFolder would be very nice.

@ain-soph ain-soph closed this Mar 8, 2021
@fmassa
Copy link
Member

fmassa commented Mar 9, 2021

Hi,

I agree that having the option to hold everything in-memory would be helpful. Again, this would probably be a different combination of building blocks for loading everything ahead of time instead of lazily.

@ain-soph
Copy link
Contributor Author

ain-soph commented Mar 3, 2022

Hi, it seems the pipes have been stable for quite a while. But I haven't seen any codes to unite Zip and Tar with ImageFolder yet.

Is there any new progress on this? @fmassa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants