-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Implement ZipFolder #3510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ZipFolder #3510
Conversation
Need Discuss:
And unit test and docs need doing if any reviewer thinks this PR worth doing. |
Codecov Report
@@ Coverage Diff @@
## master #3510 +/- ##
==========================================
- Coverage 78.70% 78.37% -0.34%
==========================================
Files 105 105
Lines 9735 9788 +53
Branches 1563 1575 +12
==========================================
+ Hits 7662 7671 +9
- Misses 1582 1626 +44
Partials 491 491
Continue to review full report at Codecov.
|
Hi, Thanks for the PR! While I think this adds useful functionality, I think it's better to handle this in a more holistic way that abstracts away if the file is zippped / tar / etc. As you mentioned, there is an ongoing effort in pytorch/pytorch#49440 to restructure PyTorch datasets so that they can be more modular. DatasetFolder = GroupByPrefix(ListFiles()) from this perspective, having a ZipFolder = GroupByPrefix(ZipListFiles()) and thus the As such, I'd be tempted to wait until the new dataset abstractions are ready before we go on modifying the datasets in torchvision, as I foresee that we will be making a number of changes in the near future. Thoughts? |
Yes. That’ll be even better if we can united them in one class. Hope to see it in 1.0.0. |
Hi, I agree that having the option to hold everything in-memory would be helpful. Again, this would probably be a different combination of building blocks for loading everything ahead of time instead of lazily. |
Hi, it seems the pipes have been stable for quite a while. But I haven't seen any codes to unite Zip and Tar with Is there any new progress on this? @fmassa |
Implement a
ZipFolder
class, which follows my previous PR #3215 .The idea is very similar to the
TarDataset
issue on pytorch/pytorch.It archives the ImageFolder to be a
zip
without any compression. The functions are almost the same asImageFolder
.Advantage: it's better for long term use with one single archive file, and makes loading and transferring faster and more convenient by avoiding small files IO (when
memory=True
), especially on HDD disk.When argument
memory
is set to be true, it'll read all bytes of the zip into memory at beginning. Otherwise, the default loading byzipfile
would be lazy, leading to the same mechanism asImageFolder
.Besides the basic utility, I also add a staticmethod
initialize_from_folder
that makes a folder (follows theImageFolder
requirements) to be the zip format.