-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read COCO dataset from ZIP file #950
Conversation
Where you'd normally have e.g. "train2014" as a folder, if you place "train2014.zip" next to that, it will transparently switch to the zipped version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start, thanks a lot!
I have a few comments and I'd like us to think a bit more the API and how to make it easier to let other parts of the codebase, like ImageFolder
to support zipped files instead of folders.
Also, we would need tests for this functionality, because it adds some non-trivial code.
…nstead, make it a new-styel class
Codecov Report
@@ Coverage Diff @@
## master #950 +/- ##
========================================
Coverage ? 65.6%
========================================
Files ? 81
Lines ? 6411
Branches ? 983
========================================
Hits ? 4206
Misses ? 1902
Partials ? 303
Continue to review full report at Codecov.
|
Needs just one special case argument for Coco, removed the "base_folder" for CelebA which really should have been part of root all along
…t tries both options instead
Oops, shouldn't have checked in that file
…nto read_zipped_data
…you'd normally pass to ImageFolder
With the addition of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good, thanks!
I want to think a bit more through this though, as there are some things that I think could be improved. I'll have another look on Monday / Tuesday next week.
If you don't mind, I might send some patches on top of your branch?
Sure, please do provide patches on top of this. |
Updated branch so that it merges cleanly with master again. |
@koenvandesande thanks for updating the PR! I am still unsure about how to nicely place this with the rest of torchvision datasets. In particular, the discussion in #1080 is very relevant. As such, I'm holding on on merging this PR for the time being, but this is a nice addition that would be good to have in torchvision at some point. |
* Rewrite torchvision packaging (pytorch#1209) Following a similar line of inquiry to pytorch/audio#217 * Packaging fixes (pytorch#1214) Add uploading support, make CUDA builds actually work. * 0.4.0 parameters Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Actually upload wheels (please port to master) Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Put macos binaries in the right place Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Propagate more environment variables. Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Change the version number Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Go time Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
…h#1218) (pytorch#1219) Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The discussion in #1080 seems to have quieted down (without consensus so far?). Is there interest to merge either (1) or (2) only in the near future? |
Sorry for the delay in replying, I just got back from holidays. I need to get back to the discussions in #1080 and pytorch/pytorch#24915 more generally. There is value in this feature, but I will need a bit more time to think it through. I'll be reviewing this again on Tuesday, Sep 3rd |
Hi, I think it'll be beneficial to have this Zipped loading style supported. It doesn't seem to be limited to COCO, but will also be sweet to have an argument Btw, do we have to add a new class And another feature might be loading all images from the ZIP or folder to the memory at the initialization just like CIFAR10 does, so that we won't take time to load image one by one during traverse ( I guess it'll be really useful and save quite some time for academic researchers. Datasets are saved on HDD for the server, but the disk space is quite sufficient. GPUs and memories are strong but disks are the bottleneck. Most of the dataset formats are |
Thanks @koenvandesande for the contribution and sorry for taking that long to get back at you. There is a new dataset API being developed and old datasets are being ported as discussed here: #5336. I am not 100% sure but I think that new features have been added to easily read zipped files/folders. @pmeier knows a lot more about this new API so I hope he will add details here. Thus, I would propose waiting a bit until the new dataset API is finalized and merged and then seeing if the features you have contributed @koenvandesande are still useful. If they are, someone familiar with the new API design can help you add them as needed and you will get proper attribution of course. Again, thanks for the contribution and sorry for the long wait. |
Yup. The prototype datasets will read from archives by default: vision/torchvision/prototype/datasets/_builtin/coco.py Lines 95 to 98 in b969cca
vision/torchvision/prototype/datasets/utils/_resource.py Lines 64 to 70 in b969cca
@koenvandesande It seems the main contribution here is the vision/torchvision/prototype/datasets/_folder.py Lines 41 to 49 in b969cca
For now, we only support loading datasets in the image folder structure from extracted archives, but changing this to read from an archive shouldn't be too hard. My proposal to resolve this is to open a new issue tracking this feature and close this PR given that it is no longer compatible with the new API. Is that ok for you? |
Probably needs more discussion about what other datasets to apply it to, but this is an initial take for CocoCaptions and CocoDetection
Fixes #947