Refactor prototype datasets #5778

pmeier · 2022-04-07T08:51:47Z

This merges a feature branch into main. All changes up to #5777/ cd36d06 were reviewed on individual PRs.

The Dataset class now inherits from IterDataPipe rather than being a custom class from which a datapipe is created. This has several upsides:

The object can have a regular constructor with standard input checking. Before the inputs were checked automatically, but we needed hacky workarounds if not all combinations of parameters were valid. Compare

vision/torchvision/prototype/datasets/_builtin/voc.py

Lines 33 to 43 in 5f74f03

    
           class VOCDatasetInfo(DatasetInfo): 
        
               def __init__(self, *args: Any, **kwargs: Any): 
        
                   super().__init__(*args, **kwargs) 
        
                   self._configs = tuple(config for config in self._configs if config.split != "test" or config.year == "2007") 
        
               def make_config(self, **options: Any) -> DatasetConfig: 
        
                   config = super().make_config(**options) 
        
                   if config.split == "test" and config.year != "2007": 
        
                       raise ValueError("`split='test'` is only available for `year='2007'`") 
        
                   return config

with

vision/torchvision/prototype/datasets/_builtin/voc.py

Lines 54 to 55 in 3be12c7

    
           if split == "test" and year != "2007": 
        
               raise ValueError("`split='test'` is only available for `year='2007'`")

Instead of passing a config object around, the dataset can now simply use instance attributes set in the constructor. Compare

vision/torchvision/prototype/datasets/_builtin/cub200.py

Lines 93 to 98 in 5f74f03

    
           def _2011_filter_split(self, row: List[str], *, split: str) -> bool: 
        
               _, split_id = row 
        
               return { 
        
                   "0": "test", 
        
                   "1": "train", 
        
               }[split_id] == split

vision/torchvision/prototype/datasets/_builtin/cub200.py

Line 174 in 5f74f03

    
           split_dp = Filter(split_dp, functools.partial(self._2011_filter_split, split=config.split))

with

vision/torchvision/prototype/datasets/_builtin/cub200.py

Lines 116 to 121 in 4c73a5e

    
           def _2011_filter_split(self, row: List[str]) -> bool: 
        
               _, split_id = row 
        
               return { 
        
                   "0": "test", 
        
                   "1": "train", 
        
               }[split_id] == self._split

vision/torchvision/prototype/datasets/_builtin/cub200.py

Line 192 in 4c73a5e

split_dp = Filter(split_dp, self._2011_filter_split)

Although not yet added, we can now document the dataset with a regular docstring. This was not possible before, because the old Dataset object would have not been exposed and in turn would not show up in the documentation.
We can use the native __len__ method to return the number of samples in the dataset. In the old implementation we would have needed to attach a custom datapipe that does nothing but adds this annotation.

* refactor prototype datasets to inherit from IterDataPipe * depend on new architecture * fix missing file detection * remove unrelated file * reinstante decorator for mock registering * options -> config * remove passing of info to mock data functions * refactor categories file generation

Conflicts: test/builtin_dataset_mocks.py test/test_prototype_builtin_datasets.py torchvision/prototype/datasets/_builtin/imagenet.py torchvision/prototype/datasets/utils/_dataset.py

Conflicts: test/test_prototype_builtin_datasets.py

Conflicts: test/builtin_dataset_mocks.py

* reenable serialization test * cleanup * fix dill test * trigger CI * patch DILL_AVAILABLE for pickle serialization * revert CI changes * remove dill test and traversable test * add data loader test * parametrize over only_datapipe * draw one sample rather than exhaust data loader * cleanup * trigger CI

* migrate VOC prototype dataset * cleanup * revert unrelated mock data changes * remove categories annotations * move properties to constructor * readd homepage

* migrate coco prototype * revert unrelated change * add kwargs to super constructor call * remove unneeded changes * fix docstring position * make kwargs explicit * add dependencies to docstring * fix missing dependency message

* Port PCAM * skip_integrity_check * Update torchvision/prototype/datasets/_builtin/pcam.py Co-authored-by: Philip Meier <github.pmeier@posteo.de> * Address comments Co-authored-by: Philip Meier <github.pmeier@posteo.de>

* Migrate DTD prototype dataset * Docstring * Apply suggestions from code review Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de>

* Migrate GTSRB prototype dataset * ufmt * Address comments * Apparently mypy doesn't know that __len__ returns ints. How cute. * why is the CI not triggered?? * Update torchvision/prototype/datasets/_builtin/gtsrb.py Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de>

* migrate CelebA prototype dataset * inline split_id

* Migrate Food101 dataset * Added length * Update torchvision/prototype/datasets/_builtin/food101.py Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de>

* Migrate Fer2013 prototype dataset * Update torchvision/prototype/datasets/_builtin/fer2013.py Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de>

* migrate caltech prototype datasets * resolve third party dependencies

* Migrate Oxford Pets prototype dataset * Update torchvision/prototype/datasets/_builtin/oxford_iiit_pet.py Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de>

* migrate MNIST prototype datasets * Update torchvision/prototype/datasets/_builtin/mnist.py Co-authored-by: Nicolas Hug <contact@nicolas-hug.com> Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

* Migrate Stanford Cars prototype dataset * Address comments

* fix category file generation * revert unrelated change * revert unrelated change

* migrate cub200 prototype dataset * address comments * fix category-file-generation

* migrate SBD prototype dataset * reuse categories

* Remove Dataset2 class * Move read_categories_file out of DatasetInfo * Remove FrozenBunch and FrozenMapping * Remove test_prototype_datasets_api.py and move missing dep test somewhere else * ufmt * Let read_categories_file accept names instead of paths * Mypy * flake8 * fix category file reading Co-authored-by: Philip Meier <github.pmeier@posteo.de>

* update prototype dataset README * fix header level * Apply suggestions from code review Co-authored-by: Nicolas Hug <contact@nicolas-hug.com> Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

pmeier · 2022-04-07T09:00:56Z

There are a few open PRs that need to change when this is merged:

A couple of them are from OSS contributors. Given that this rework only shuffles some information around, I propose I'll push some changes to their branches to minimize the impact.

In addition, after this is merged, I'll post an issue in #5336 for everyone who already started but hasn't send a PR yet.

Finally, with this change, are we now confident enough to re-open the dataset ports to community contributions?

NicolasHug

Thanks a lot @pmeier

The prototype job is green so it's safe to approve now, but to be extra careful let's wait until all jobs are green before merging.

The summary above is very useful, thank you. On top of what you already said, it's worth noting that even though this isn't reflected in this PR, the number of line of code has significantly reduced in our testing + base data structures, by ~500 (this PR reflects this more accurately: #5774). The reason why this PR shows more loc is because we added more features (e.g. support for __len__()), and more docs. Thanks again for your help Philip

NicolasHug · 2022-04-07T09:26:03Z

Also CCing @ejguan @NivekT @VitalyFedyunin ; Thanks to you guys' work on fixing pytorch/data#237, we were able to greatly simplify our implementations, as described above. As you know we still need a full-fledged fix that also handles environments where dill is installed, but for now we're able to move forward with our preferred design. Thank you!

Summary: * refactor prototype datasets to inherit from IterDataPipe (#5448) * refactor prototype datasets to inherit from IterDataPipe * depend on new architecture * fix missing file detection * remove unrelated file * reinstante decorator for mock registering * options -> config * remove passing of info to mock data functions * refactor categories file generation * fix imagenet * fix prototype datasets data loading tests (#5711) * reenable serialization test * cleanup * fix dill test * trigger CI * patch DILL_AVAILABLE for pickle serialization * revert CI changes * remove dill test and traversable test * add data loader test * parametrize over only_datapipe * draw one sample rather than exhaust data loader * cleanup * trigger CI * migrate VOC prototype dataset (#5743) * migrate VOC prototype dataset * cleanup * revert unrelated mock data changes * remove categories annotations * move properties to constructor * readd homepage * migrate CIFAR prototype datasets (#5751) * migrate country211 prototype dataset (#5753) * migrate CLEVR prototype datsaet (#5752) * migrate coco prototype (#5473) * migrate coco prototype * revert unrelated change * add kwargs to super constructor call * remove unneeded changes * fix docstring position * make kwargs explicit * add dependencies to docstring * fix missing dependency message * Migrate PCAM prototype dataset (#5745) * Port PCAM * skip_integrity_check * Update torchvision/prototype/datasets/_builtin/pcam.py * Address comments * Migrate DTD prototype dataset (#5757) * Migrate DTD prototype dataset * Docstring * Apply suggestions from code review * Migrate GTSRB prototype dataset (#5746) * Migrate GTSRB prototype dataset * ufmt * Address comments * Apparently mypy doesn't know that __len__ returns ints. How cute. * why is the CI not triggered?? * Update torchvision/prototype/datasets/_builtin/gtsrb.py * migrate CelebA prototype dataset (#5750) * migrate CelebA prototype dataset * inline split_id * Migrate Food101 prototype dataset (#5758) * Migrate Food101 dataset * Added length * Update torchvision/prototype/datasets/_builtin/food101.py * Migrate Fer2013 prototype dataset (#5759) * Migrate Fer2013 prototype dataset * Update torchvision/prototype/datasets/_builtin/fer2013.py * Migrate EuroSAT prototype dataset (#5760) * Migrate Semeion prototype dataset (#5761) * migrate caltech prototype datasets (#5749) * migrate caltech prototype datasets * resolve third party dependencies * Migrate Oxford Pets prototype dataset (#5764) * Migrate Oxford Pets prototype dataset * Update torchvision/prototype/datasets/_builtin/oxford_iiit_pet.py * migrate mnist prototype datasets (#5480) * migrate MNIST prototype datasets * Update torchvision/prototype/datasets/_builtin/mnist.py * Migrate Stanford Cars prototype dataset (#5767) * Migrate Stanford Cars prototype dataset * Address comments * fix category file generation (#5770) * fix category file generation * revert unrelated change * revert unrelated change * migrate cub200 prototype dataset (#5765) * migrate cub200 prototype dataset * address comments * fix category-file-generation * Migrate USPS prototype dataset (#5771) * migrate SBD prototype dataset (#5772) * migrate SBD prototype dataset * reuse categories * Migrate SVHN prototype dataset (#5769) * add test to enforce __len__ is working on prototype datasets (#5742) * reactivate special dataset tests * add missing annotation * Cleanup prototype dataset implementation (#5774) * Remove Dataset2 class * Move read_categories_file out of DatasetInfo * Remove FrozenBunch and FrozenMapping * Remove test_prototype_datasets_api.py and move missing dep test somewhere else * ufmt * Let read_categories_file accept names instead of paths * Mypy * flake8 * fix category file reading * update prototype dataset README (#5777) * update prototype dataset README * fix header level * Apply suggestions from code review (Note: this ignores all push blocking failures!) Reviewed By: jdsgomes, NicolasHug Differential Revision: D36095693 fbshipit-source-id: d57f2b4a89ef1c45f5e2783ea57bce08df5c561d Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Nicolas Hug <contact@nicolas-hug.com> Co-authored-by: Nicolas Hug <contact@nicolas-hug.com> Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Nicolas Hug <contact@nicolas-hug.com> Co-authored-by: Nicolas Hug <contact@nicolas-hug.com> Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

pmeier and others added 30 commits February 24, 2022 16:47

Merge branch 'main' into prototype-datasets-inheritance

1916bd7

Merge branch 'main' into prototype-datasets-inheritance

d5c3a43

Merge branch 'main'

e7e921e

Conflicts: test/builtin_dataset_mocks.py test/test_prototype_builtin_datasets.py torchvision/prototype/datasets/_builtin/imagenet.py torchvision/prototype/datasets/utils/_dataset.py

Merge branch 'main'

9acb8f9

Conflicts: test/test_prototype_builtin_datasets.py

fix imagenet

c5f6c11

Merge branch 'main'

49655b2

Conflicts: test/builtin_dataset_mocks.py

Merge branch 'main' into prototype-datasets-inheritance

b514955

migrate VOC prototype dataset (#5743)

aca4164

* migrate VOC prototype dataset * cleanup * revert unrelated mock data changes * remove categories annotations * move properties to constructor * readd homepage

migrate CIFAR prototype datasets (#5751)

dead87d

migrate country211 prototype dataset (#5753)

6a0592f

migrate CLEVR prototype datsaet (#5752)

2ed549d

migrate coco prototype (#5473)

42bc682

* migrate coco prototype * revert unrelated change * add kwargs to super constructor call * remove unneeded changes * fix docstring position * make kwargs explicit * add dependencies to docstring * fix missing dependency message

Migrate PCAM prototype dataset (#5745)

27104fe

* Port PCAM * skip_integrity_check * Update torchvision/prototype/datasets/_builtin/pcam.py Co-authored-by: Philip Meier <github.pmeier@posteo.de> * Address comments Co-authored-by: Philip Meier <github.pmeier@posteo.de>

Migrate DTD prototype dataset (#5757)

291be31

* Migrate DTD prototype dataset * Docstring * Apply suggestions from code review Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de>

migrate CelebA prototype dataset (#5750)

2612c4c

* migrate CelebA prototype dataset * inline split_id

Migrate Food101 prototype dataset (#5758)

6de6ec4

* Migrate Food101 dataset * Added length * Update torchvision/prototype/datasets/_builtin/food101.py Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de>

Migrate Fer2013 prototype dataset (#5759)

ebe9006

* Migrate Fer2013 prototype dataset * Update torchvision/prototype/datasets/_builtin/fer2013.py Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de>

Migrate EuroSAT prototype dataset (#5760)

8194b17

Migrate Semeion prototype dataset (#5761)

4c9cbab

migrate caltech prototype datasets (#5749)

5cd5722

* migrate caltech prototype datasets * resolve third party dependencies

Migrate Oxford Pets prototype dataset (#5764)

70cd406

* Migrate Oxford Pets prototype dataset * Update torchvision/prototype/datasets/_builtin/oxford_iiit_pet.py Co-authored-by: Philip Meier <github.pmeier@posteo.de> Co-authored-by: Philip Meier <github.pmeier@posteo.de>

migrate mnist prototype datasets (#5480)

ccfcaa5

* migrate MNIST prototype datasets * Update torchvision/prototype/datasets/_builtin/mnist.py Co-authored-by: Nicolas Hug <contact@nicolas-hug.com> Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

Migrate Stanford Cars prototype dataset (#5767)

9ea341a

* Migrate Stanford Cars prototype dataset * Address comments

fix category file generation (#5770)

3b10147

* fix category file generation * revert unrelated change * revert unrelated change

migrate cub200 prototype dataset (#5765)

1691e72

* migrate cub200 prototype dataset * address comments * fix category-file-generation

Migrate USPS prototype dataset (#5771)

2a212b8

migrate SBD prototype dataset (#5772)

0b66ed6

* migrate SBD prototype dataset * reuse categories

NicolasHug and others added 8 commits April 6, 2022 16:02

Migrate SVHN prototype dataset (#5769)

b3c8384

Merge branch 'main' into prototype-datasets-inheritance

fb56882

add test to enforce __len__ is working on prototype datasets (#5742)

1199144

reactivate special dataset tests

8e7987a

add missing annotation

5062a32

Merge branch 'main' into prototype-datasets-inheritance

4c73a5e

update prototype dataset README (#5777)

cd36d06

* update prototype dataset README * fix header level * Apply suggestions from code review Co-authored-by: Nicolas Hug <contact@nicolas-hug.com> Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

pmeier added module: datasets prototype labels Apr 7, 2022

pmeier requested a review from NicolasHug April 7, 2022 08:51

facebook-github-bot added the cla signed label Apr 7, 2022

pmeier mentioned this pull request Apr 7, 2022

implement imagenet prototype dataset as function #5565

Closed

NicolasHug approved these changes Apr 7, 2022

View reviewed changes

pmeier mentioned this pull request Apr 7, 2022

keep imagenet prototype as class but factor out methods #5701

Closed

NicolasHug merged commit 1ac6e8b into main Apr 7, 2022

pmeier deleted the prototype-datasets-inheritance branch April 7, 2022 10:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor prototype datasets #5778

Refactor prototype datasets #5778

Uh oh!

pmeier commented Apr 7, 2022

Uh oh!

pmeier commented Apr 7, 2022

Uh oh!

NicolasHug left a comment •

edited

Loading

Uh oh!

NicolasHug commented Apr 7, 2022 •

edited

Loading

Uh oh!

Uh oh!

	class VOCDatasetInfo(DatasetInfo):
	def __init__(self, args: Any, *kwargs: Any):
	super().__init__(args, *kwargs)
	self._configs = tuple(config for config in self._configs if config.split != "test" or config.year == "2007")

	def make_config(self, **options: Any) -> DatasetConfig:
	config = super().make_config(**options)
	if config.split == "test" and config.year != "2007":
	raise ValueError("`split='test'` is only available for `year='2007'`")

	return config

	if split == "test" and year != "2007":
	raise ValueError("`split='test'` is only available for `year='2007'`")

	def _2011_filter_split(self, row: List[str], *, split: str) -> bool:
	_, split_id = row
	return {
	"0": "test",
	"1": "train",
	}[split_id] == split

Refactor prototype datasets #5778

Refactor prototype datasets #5778

Uh oh!

Conversation

pmeier commented Apr 7, 2022

Uh oh!

pmeier commented Apr 7, 2022

Uh oh!

NicolasHug left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Apr 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

NicolasHug left a comment •

edited

Loading

NicolasHug commented Apr 7, 2022 •

edited

Loading