implement imagenet prototype dataset as function #5565

pmeier · 2022-03-08T09:21:14Z

First dataset that implements the "function design" we discussed in our latest sync. I didn't find any blockers, but there are still two issues I currently don't have a solution for.

facebook-github-bot · 2022-03-08T09:21:21Z

💊 CI failures summary and remediations

As of commit 271fe99 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

pmeier · 2022-03-08T09:23:51Z

test/builtin_dataset_mocks.py

+        # with unittest.mock.patch.object(datasets.utils.Dataset2, "__init__"):
+        #     required_file_names = {
+        #         resource.file_name for resource in datasets.load(self.name, root=root, **config)._resources()
+        #     }
+        # available_file_names = {path.name for path in root.glob("*")}
+        # missing_file_names = required_file_names - available_file_names
+        # if missing_file_names:
+        #     raise pytest.UsageError(
+        #         f"Dataset '{self.name}' requires the files {sequence_to_str(sorted(missing_file_names))} "
+        #         f"for {config}, but they were not created by the mock data function."
+        #     )


Since the resources are now fully internal inside the dataset function, I don't see a good way to check if the mock data is set up correctly. One thing we could try is to patch

vision/torchvision/prototype/datasets/utils/_resource.py

Lines 117 to 119 in d8654bb

@abc.abstractmethod

def _download(self, root: pathlib.Path) -> None:

pass

on all subclasses to raise the error I commented out above.

Could load_images_dp be registered and mocked?

That would require registering functions like that just for testing purposes. Not sure if we should go that way.

Then, how about making info including another dictionary of Resources? And, load_images_dp could load resource from info.

That would be a possibility, yes. cc @NicolasHug

pmeier · 2022-03-08T09:24:35Z

test/test_prototype_builtin_datasets.py

@@ -10,6 +10,7 @@
 from torch.utils.data.graph import traverse
 from torchdata.datapipes.iter import Shuffler, ShardingFilter
 from torchvision.prototype import transforms, datasets
+from torchvision.prototype.datasets.utils._internal import TakerDataPipe


@NivekT If I recall correctly, we wanted to upstream this to torchdata. Any progress on that?

I will have a look at the implementation sometime today

We can upstream Taker in torchdata. But, just a heads-up, we are aligning the API on the functionality of this DataPipe with the internal team. The name may be changed to Limiter with functional API as limit when alignment is settled down.

Please tag me in the PR so I can make the changes here after it is landed.

pmeier · 2022-03-08T09:29:40Z

torchvision/prototype/datasets/_builtin/imagenet.py

-    def _generate_categories(self) -> List[Tuple[str, ...]]:
-        self._split = "val"
-        resources = self._resources()
+def generate_categories(root: Union[str, pathlib.Path], **kwargs: Any) -> List[Tuple[str, ...]]:


This is currently not working. Before, we checked for this method and called it through the already exposed dataset class. Now, we either need to expose this explicitly or find another way to call it.

In one of the very first iterations I simply added a if __name__ == "__main__": ... to the module and performed everything there. This way we wouldn't have an option to generate multiple category files at once, but that is probably not a significant issue.

torchvision/prototype/datasets/_builtin/__init__.py

NicolasHug · 2022-03-08T11:11:22Z

torchvision/prototype/datasets/_builtin/imagenet.py

-        images = ImageNetResource(
-            file_name=f"ILSVRC2012_img_{name}.tar",
-            sha256=self._IMAGES_CHECKSUMS[name],
+def imagenet(root: Union[str, pathlib.Path], *, split: str = "train", **kwargs: Any) -> TakerDataPipe:


I don't know yet whether the function should have the same name as the one available in load() (i.e. imagenet), or if we should keep the same names as our current dataset to minimize friction (i.e. ImageNet). We can decide that later.

I'd rather not. I'm ok with adding a thin deprecation layer like

def ImageNet(...): warnings.warn("This is deprecated. Use datasets.load("imagenet") instead.") return imagenet(...)

The way we'll handle the deprecation / transition is related indeed. That's also something we'll need to decide on. But there is value in conserving the same name, so let's keep an open mind on this.

But there is value in conserving the same name, so let's keep an open mind on this.

The new datasets are BC breaking in so many ways and the consensus from #5040 was that having a compatibility layer is not worth it. Thus, I don't think that the name is the problem here given that they need to do some more work on the entire pipeline.

I agree that we don't want a compatibility layer.
Still, it's always good to preserve what we can preserve if we can do it for cheap. Again, let's please keep an open mind here.

Still, it's always good to preserve what we can preserve if we can do it for cheap.

I only partially agree. Although it is cheap it brings 2 problems:

The old and the new API cannot coexist. Thus, we would have a hard BC break even if users are fine with sticking to the old API for now.

Having the same name with different functionality is IMO more confusing than having two separate names one of which will be deprecated in favor of the other.

Again, let's please keep an open mind here.

👍

ejguan · 2022-03-08T16:49:40Z

test/test_prototype_builtin_datasets.py

@@ -10,6 +10,7 @@
 from torch.utils.data.graph import traverse
 from torchdata.datapipes.iter import Shuffler, ShardingFilter
 from torchvision.prototype import transforms, datasets
+from torchvision.prototype.datasets.utils._internal import TakerDataPipe


We can upstream Taker in torchdata. But, just a heads-up, we are aligning the API on the functionality of this DataPipe with the internal team. The name may be changed to Limiter with functional API as limit when alignment is settled down.

torchvision/prototype/datasets/_builtin/imagenet.py

ejguan · 2022-03-08T17:52:21Z

test/builtin_dataset_mocks.py

+        # with unittest.mock.patch.object(datasets.utils.Dataset2, "__init__"):
+        #     required_file_names = {
+        #         resource.file_name for resource in datasets.load(self.name, root=root, **config)._resources()
+        #     }
+        # available_file_names = {path.name for path in root.glob("*")}
+        # missing_file_names = required_file_names - available_file_names
+        # if missing_file_names:
+        #     raise pytest.UsageError(
+        #         f"Dataset '{self.name}' requires the files {sequence_to_str(sorted(missing_file_names))} "
+        #         f"for {config}, but they were not created by the mock data function."
+        #     )


Could load_images_dp be registered and mocked?

torchvision/prototype/datasets/_builtin/imagenet.py

pmeier · 2022-04-07T08:55:23Z

Superseded by #5778.

implement imagenet prototype dataset as function

be6c32d

pmeier added module: datasets prototype labels Mar 8, 2022

pmeier requested review from NicolasHug and ejguan March 8, 2022 09:21

pytorch-bot bot added the ciflow/default label Mar 8, 2022

facebook-github-bot added the cla signed label Mar 8, 2022

pmeier commented Mar 8, 2022

View reviewed changes

NicolasHug reviewed Mar 8, 2022

View reviewed changes

pmeier added 3 commits March 8, 2022 17:32

make root optional on dataset function

dbcde5c

fix import

2e92d26

fix annotation

617bfb3

ejguan reviewed Mar 8, 2022

View reviewed changes

NivekT reviewed Mar 8, 2022

View reviewed changes

torchvision/prototype/datasets/_builtin/imagenet.py Show resolved Hide resolved

fix val label generation

3539ebe

pmeier mentioned this pull request Mar 9, 2022

use upstream torchdata datapipes in prototype datasets #5570

Merged

fix test home patching

271fe99

pmeier closed this Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement imagenet prototype dataset as function #5565

implement imagenet prototype dataset as function #5565

pmeier commented Mar 8, 2022

facebook-github-bot commented Mar 8, 2022 •

edited

Loading

pmeier Mar 8, 2022

ejguan Mar 8, 2022

pmeier Mar 9, 2022

ejguan Mar 9, 2022 •

edited

Loading

pmeier Mar 9, 2022

pmeier Mar 8, 2022

NivekT Mar 8, 2022

ejguan Mar 8, 2022

pmeier Mar 9, 2022

pmeier Mar 8, 2022

NicolasHug Mar 8, 2022

pmeier Mar 8, 2022

NicolasHug Mar 8, 2022

pmeier Mar 8, 2022

NicolasHug Mar 8, 2022 •

edited

Loading

pmeier Mar 8, 2022

ejguan Mar 8, 2022

ejguan Mar 8, 2022

pmeier commented Apr 7, 2022

	@abc.abstractmethod
	def _download(self, root: pathlib.Path) -> None:
	pass

implement imagenet prototype dataset as function #5565

implement imagenet prototype dataset as function #5565

Conversation

pmeier commented Mar 8, 2022

facebook-github-bot commented Mar 8, 2022 • edited Loading

💊 CI failures summary and remediations

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ejguan Mar 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug Mar 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmeier commented Apr 7, 2022

facebook-github-bot commented Mar 8, 2022 •

edited

Loading

ejguan Mar 9, 2022 •

edited

Loading

NicolasHug Mar 8, 2022 •

edited

Loading