Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement imagenet prototype dataset as function #5565

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Mar 8, 2022

First dataset that implements the "function design" we discussed in our latest sync. I didn't find any blockers, but there are still two issues I currently don't have a solution for.

@facebook-github-bot
Copy link

facebook-github-bot commented Mar 8, 2022

💊 CI failures summary and remediations

As of commit 271fe99 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Comment on lines +67 to +77
# with unittest.mock.patch.object(datasets.utils.Dataset2, "__init__"):
# required_file_names = {
# resource.file_name for resource in datasets.load(self.name, root=root, **config)._resources()
# }
# available_file_names = {path.name for path in root.glob("*")}
# missing_file_names = required_file_names - available_file_names
# if missing_file_names:
# raise pytest.UsageError(
# f"Dataset '{self.name}' requires the files {sequence_to_str(sorted(missing_file_names))} "
# f"for {config}, but they were not created by the mock data function."
# )
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the resources are now fully internal inside the dataset function, I don't see a good way to check if the mock data is set up correctly. One thing we could try is to patch

@abc.abstractmethod
def _download(self, root: pathlib.Path) -> None:
pass

on all subclasses to raise the error I commented out above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could load_images_dp be registered and mocked?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require registering functions like that just for testing purposes. Not sure if we should go that way.

Copy link
Contributor

@ejguan ejguan Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, how about making info including another dictionary of Resources? And, load_images_dp could load resource from info.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a possibility, yes. cc @NicolasHug

@@ -10,6 +10,7 @@
from torch.utils.data.graph import traverse
from torchdata.datapipes.iter import Shuffler, ShardingFilter
from torchvision.prototype import transforms, datasets
from torchvision.prototype.datasets.utils._internal import TakerDataPipe
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NivekT If I recall correctly, we wanted to upstream this to torchdata. Any progress on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have a look at the implementation sometime today

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can upstream Taker in torchdata. But, just a heads-up, we are aligning the API on the functionality of this DataPipe with the internal team. The name may be changed to Limiter with functional API as limit when alignment is settled down.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please tag me in the PR so I can make the changes here after it is landed.

def _generate_categories(self) -> List[Tuple[str, ...]]:
self._split = "val"
resources = self._resources()
def generate_categories(root: Union[str, pathlib.Path], **kwargs: Any) -> List[Tuple[str, ...]]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently not working. Before, we checked for this method and called it through the already exposed dataset class. Now, we either need to expose this explicitly or find another way to call it.

In one of the very first iterations I simply added a if __name__ == "__main__": ... to the module and performed everything there. This way we wouldn't have an option to generate multiple category files at once, but that is probably not a significant issue.

torchvision/prototype/datasets/_builtin/__init__.py Outdated Show resolved Hide resolved
images = ImageNetResource(
file_name=f"ILSVRC2012_img_{name}.tar",
sha256=self._IMAGES_CHECKSUMS[name],
def imagenet(root: Union[str, pathlib.Path], *, split: str = "train", **kwargs: Any) -> TakerDataPipe:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know yet whether the function should have the same name as the one available in load() (i.e. imagenet), or if we should keep the same names as our current dataset to minimize friction (i.e. ImageNet). We can decide that later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not. I'm ok with adding a thin deprecation layer like

def ImageNet(...):
    warnings.warn("This is deprecated. Use datasets.load("imagenet") instead.")
    return imagenet(...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we'll handle the deprecation / transition is related indeed. That's also something we'll need to decide on. But there is value in conserving the same name, so let's keep an open mind on this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there is value in conserving the same name, so let's keep an open mind on this.

The new datasets are BC breaking in so many ways and the consensus from #5040 was that having a compatibility layer is not worth it. Thus, I don't think that the name is the problem here given that they need to do some more work on the entire pipeline.

Copy link
Member

@NicolasHug NicolasHug Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we don't want a compatibility layer.
Still, it's always good to preserve what we can preserve if we can do it for cheap. Again, let's please keep an open mind here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, it's always good to preserve what we can preserve if we can do it for cheap.

I only partially agree. Although it is cheap it brings 2 problems:

  1. The old and the new API cannot coexist. Thus, we would have a hard BC break even if users are fine with sticking to the old API for now.
  2. Having the same name with different functionality is IMO more confusing than having two separate names one of which will be deprecated in favor of the other.

Again, let's please keep an open mind here.

👍

@@ -10,6 +10,7 @@
from torch.utils.data.graph import traverse
from torchdata.datapipes.iter import Shuffler, ShardingFilter
from torchvision.prototype import transforms, datasets
from torchvision.prototype.datasets.utils._internal import TakerDataPipe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can upstream Taker in torchdata. But, just a heads-up, we are aligning the API on the functionality of this DataPipe with the internal team. The name may be changed to Limiter with functional API as limit when alignment is settled down.

Comment on lines +67 to +77
# with unittest.mock.patch.object(datasets.utils.Dataset2, "__init__"):
# required_file_names = {
# resource.file_name for resource in datasets.load(self.name, root=root, **config)._resources()
# }
# available_file_names = {path.name for path in root.glob("*")}
# missing_file_names = required_file_names - available_file_names
# if missing_file_names:
# raise pytest.UsageError(
# f"Dataset '{self.name}' requires the files {sequence_to_str(sorted(missing_file_names))} "
# f"for {config}, but they were not created by the mock data function."
# )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could load_images_dp be registered and mocked?

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 7, 2022

Superseded by #5778.

@pmeier pmeier closed this Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants