-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement imagenet prototype dataset as function #5565
implement imagenet prototype dataset as function #5565
Conversation
💊 CI failures summary and remediationsAs of commit 271fe99 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
# with unittest.mock.patch.object(datasets.utils.Dataset2, "__init__"): | ||
# required_file_names = { | ||
# resource.file_name for resource in datasets.load(self.name, root=root, **config)._resources() | ||
# } | ||
# available_file_names = {path.name for path in root.glob("*")} | ||
# missing_file_names = required_file_names - available_file_names | ||
# if missing_file_names: | ||
# raise pytest.UsageError( | ||
# f"Dataset '{self.name}' requires the files {sequence_to_str(sorted(missing_file_names))} " | ||
# f"for {config}, but they were not created by the mock data function." | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the resources are now fully internal inside the dataset function, I don't see a good way to check if the mock data is set up correctly. One thing we could try is to patch
vision/torchvision/prototype/datasets/utils/_resource.py
Lines 117 to 119 in d8654bb
@abc.abstractmethod | |
def _download(self, root: pathlib.Path) -> None: | |
pass |
on all subclasses to raise the error I commented out above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could load_images_dp
be registered and mocked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would require registering functions like that just for testing purposes. Not sure if we should go that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, how about making info
including another dictionary of Resource
s? And, load_images_dp
could load resource from info
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a possibility, yes. cc @NicolasHug
@@ -10,6 +10,7 @@ | |||
from torch.utils.data.graph import traverse | |||
from torchdata.datapipes.iter import Shuffler, ShardingFilter | |||
from torchvision.prototype import transforms, datasets | |||
from torchvision.prototype.datasets.utils._internal import TakerDataPipe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NivekT If I recall correctly, we wanted to upstream this to torchdata
. Any progress on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have a look at the implementation sometime today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can upstream Taker
in torchdata. But, just a heads-up, we are aligning the API on the functionality of this DataPipe with the internal team. The name may be changed to Limiter
with functional API as limit
when alignment is settled down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please tag me in the PR so I can make the changes here after it is landed.
def _generate_categories(self) -> List[Tuple[str, ...]]: | ||
self._split = "val" | ||
resources = self._resources() | ||
def generate_categories(root: Union[str, pathlib.Path], **kwargs: Any) -> List[Tuple[str, ...]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently not working. Before, we checked for this method and called it through the already exposed dataset class. Now, we either need to expose this explicitly or find another way to call it.
In one of the very first iterations I simply added a if __name__ == "__main__": ...
to the module and performed everything there. This way we wouldn't have an option to generate multiple category files at once, but that is probably not a significant issue.
images = ImageNetResource( | ||
file_name=f"ILSVRC2012_img_{name}.tar", | ||
sha256=self._IMAGES_CHECKSUMS[name], | ||
def imagenet(root: Union[str, pathlib.Path], *, split: str = "train", **kwargs: Any) -> TakerDataPipe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know yet whether the function should have the same name as the one available in load()
(i.e. imagenet), or if we should keep the same names as our current dataset to minimize friction (i.e. ImageNet). We can decide that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not. I'm ok with adding a thin deprecation layer like
def ImageNet(...):
warnings.warn("This is deprecated. Use datasets.load("imagenet") instead.")
return imagenet(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way we'll handle the deprecation / transition is related indeed. That's also something we'll need to decide on. But there is value in conserving the same name, so let's keep an open mind on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there is value in conserving the same name, so let's keep an open mind on this.
The new datasets are BC breaking in so many ways and the consensus from #5040 was that having a compatibility layer is not worth it. Thus, I don't think that the name is the problem here given that they need to do some more work on the entire pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we don't want a compatibility layer.
Still, it's always good to preserve what we can preserve if we can do it for cheap. Again, let's please keep an open mind here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, it's always good to preserve what we can preserve if we can do it for cheap.
I only partially agree. Although it is cheap it brings 2 problems:
- The old and the new API cannot coexist. Thus, we would have a hard BC break even if users are fine with sticking to the old API for now.
- Having the same name with different functionality is IMO more confusing than having two separate names one of which will be deprecated in favor of the other.
Again, let's please keep an open mind here.
👍
@@ -10,6 +10,7 @@ | |||
from torch.utils.data.graph import traverse | |||
from torchdata.datapipes.iter import Shuffler, ShardingFilter | |||
from torchvision.prototype import transforms, datasets | |||
from torchvision.prototype.datasets.utils._internal import TakerDataPipe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can upstream Taker
in torchdata. But, just a heads-up, we are aligning the API on the functionality of this DataPipe with the internal team. The name may be changed to Limiter
with functional API as limit
when alignment is settled down.
# with unittest.mock.patch.object(datasets.utils.Dataset2, "__init__"): | ||
# required_file_names = { | ||
# resource.file_name for resource in datasets.load(self.name, root=root, **config)._resources() | ||
# } | ||
# available_file_names = {path.name for path in root.glob("*")} | ||
# missing_file_names = required_file_names - available_file_names | ||
# if missing_file_names: | ||
# raise pytest.UsageError( | ||
# f"Dataset '{self.name}' requires the files {sequence_to_str(sorted(missing_file_names))} " | ||
# f"for {config}, but they were not created by the mock data function." | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could load_images_dp
be registered and mocked?
Superseded by #5778. |
First dataset that implements the "function design" we discussed in our latest sync. I didn't find any blockers, but there are still two issues I currently don't have a solution for.