-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge DatasetInfo into Dataset #5369
Conversation
💊 CI failures summary and remediationsAs of commit 36f5c5f (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages: unittest_prototype (1/1)Step: "Run tests" (full log | diagnosis details | 🔁 rerun)
|
Job | Step | Action |
---|---|---|
circleci_consistency | Check CircleCI config consistency | 🔁 rerun |
This comment was automatically generated by Dr. CI (expand for details).
Please report bugs/suggestions to the (internal) Dr. CI Users group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only implemented the new design for VOC and ImageNet. If we agree on the changes, I'll address the remaining datasets.
self.configs = [] | ||
for combination in itertools.product(*[option.valid for option in self.dataset.options]): | ||
options = dict(zip([option.name for option in self.dataset.options], combination)) | ||
with contextlib.suppress(Exception): | ||
self.configs.append(self.dataset.make_config(**options)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only part of the test suite changes that needs reviewing. All other changes are just hacks to temporary only test VOC and ImageNet.
if arg not in option.valid: | ||
raise ValueError(f"Invalid argument '{arg}' for option '{name}' of dataset '{self.name}'. {option.doc}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could also have this logic on the DatasetOption
class.
SENTINEL = object() | ||
|
||
|
||
class DatasetOption: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for having a DatasetOption
class instead of relying on standard Python parameter passing + docstrings?
I'm not sure we should try to automate input validation or docstring generation at all cost TBH. If it forces us to deviate from our usual Python development practices, it may not be worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, we instantiate all Dataset
objects at import time
vision/torchvision/prototype/datasets/_api.py
Lines 21 to 23 in 44e93be
for name, obj in _builtin.__dict__.items(): | |
if not name.startswith("_") and isinstance(obj, type) and issubclass(obj, Dataset) and obj is not Dataset: | |
register(obj()) |
At this stage no specific configuration is known. So something like
class MyDataset(Dataset):
def __init__(self, foo="foo", bar=1):
if not isinstance(bar, int):
raise TypeError
is not possible.
Now, we can of course change the API to only instantiate the objects at runtime. But then we also need a different way to access the static data such as categories
or the like, since they should be accessible without a specific configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then we also need a different way to access the static data such as categories or the like, since they should be accessible without a specific configuration.
Would it be reasonable to expect all parameters to have a default? This would allow us to instanciate the dataset objects whenever we need their static data.
Also, about the need for accessing meta-data without a config: perhaps we don't have that yet, but I could see a dataset that has different classes (or different "extra" field) depending on a parameter. Are we supporting this use-case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be reasonable to expect all parameters to have a default?
IMO, yes. We shouldn't have a dataset that is not loadable with datasets.load("foo")
. Looking through our old datasets, I think there two different scenarios where this violated:
-
Datasets that cannot be automatically downloaded and need multiple files as input
vision/torchvision/datasets/coco.py
Lines 27 to 28 in 44e93be
root: str, annFile: str, -
Datasets with arguments that could very well have a default but don't for an probably unknown reason
vision/torchvision/datasets/mnist.py
Line 293 in 44e93be
def __init__(self, root: str, split: str, **kwargs: Any) -> None:
Since the download is handled differently and we no longer need to pass paths to the dataset, I currently don't see a reason to not assume it. If there ever is a case where we actually have no good default value, we can probably hack around it.
Also, about the need for accessing meta-data without a config: perhaps we don't have that yet, but I could see a dataset that has different classes (or different "extra" field) depending on a parameter. Are we supporting this use-case?
Currently this is not supported. But I'm not sure what would be a good way to add this. With the design discussed in this thread, datasets.info()
would give you the Dataset
object instantiated with default parameters. Thus, we would always get the same meta data back. One way would be to accept **options
there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would help to see how we would be documenting these new datasets, before we can make a decision on this.
I'm sceptical in general about not using normal parameter passing, defaults, and docs. Especially if we ever allow users to create their own classes, we don't want them to learn yet another framework (but it's relevant for developers too). If it makes the docs much easier to write them we might have a reason, but for now it does not appear as obvious to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me write minimalistic docs to showcase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://1173984-73328905-gh.circle-artifacts.com/0/docs/datasets_prototype/index.html. Of course if we had more meta information filled in such as the citation, license, description, and so on, the pages would look not as empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @pmeier .
I have a few questions:
- how would we document the type of the parameters?
- how would we document extra methods on the datasets (if relevant)? What about extra attributes (again not sure if relevant)?
- how would we document additional info that is specific to each dataset? Stuff like "This dataset contains X images of cars and Y images of trucks".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would we document the type of the parameters?
I would put another column in the table for the type. DatasetOption
could have an annotation
field for that. Let me add it to the examples.
how would we document extra methods on the datasets (if relevant)?
IMO, we shouldn't have extra methods on the dataset objects. Currently, I don't think we have datasets that need this. Did you have anything in mind what other datasets might provide?
What about extra attributes (again not sure if relevant)?
In our current stable API we sometimes add them to the docstring in the Attributes:
section. We could simply have another table similar to Options
that holds these. Let me add them to my examples.
how would we document additional info that is specific to each dataset? Stuff like "This dataset contains X images of cars and Y images of trucks".
Given that this is very individual, I would simply have description
field which takes a string input and puts it at the top of the generated documentation. Let me add an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://1181665-73328905-gh.circle-artifacts.com/0/docs/datasets_prototype/index.html. I've taken the examples from https://paperswithcode.com/ for demo purposes.
Superseded by the development on the https://github.com/pytorch/vision/tree/prototype-datasets-inheritance branch. |
Addresses Points 1. to 3. from #5281.