Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge DatasetInfo into Dataset #5369

Closed
wants to merge 7 commits into from

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Feb 3, 2022

Addresses Points 1. to 3. from #5281.

@facebook-github-bot
Copy link

facebook-github-bot commented Feb 3, 2022

💊 CI failures summary and remediations

As of commit 36f5c5f (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build unittest_prototype (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

�[31m=============================== �[31m�[1m1...0.47s�[0m�[31m ===============================�[0m
    pytest.param(DatasetMock().default_config, None, id="default"),
  File "/home/circleci/project/test/test_prototype_datasets_api.py", line 167, in __init__
    self._info = info or make_minimal_dataset_info(valid_options=dict(split=("train", "test")))
  File "/home/circleci/project/test/test_prototype_datasets_api.py", line 9, in make_minimal_dataset_info
    return datasets.utils.DatasetInfo(name, type=type, categories=categories or [], **kwargs)
TypeError: DatasetInfo() takes no arguments
------ generated xml file: /home/circleci/project/test-results/junit.xml -------
=========================== short test summary info ============================
ERROR test/test_prototype_datasets_api.py - TypeError: DatasetInfo() takes no...
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
�[31m=============================== �[31m�[1m1 error�[0m�[31m in 0.47s�[0m�[31m ===============================�[0m


Exited with code exit status 2


1 failure not recognized by patterns:

Job Step Action
CircleCI circleci_consistency Check CircleCI config consistency 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Collaborator Author

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only implemented the new design for VOC and ImageNet. If we agree on the changes, I'll address the remaining datasets.

Comment on lines +44 to +48
self.configs = []
for combination in itertools.product(*[option.valid for option in self.dataset.options]):
options = dict(zip([option.name for option in self.dataset.options], combination))
with contextlib.suppress(Exception):
self.configs.append(self.dataset.make_config(**options))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only part of the test suite changes that needs reviewing. All other changes are just hacks to temporary only test VOC and ImageNet.

Comment on lines +113 to +114
if arg not in option.valid:
raise ValueError(f"Invalid argument '{arg}' for option '{name}' of dataset '{self.name}'. {option.doc}")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could also have this logic on the DatasetOption class.

SENTINEL = object()


class DatasetOption:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for having a DatasetOption class instead of relying on standard Python parameter passing + docstrings?

I'm not sure we should try to automate input validation or docstring generation at all cost TBH. If it forces us to deviate from our usual Python development practices, it may not be worth it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, we instantiate all Dataset objects at import time

for name, obj in _builtin.__dict__.items():
if not name.startswith("_") and isinstance(obj, type) and issubclass(obj, Dataset) and obj is not Dataset:
register(obj())

At this stage no specific configuration is known. So something like

class MyDataset(Dataset):
    def __init__(self, foo="foo", bar=1):
        if not isinstance(bar, int):
            raise TypeError

is not possible.

Now, we can of course change the API to only instantiate the objects at runtime. But then we also need a different way to access the static data such as categories or the like, since they should be accessible without a specific configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we also need a different way to access the static data such as categories or the like, since they should be accessible without a specific configuration.

Would it be reasonable to expect all parameters to have a default? This would allow us to instanciate the dataset objects whenever we need their static data.

Also, about the need for accessing meta-data without a config: perhaps we don't have that yet, but I could see a dataset that has different classes (or different "extra" field) depending on a parameter. Are we supporting this use-case?

Copy link
Collaborator Author

@pmeier pmeier Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be reasonable to expect all parameters to have a default?

IMO, yes. We shouldn't have a dataset that is not loadable with datasets.load("foo"). Looking through our old datasets, I think there two different scenarios where this violated:

  1. Datasets that cannot be automatically downloaded and need multiple files as input

    root: str,
    annFile: str,

  2. Datasets with arguments that could very well have a default but don't for an probably unknown reason

    def __init__(self, root: str, split: str, **kwargs: Any) -> None:

Since the download is handled differently and we no longer need to pass paths to the dataset, I currently don't see a reason to not assume it. If there ever is a case where we actually have no good default value, we can probably hack around it.

Also, about the need for accessing meta-data without a config: perhaps we don't have that yet, but I could see a dataset that has different classes (or different "extra" field) depending on a parameter. Are we supporting this use-case?

Currently this is not supported. But I'm not sure what would be a good way to add this. With the design discussed in this thread, datasets.info() would give you the Dataset object instantiated with default parameters. Thus, we would always get the same meta data back. One way would be to accept **options there.

Copy link
Member

@NicolasHug NicolasHug Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would help to see how we would be documenting these new datasets, before we can make a decision on this.

I'm sceptical in general about not using normal parameter passing, defaults, and docs. Especially if we ever allow users to create their own classes, we don't want them to learn yet another framework (but it's relevant for developers too). If it makes the docs much easier to write them we might have a reason, but for now it does not appear as obvious to me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me write minimalistic docs to showcase.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://1173984-73328905-gh.circle-artifacts.com/0/docs/datasets_prototype/index.html. Of course if we had more meta information filled in such as the citation, license, description, and so on, the pages would look not as empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @pmeier .

I have a few questions:

  • how would we document the type of the parameters?
  • how would we document extra methods on the datasets (if relevant)? What about extra attributes (again not sure if relevant)?
  • how would we document additional info that is specific to each dataset? Stuff like "This dataset contains X images of cars and Y images of trucks".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would we document the type of the parameters?

I would put another column in the table for the type. DatasetOption could have an annotation field for that. Let me add it to the examples.

how would we document extra methods on the datasets (if relevant)?

IMO, we shouldn't have extra methods on the dataset objects. Currently, I don't think we have datasets that need this. Did you have anything in mind what other datasets might provide?

What about extra attributes (again not sure if relevant)?

In our current stable API we sometimes add them to the docstring in the Attributes: section. We could simply have another table similar to Options that holds these. Let me add them to my examples.

how would we document additional info that is specific to each dataset? Stuff like "This dataset contains X images of cars and Y images of trucks".

Given that this is very individual, I would simply have description field which takes a string input and puts it at the top of the generated documentation. Let me add an example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug NicolasHug mentioned this pull request Feb 3, 2022
@pmeier
Copy link
Collaborator Author

pmeier commented Feb 25, 2022

Superseded by the development on the https://github.com/pytorch/vision/tree/prototype-datasets-inheritance branch.

@pmeier pmeier closed this Feb 25, 2022
@pmeier pmeier deleted the datasets/refactor-info branch February 25, 2022 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants