merge DatasetInfo into Dataset #5369

pmeier · 2022-02-03T12:06:50Z

Addresses Points 1. to 3. from #5281.

facebook-github-bot · 2022-02-03T12:06:56Z

💊 CI failures summary and remediations

As of commit 36f5c5f (more details on the Dr. CI page):

2/2 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

unittest_prototype (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

�[31m=============================== �[31m�[1m1...0.47s�[0m�[31m ===============================�[0m

    pytest.param(DatasetMock().default_config, None, id="default"),
  File "/home/circleci/project/test/test_prototype_datasets_api.py", line 167, in __init__
    self._info = info or make_minimal_dataset_info(valid_options=dict(split=("train", "test")))
  File "/home/circleci/project/test/test_prototype_datasets_api.py", line 9, in make_minimal_dataset_info
    return datasets.utils.DatasetInfo(name, type=type, categories=categories or [], **kwargs)
TypeError: DatasetInfo() takes no arguments
------ generated xml file: /home/circleci/project/test-results/junit.xml -------
=========================== short test summary info ============================
ERROR test/test_prototype_datasets_api.py - TypeError: DatasetInfo() takes no...
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
�[31m=============================== �[31m�[1m1 error�[0m�[31m in 0.47s�[0m�[31m ===============================�[0m


Exited with code exit status 2

1 failure not recognized by patterns:

Job	Step	Action
^{circleci_consistency}	^{Check CircleCI config consistency}	🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

pmeier

I only implemented the new design for VOC and ImageNet. If we agree on the changes, I'll address the remaining datasets.

pmeier · 2022-02-03T12:07:35Z

test/builtin_dataset_mocks.py

+        self.configs = []
+        for combination in itertools.product(*[option.valid for option in self.dataset.options]):
+            options = dict(zip([option.name for option in self.dataset.options], combination))
+            with contextlib.suppress(Exception):
+                self.configs.append(self.dataset.make_config(**options))


This is the only part of the test suite changes that needs reviewing. All other changes are just hacks to temporary only test VOC and ImageNet.

pmeier · 2022-02-03T12:08:55Z

torchvision/prototype/datasets/utils/_dataset.py

+            if arg not in option.valid:
+                raise ValueError(f"Invalid argument '{arg}' for option '{name}' of dataset '{self.name}'. {option.doc}")


Maybe we could also have this logic on the DatasetOption class.

NicolasHug · 2022-02-03T12:30:16Z

torchvision/prototype/datasets/utils/_dataset.py

+SENTINEL = object()
+
+
+class DatasetOption:


What is the reason for having a DatasetOption class instead of relying on standard Python parameter passing + docstrings?

I'm not sure we should try to automate input validation or docstring generation at all cost TBH. If it forces us to deviate from our usual Python development practices, it may not be worth it?

Right now, we instantiate all Dataset objects at import time

vision/torchvision/prototype/datasets/_api.py

Lines 21 to 23 in 44e93be

for name, obj in _builtin.__dict__.items():

if not name.startswith("_") and isinstance(obj, type) and issubclass(obj, Dataset) and obj is not Dataset:

register(obj())

At this stage no specific configuration is known. So something like

class MyDataset(Dataset): def __init__(self, foo="foo", bar=1): if not isinstance(bar, int): raise TypeError

is not possible.

Now, we can of course change the API to only instantiate the objects at runtime. But then we also need a different way to access the static data such as categories or the like, since they should be accessible without a specific configuration.

But then we also need a different way to access the static data such as categories or the like, since they should be accessible without a specific configuration.

Would it be reasonable to expect all parameters to have a default? This would allow us to instanciate the dataset objects whenever we need their static data.

Also, about the need for accessing meta-data without a config: perhaps we don't have that yet, but I could see a dataset that has different classes (or different "extra" field) depending on a parameter. Are we supporting this use-case?

Would it be reasonable to expect all parameters to have a default?

IMO, yes. We shouldn't have a dataset that is not loadable with datasets.load("foo"). Looking through our old datasets, I think there two different scenarios where this violated:

Datasets that cannot be automatically downloaded and need multiple files as input

vision/torchvision/datasets/coco.py

Lines 27 to 28 in 44e93be

root: str,

annFile: str,

Datasets with arguments that could very well have a default but don't for an probably unknown reason

vision/torchvision/datasets/mnist.py

Line 293 in 44e93be

def __init__(self, root: str, split: str, **kwargs: Any) -> None:

Since the download is handled differently and we no longer need to pass paths to the dataset, I currently don't see a reason to not assume it. If there ever is a case where we actually have no good default value, we can probably hack around it.

Also, about the need for accessing meta-data without a config: perhaps we don't have that yet, but I could see a dataset that has different classes (or different "extra" field) depending on a parameter. Are we supporting this use-case?

Currently this is not supported. But I'm not sure what would be a good way to add this. With the design discussed in this thread, datasets.info() would give you the Dataset object instantiated with default parameters. Thus, we would always get the same meta data back. One way would be to accept **options there.

Perhaps it would help to see how we would be documenting these new datasets, before we can make a decision on this.

I'm sceptical in general about not using normal parameter passing, defaults, and docs. Especially if we ever allow users to create their own classes, we don't want them to learn yet another framework (but it's relevant for developers too). If it makes the docs much easier to write them we might have a reason, but for now it does not appear as obvious to me

Let me write minimalistic docs to showcase.

See https://1173984-73328905-gh.circle-artifacts.com/0/docs/datasets_prototype/index.html. Of course if we had more meta information filled in such as the citation, license, description, and so on, the pages would look not as empty.

Thanks a lot @pmeier .

I have a few questions:

how would we document the type of the parameters?

how would we document extra methods on the datasets (if relevant)? What about extra attributes (again not sure if relevant)?

how would we document additional info that is specific to each dataset? Stuff like "This dataset contains X images of cars and Y images of trucks".

how would we document the type of the parameters?

I would put another column in the table for the type. DatasetOption could have an annotation field for that. Let me add it to the examples.

how would we document extra methods on the datasets (if relevant)?

IMO, we shouldn't have extra methods on the dataset objects. Currently, I don't think we have datasets that need this. Did you have anything in mind what other datasets might provide?

What about extra attributes (again not sure if relevant)?

In our current stable API we sometimes add them to the docstring in the Attributes: section. We could simply have another table similar to Options that holds these. Let me add them to my examples.

how would we document additional info that is specific to each dataset? Stuff like "This dataset contains X images of cars and Y images of trucks".

Given that this is very individual, I would simply have description field which takes a string input and puts it at the top of the generated documentation. Let me add an example.

See https://1181665-73328905-gh.circle-artifacts.com/0/docs/datasets_prototype/index.html. I've taken the examples from https://paperswithcode.com/ for demo purposes.

torchvision/prototype/datasets/utils/_dataset.py

pmeier · 2022-02-25T09:21:56Z

Superseded by the development on the https://github.com/pytorch/vision/tree/prototype-datasets-inheritance branch.

merge DatasetInfo into Dataset

a02b7cd

pytorch-bot bot added the ciflow/default label Feb 3, 2022

facebook-github-bot added the cla signed label Feb 3, 2022

pmeier commented Feb 3, 2022

View reviewed changes

pmeier added module: datasets prototype labels Feb 3, 2022

NicolasHug reviewed Feb 3, 2022

View reviewed changes

NicolasHug mentioned this pull request Feb 3, 2022

Cityscapes #5353

Open

pmeier added 6 commits February 4, 2022 11:01

add minimal doc generation for the prototype datasets

5a15356

add prototype dependencies to docs workflow

25e4807

fix typo

6f61338

Merge branch 'main' into datasets/refactor-info

b7d0750

expand datasets documentation

4011d58

fix docs

36f5c5f

pmeier closed this Feb 25, 2022

pmeier deleted the datasets/refactor-info branch February 25, 2022 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge DatasetInfo into Dataset #5369

merge DatasetInfo into Dataset #5369

pmeier commented Feb 3, 2022

facebook-github-bot commented Feb 3, 2022 •

edited

Loading

pmeier left a comment

pmeier Feb 3, 2022

pmeier Feb 3, 2022

NicolasHug Feb 3, 2022

pmeier Feb 3, 2022

NicolasHug Feb 3, 2022

pmeier Feb 3, 2022 •

edited

Loading

NicolasHug Feb 4, 2022 •

edited

Loading

pmeier Feb 4, 2022

pmeier Feb 4, 2022

NicolasHug Feb 4, 2022

pmeier Feb 8, 2022

pmeier Feb 9, 2022

pmeier commented Feb 25, 2022

		if arg not in option.valid:
		raise ValueError(f"Invalid argument '{arg}' for option '{name}' of dataset '{self.name}'. {option.doc}")

	for name, obj in _builtin.__dict__.items():
	if not name.startswith("_") and isinstance(obj, type) and issubclass(obj, Dataset) and obj is not Dataset:
	register(obj())

merge DatasetInfo into Dataset #5369

merge DatasetInfo into Dataset #5369

Conversation

pmeier commented Feb 3, 2022

facebook-github-bot commented Feb 3, 2022 • edited Loading

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

unittest_prototype (1/1)

1 failure not recognized by patterns:

pmeier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmeier Feb 3, 2022 • edited Loading

Choose a reason for hiding this comment

NicolasHug Feb 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmeier commented Feb 25, 2022

facebook-github-bot commented Feb 3, 2022 •

edited

Loading

pmeier Feb 3, 2022 •

edited

Loading

NicolasHug Feb 4, 2022 •

edited

Loading