Add sun397 prototype datapipe #5667

YosuaMichael · 2022-03-23T13:23:06Z

Related issues: #5351

Adding SUN397 dataset into the builtin prototype datapipe.

facebook-github-bot · 2022-03-23T13:23:14Z

💊 CI failures summary and remediations

As of commit 2573b16 (more details on the Dr. CI page):

1/1 failures introduced in this PR

1 failure not recognized by patterns:

Job	Step	Action
^{unittest_prototype}	^{Run tests}	🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

torchvision/prototype/datasets/utils/_resource.py

pmeier

Hey @YosuaMichael and welcome to torchvision team from me 🤗 The PR looks good in general. I've added a few comments inline.

One major thing that we need to address is the last point in #5164 (comment). cc @NicolasHug I'm sorry, I forgot to mention this on the porting issue. The point is that the dataset defines a train and test split as well as 10 different folds. We postponed handling this for the legacy dataset, but should handle this now.

torchvision/prototype/datasets/utils/_resource.py

test/builtin_dataset_mocks.py

torchvision/prototype/datasets/_builtin/sun397.py

test/builtin_dataset_mocks.py

…file filter

YosuaMichael · 2022-03-23T16:32:26Z

Hey @pmeier thanks a lot for your review!

Regarding your comment:

One major thing that we need to address is the last point in #5164 (comment). cc @NicolasHug I'm sorry, I forgot to mention this on the porting issue. The point is that the dataset defines a train and test split as well as 10 different folds. We postponed handling this for the legacy dataset, but should handle this now.

I understand from the dataset homepage: https://vision.princeton.edu/projects/2010/SUN/ they explain about the train and test (they have 10 different pair of train and test). Currently my idea is to have a split (train, test) and fold (1-10). Will try to do this.

pmeier

@YosuaMichael The overarching change that simplified the mock data generation is similar to what we discussed yesterday offline: if the dataset provides "keys" for the data, it is preferable to use them. See detailed comments inline.

test/builtin_dataset_mocks.py

test/datasets_utils.py

pmeier · 2022-03-25T09:54:01Z

test/builtin_dataset_mocks.py

+    for fold in range(1, 11):
+        random.shuffle(keys)
+
+        for split, keys_in_split in random_group(keys, ["train", "test"]).items():


This randomly splits all our keys into two groups while making sure that each group has at least one element. Thus, we get a random number of samples for each config, which makes the common tests more robust.

What's the reason for having random subsets here? In general, we need our tests to be deterministic

We randomize the the images that are part of split / fold configuration. If we had the same subsets for these configurations, the number of samples test cannot enforce that the correct config was loaded from the dataset.

test/builtin_dataset_mocks.py

NicolasHug · 2022-03-25T10:26:38Z

test/datasets_utils.py

+        groups: Collection of group keys.
+    Returns:
+        Dictionary with ``groups`` as keys. Each value is a list of random items from ``collection`` without overlap
+            to the other values. Each list has at least length ``1``.


Perhaps a docstring example would be helpful to understand what this really does?
Also, do we need the groups logic here? Looks like the core of this util is to create a random partition of n randomly-sized subsets. But perhaps leaving out the groups logic would make it more re-usable?

I refactored to have a random_subsets function rather than random_groups. The docstring now also includes a usage example. LMK if it is sufficient.

NicolasHug · 2022-03-28T10:36:40Z

torchvision/prototype/datasets/_builtin/sun397.py

+                split=("train", "test"),
+                fold=tuple(str(fold) for fold in range(1, 11)),


@pmeier

In the original implementation of this dataset we decided not to include the fold parameter and also removed the split param #5164 (comment). Should we keep things consistent here?

That would ignore a lot of information the dataset provides. IIRC, the only reason we were ok to remove it, was that this was one of the last datasets in the old API and FLAVA didn't need the functionality. Given that this no longer applies, IMO we should handle this now.

From what I recall, we decided not to include a fold parameter because:

it's not clear what it would be used for for most users - FLAVA is our only known use-case ATM, and they don't need it

we can add it in a BC way if users show a clear need for it

The use of the fold parameter would set a precedent, and might conflict with other datasets like DTD which also partition the dataset, but in a different way. So by not supporting this we're also minimizing the chances of inconsitencies in our API.

I think all of these still apply today, so unless we have a strong and obvious use-case for including it, I would prefer to go with the status quo here, which would give us the maximal flexibility for future us.

Hey @NicolasHug @pmeier , I changes the valid_options to the following:
It only have split options which value can be:

"all" (default) -> this will give same datasets as old API, and since this is the default it will backward compatible with the old API

"train-{fold}", sample: "train-5" -> this will give dataset of train at fold=5 according to the paper

"test-{fold}", sampple: "test-5" -> this will give dataset of test at fold=5 according to the paper

What are your thought on this?

Add sun397 datapipe prototype

904e2ec

YosuaMichael self-assigned this Mar 23, 2022

facebook-github-bot added the cla signed label Mar 23, 2022

YosuaMichael commented Mar 23, 2022

View reviewed changes

torchvision/prototype/datasets/utils/_resource.py Outdated Show resolved Hide resolved

torchvision/prototype/datasets/utils/_resource.py Outdated Show resolved Hide resolved

YosuaMichael requested a review from pmeier March 23, 2022 13:31

YosuaMichael added 2 commits March 23, 2022 13:36

Fix unused import

362ad63

Fix error no import

d8c55bc

YosuaMichael requested review from datumbox and NicolasHug March 23, 2022 13:41

pmeier removed request for NicolasHug and datumbox March 23, 2022 13:50

YosuaMichael marked this pull request as draft March 23, 2022 14:00

YosuaMichael added 2 commits March 23, 2022 14:06

Fix typing for PREPROCESS_TYPE_CHOICES

b99d9f6

Merge branch 'main' into add-sun397-datapipe

979cf25

pmeier reviewed Mar 23, 2022

View reviewed changes

pmeier linked an issue Mar 23, 2022 that may be closed by this pull request

SUN397 #5351

Open

pmeier added module: datasets prototype labels Mar 23, 2022

pmeier reviewed Mar 23, 2022

View reviewed changes

test/builtin_dataset_mocks.py Outdated Show resolved Hide resolved

Resolve comment: handle filepath in windows and simplified the image …

da24f19

…file filter

pmeier mentioned this pull request Mar 23, 2022

fix HttpResource.resolve() with preprocessing #5669

Merged

YosuaMichael added 2 commits March 23, 2022 21:07

Implement the split and fold for dataset sun397

d4431d4

Merging with main, user prototype/datasets/utils/_resource.py from main

94165f0

YosuaMichael marked this pull request as ready for review March 23, 2022 22:11

pmeier added 2 commits March 24, 2022 15:37

cleanup

fbb5b18

refactor mock data generation

878918a

pmeier reviewed Mar 25, 2022

View reviewed changes

Merge branch 'main' into add-sun397-datapipe

5f987ec

NicolasHug reviewed Mar 25, 2022

View reviewed changes

pmeier added 2 commits March 25, 2022 13:53

refactor random_groups to random_subsets

92eab46

Merge branch 'main' into add-sun397-datapipe

701539f

NicolasHug reviewed Mar 28, 2022

View reviewed changes

YosuaMichael added 2 commits March 30, 2022 16:07

Change the valid_options to make sure it is backward compatible

ddb1887

Merge branch 'main' into add-sun397-datapipe

2573b16

pmeier mentioned this pull request Apr 7, 2022

Refactor prototype datasets #5778

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sun397 prototype datapipe #5667

Add sun397 prototype datapipe #5667

YosuaMichael commented Mar 23, 2022

facebook-github-bot commented Mar 23, 2022 •

edited

Loading

pmeier left a comment

YosuaMichael commented Mar 23, 2022

pmeier left a comment

pmeier Mar 25, 2022

NicolasHug Mar 25, 2022

pmeier Mar 25, 2022

NicolasHug Mar 25, 2022 •

edited

Loading

pmeier Mar 25, 2022

NicolasHug Mar 28, 2022

pmeier Mar 28, 2022

NicolasHug Mar 28, 2022

YosuaMichael Mar 30, 2022 •

edited

Loading

		split=("train", "test"),
		fold=tuple(str(fold) for fold in range(1, 11)),

Add sun397 prototype datapipe #5667

Are you sure you want to change the base?

Add sun397 prototype datapipe #5667

Conversation

YosuaMichael commented Mar 23, 2022

facebook-github-bot commented Mar 23, 2022 • edited Loading

💊 CI failures summary and remediations

1 failure not recognized by patterns:

pmeier left a comment

Choose a reason for hiding this comment

YosuaMichael commented Mar 23, 2022

pmeier left a comment

Choose a reason for hiding this comment

pmeier Mar 25, 2022

Choose a reason for hiding this comment

NicolasHug Mar 25, 2022

Choose a reason for hiding this comment

pmeier Mar 25, 2022

Choose a reason for hiding this comment

NicolasHug Mar 25, 2022 • edited Loading

Choose a reason for hiding this comment

pmeier Mar 25, 2022

Choose a reason for hiding this comment

NicolasHug Mar 28, 2022

Choose a reason for hiding this comment

pmeier Mar 28, 2022

Choose a reason for hiding this comment

NicolasHug Mar 28, 2022

Choose a reason for hiding this comment

YosuaMichael Mar 30, 2022 • edited Loading

Choose a reason for hiding this comment

facebook-github-bot commented Mar 23, 2022 •

edited

Loading

NicolasHug Mar 25, 2022 •

edited

Loading

YosuaMichael Mar 30, 2022 •

edited

Loading