Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sun397 prototype datapipe #5667

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Add sun397 prototype datapipe #5667

wants to merge 15 commits into from

Conversation

YosuaMichael
Copy link
Contributor

Related issues: #5351

Adding SUN397 dataset into the builtin prototype datapipe.

@facebook-github-bot
Copy link

facebook-github-bot commented Mar 23, 2022

💊 CI failures summary and remediations

As of commit 2573b16 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

1 failure not recognized by patterns:

Job Step Action
CircleCI unittest_prototype Run tests 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@YosuaMichael YosuaMichael marked this pull request as draft March 23, 2022 14:00
Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @YosuaMichael and welcome to torchvision team from me 🤗 The PR looks good in general. I've added a few comments inline.

One major thing that we need to address is the last point in #5164 (comment). cc @NicolasHug I'm sorry, I forgot to mention this on the porting issue. The point is that the dataset defines a train and test split as well as 10 different folds. We postponed handling this for the legacy dataset, but should handle this now.

torchvision/prototype/datasets/utils/_resource.py Outdated Show resolved Hide resolved
test/builtin_dataset_mocks.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sun397.py Outdated Show resolved Hide resolved
test/builtin_dataset_mocks.py Outdated Show resolved Hide resolved
@pmeier pmeier linked an issue Mar 23, 2022 that may be closed by this pull request
@YosuaMichael
Copy link
Contributor Author

Hey @pmeier thanks a lot for your review!

Regarding your comment:

One major thing that we need to address is the last point in #5164 (comment). cc @NicolasHug I'm sorry, I forgot to mention this on the porting issue. The point is that the dataset defines a train and test split as well as 10 different folds. We postponed handling this for the legacy dataset, but should handle this now.

I understand from the dataset homepage: https://vision.princeton.edu/projects/2010/SUN/ they explain about the train and test (they have 10 different pair of train and test). Currently my idea is to have a split (train, test) and fold (1-10). Will try to do this.

@YosuaMichael YosuaMichael marked this pull request as ready for review March 23, 2022 22:11
Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YosuaMichael The overarching change that simplified the mock data generation is similar to what we discussed yesterday offline: if the dataset provides "keys" for the data, it is preferable to use them. See detailed comments inline.

test/builtin_dataset_mocks.py Show resolved Hide resolved
test/builtin_dataset_mocks.py Show resolved Hide resolved
test/builtin_dataset_mocks.py Show resolved Hide resolved
test/datasets_utils.py Outdated Show resolved Hide resolved
for fold in range(1, 11):
random.shuffle(keys)

for split, keys_in_split in random_group(keys, ["train", "test"]).items():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This randomly splits all our keys into two groups while making sure that each group has at least one element. Thus, we get a random number of samples for each config, which makes the common tests more robust.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for having random subsets here? In general, we need our tests to be deterministic

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We randomize the the images that are part of split / fold configuration. If we had the same subsets for these configurations, the number of samples test cannot enforce that the correct config was loaded from the dataset.

test/builtin_dataset_mocks.py Show resolved Hide resolved
groups: Collection of group keys.
Returns:
Dictionary with ``groups`` as keys. Each value is a list of random items from ``collection`` without overlap
to the other values. Each list has at least length ``1``.
Copy link
Member

@NicolasHug NicolasHug Mar 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a docstring example would be helpful to understand what this really does?
Also, do we need the groups logic here? Looks like the core of this util is to create a random partition of n randomly-sized subsets. But perhaps leaving out the groups logic would make it more re-usable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored to have a random_subsets function rather than random_groups. The docstring now also includes a usage example. LMK if it is sufficient.

Comment on lines 22 to 23
split=("train", "test"),
fold=tuple(str(fold) for fold in range(1, 11)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmeier

In the original implementation of this dataset we decided not to include the fold parameter and also removed the split param #5164 (comment). Should we keep things consistent here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would ignore a lot of information the dataset provides. IIRC, the only reason we were ok to remove it, was that this was one of the last datasets in the old API and FLAVA didn't need the functionality. Given that this no longer applies, IMO we should handle this now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I recall, we decided not to include a fold parameter because:

  • it's not clear what it would be used for for most users - FLAVA is our only known use-case ATM, and they don't need it
  • we can add it in a BC way if users show a clear need for it
  • The use of the fold parameter would set a precedent, and might conflict with other datasets like DTD which also partition the dataset, but in a different way. So by not supporting this we're also minimizing the chances of inconsitencies in our API.

I think all of these still apply today, so unless we have a strong and obvious use-case for including it, I would prefer to go with the status quo here, which would give us the maximal flexibility for future us.

Copy link
Contributor Author

@YosuaMichael YosuaMichael Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @NicolasHug @pmeier , I changes the valid_options to the following:
It only have split options which value can be:

  • "all" (default) -> this will give same datasets as old API, and since this is the default it will backward compatible with the old API
  • "train-{fold}", sample: "train-5" -> this will give dataset of train at fold=5 according to the paper
  • "test-{fold}", sampple: "test-5" -> this will give dataset of test at fold=5 according to the paper

What are your thought on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SUN397
5 participants