-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sun397 prototype datapipe #5667
base: main
Are you sure you want to change the base?
Conversation
💊 CI failures summary and remediationsAs of commit 2573b16 (more details on the Dr. CI page):
1 failure not recognized by patterns:
This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @YosuaMichael and welcome to torchvision
team from me 🤗 The PR looks good in general. I've added a few comments inline.
One major thing that we need to address is the last point in #5164 (comment). cc @NicolasHug I'm sorry, I forgot to mention this on the porting issue. The point is that the dataset defines a train and test split as well as 10 different folds. We postponed handling this for the legacy dataset, but should handle this now.
Hey @pmeier thanks a lot for your review! Regarding your comment:
I understand from the dataset homepage: https://vision.princeton.edu/projects/2010/SUN/ they explain about the train and test (they have 10 different pair of train and test). Currently my idea is to have a split (train, test) and fold (1-10). Will try to do this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YosuaMichael The overarching change that simplified the mock data generation is similar to what we discussed yesterday offline: if the dataset provides "keys" for the data, it is preferable to use them. See detailed comments inline.
test/builtin_dataset_mocks.py
Outdated
for fold in range(1, 11): | ||
random.shuffle(keys) | ||
|
||
for split, keys_in_split in random_group(keys, ["train", "test"]).items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This randomly splits all our keys into two groups while making sure that each group has at least one element. Thus, we get a random number of samples for each config, which makes the common tests more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for having random subsets here? In general, we need our tests to be deterministic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We randomize the the images that are part of split / fold configuration. If we had the same subsets for these configurations, the number of samples test cannot enforce that the correct config was loaded from the dataset.
test/datasets_utils.py
Outdated
groups: Collection of group keys. | ||
Returns: | ||
Dictionary with ``groups`` as keys. Each value is a list of random items from ``collection`` without overlap | ||
to the other values. Each list has at least length ``1``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a docstring example would be helpful to understand what this really does?
Also, do we need the groups
logic here? Looks like the core of this util is to create a random partition of n
randomly-sized subsets. But perhaps leaving out the groups
logic would make it more re-usable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored to have a random_subsets
function rather than random_groups
. The docstring now also includes a usage example. LMK if it is sufficient.
split=("train", "test"), | ||
fold=tuple(str(fold) for fold in range(1, 11)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original implementation of this dataset we decided not to include the fold parameter and also removed the split param #5164 (comment). Should we keep things consistent here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would ignore a lot of information the dataset provides. IIRC, the only reason we were ok to remove it, was that this was one of the last datasets in the old API and FLAVA didn't need the functionality. Given that this no longer applies, IMO we should handle this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I recall, we decided not to include a fold
parameter because:
- it's not clear what it would be used for for most users - FLAVA is our only known use-case ATM, and they don't need it
- we can add it in a BC way if users show a clear need for it
- The use of the
fold
parameter would set a precedent, and might conflict with other datasets like DTD which also partition the dataset, but in a different way. So by not supporting this we're also minimizing the chances of inconsitencies in our API.
I think all of these still apply today, so unless we have a strong and obvious use-case for including it, I would prefer to go with the status quo here, which would give us the maximal flexibility for future us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @NicolasHug @pmeier , I changes the valid_options to the following:
It only have split
options which value can be:
- "all" (default) -> this will give same datasets as old API, and since this is the default it will backward compatible with the old API
- "train-{fold}", sample: "train-5" -> this will give dataset of train at fold=5 according to the paper
- "test-{fold}", sampple: "test-5" -> this will give dataset of test at fold=5 according to the paper
What are your thought on this?
Related issues: #5351
Adding SUN397 dataset into the builtin prototype datapipe.