Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sketch my suggestions #564

Conversation

jpcbertoldo
Copy link
Contributor

a sketch of my suggestions to PR #558

@github-actions github-actions bot added the Data label Sep 19, 2022
@jpcbertoldo jpcbertoldo mentioned this pull request Sep 19, 2022
12 tasks
Copy link
Collaborator

@ashwinvaidya17 ashwinvaidya17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these changes but as @djdameln has been working on this task I'll leave the design discussion to him. Especially about moving _create_samples to AnomalibDataset

Comment on lines +27 to +41
TASK_CLASSIFICATION = "classification"
TASK_SEGMENTATION = "segmentation"
TASKS = (TASK_CLASSIFICATION, TASK_SEGMENTATION)

SPLIT_TRAIN = "train"
SPLIT_VAL = "val"
SPLIT_TEST = "test"
SPLITS = (SPLIT_TRAIN, SPLIT_VAL, SPLIT_TEST)

LABEL_NORMAL = 0
LABEL_ANOMALOUS = 1

_SAMPLES_MANDATORY_COLUMNS_CLASSIFICATION = ["image_path", "label"]
_SAMPLES_MANDATORY_COLUMNS_SEGMENTATION = _SAMPLES_MANDATORY_COLUMNS_CLASSIFICATION + ["mask_path"]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our style guide is based on google'. We should try to upload it somewhere. According to it, we should try to avoid global variables https://google.github.io/styleguide/pyguide.html#25-global-variables. I'd be in favour of having enums. Then we can have Task.Classification and Task.Segmentation.

self.task = task
self.split = split
self.pre_process = pre_process
self.samples: DataFrame = self._create_samples(self.root, self.split)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem with having _create_samples as an abstract method of AnomalibDataset, is that this method involves randomly assigning the samples to one of the subsets (train, val, test). This means that we would need to pass a seed to the dataset instance to make sure that the samples in each of the subsets remain mutually exclusive. This also means that we would generate the exact same samples table in each call of _create_samples, which would be inefficient.

The more sensible approach, therefore, would be to create the samples once in a common location, then apply the subset splitting, and then only pass the samples of each split to the constructors of the respective dataset instances. In line with this reasoning, we moved _create_samples to the datamodule.

I do agree that it would be better to decouple AnomalibDataModule and AnomalibDataset to make the dataset more standalone. I'm also not very happy that we're passing a dataframe around between classes, so we might need yet another design for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. ok i see, good point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't how fast you want to get over with this pr, but here is a possible alternative:

AnomalibDataset is not aware of these dynamically defined splits and deals with dataset-defines splits (mvtec would have train and test only, mvtec loco has train, val, and test)

AnomalibDataset gets still gets a df at the init like your version but should provide an abstract class method to load the splits mentioned above.

AnomalivDataset has an interface or method to generate subsplits (get two instances from one such that they dont have common samples)

this interface could take two lists of indices so the seed that generates them stays out of this function

(Probably better) alternative: the list of indices for subsampling can be given at the abstract classmethod mentioned before

AnomalibDatamodule would deal with the subspliting at the init, by only generating the lists of indices from above (must know the nb of samples in advance but no big deal)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, let's make a call to discuss

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpcbertoldo I've outlined a new design proposal in #572. We could continue the discussion there. I will close this PR now.

@djdameln djdameln closed this Sep 22, 2022
@jpcbertoldo jpcbertoldo deleted the up/da/refactor-datamodules branch February 9, 2024 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants