sketch my suggestions #564

jpcbertoldo · 2022-09-19T01:04:58Z

ashwinvaidya17

I like these changes but as @djdameln has been working on this task I'll leave the design discussion to him. Especially about moving _create_samples to AnomalibDataset

ashwinvaidya17 · 2022-09-19T06:51:35Z

anomalib/data/base.py

+TASK_CLASSIFICATION = "classification"
+TASK_SEGMENTATION = "segmentation"
+TASKS = (TASK_CLASSIFICATION, TASK_SEGMENTATION)
+
+SPLIT_TRAIN = "train"
+SPLIT_VAL = "val"
+SPLIT_TEST = "test"
+SPLITS = (SPLIT_TRAIN, SPLIT_VAL, SPLIT_TEST)
+
+LABEL_NORMAL = 0
+LABEL_ANOMALOUS = 1
+
+_SAMPLES_MANDATORY_COLUMNS_CLASSIFICATION = ["image_path", "label"]
+_SAMPLES_MANDATORY_COLUMNS_SEGMENTATION = _SAMPLES_MANDATORY_COLUMNS_CLASSIFICATION + ["mask_path"]
+


Our style guide is based on google'. We should try to upload it somewhere. According to it, we should try to avoid global variables https://google.github.io/styleguide/pyguide.html#25-global-variables. I'd be in favour of having enums. Then we can have Task.Classification and Task.Segmentation.

djdameln · 2022-09-19T16:58:35Z

anomalib/data/base.py

        self.task = task
        self.split = split
        self.pre_process = pre_process
+        self.samples: DataFrame = self._create_samples(self.root, self.split)


The main problem with having _create_samples as an abstract method of AnomalibDataset, is that this method involves randomly assigning the samples to one of the subsets (train, val, test). This means that we would need to pass a seed to the dataset instance to make sure that the samples in each of the subsets remain mutually exclusive. This also means that we would generate the exact same samples table in each call of _create_samples, which would be inefficient.

The more sensible approach, therefore, would be to create the samples once in a common location, then apply the subset splitting, and then only pass the samples of each split to the constructors of the respective dataset instances. In line with this reasoning, we moved _create_samples to the datamodule.

I do agree that it would be better to decouple AnomalibDataModule and AnomalibDataset to make the dataset more standalone. I'm also not very happy that we're passing a dataframe around between classes, so we might need yet another design for this.

hmm.. ok i see, good point

I don't how fast you want to get over with this pr, but here is a possible alternative:

AnomalibDataset is not aware of these dynamically defined splits and deals with dataset-defines splits (mvtec would have train and test only, mvtec loco has train, val, and test)

AnomalibDataset gets still gets a df at the init like your version but should provide an abstract class method to load the splits mentioned above.

AnomalivDataset has an interface or method to generate subsplits (get two instances from one such that they dont have common samples)

this interface could take two lists of indices so the seed that generates them stays out of this function

(Probably better) alternative: the list of indices for subsampling can be given at the abstract classmethod mentioned before

AnomalibDatamodule would deal with the subspliting at the init, by only generating the lists of indices from above (must know the nb of samples in advance but no big deal)

If you want, let's make a call to discuss

@jpcbertoldo I've outlined a new design proposal in #572. We could continue the discussion there. I will close this PR now.

sketch my suggestions

e2700cb

github-actions bot added the Data label Sep 19, 2022

jpcbertoldo mentioned this pull request Sep 19, 2022

Refactor data modules #558

Closed

12 tasks

ashwinvaidya17 reviewed Sep 19, 2022

View reviewed changes

ashwinvaidya17 requested a review from djdameln September 19, 2022 06:54

djdameln reviewed Sep 19, 2022

View reviewed changes

djdameln closed this Sep 22, 2022

jpcbertoldo deleted the up/da/refactor-datamodules branch February 9, 2024 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sketch my suggestions #564

sketch my suggestions #564

jpcbertoldo commented Sep 19, 2022

ashwinvaidya17 left a comment

ashwinvaidya17 Sep 19, 2022

djdameln Sep 19, 2022

jpcbertoldo Sep 19, 2022

jpcbertoldo Sep 20, 2022

jpcbertoldo Sep 20, 2022

djdameln Sep 22, 2022

sketch my suggestions #564

sketch my suggestions #564

Conversation

jpcbertoldo commented Sep 19, 2022

ashwinvaidya17 left a comment

Choose a reason for hiding this comment

ashwinvaidya17 Sep 19, 2022

Choose a reason for hiding this comment

djdameln Sep 19, 2022

Choose a reason for hiding this comment

jpcbertoldo Sep 19, 2022

Choose a reason for hiding this comment

jpcbertoldo Sep 20, 2022

Choose a reason for hiding this comment

jpcbertoldo Sep 20, 2022

Choose a reason for hiding this comment

djdameln Sep 22, 2022

Choose a reason for hiding this comment