-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sketch my suggestions #564
sketch my suggestions #564
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these changes but as @djdameln has been working on this task I'll leave the design discussion to him. Especially about moving _create_samples
to AnomalibDataset
TASK_CLASSIFICATION = "classification" | ||
TASK_SEGMENTATION = "segmentation" | ||
TASKS = (TASK_CLASSIFICATION, TASK_SEGMENTATION) | ||
|
||
SPLIT_TRAIN = "train" | ||
SPLIT_VAL = "val" | ||
SPLIT_TEST = "test" | ||
SPLITS = (SPLIT_TRAIN, SPLIT_VAL, SPLIT_TEST) | ||
|
||
LABEL_NORMAL = 0 | ||
LABEL_ANOMALOUS = 1 | ||
|
||
_SAMPLES_MANDATORY_COLUMNS_CLASSIFICATION = ["image_path", "label"] | ||
_SAMPLES_MANDATORY_COLUMNS_SEGMENTATION = _SAMPLES_MANDATORY_COLUMNS_CLASSIFICATION + ["mask_path"] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our style guide is based on google'. We should try to upload it somewhere. According to it, we should try to avoid global variables https://google.github.io/styleguide/pyguide.html#25-global-variables. I'd be in favour of having enums. Then we can have Task.Classification
and Task.Segmentation
.
self.task = task | ||
self.split = split | ||
self.pre_process = pre_process | ||
self.samples: DataFrame = self._create_samples(self.root, self.split) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main problem with having _create_samples
as an abstract method of AnomalibDataset, is that this method involves randomly assigning the samples to one of the subsets (train, val, test). This means that we would need to pass a seed to the dataset instance to make sure that the samples in each of the subsets remain mutually exclusive. This also means that we would generate the exact same samples table in each call of _create_samples
, which would be inefficient.
The more sensible approach, therefore, would be to create the samples once in a common location, then apply the subset splitting, and then only pass the samples of each split to the constructors of the respective dataset instances. In line with this reasoning, we moved _create_samples
to the datamodule.
I do agree that it would be better to decouple AnomalibDataModule
and AnomalibDataset
to make the dataset more standalone. I'm also not very happy that we're passing a dataframe around between classes, so we might need yet another design for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm.. ok i see, good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't how fast you want to get over with this pr, but here is a possible alternative:
AnomalibDataset is not aware of these dynamically defined splits and deals with dataset-defines splits (mvtec would have train and test only, mvtec loco has train, val, and test)
AnomalibDataset gets still gets a df at the init like your version but should provide an abstract class method to load the splits mentioned above.
AnomalivDataset has an interface or method to generate subsplits (get two instances from one such that they dont have common samples)
this interface could take two lists of indices so the seed that generates them stays out of this function
(Probably better) alternative: the list of indices for subsampling can be given at the abstract classmethod mentioned before
AnomalibDatamodule would deal with the subspliting at the init, by only generating the lists of indices from above (must know the nb of samples in advance but no big deal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, let's make a call to discuss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpcbertoldo I've outlined a new design proposal in #572. We could continue the discussion there. I will close this PR now.
a sketch of my suggestions to PR #558