-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML info: split #45
ML info: split #45
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
I was thinking we would do something much simpler for splits: Specify it on RecordSet, as a simple enum field, sth like Do you see a strong reason to define splits at other levels? |
…cordSet based on downloaded files content
I have added Oxford 102 flower category dataset as another example, this time where splits are defined on the recordSet level. This is more complicated than I had first remembered: they used lookups to define a record split. Tentatively went with a possible syntax which will need to be fixed, but could help the conversation. In both examples, we are still missing the semantic understanding of the splits, to account for different wordings of the same concept people use ("val", "validation", "train", "dev", ...). As for defining a recordSet for every split, that would be a possibility, but might lead to duplications of sometimes already complex datasets, basically defining 3 times (possibly more sometimes) the same A possibility would maybe to define generic attribute values to File and FileSet, and have the recordSet be able to refer to those in one of its fields. |
As per offline discussions, I'll update the PR to push the split definitions to the |
Please take another look, thanks! |
Open question: Also, I have used a non existing mlcommons.org/definitions, because I was not able to find a wikidata for individual data sets splits (train, or test, or validation). |
Add COCO2014 as an example of what split definition could look like when split is defined at the fileSet level.
In the end I think we want to be able to define split at all levels: FileObject, FileSet, and RecordSet Field, as this can be transversal to other definitions. So once we agree on this, I can send an example where splits are defined as a RecordSet Field, and another example where splits are defined on the FileObject directly.