Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML info: split #45

Merged
merged 12 commits into from
Jun 6, 2023
Merged

ML info: split #45

merged 12 commits into from
Jun 6, 2023

Conversation

pierrot0
Copy link
Contributor

@pierrot0 pierrot0 commented Jun 1, 2023

Add COCO2014 as an example of what split definition could look like when split is defined at the fileSet level.

In the end I think we want to be able to define split at all levels: FileObject, FileSet, and RecordSet Field, as this can be transversal to other definitions. So once we agree on this, I can send an example where splits are defined as a RecordSet Field, and another example where splits are defined on the FileObject directly.

@pierrot0 pierrot0 requested a review from a team as a code owner June 1, 2023 09:44
@github-actions
Copy link

github-actions bot commented Jun 1, 2023

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@pierrot0 pierrot0 requested a review from benjelloun June 1, 2023 09:44
@benjelloun
Copy link
Contributor

I was thinking we would do something much simpler for splits: Specify it on RecordSet, as a simple enum field, sth like
"split" : "TEST". This requires the dataset creator to always define a RecordSet for each split, but I think that's useful anyway, since we want the data to be accessible via the library.

Do you see a strong reason to define splits at other levels?

@pierrot0 pierrot0 linked an issue Jun 2, 2023 that may be closed by this pull request
@pierrot0
Copy link
Contributor Author

pierrot0 commented Jun 2, 2023

I have added Oxford 102 flower category dataset as another example, this time where splits are defined on the recordSet level.

This is more complicated than I had first remembered: they used lookups to define a record split. Tentatively went with a possible syntax which will need to be fixed, but could help the conversation.

In both examples, we are still missing the semantic understanding of the splits, to account for different wordings of the same concept people use ("val", "validation", "train", "dev", ...).

As for defining a recordSet for every split, that would be a possibility, but might lead to duplications of sometimes already complex datasets, basically defining 3 times (possibly more sometimes) the same recordSets, where the only difference between those would be the split.

A possibility would maybe to define generic attribute values to File and FileSet, and have the recordSet be able to refer to those in one of its fields.

@pierrot0
Copy link
Contributor Author

pierrot0 commented Jun 2, 2023

As per offline discussions, I'll update the PR to push the split definitions to the RecordSet level, possibly using filename and full_path_name or something similar to extract the needed information there.

@pierrot0
Copy link
Contributor Author

pierrot0 commented Jun 5, 2023

Please take another look, thanks!
I have only kept COCO, I can re-introduce flowers in another PR.

@pierrot0
Copy link
Contributor Author

pierrot0 commented Jun 5, 2023

Open question:
how do we distinguish between "#{csvfile.csv}/filename" the column and the actual filename?
Maybe we want a specific syntax to extract content based on file type? (CSV, JSON, etc.)

Also, I have used a non existing mlcommons.org/definitions, because I was not able to find a wikidata for individual data sets splits (train, or test, or validation).

@pierrot0 pierrot0 merged commit 4357ea3 into main Jun 6, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Jun 6, 2023
@pierrot0 pierrot0 deleted the pierrot_ml_semantics branch June 6, 2023 20:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for splits to Croissant format
4 participants