ML info: split #45

pierrot0 · 2023-06-01T09:44:02Z

Add COCO2014 as an example of what split definition could look like when split is defined at the fileSet level.

In the end I think we want to be able to define split at all levels: FileObject, FileSet, and RecordSet Field, as this can be transversal to other definitions. So once we agree on this, I can send an example where splits are defined as a RecordSet Field, and another example where splits are defined on the FileObject directly.

github-actions · 2023-06-01T09:44:15Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

benjelloun · 2023-06-01T11:48:58Z

I was thinking we would do something much simpler for splits: Specify it on RecordSet, as a simple enum field, sth like
"split" : "TEST". This requires the dataset creator to always define a RecordSet for each split, but I think that's useful anyway, since we want the data to be accessible via the library.

Do you see a strong reason to define splits at other levels?

…cordSet based on downloaded files content

pierrot0 · 2023-06-02T08:35:39Z

I have added Oxford 102 flower category dataset as another example, this time where splits are defined on the recordSet level.

This is more complicated than I had first remembered: they used lookups to define a record split. Tentatively went with a possible syntax which will need to be fixed, but could help the conversation.

In both examples, we are still missing the semantic understanding of the splits, to account for different wordings of the same concept people use ("val", "validation", "train", "dev", ...).

As for defining a recordSet for every split, that would be a possibility, but might lead to duplications of sometimes already complex datasets, basically defining 3 times (possibly more sometimes) the same recordSets, where the only difference between those would be the split.

A possibility would maybe to define generic attribute values to File and FileSet, and have the recordSet be able to refer to those in one of its fields.

datasets/coco2014/metadata.json

pierrot0 · 2023-06-02T12:52:51Z

As per offline discussions, I'll update the PR to push the split definitions to the RecordSet level, possibly using filename and full_path_name or something similar to extract the needed information there.

datasets/coco2014/README.md

pierrot0 · 2023-06-05T10:01:57Z

Please take another look, thanks!
I have only kept COCO, I can re-introduce flowers in another PR.

pierrot0 · 2023-06-05T11:05:14Z

Open question:
how do we distinguish between "#{csvfile.csv}/filename" the column and the actual filename?
Maybe we want a specific syntax to extract content based on file type? (CSV, JSON, etc.)

Also, I have used a non existing mlcommons.org/definitions, because I was not able to find a wikidata for individual data sets splits (train, or test, or validation).

datasets/coco2014/metadata.json

datasets/coco2014/README.md

coco2014 for splits

8bb963c

pierrot0 requested a review from a team as a code owner June 1, 2023 09:44

pierrot0 requested a review from benjelloun June 1, 2023 09:44

Add oxford 102 category flower example for splits being defined in re…

b4ba35c

…cordSet based on downloaded files content

pierrot0 linked an issue Jun 2, 2023 that may be closed by this pull request

Add support for splits to Croissant format #44

Closed

marcenacp reviewed Jun 2, 2023

View reviewed changes

datasets/coco2014/metadata.json Outdated Show resolved Hide resolved

josvandervelde reviewed Jun 2, 2023

View reviewed changes

datasets/coco2014/README.md Outdated Show resolved Hide resolved

pierrot0 added 2 commits June 5, 2023 08:30

coco2014: move split definition to RecordSet level.

40265e5

Add semantic understanding of data splits.

e62e58c

pierrot0 added a commit that referenced this pull request Jun 5, 2023

oxford_102_category_flower dataset initially from pr #45.

15898e0

pierrot0 added 2 commits June 5, 2023 09:23

remove oxford 102 category flower, I will send another PR for this one.

b01be25

only keep split definition at RecordSet level.

17fbe31

fix file names

01b7c8e

marcenacp reviewed Jun 5, 2023

View reviewed changes

datasets/coco2014/metadata.json Outdated Show resolved Hide resolved

josvandervelde approved these changes Jun 5, 2023

View reviewed changes

pierrot0 added 3 commits June 5, 2023 12:57

Merge remote-tracking branch 'origin/main' into pierrot0/issue44

a7547b8

define splits in separate csv file

6ea5b81

misc fixes

aa128ea

marcenacp reviewed Jun 6, 2023

View reviewed changes

datasets/coco2014/README.md Outdated Show resolved Hide resolved

marcenacp approved these changes Jun 6, 2023

View reviewed changes

pierrot0 added 2 commits June 6, 2023 14:37

fix caption field dataType (text not int)

41d37a2

remove README.md, place warning in dataset description

d2860a8

pierrot0 merged commit 4357ea3 into main Jun 6, 2023

github-actions bot locked and limited conversation to collaborators Jun 6, 2023

pierrot0 deleted the pierrot_ml_semantics branch June 6, 2023 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML info: split #45

ML info: split #45

pierrot0 commented Jun 1, 2023

github-actions bot commented Jun 1, 2023 •

edited

Loading

benjelloun commented Jun 1, 2023

pierrot0 commented Jun 2, 2023

pierrot0 commented Jun 2, 2023

pierrot0 commented Jun 5, 2023

pierrot0 commented Jun 5, 2023

ML info: split #45

ML info: split #45

Conversation

pierrot0 commented Jun 1, 2023

github-actions bot commented Jun 1, 2023 • edited Loading

benjelloun commented Jun 1, 2023

pierrot0 commented Jun 2, 2023

pierrot0 commented Jun 2, 2023

pierrot0 commented Jun 5, 2023

pierrot0 commented Jun 5, 2023

github-actions bot commented Jun 1, 2023 •

edited

Loading