Tabular data to Croissant #664

sadda · 2024-05-28T15:04:07Z

sadda
May 28, 2024

Hello, I wanted to ask whether there is a simple tutorial how to convert tabular metadata (csv format) into the Croissant format. Each row in the metadata corresponds to one image depicting exactly one animal. The information provided for each image is its path, identity of the depicted animal (class in the context of ML), split (train/test) and additional information such as date, species or animal orientation (left, top, ...). I would also need the order of Croissant metadata to stay the same as the original csv file.

Thanks a lot,

Lukas

pierrot0 · 2024-05-29T14:43:55Z

pierrot0
May 29, 2024
Maintainer

Hi Lukas,

https://github.com/mlcommons/croissant/blob/main/datasets/1.0/pass-mini/metadata.json is a similar dataset: jpegs from two tar files are joined with a CSV containing other information. There is no split in that dataset, but that's just a semantic type, as defined in https://github.com/mlcommons/croissant/blob/main/datasets/1.0/recipes/simple-split.json.

The order in which the data examples are yielded is implementation-dependent, as the croissant spec doesn't specify anything at the moment. It's probably either the order of the examples in the jpeg folder (or archive) or the order of the csv. Defining the order my be an feature of the format. Do you want to open a feature request and give a few examples of when that would be needed?

0 replies

sadda · 2024-06-03T06:56:00Z

sadda
Jun 3, 2024
Author

Hi Pierre,

thanks a lot for the link to the pass-mini dataset. Is there some code which created the metadata.json file? Also the dataset seems to be a bit different from what I have as the files are stored in two tar files and not in folders with some hierarchical structure.

Being honest, Croissant is now required for the NeurIPS 2024 Datasets and Benchmarks Track and I struggle to understand it :(

Best,

Lukas

0 replies

benjelloun · 2024-06-03T09:02:44Z

benjelloun
Jun 3, 2024
Maintainer

Hi Lukas,

You can use the Croissant editor to create your JSON file:

https://huggingface.co/spaces/MLCommons/croissant-editor
Or you can build it locally from https://github.com/mlcommons/croissant/tree/main/editor

If you upload your files on the "Resources" tab, the editor will try to infer the corresponding Croissant definitions, but you may need to correct them.

Please let us know how it goes.

Best,
Omar

0 replies

sadda · 2024-06-07T06:04:26Z

sadda
Jun 7, 2024
Author

Hi Omar,

thanks a lot for the reply, it was very helpful. After checking the editor, I realized that if I upload the dataset on Kaggle, I can download the metadata direectly from there :) When I went back to your Github readme, I realized that you hint on this in the "Integrations" section. I would suggest to stress this much more so that people not so skilled with technology like me can realize that the Croissant format is automatically generated by Kaggle and HuggingFace (and possibly others) and users do not need to handle it manually :)

Thanks a lot once again and keep on with the good job :)

Lukas

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabular data to Croissant #664

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tabular data to Croissant #664

sadda May 28, 2024

Replies: 4 comments

pierrot0 May 29, 2024 Maintainer

sadda Jun 3, 2024 Author

benjelloun Jun 3, 2024 Maintainer

sadda Jun 7, 2024 Author

sadda
May 28, 2024

pierrot0
May 29, 2024
Maintainer

sadda
Jun 3, 2024
Author

benjelloun
Jun 3, 2024
Maintainer

sadda
Jun 7, 2024
Author