dataset metadata for reproducibility #4129

nbroad1881 · 2022-04-08T14:17:28Z

When pulling a dataset from the hub, it would be useful to have some metadata about the specific dataset and version that is used. The metadata could then be passed to the Trainer which could then be saved to a model card. This is useful for people who run many experiments on different versions (commits/branches) of the same dataset.

The dataset could have a list of “source datasets” metadata and ignore what happens to them before arriving in the Trainer (i.e. ignore mapping, filtering, etc.).

Here is a basic representation (made by @lhoestq )

>>> from datasets import load_dataset
>>> 
>>> my_dataset = load_dataset(...)["train"]
>>> my_dataset = my_dataset.map(...)
>>> 
>>> my_dataset.sources
[HFHubDataset(repo_id=..., revision=..., arguments={...})]

The text was updated successfully, but these errors were encountered:

davanstrien · 2023-09-29T09:23:55Z

+1 on this idea. This could be powerful for helping better track datasets used for model training and help with automatic model card creation.

One possible way of doing this would be to store some/most/all the arguments passed to load_dataset if a hub id is passed. i.e. store the Hub ID, configuration, etc.

cc @tomaarsen

nbroad1881 added the enhancement New feature or request label Apr 8, 2022

nbroad1881 mentioned this issue Apr 8, 2022

add dataset metadata to model card huggingface/transformers#16665

Closed

davanstrien mentioned this issue May 1, 2022

Track metadata davanstrien/hugit-cli#29

Open

lhoestq mentioned this issue Sep 29, 2023

Add repo_id to DatasetInfo #6268

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset metadata for reproducibility #4129

dataset metadata for reproducibility #4129

nbroad1881 commented Apr 8, 2022

davanstrien commented Sep 29, 2023

dataset metadata for reproducibility #4129

dataset metadata for reproducibility #4129

Comments

nbroad1881 commented Apr 8, 2022

davanstrien commented Sep 29, 2023