You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When pulling a dataset from the hub, it would be useful to have some metadata about the specific dataset and version that is used. The metadata could then be passed to the Trainer which could then be saved to a model card. This is useful for people who run many experiments on different versions (commits/branches) of the same dataset.
The dataset could have a list of “source datasets” metadata and ignore what happens to them before arriving in the Trainer (i.e. ignore mapping, filtering, etc.).
Here is a basic representation (made by @lhoestq )
+1 on this idea. This could be powerful for helping better track datasets used for model training and help with automatic model card creation.
One possible way of doing this would be to store some/most/all the arguments passed to load_dataset if a hub id is passed. i.e. store the Hub ID, configuration, etc.
When pulling a dataset from the hub, it would be useful to have some metadata about the specific dataset and version that is used. The metadata could then be passed to the
Trainer
which could then be saved to a model card. This is useful for people who run many experiments on different versions (commits/branches) of the same dataset.The dataset could have a list of “source datasets” metadata and ignore what happens to them before arriving in the Trainer (i.e. ignore mapping, filtering, etc.).
Here is a basic representation (made by @lhoestq )
The text was updated successfully, but these errors were encountered: