Allow manual configuration of Dataset Viewer for datasets not created with the `datasets` library #7315

diarray-hub · 2024-12-07T16:37:12Z

Problem Description

Currently, the Hugging Face Dataset Viewer automatically interprets dataset fields for datasets created with the datasets library. However, for datasets pushed directly via git, the Viewer:

Defaults to generic columns like label with null values if no explicit mapping is provided.
Does not allow dataset creators to configure field mappings or suppress default fields unless the dataset is recreated and pushed using the datasets library.

This creates a limitation for creators who:

Use custom workflows to prepare datasets (e.g., manifest files with audio-transcription mappings).
Push large datasets directly via git and cannot easily restructure them to conform to the datasets library format.

Proposed Solution

Introduce a feature that allows dataset creators to manually configure the Dataset Viewer behavior for datasets not created with the datasets library. This could be achieved by:

Using the YAML Metadata in README.md:

Add support for defining the dataset's field mappings directly in the README.md YAML section.

Example:

viewer:
  fields:
    - name: "audio"
      type: "audio_path" / "text"
      source: "manifest['audio']"
    - name: "bambara_transcription"
      type: "text"
      source: "manifest['bambara']"
    - name: "french_translation"
      type: "text"
      source: "manifest['french']"

With manifest being a csv or json like format file in the repository so that the viewer understands that it should look for the values of each field in that file.

Benefits

Improves flexibility for dataset creators who push datasets via git.
Enhances dataset discoverability and usability on the Hugging Face Hub by allowing creators to present meaningful field mappings without restructuring their data.
Reduces overhead for creators of large or complex datasets.

Examples of Use Case

An audio dataset with transcriptions in multiple languages stored in a manifest.json file, where the user wants the Viewer to:
- Display the audio column and Explicitly map features that he defined such as bambara_transcription and french_translation from the manifest.

The text was updated successfully, but these errors were encountered:

Wauplin · 2024-12-09T07:15:18Z

Hi @diarray-hub , thanks for opening the issue :) Let me ping @lhoestq and @severo from the dataset viewer team 🤗

diarray-hub · 2024-12-09T12:00:04Z

amazing :)

lhoestq · 2024-12-09T16:34:40Z

Hi ! why not modify the manifest.json file directly ? this way users see in the viewer the dataset as is instead which makes it easier to use using e.g. the datasets library

diarray-hub · 2024-12-10T09:46:31Z

Can I create and push the dataset with the dataset library while also pushing the dataset directory, mainting its structure and all the files as with git?

Wauplin · 2024-12-10T10:43:05Z

(I transferred to the issue to the datasets repo as it's not related to huggingface_hub)

lhoestq · 2024-12-10T10:58:02Z

Can I create and push the dataset with the dataset library while also pushing the dataset directory, mainting its structure and all the files as with git?

yes push_to_hub simply uploads Parquet files in a directory named "data" in the git repository

diarray-hub · 2024-12-10T11:09:22Z

That's the problem actually, I need that the data stays in the same format and the directory they are in keep the same structure in order to go quick with Nemo training so users of Nvidia's Nemo framework don't need to write any preprocessing code before starting training. That's why I used git instead of push_to_hub so me and other users working with Nemo can just:

git clone
asr_model.setup_training_data(train_data_config={'manifest_filepath': training_manifest_filepath})

And start training already. It may be not very kind of me to prioritize users of a specific framework but I noticed that it take much more code to convert an huggingFace dataset with the parquet file to Nemo manifest format than the inverse :haha:

lhoestq · 2024-12-10T11:29:26Z

Happy to help if you think the Nemo dataset format should be supported in datasets (and therefore in the HF Viewer that is based on datasets). Maybe the Nemo team could help as well

Though I'm not sure if there is only one but actually many formats/structure in Nemo depending on the task ?

diarray-hub · 2024-12-10T11:38:59Z

Yeah, you're right Quentin, it depends of the task. This one is for ASR. And, yes maybe they can help. I noticed that they already share their models through HF. Maybe someone in your teams already have a contact point there. Anyway it's not really a big issues since people can easily understand the dataset and its format with the dataset card but it's a little annoying for those who wanna visually explore each features with the viewer as for regular HF datasets

lhoestq · 2024-12-10T13:41:08Z

In that case I'd recommend you to upload the dataset in Nemo format and

add the "nemo" tag
add how to use the dataset in Nemo in the dataset README.md

The viewer is likely to show the audio content by default but without the transcriptions. You can also configure the viewer to show the transcriptions instead (without the audio).

diarray-hub · 2024-12-10T13:49:35Z

I already did, it's just a little bit "dommage" (Hope you'll understand, you speak french right? Cause I don't know any english word for this) that I have to choose which one the viewer displays. But it's no problem for the usability of the dataset. Thanks Quentin 👍

lhoestq · 2024-12-10T14:12:57Z

It's "dommage" for now, but feel free to ping the Nemo people if you think there is room for making this better together :)

Kinda related, but the datasets AudioFolder structure looks similar and simply asks for a metadata.jsonl with a field named "file_name" to link the transcriptions to the audio files - you could also add this file to your repository to make the viewer show audio + transcripts.

Alternatively maybe we can expand the AudioFolder configuration to allow you to set the metadata file to be the "manifest.json" and the linking field to be "audio_file_name" (we just need to agree on something general - not just for Nemo)

diarray-hub · 2024-12-11T11:05:21Z

Right, actually that was my idea when I opened this issues. That's what I suggested, taking my case as an exemple but you should think of a more general approach like adding a field to configure the viewer as you wish in the metadata (in the dataset card) or a config.yaml or json file. With a level of abstraction like the solution I proposed ot even higher abstraction, it would allow for more customizability :)

Wauplin transferred this issue from huggingface/huggingface_hub Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow manual configuration of Dataset Viewer for datasets not created with the `datasets` library #7315

Allow manual configuration of Dataset Viewer for datasets not created with the `datasets` library #7315

diarray-hub commented Dec 7, 2024

Wauplin commented Dec 9, 2024

diarray-hub commented Dec 9, 2024

lhoestq commented Dec 9, 2024

diarray-hub commented Dec 10, 2024 •

edited

Loading

Wauplin commented Dec 10, 2024

lhoestq commented Dec 10, 2024

diarray-hub commented Dec 10, 2024

lhoestq commented Dec 10, 2024

diarray-hub commented Dec 10, 2024

lhoestq commented Dec 10, 2024

diarray-hub commented Dec 10, 2024

lhoestq commented Dec 10, 2024

diarray-hub commented Dec 11, 2024

Allow manual configuration of Dataset Viewer for datasets not created with the datasets library #7315

Allow manual configuration of Dataset Viewer for datasets not created with the datasets library #7315

Comments

diarray-hub commented Dec 7, 2024

Problem Description

Proposed Solution

Benefits

Examples of Use Case

Wauplin commented Dec 9, 2024

diarray-hub commented Dec 9, 2024

lhoestq commented Dec 9, 2024

diarray-hub commented Dec 10, 2024 • edited Loading

Wauplin commented Dec 10, 2024

lhoestq commented Dec 10, 2024

diarray-hub commented Dec 10, 2024

lhoestq commented Dec 10, 2024

diarray-hub commented Dec 10, 2024

lhoestq commented Dec 10, 2024

diarray-hub commented Dec 10, 2024

lhoestq commented Dec 10, 2024

diarray-hub commented Dec 11, 2024

Allow manual configuration of Dataset Viewer for datasets not created with the `datasets` library #7315

Allow manual configuration of Dataset Viewer for datasets not created with the `datasets` library #7315

diarray-hub commented Dec 10, 2024 •

edited

Loading