Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow manual configuration of Dataset Viewer for datasets not created with the datasets library #7315

Open
diarray-hub opened this issue Dec 7, 2024 · 13 comments

Comments

@diarray-hub
Copy link

Problem Description

Currently, the Hugging Face Dataset Viewer automatically interprets dataset fields for datasets created with the datasets library. However, for datasets pushed directly via git, the Viewer:

  • Defaults to generic columns like label with null values if no explicit mapping is provided.
  • Does not allow dataset creators to configure field mappings or suppress default fields unless the dataset is recreated and pushed using the datasets library.

This creates a limitation for creators who:

  • Use custom workflows to prepare datasets (e.g., manifest files with audio-transcription mappings).
  • Push large datasets directly via git and cannot easily restructure them to conform to the datasets library format.

Proposed Solution

Introduce a feature that allows dataset creators to manually configure the Dataset Viewer behavior for datasets not created with the datasets library. This could be achieved by:

  1. Using the YAML Metadata in README.md:
    • Add support for defining the dataset's field mappings directly in the README.md YAML section.

    • Example:

      viewer:
        fields:
          - name: "audio"
            type: "audio_path" / "text"
            source: "manifest['audio']"
          - name: "bambara_transcription"
            type: "text"
            source: "manifest['bambara']"
          - name: "french_translation"
            type: "text"
            source: "manifest['french']"

With manifest being a csv or json like format file in the repository so that the viewer understands that it should look for the values of each field in that file.

Benefits

  • Improves flexibility for dataset creators who push datasets via git.
  • Enhances dataset discoverability and usability on the Hugging Face Hub by allowing creators to present meaningful field mappings without restructuring their data.
  • Reduces overhead for creators of large or complex datasets.

Examples of Use Case

  • An audio dataset with transcriptions in multiple languages stored in a manifest.json file, where the user wants the Viewer to:
    • Display the audio column and Explicitly map features that he defined such as bambara_transcription and french_translation from the manifest.
@Wauplin
Copy link
Contributor

Wauplin commented Dec 9, 2024

Hi @diarray-hub , thanks for opening the issue :) Let me ping @lhoestq and @severo from the dataset viewer team 🤗

@diarray-hub
Copy link
Author

amazing :)

@lhoestq
Copy link
Member

lhoestq commented Dec 9, 2024

Hi ! why not modify the manifest.json file directly ? this way users see in the viewer the dataset as is instead which makes it easier to use using e.g. the datasets library

@diarray-hub
Copy link
Author

diarray-hub commented Dec 10, 2024

Can I create and push the dataset with the dataset library while also pushing the dataset directory, mainting its structure and all the files as with git?

@Wauplin Wauplin transferred this issue from huggingface/huggingface_hub Dec 10, 2024
@Wauplin
Copy link
Contributor

Wauplin commented Dec 10, 2024

(I transferred to the issue to the datasets repo as it's not related to huggingface_hub)

@lhoestq
Copy link
Member

lhoestq commented Dec 10, 2024

Can I create and push the dataset with the dataset library while also pushing the dataset directory, mainting its structure and all the files as with git?

yes push_to_hub simply uploads Parquet files in a directory named "data" in the git repository

@diarray-hub
Copy link
Author

That's the problem actually, I need that the data stays in the same format and the directory they are in keep the same structure in order to go quick with Nemo training so users of Nvidia's Nemo framework don't need to write any preprocessing code before starting training. That's why I used git instead of push_to_hub so me and other users working with Nemo can just:

  1. git clone
  2. asr_model.setup_training_data(train_data_config={'manifest_filepath': training_manifest_filepath})

And start training already. It may be not very kind of me to prioritize users of a specific framework but I noticed that it take much more code to convert an huggingFace dataset with the parquet file to Nemo manifest format than the inverse :haha:

@lhoestq
Copy link
Member

lhoestq commented Dec 10, 2024

Happy to help if you think the Nemo dataset format should be supported in datasets (and therefore in the HF Viewer that is based on datasets). Maybe the Nemo team could help as well

Though I'm not sure if there is only one but actually many formats/structure in Nemo depending on the task ?

@diarray-hub
Copy link
Author

Yeah, you're right Quentin, it depends of the task. This one is for ASR. And, yes maybe they can help. I noticed that they already share their models through HF. Maybe someone in your teams already have a contact point there. Anyway it's not really a big issues since people can easily understand the dataset and its format with the dataset card but it's a little annoying for those who wanna visually explore each features with the viewer as for regular HF datasets

@lhoestq
Copy link
Member

lhoestq commented Dec 10, 2024

In that case I'd recommend you to upload the dataset in Nemo format and

  1. add the "nemo" tag
  2. add how to use the dataset in Nemo in the dataset README.md

The viewer is likely to show the audio content by default but without the transcriptions. You can also configure the viewer to show the transcriptions instead (without the audio).

@diarray-hub
Copy link
Author

I already did, it's just a little bit "dommage" (Hope you'll understand, you speak french right? Cause I don't know any english word for this) that I have to choose which one the viewer displays. But it's no problem for the usability of the dataset. Thanks Quentin 👍

@lhoestq
Copy link
Member

lhoestq commented Dec 10, 2024

It's "dommage" for now, but feel free to ping the Nemo people if you think there is room for making this better together :)

Kinda related, but the datasets AudioFolder structure looks similar and simply asks for a metadata.jsonl with a field named "file_name" to link the transcriptions to the audio files - you could also add this file to your repository to make the viewer show audio + transcripts.

Alternatively maybe we can expand the AudioFolder configuration to allow you to set the metadata file to be the "manifest.json" and the linking field to be "audio_file_name" (we just need to agree on something general - not just for Nemo)

@diarray-hub
Copy link
Author

Right, actually that was my idea when I opened this issues. That's what I suggested, taking my case as an exemple but you should think of a more general approach like adding a field to configure the viewer as you wish in the metadata (in the dataset card) or a config.yaml or json file. With a level of abstraction like the solution I proposed ot even higher abstraction, it would allow for more customizability :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants