-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow manual configuration of Dataset Viewer for datasets not created with the datasets
library
#7315
Comments
Hi @diarray-hub , thanks for opening the issue :) Let me ping @lhoestq and @severo from the dataset viewer team 🤗 |
amazing :) |
Hi ! why not modify the manifest.json file directly ? this way users see in the viewer the dataset as is instead which makes it easier to use using e.g. the |
Can I create and push the dataset with the dataset library while also pushing the dataset directory, mainting its structure and all the files as with git? |
(I transferred to the issue to the |
yes push_to_hub simply uploads Parquet files in a directory named "data" in the git repository |
That's the problem actually, I need that the data stays in the same format and the directory they are in keep the same structure in order to go quick with Nemo training so users of Nvidia's Nemo framework don't need to write any preprocessing code before starting training. That's why I used git instead of push_to_hub so me and other users working with Nemo can just:
And start training already. It may be not very kind of me to prioritize users of a specific framework but I noticed that it take much more code to convert an huggingFace dataset with the parquet file to Nemo manifest format than the inverse :haha: |
Happy to help if you think the Nemo dataset format should be supported in Though I'm not sure if there is only one but actually many formats/structure in Nemo depending on the task ? |
Yeah, you're right Quentin, it depends of the task. This one is for ASR. And, yes maybe they can help. I noticed that they already share their models through HF. Maybe someone in your teams already have a contact point there. Anyway it's not really a big issues since people can easily understand the dataset and its format with the dataset card but it's a little annoying for those who wanna visually explore each features with the viewer as for regular HF datasets |
In that case I'd recommend you to upload the dataset in Nemo format and
The viewer is likely to show the audio content by default but without the transcriptions. You can also configure the viewer to show the transcriptions instead (without the audio). |
I already did, it's just a little bit "dommage" (Hope you'll understand, you speak french right? Cause I don't know any english word for this) that I have to choose which one the viewer displays. But it's no problem for the usability of the dataset. Thanks Quentin 👍 |
It's "dommage" for now, but feel free to ping the Nemo people if you think there is room for making this better together :) Kinda related, but the Alternatively maybe we can expand the AudioFolder configuration to allow you to set the metadata file to be the "manifest.json" and the linking field to be "audio_file_name" (we just need to agree on something general - not just for Nemo) |
Right, actually that was my idea when I opened this issues. That's what I suggested, taking my case as an exemple but you should think of a more general approach like adding a field to configure the viewer as you wish in the metadata (in the dataset card) or a config.yaml or json file. With a level of abstraction like the solution I proposed ot even higher abstraction, it would allow for more customizability :) |
Problem Description
Currently, the Hugging Face Dataset Viewer automatically interprets dataset fields for datasets created with the
datasets
library. However, for datasets pushed directly viagit
, the Viewer:label
withnull
values if no explicit mapping is provided.datasets
library.This creates a limitation for creators who:
git
and cannot easily restructure them to conform to thedatasets
library format.Proposed Solution
Introduce a feature that allows dataset creators to manually configure the Dataset Viewer behavior for datasets not created with the
datasets
library. This could be achieved by:README.md
:Add support for defining the dataset's field mappings directly in the
README.md
YAML section.Example:
With manifest being a csv or json like format file in the repository so that the viewer understands that it should look for the values of each field in that file.
Benefits
git
.Examples of Use Case
manifest.json
file, where the user wants the Viewer to:audio
column and Explicitly map features that he defined such asbambara_transcription
andfrench_translation
from the manifest.The text was updated successfully, but these errors were encountered: