Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timit_asr dataset only includes one text phrase #2913

Closed
margotwagner opened this issue Sep 14, 2021 · 2 comments
Closed

timit_asr dataset only includes one text phrase #2913

margotwagner opened this issue Sep 14, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@margotwagner
Copy link

Describe the bug

The dataset 'timit_asr' only includes one text phrase. It only includes the transcription "Would such an act of refusal be useful?" multiple times rather than different phrases.

Steps to reproduce the bug

Note: I am following the tutorial https://huggingface.co/blog/fine-tune-wav2vec2-english

  1. Install the dataset and other packages
!pip install datasets>=1.5.0
!pip install transformers==4.4.0
!pip install soundfile
!pip install jiwer
  1. Load the dataset
from datasets import load_dataset, load_metric

timit = load_dataset("timit_asr")
  1. Remove columns that we don't want
timit = timit.remove_columns(["phonetic_detail", "word_detail", "dialect_region", "id", "sentence_type", "speaker_id"])
  1. Write a short function to display some random samples of the dataset.
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

show_random_elements(timit["train"].remove_columns(["file"]))

Expected results

10 random different transcription phrases.

Actual results

10 of the same transcription phrase "Would such an act of refusal be useful?"

Environment info

  • datasets version: 1.4.1
  • Platform: macOS-10.15.7-x86_64-i386-64bit
  • Python version: 3.8.5
  • PyArrow version: not listed
@margotwagner margotwagner added the bug Something isn't working label Sep 14, 2021
@bhavitvyamalik
Copy link
Contributor

bhavitvyamalik commented Sep 15, 2021

Hi @margotwagner,
This bug was fixed in #1995. Upgrading the datasets should work (min v1.8.0 ideally)

@albertvillanova
Copy link
Member

Hi @margotwagner,

Yes, as @bhavitvyamalik has commented, this bug was fixed in datasets version 1.5.0. You need to update it, as your current version is 1.4.1:

Environment info

  • datasets version: 1.4.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants