timit_asr dataset only includes one text phrase #2913

margotwagner · 2021-09-14T21:06:07Z

Describe the bug

The dataset 'timit_asr' only includes one text phrase. It only includes the transcription "Would such an act of refusal be useful?" multiple times rather than different phrases.

Steps to reproduce the bug

Note: I am following the tutorial https://huggingface.co/blog/fine-tune-wav2vec2-english

Install the dataset and other packages

!pip install datasets>=1.5.0
!pip install transformers==4.4.0
!pip install soundfile
!pip install jiwer

Load the dataset

from datasets import load_dataset, load_metric

timit = load_dataset("timit_asr")

Remove columns that we don't want

timit = timit.remove_columns(["phonetic_detail", "word_detail", "dialect_region", "id", "sentence_type", "speaker_id"])

Write a short function to display some random samples of the dataset.

from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

show_random_elements(timit["train"].remove_columns(["file"]))

Expected results

10 random different transcription phrases.

Actual results

10 of the same transcription phrase "Would such an act of refusal be useful?"

Environment info

datasets version: 1.4.1
Platform: macOS-10.15.7-x86_64-i386-64bit
Python version: 3.8.5
PyArrow version: not listed

The text was updated successfully, but these errors were encountered:

bhavitvyamalik · 2021-09-15T06:09:24Z

Hi @margotwagner,
This bug was fixed in #1995. Upgrading the datasets should work (min v1.8.0 ideally)

albertvillanova · 2021-09-15T08:05:18Z

Hi @margotwagner,

Yes, as @bhavitvyamalik has commented, this bug was fixed in datasets version 1.5.0. You need to update it, as your current version is 1.4.1:

Environment info

datasets version: 1.4.1

margotwagner added the bug Something isn't working label Sep 14, 2021

albertvillanova closed this as completed Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timit_asr dataset only includes one text phrase #2913

timit_asr dataset only includes one text phrase #2913

margotwagner commented Sep 14, 2021

bhavitvyamalik commented Sep 15, 2021 •

edited

Loading

albertvillanova commented Sep 15, 2021

timit_asr dataset only includes one text phrase #2913

timit_asr dataset only includes one text phrase #2913

Comments

margotwagner commented Sep 14, 2021

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

bhavitvyamalik commented Sep 15, 2021 • edited Loading

albertvillanova commented Sep 15, 2021

bhavitvyamalik commented Sep 15, 2021 •

edited

Loading