Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prepare timit manifests #324

Merged
merged 13 commits into from
Jul 19, 2021
Merged

prepare timit manifests #324

merged 13 commits into from
Jul 19, 2021

Conversation

luomingshuang
Copy link
Contributor

@luomingshuang luomingshuang commented Jun 29, 2021

In timit.py, I use phonemes as the supervisions.


def download_and_unzip(
target_dir: Pathlike = '.',
force_download: Optional[bool] = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are force_download and base_url Optional?
What if Optional is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the timit zip exits, we needn't to download. Optional in here means that we can download or not download. And the base_url is not unique.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional here means you can pass None to the argument. Obviously, base_url cannot be None.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the types should be plain bool and str

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
force_download: Optional[bool] = False,
force_download: bool = False,

force_download: Optional[bool] = False,
base_url: Optional[str] = 'https://data.deepai.org/timit.zip') -> None:
"""
Download and unzip the dataset, supporting both TIMIT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does supporting both TIMIT mean? Is there something missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Em, yes, I am rewriting.

output_dir: Optional[Pathlike] = None,
num_jobs: int = 1
) -> Dict[str, Dict[str, Union[RecordingSet, SupervisionSet]]]:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will do it.

assert corpus_dir.is_dir(), f'No such directory: {corpus_dir}'

splits_dir = Path(splits_dir)
assert corpus_dir.is_dir(), f'No such directory: {splits_dir}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert corpus_dir.is_dir(), f'No such directory: {splits_dir}'
assert splits_dir.is_dir(), f'No such directory: {splits_dir}'

from collections import defaultdict

import os
import glob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove packages that are not used.


with ThreadPoolExecutor(num_jobs) as ex:
for part in dataset_parts:
wav_files = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wav_files = []

with ThreadPoolExecutor(num_jobs) as ex:
for part in dataset_parts:
wav_files = []
file_name = ''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
file_name = ''

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Left some comments. Can you also update the table in docs/corpus.rst and add a CLI mode for TIMIT (see lhotse/bin/modes/recipes for a lot of examples)

from lhotse.utils import Pathlike, urlretrieve_progress


def download_and_unzip(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to name it “download_timit”

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!


def download_and_unzip(
target_dir: Pathlike = '.',
force_download: Optional[bool] = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the types should be plain bool and str

file_name = ''

if part == 'TRAIN':
file_name = os.path.join(splits_dir, 'train_samples.txt')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can replace all os.path.join(a, b) with: a / b (this is possible because we use Path objects)

items = wav_file.split('/')
idx = items[-2] + '-' + items[-1][:-4]
speaker = items[-2]
transcript_file = wav_file[:-3] + 'PHN' ###the phone file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wav_file.with_suffix(‘.PHN’)

lhotse/recipes/timit.py Outdated Show resolved Hide resolved
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)

manifests = defaultdict(dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a plain dict work here?
Any reason to use a defaultdict?

wav_files = []
with open(file_name, 'r') as f:
lines = f.readlines()
for line in lines:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for line in f

No need to read all lines at once.

@@ -98,8 +98,8 @@ def prepare_timit(
for wav_file in tqdm(wav_files):
items = wav_file.split('/')
idx = items[-2] + '-' + items[-1][:-4]
speaker = items[-2]
transcript_file = wav_file[:-3] + 'PHN' ###the phone file
speaker = items[-2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove ALL leading and trailing spaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do it.

"""
target_dir = Path(target_dir)
target_dir.mkdir(parents=True, exist_ok=True)
tar_name = f'timit.zip'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe f-string is not necessarily needed here.

- tar_name = f'timit.zip'
+ tar_name = 'timit.zip'

file_name = ''

if part == 'TRAIN':
file_name = splits_dir/'train_samples.txt'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to add a space before and after opterator "/".

-file_name = splits_dir/'train_samples.txt'
+ file_name = splits_dir / 'train_samples.txt'

Copy link
Contributor

@glynpu glynpu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@danpovey
Copy link
Collaborator

Sometimes TIMIT has logic for using a 48-phone version or some other size of phone set.
Does it make sense to include that logic here?

@jtrmal
Copy link
Collaborator

jtrmal commented Jun 30, 2021 via email

@pzelasko
Copy link
Collaborator

Let me know when it is ready to merge

@luomingshuang
Copy link
Contributor Author

Yes,here, maybe I should add three options(60, 48, 39) for choosing the number of phonemes. The attach phones_60_48_39.txt is the same as the phones.60-48-39.map in kaldi.
phones_60_48_39.txt

@luomingshuang
Copy link
Contributor Author

I will set 48-phone version as the default.

@pzelasko pzelasko added this to the v0.8 milestone Jul 2, 2021
@luomingshuang
Copy link
Contributor Author

@pzelasko , I add the function which can convert the phones to 48 (default) or 39. And I think it is ready to merge.

corpus_dir: Pathlike,
splits_dir: Pathlike,
output_dir: Optional[Pathlike] = None,
num_phones: int = 48,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some documentation about the possible options for num_phones.


return manifests

def get_phonemes(num_phones):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function also handle 60 phones?

Also please raise an exception if somebody passes a different number than 39 / 48 (and maybe 60 if it makes sense).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document this function.

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve my two last comments and I'm OK to merge it.

Copy link
Contributor Author

@luomingshuang luomingshuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve my two last comments and I'm OK to merge it.

Thanks for your advice! Done it.

from lhotse.utils import Pathlike, urlretrieve_progress


def download_and_unzip(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

@@ -98,8 +98,8 @@ def prepare_timit(
for wav_file in tqdm(wav_files):
items = wav_file.split('/')
idx = items[-2] + '-' + items[-1][:-4]
speaker = items[-2]
transcript_file = wav_file[:-3] + 'PHN' ###the phone file
speaker = items[-2]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do it.

@luomingshuang
Copy link
Contributor Author

Please resolve my two last comments and I'm OK to merge it.


phones_dict = get_phonemes(num_phones)
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to use try .. except here?
What happens if it is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Em....it is to remind people to set num_phones in [60, 48, 39]. If the value of num_phones not in [60, 48, 39], the error happens.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @csukuangfj that this try/except block is unnecessary. Please change:

    try:
        if num_phones in [60, 48, 39]:
             phones_dict = get_phonemes(num_phones)
         else:
             raise ValueError("The value of num_phones must be in [60, 48, 39].")
     except ValueError as e:
         print("Exception: ", repr(e))
         raise 

into

        if num_phones in [60, 48, 39]:
             phones_dict = get_phonemes(num_phones)
         else:
             raise ValueError("The value of num_phones must be in [60, 48, 39].")

import os
import zipfile
import logging
import string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import string


def download_and_unzip(
target_dir: Pathlike = '.',
force_download: Optional[bool] = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
force_download: Optional[bool] = False,
force_download: bool = False,

base_url: Optional[str] = 'https://data.deepai.org/timit.zip') -> None:
"""
Download and unzip the dataset TIMIT.
:param target_dir: Pathlike, the path of the dir to storage the dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param target_dir: Pathlike, the path of the dir to storage the dataset.
:param target_dir: Pathlike, the path of the dir to store the dataset.

"""
Download and unzip the dataset TIMIT.
:param target_dir: Pathlike, the path of the dir to storage the dataset.
:param force_download: Bool, if True, download the zips no matter if the zips exists.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param force_download: Bool, if True, download the zips no matter if the zips exists.
:param force_download: bool, if True, download the zips no matter if the zips exists.

Download and unzip the dataset TIMIT.
:param target_dir: Pathlike, the path of the dir to storage the dataset.
:param force_download: Bool, if True, download the zips no matter if the zips exists.
:param base_url: str, the url of the TIMIT download for free.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param base_url: str, the url of the TIMIT download for free.
:param base_url: str, the URL of the TIMIT dataset to download.

lines = f.readlines()
for line in lines:
items = line.strip().split(' ')
wav = os.path.join(corpus_dir, items[-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wav = os.path.join(corpus_dir, items[-1])
wav = corpus_dir / items[-1]

items = line.strip().split(' ')
wav = os.path.join(corpus_dir, items[-1])
wav_files.append(wav)
print(f'{part} dataset manifest generation.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(f'{part} dataset manifest generation.')
logging.debug(f'{part} dataset manifest generation.')


return manifests

def get_phonemes(num_phones):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document this function.

phonemes["zh"] = "sh"

else:
print("Using 60 phones for modeling!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be consistent: You're using both logging and print. Please choose either one, not both.

@luomingshuang
Copy link
Contributor Author

@csukuangfj and @pzelasko , thanks for your advice! Done it.

@pzelasko
Copy link
Collaborator

Cool, thank you @luomingshuang! I am merging, if any issues come up later we can do a follow up PR.

@pzelasko pzelasko merged commit 424abf6 into lhotse-speech:master Jul 19, 2021
@luomingshuang
Copy link
Contributor Author

luomingshuang commented Jul 19, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants