prepare timit manifests #324

luomingshuang · 2021-06-29T01:34:24Z

In timit.py, I use phonemes as the supervisions.

csukuangfj · 2021-06-29T02:39:31Z

lhotse/recipes/timit.py

+
+def download_and_unzip(
+        target_dir: Pathlike = '.',
+        force_download: Optional[bool] = False,


Why are force_download and base_url Optional?
What if Optional is removed?

if the timit zip exits, we needn't to download. Optional in here means that we can download or not download. And the base_url is not unique.

Optional here means you can pass None to the argument. Obviously, base_url cannot be None.

Yeah the types should be plain bool and str

Suggested change

force_download: Optional[bool] = False,

force_download: bool = False,

csukuangfj · 2021-06-29T02:40:12Z

lhotse/recipes/timit.py

+        force_download: Optional[bool] = False,
+        base_url: Optional[str] = 'https://data.deepai.org/timit.zip') -> None:
+    """
+    Download and unzip the dataset, supporting both TIMIT


What does supporting both TIMIT mean? Is there something missing?

Em, yes, I am rewriting.

csukuangfj · 2021-06-29T02:41:11Z

lhotse/recipes/timit.py

+        output_dir: Optional[Pathlike] = None,
+        num_jobs: int = 1
+) -> Dict[str, Dict[str, Union[RecordingSet, SupervisionSet]]]:
+    """


Could you add some documentation?

Ok, I will do it.

csukuangfj · 2021-06-29T02:41:45Z

lhotse/recipes/timit.py

+    assert corpus_dir.is_dir(), f'No such directory: {corpus_dir}'
+
+    splits_dir = Path(splits_dir)
+    assert corpus_dir.is_dir(), f'No such directory: {splits_dir}'


Suggested change

assert corpus_dir.is_dir(), f'No such directory: {splits_dir}'

assert splits_dir.is_dir(), f'No such directory: {splits_dir}'

csukuangfj · 2021-06-29T02:43:29Z

lhotse/recipes/timit.py

+from collections import defaultdict
+
+import os
+import glob


Please remove packages that are not used.

csukuangfj · 2021-06-29T02:44:53Z

lhotse/recipes/timit.py

+
+    with ThreadPoolExecutor(num_jobs) as ex:
+        for part in dataset_parts:
+          wav_files = []


Suggested change

wav_files = []

csukuangfj · 2021-06-29T02:45:01Z

lhotse/recipes/timit.py

+    with ThreadPoolExecutor(num_jobs) as ex:
+        for part in dataset_parts:
+          wav_files = []
+          file_name = ''


Suggested change

file_name = ''

pzelasko

Thanks! Left some comments. Can you also update the table in docs/corpus.rst and add a CLI mode for TIMIT (see lhotse/bin/modes/recipes for a lot of examples)

pzelasko · 2021-06-29T03:09:10Z

lhotse/recipes/timit.py

+from lhotse.utils import Pathlike, urlretrieve_progress
+
+
+def download_and_unzip(


I suggest to name it “download_timit”

pzelasko · 2021-06-29T03:10:07Z

lhotse/recipes/timit.py

+
+def download_and_unzip(
+        target_dir: Pathlike = '.',
+        force_download: Optional[bool] = False,


Yeah the types should be plain bool and str

pzelasko · 2021-06-29T03:12:39Z

lhotse/recipes/timit.py

+          file_name = ''
+
+          if part == 'TRAIN':
+            file_name = os.path.join(splits_dir, 'train_samples.txt') 


You can replace all os.path.join(a, b) with: a / b (this is possible because we use Path objects)

pzelasko · 2021-06-29T03:14:27Z

lhotse/recipes/timit.py

+                items = wav_file.split('/')
+                idx = items[-2] + '-' + items[-1][:-4]
+                speaker = items[-2]
+                transcript_file = wav_file[:-3] + 'PHN' ###the phone file


wav_file.with_suffix(‘.PHN’)

lhotse/recipes/timit.py

csukuangfj · 2021-06-29T03:24:39Z

lhotse/recipes/timit.py

+        output_dir = Path(output_dir)
+        output_dir.mkdir(parents=True, exist_ok=True)
+
+    manifests = defaultdict(dict)


Does a plain dict work here?
Any reason to use a defaultdict?

csukuangfj · 2021-06-29T03:25:47Z

lhotse/recipes/timit.py

+          wav_files = []
+          with open(file_name, 'r') as f:
+            lines = f.readlines()
+            for line in lines:


for line in f

No need to read all lines at once.

csukuangfj · 2021-06-29T06:50:11Z

lhotse/recipes/timit.py

@@ -98,8 +98,8 @@ def prepare_timit(
                for wav_file in tqdm(wav_files):
                    items = wav_file.split('/')
                    idx = items[-2] + '-' + items[-1][:-4]
-                    speaker = items[-2]
-                    transcript_file = wav_file[:-3] + 'PHN' ###the phone file
+                    speaker = items[-2] 


Please remove ALL leading and trailing spaces.

glynpu · 2021-06-29T07:05:01Z

lhotse/recipes/timit.py

+    """
+    target_dir = Path(target_dir)
+    target_dir.mkdir(parents=True, exist_ok=True)
+    tar_name = f'timit.zip'


Maybe f-string is not necessarily needed here.

- tar_name = f'timit.zip' + tar_name = 'timit.zip'

glynpu · 2021-06-29T07:11:08Z

lhotse/recipes/timit.py

+            file_name = ''
+
+            if part == 'TRAIN': 
+                file_name = splits_dir/'train_samples.txt'


I would suggest to add a space before and after opterator "/".

-file_name = splits_dir/'train_samples.txt' + file_name = splits_dir / 'train_samples.txt'

glynpu

+1

danpovey · 2021-06-30T13:49:39Z

Sometimes TIMIT has logic for using a 48-phone version or some other size of phone set.
Does it make sense to include that logic here?

jtrmal · 2021-06-30T13:51:23Z

just IMO, yes... y.

…

On Wed, Jun 30, 2021 at 3:49 PM Daniel Povey ***@***.***> wrote: Sometimes TIMIT has logic for using a 48-phone version or some other size of phone set. Does it make sense to include that logic here? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#324 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX7DU2N7SSQ4QMIDSL3TVMOH7ANCNFSM47PCSD6A> .

pzelasko · 2021-06-30T22:12:04Z

Let me know when it is ready to merge

luomingshuang · 2021-07-01T01:51:56Z

Yes，here, maybe I should add three options(60, 48, 39) for choosing the number of phonemes. The attach phones_60_48_39.txt is the same as the phones.60-48-39.map in kaldi.
phones_60_48_39.txt

luomingshuang · 2021-07-01T01:54:34Z

I will set 48-phone version as the default.

luomingshuang · 2021-07-15T05:56:14Z

@pzelasko , I add the function which can convert the phones to 48 (default) or 39. And I think it is ready to merge.

pzelasko · 2021-07-17T02:11:54Z

lhotse/recipes/timit.py

+        corpus_dir: Pathlike,
+        splits_dir: Pathlike,
+        output_dir: Optional[Pathlike] = None,
+        num_phones: int = 48,


Please add some documentation about the possible options for num_phones.

pzelasko · 2021-07-17T02:12:59Z

lhotse/recipes/timit.py

+
+    return manifests
+
+def get_phonemes(num_phones):


Should this function also handle 60 phones?

Also please raise an exception if somebody passes a different number than 39 / 48 (and maybe 60 if it makes sense).

Please document this function.

pzelasko

Please resolve my two last comments and I'm OK to merge it.

luomingshuang

Please resolve my two last comments and I'm OK to merge it.

Thanks for your advice! Done it.

luomingshuang · 2021-06-29T03:33:12Z

lhotse/recipes/timit.py

+from lhotse.utils import Pathlike, urlretrieve_progress
+
+
+def download_and_unzip(


luomingshuang · 2021-06-29T07:06:44Z

lhotse/recipes/timit.py

@@ -98,8 +98,8 @@ def prepare_timit(
                for wav_file in tqdm(wav_files):
                    items = wav_file.split('/')
                    idx = items[-2] + '-' + items[-1][:-4]
-                    speaker = items[-2]
-                    transcript_file = wav_file[:-3] + 'PHN' ###the phone file
+                    speaker = items[-2] 


luomingshuang · 2021-07-18T07:15:41Z

Please resolve my two last comments and I'm OK to merge it.

csukuangfj · 2021-07-18T07:28:36Z

lhotse/recipes/timit.py


-    phones_dict = get_phonemes(num_phones)
+    try:


Is there a need to use try .. except here?
What happens if it is removed?

Em....it is to remind people to set num_phones in [60, 48, 39]. If the value of num_phones not in [60, 48, 39], the error happens.

I agree with @csukuangfj that this try/except block is unnecessary. Please change:

try: if num_phones in [60, 48, 39]: phones_dict = get_phonemes(num_phones) else: raise ValueError("The value of num_phones must be in [60, 48, 39].") except ValueError as e: print("Exception: ", repr(e)) raise

into

if num_phones in [60, 48, 39]: phones_dict = get_phonemes(num_phones) else: raise ValueError("The value of num_phones must be in [60, 48, 39].")

csukuangfj · 2021-07-18T14:19:06Z

lhotse/recipes/timit.py

+import os
+import zipfile
+import logging
+import string


Suggested change

import string

csukuangfj · 2021-07-18T14:20:41Z

lhotse/recipes/timit.py

+
+def download_and_unzip(
+        target_dir: Pathlike = '.',
+        force_download: Optional[bool] = False,


Suggested change

force_download: Optional[bool] = False,

force_download: bool = False,

csukuangfj · 2021-07-18T14:21:03Z

lhotse/recipes/timit.py

+        base_url: Optional[str] = 'https://data.deepai.org/timit.zip') -> None:
+    """
+    Download and unzip the dataset TIMIT.
+    :param target_dir: Pathlike, the path of the dir to storage the dataset.


Suggested change

:param target_dir: Pathlike, the path of the dir to storage the dataset.

:param target_dir: Pathlike, the path of the dir to store the dataset.

csukuangfj · 2021-07-18T14:21:28Z

lhotse/recipes/timit.py

+    """
+    Download and unzip the dataset TIMIT.
+    :param target_dir: Pathlike, the path of the dir to storage the dataset.
+    :param force_download: Bool, if True, download the zips no matter if the zips exists.


Suggested change

:param force_download: Bool, if True, download the zips no matter if the zips exists.

:param force_download: bool, if True, download the zips no matter if the zips exists.

csukuangfj · 2021-07-18T14:21:54Z

lhotse/recipes/timit.py

+    Download and unzip the dataset TIMIT.
+    :param target_dir: Pathlike, the path of the dir to storage the dataset.
+    :param force_download: Bool, if True, download the zips no matter if the zips exists.
+    :param base_url: str, the url of the TIMIT download for free.


Suggested change

:param base_url: str, the url of the TIMIT download for free.

:param base_url: str, the URL of the TIMIT dataset to download.

csukuangfj · 2021-07-18T14:24:36Z

lhotse/recipes/timit.py

+                lines = f.readlines() 
+                for line in lines:
+                    items = line.strip().split(' ')
+                    wav = os.path.join(corpus_dir, items[-1])


Suggested change

wav = os.path.join(corpus_dir, items[-1])

wav = corpus_dir / items[-1]

csukuangfj · 2021-07-18T14:24:53Z

lhotse/recipes/timit.py

+                    items = line.strip().split(' ')
+                    wav = os.path.join(corpus_dir, items[-1])
+                    wav_files.append(wav)
+                print(f'{part} dataset manifest generation.')


Suggested change

print(f'{part} dataset manifest generation.')

logging.debug(f'{part} dataset manifest generation.')

csukuangfj · 2021-07-18T14:25:52Z

lhotse/recipes/timit.py

+
+    return manifests
+
+def get_phonemes(num_phones):


Please document this function.

csukuangfj · 2021-07-18T14:26:33Z

lhotse/recipes/timit.py

+        phonemes["zh"] = "sh"
+
+    else:
+        print("Using 60 phones for modeling!")


Please be consistent: You're using both logging and print. Please choose either one, not both.

luomingshuang · 2021-07-19T01:47:08Z

@csukuangfj and @pzelasko , thanks for your advice! Done it.

pzelasko · 2021-07-19T02:15:12Z

Cool, thank you @luomingshuang! I am merging, if any issues come up later we can do a follow up PR.

luomingshuang · 2021-07-19T04:35:30Z

Good!

…

---Original--- From: "Piotr ***@***.***> Date: Mon, Jul 19, 2021 10:15 AM To: ***@***.***>; Cc: "Mingshuang ***@***.******@***.***>; Subject: Re: [lhotse-speech/lhotse] prepare timit manifests (#324) Cool, thank you @luomingshuang! I am merging, if any issues come up later we can do a follow up PR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

luomingshuang and others added 2 commits June 29, 2021 09:29

prepare timit manifests

bd83e95

Update timit.py

539304f

csukuangfj suggested changes Jun 29, 2021

View reviewed changes

pzelasko reviewed Jun 29, 2021

View reviewed changes

csukuangfj reviewed Jun 29, 2021

View reviewed changes

csukuangfj suggested changes Jun 29, 2021

View reviewed changes

luomingshuang added 4 commits June 29, 2021 13:00

Update __init__.py

f65fe93

Update timit.py

16a5d78

Update timit.py

51146b2

Update timit.py

ba57dc7

csukuangfj reviewed Jun 29, 2021

View reviewed changes

glynpu reviewed Jun 29, 2021

View reviewed changes

luomingshuang added 2 commits June 29, 2021 15:05

Update timit.py

752ce90

Update timit.py

9554ad5

glynpu reviewed Jun 29, 2021

View reviewed changes

Update timit.py

123fb1b

pzelasko added this to the v0.8 milestone Jul 2, 2021

Update timit.py

69479c8

pzelasko reviewed Jul 17, 2021

View reviewed changes

Update timit.py

73ce2bb

luomingshuang commented Jul 18, 2021

View reviewed changes

luomingshuang closed this Jul 18, 2021

luomingshuang reopened this Jul 18, 2021

csukuangfj reviewed Jul 18, 2021

View reviewed changes

csukuangfj suggested changes Jul 18, 2021

View reviewed changes

Update timit.py

7789a2c

Merge branch 'master' into timit

d8b50f3

pzelasko merged commit 424abf6 into lhotse-speech:master Jul 19, 2021

	force_download: Optional[bool] = False,
	force_download: bool = False,

	assert corpus_dir.is_dir(), f'No such directory: {splits_dir}'
	assert splits_dir.is_dir(), f'No such directory: {splits_dir}'

		from lhotse.utils import Pathlike, urlretrieve_progress


		def download_and_unzip(

	:param target_dir: Pathlike, the path of the dir to storage the dataset.
	:param target_dir: Pathlike, the path of the dir to store the dataset.

	:param force_download: Bool, if True, download the zips no matter if the zips exists.
	:param force_download: bool, if True, download the zips no matter if the zips exists.

	:param base_url: str, the url of the TIMIT download for free.
	:param base_url: str, the URL of the TIMIT dataset to download.

	wav = os.path.join(corpus_dir, items[-1])
	wav = corpus_dir / items[-1]

	print(f'{part} dataset manifest generation.')
	logging.debug(f'{part} dataset manifest generation.')

prepare timit manifests #324

prepare timit manifests #324

Conversation

luomingshuang commented Jun 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glynpu left a comment

Choose a reason for hiding this comment

danpovey commented Jun 30, 2021

jtrmal commented Jun 30, 2021 via email

pzelasko commented Jun 30, 2021

luomingshuang commented Jul 1, 2021

luomingshuang commented Jul 1, 2021

luomingshuang commented Jul 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko left a comment

Choose a reason for hiding this comment

luomingshuang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luomingshuang commented Jul 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luomingshuang commented Jul 19, 2021

pzelasko commented Jul 19, 2021

luomingshuang commented Jul 19, 2021 via email

luomingshuang commented Jun 29, 2021 •

edited

Loading