new dataset format with librispeech and commonvoice #303

vincentqb · 2019-10-05T20:02:55Z

Introducing a new dataset format based on generators, along with two new datasets: LibriSpeech and CommonVoice.

Related to

torchaudio/datasets/commonvoice.py

cpuhrsch · 2019-10-08T18:33:03Z

torchaudio/datasets/datasets.py

+        return item
+
+
+def download(urls, root_path):


If you want to write an abstraction that explicitly downloads a file to disk I'd call this "download_to_file".

In general a generic download function that returns a stream is a bit tricky, because if a user doesn't consume the stream the connection might break down (?). Also when writing a file straight to disk from the web we can potentially bypass system memory.

However, having something that does indeed give you a stream of data from the web is very useful.

If you want to get fancy you could also try to auto-discover the filename to be used as in here: https://github.com/pytorch/text/blob/fd31bf3722e17dfd5998b8c4f32c0431f3016d59/torchtext/utils.py#L33

cpuhrsch · 2019-10-08T18:41:18Z

torchaudio/datasets/datasets.py

+        yield file, folder
+
+
+def extract(files):


You should add a flag to have the user choose a format and also I don't think you need to make this specific to files. Have a look at https://github.com/pytorch/text/blob/fd31bf3722e17dfd5998b8c4f32c0431f3016d59/torchtext/utils.py#L152 , which however, regrettably, still requires files.

For an example of how to extract "on-the-fly" from an arbitrary file / buffer check out this snippet

def get_data(URL): r = requests.get(URL) file_like_object = io.BytesIO(r.content) tar = tarfile.open(fileobj=file_like_object) d = {} for member in tar.getmembers(): if member.isfile() and member.name.endswith('csv'): k = 'train' if 'train' in member.name else 'test' d[k] = tar.extractfile(member) return d

With something generic like this users can do

for tar_buffer in extract(open("/path/to/file.tar") tar_buffer.write_to_disk("/path...")

In general https://docs.python.org/3/library/io.html is a good library to look at for this.

If you want to make it explicit that this extracts form and to a file, I'd change the name to extract_file_to_disk or such.

cpuhrsch · 2019-10-08T18:44:02Z

torchaudio/datasets/datasets.py

+    Input: path
+    Output: path, file name identifying a row of data
+    """
+    for path in paths:


You can potentially write this as as a list expression, which might be more performant.

In general this function is more akin to the http://man7.org/linux/man-pages/man1/find.1.html tool, so that might be a better name?

cpuhrsch · 2019-10-08T18:45:15Z

torchaudio/datasets/datasets.py

+                    yield path, f
+
+
+def shuffle(generator):


I don't think this is necessary, except that random.shuffle is annoying because it doesn't give you a reference to the shuffled object.

Instead of generator you can also say "iterable"

cpuhrsch · 2019-10-08T19:03:27Z

torchaudio/datasets/datasets.py

+        yield g
+
+
+def filtering(fileids, reference):


This is similar to https://linux.die.net/man/1/comm and can also be extended to work on generic streams of text plus one constant source. It's similar to set intersection etc.

torchaudio/datasets/datasets.py

cpuhrsch · 2019-10-08T20:53:49Z

torchaudio/datasets/utils.py

+        return len(self._cache)
+
+
+class Buffer:


For this to provide a performance gain it will probably want to make use of multithreading. This does introduce various constraints on the given generator that need to be determined. I imagine this to be most useful if applied to function that perform IO operations such as reading from multiple files, downloading multiple URLs etc.

See the following snippet

import time import concurrent.futures import urllib.request def read_50(fn): for _ in range(50): yield fn() def read_fn(): with urllib.request.urlopen(<your_image_url>) as response: return response.read() def read_50_threaded(fn): with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: futures = [executor.submit(fn) for _ in range(50)] for future in futures: yield future.result() t1 = time.time() results = list(read_50(read_fn)) t2 = time.time() print("t: " + str(t2 - t1)) t1 = time.time() results = list(read_50_threaded(read_fn)) t2 = time.time() print("t: " + str(t2 - t1))

torchaudio/datasets/utils.py

cpuhrsch · 2019-10-08T22:26:10Z

torchaudio/datasets/datasets.py

+import torchaudio
+
+
+class Cache:


you could even make this a function annotations that assumes the input/output relationship is deterministic.

cpuhrsch · 2019-10-08T22:27:52Z

torchaudio/datasets/datasets.py

+            self._cache.append(file)
+
+            os.makedirs(self.location, exist_ok=True)
+            with open(file, "wb") as file:


pickling and files are a relatively strong assumption. lots of things can't be pickled. in terms of caching things on list for each input argument, you might end up with a lot of small files for something like images. if you want to write a read-only file cache (which is what this kind of is) you want to have a single file you append to. but random seek is not very good when it comes to files.

i'd say for now we stick to in-memory only caching

Or if you want to serialize it, read out the entire generator into a list first and then save that, or append to an open file etc.

For reference: SpooledTemporaryFile.

pytorch also has torch.save for serialization.

torchaudio/datasets/datasets.py

cpuhrsch · 2019-10-08T22:36:40Z

torchaudio/datasets/legacy/vctk.py

+
+
+def read_audio(fp, downsample=True):
+    if downsample:


I wonder if we can use our resample transform or if this is necessary for this dataset

cpuhrsch · 2019-10-08T22:39:41Z

torchaudio/datasets/legacy/vctk.py

+    return sig, sr
+
+
+def load_txts(dir):


I think you could first run a "find" function to get a dictionary of file paths plus the corresponding text (in UTF-8) format.

Then somewhere you need a map between transcription files and audio files.

This is part of the legacy datasets which I simply moved in a legacy folder.

cpuhrsch · 2019-10-08T22:43:16Z

torchaudio/datasets/librispeech.py

+        if not found:
+            from warnings import warn
+
+            warn("Translation not found for {}.".format(fileid))


Should this ever happen? If not I'd not check for existence at all and just have the open function error out.

cpuhrsch · 2019-10-08T22:43:41Z

torchaudio/datasets/utils.py

+def check_integrity(fpath, md5=None):
+    if not os.path.isfile(fpath):
+        return False
+    if md5 is None:


Is this actually desired?

ryanleary · 2019-10-10T01:40:01Z

For best efficiency (if that's a goal here), it might make some sense to format the audio such that utterances are shuffled and packed into one or more tars (or something like it) such that sequential i/o can be used. Especially for these larger datasets, dealing with thousands of files and accessing them randomly over network can become a significant challenge.

cpuhrsch · 2019-10-21T19:49:11Z

torchaudio/datasets/librispeech.py

+                torchaudio.datasets.utils.download_url(_url, root)
+            torchaudio.datasets.utils.extract_archive(archive)
+
+        walker = torchaudio.datasets.utils.walk_files(


Not sure walking here is the right approach. This is unfortunately one of those situations where the "input" to the pipeline is a list of paths that then are randomized (sometimes based on some per-path metadata). We could offer the user to provide these paths as an argument as well and reference the walk_files util as a means of retrieving them.

cpuhrsch · 2019-10-21T19:50:19Z

torchaudio/datasets/utils.py

+
+def calculate_md5(fpath, chunk_size=1024 * 1024):
+    md5 = hashlib.md5()
+    with open(fpath, "rb") as f:


I assume this is a copy. In general this shouldn't be restricted to files, but binary blobs. we might very easily want to provide this functionality to files downloaded on the fly.

torchaudio/datasets/utils.py

cpuhrsch · 2019-10-21T19:58:26Z

torchaudio/datasets/utils.py

+    return filename.endswith(".zip")
+
+
+def extract_archive(from_path, to_path=None, overwrite=False):


This doesn't need to be restricted to paths.

cpuhrsch · 2019-10-21T19:59:31Z

torchaudio/datasets/utils.py

+    if to_path is None:
+        to_path = os.path.dirname(from_path)
+
+    if from_path.endswith((".tar.gz", ".tgz")):


It's possible to try...except opening something as a tar file. The tar library will read the first few bytes and then decided whether it's a tarfile or not. We shouldn't make this dependent on the extension.

Good point.

cpuhrsch · 2019-10-21T20:01:37Z

torchaudio/datasets/utils.py

+        )
+
+
+class Cache:


I'd give this a more specific name. Like "DiskCache".

Good point.

torchaudio/datasets/yesno.py

cpuhrsch · 2019-10-22T19:15:14Z

@ryanleary - I can't agree more that pipelining and chunking are necessary optimizations. For now we're aiming for decreased complexity and coverage for these new datasets without breaking backwards compatibility. After that we'll work on performance.

vincentqb · 2019-10-22T19:22:40Z

I've defined the interface close to what was existing, threw a few deprecation warnings, and removed the "legacy" folder. @cpuhrsch thoughts?

cpuhrsch

Looks great to me :)

vincentqb mentioned this pull request Oct 7, 2019

Shared Dataset Functionality pytorch/pytorch#24915

Open

cpuhrsch reviewed Oct 8, 2019

View reviewed changes

torchaudio/datasets/commonvoice.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Oct 8, 2019

View reviewed changes

torchaudio/datasets/datasets.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Oct 8, 2019

View reviewed changes

torchaudio/datasets/utils.py Show resolved Hide resolved

cpuhrsch reviewed Oct 8, 2019

View reviewed changes

torchaudio/datasets/datasets.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Oct 8, 2019

View reviewed changes

vincentqb changed the title ~~[WIP] new dataset format.~~ [WIP] new dataset format with librispeech and commonvoice Oct 9, 2019

vincentqb added 7 commits October 16, 2019 15:47

new dataset format.

9c5868c

removing some functions.

f077d43

updating vctk test.

81e8696

notes on buffer.

d092a71

remove getitem on buffer.

75ac36a

remove redundant.

0e223dc

typos fixed.

2462621

vincentqb force-pushed the datasets branch from 6e3cdf7 to 6f19495 Compare October 16, 2019 19:53

add basic test.

3a53cfc

vincentqb force-pushed the datasets branch from 6f19495 to 3a53cfc Compare October 16, 2019 19:57

dataset variant of librispeech and commonvoice.

ade2fac

cpuhrsch reviewed Oct 21, 2019

View reviewed changes

torchaudio/datasets/utils.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Oct 21, 2019

View reviewed changes

torchaudio/datasets/yesno.py Outdated Show resolved Hide resolved

vincentqb added 4 commits October 21, 2019 18:11

going back to indexed.

c097dc8

remove not used utils.

6c6cbe5

add diskcache.

7d7f262

adding deprecation warnings.

29cef42

vincentqb changed the title ~~[WIP] new dataset format with librispeech and commonvoice~~ new dataset format with librispeech and commonvoice Oct 22, 2019

vincentqb added 3 commits October 22, 2019 15:01

removing legacy.

99507e8

suppress warning.

cd20050

warning about transforms.

2c80025

vincentqb added 2 commits October 22, 2019 15:18

detecting file format using reader.

5c57cb7

referring to torchaudio.

69b7c34

vincentqb requested review from zhangguanheng66 and fmassa October 22, 2019 23:42

cpuhrsch approved these changes Oct 23, 2019

View reviewed changes

files for testing.

fa5d8f4

vincentqb merged commit 8920802 into pytorch:master Oct 23, 2019

vincentqb deleted the datasets branch October 23, 2019 17:55

vincentqb mentioned this pull request Oct 23, 2019

Fix typo: dset -> dest #311

Closed

This was referenced Oct 30, 2019

Background generator #323

Merged

Data points remain tuples #330

Merged

vincentqb mentioned this pull request Dec 20, 2019

Update audio preprocessing tutorial pytorch/tutorials#797

Merged

10 tasks

vincentqb mentioned this pull request Sep 28, 2020

dataset transform and target_transform #923

Closed

		return filename.endswith(".zip")


		def extract_archive(from_path, to_path=None, overwrite=False):

		)


		class Cache:

new dataset format with librispeech and commonvoice #303

new dataset format with librispeech and commonvoice #303

Uh oh!

Conversation

vincentqb commented Oct 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpuhrsch Oct 8, 2019 • edited by vincentqb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincentqb Oct 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanleary commented Oct 10, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cpuhrsch commented Oct 22, 2019

Uh oh!

vincentqb commented Oct 22, 2019

Uh oh!

cpuhrsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vincentqb commented Oct 5, 2019 •

edited

Loading

cpuhrsch Oct 8, 2019 •

edited by vincentqb

Loading

vincentqb Oct 17, 2019 •

edited

Loading