Skip to content

new dataset format with librispeech and commonvoice #303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Oct 23, 2019

Conversation

vincentqb
Copy link
Contributor

@vincentqb vincentqb commented Oct 5, 2019

Introducing a new dataset format based on generators, along with two new datasets: LibriSpeech and CommonVoice.

Related to

return item


def download(urls, root_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to write an abstraction that explicitly downloads a file to disk I'd call this "download_to_file".

In general a generic download function that returns a stream is a bit tricky, because if a user doesn't consume the stream the connection might break down (?). Also when writing a file straight to disk from the web we can potentially bypass system memory.

However, having something that does indeed give you a stream of data from the web is very useful.

If you want to get fancy you could also try to auto-discover the filename to be used as in here: https://github.com/pytorch/text/blob/fd31bf3722e17dfd5998b8c4f32c0431f3016d59/torchtext/utils.py#L33

yield file, folder


def extract(files):
Copy link
Contributor

@cpuhrsch cpuhrsch Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a flag to have the user choose a format and also I don't think you need to make this specific to files. Have a look at https://github.com/pytorch/text/blob/fd31bf3722e17dfd5998b8c4f32c0431f3016d59/torchtext/utils.py#L152 , which however, regrettably, still requires files.

For an example of how to extract "on-the-fly" from an arbitrary file / buffer check out this snippet

def get_data(URL):
    r = requests.get(URL)
    file_like_object = io.BytesIO(r.content)
    tar = tarfile.open(fileobj=file_like_object)
    d = {}
    for member in tar.getmembers():
        if member.isfile() and member.name.endswith('csv'):
            k = 'train' if 'train' in member.name else 'test'
            d[k] = tar.extractfile(member)
    return d

With something generic like this users can do

for tar_buffer in extract(open("/path/to/file.tar")
    tar_buffer.write_to_disk("/path...")

In general https://docs.python.org/3/library/io.html is a good library to look at for this.

If you want to make it explicit that this extracts form and to a file, I'd change the name to extract_file_to_disk or such.

Input: path
Output: path, file name identifying a row of data
"""
for path in paths:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can potentially write this as as a list expression, which might be more performant.

In general this function is more akin to the http://man7.org/linux/man-pages/man1/find.1.html tool, so that might be a better name?

yield path, f


def shuffle(generator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary, except that random.shuffle is annoying because it doesn't give you a reference to the shuffled object.

Instead of generator you can also say "iterable"

yield g


def filtering(fileids, reference):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to https://linux.die.net/man/1/comm and can also be extended to work on generic streams of text plus one constant source. It's similar to set intersection etc.

return len(self._cache)


class Buffer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this to provide a performance gain it will probably want to make use of multithreading. This does introduce various constraints on the given generator that need to be determined. I imagine this to be most useful if applied to function that perform IO operations such as reading from multiple files, downloading multiple URLs etc.

See the following snippet

import time
import concurrent.futures
import urllib.request

def read_50(fn):
    for _ in range(50):
        yield fn()

def read_fn():
    with urllib.request.urlopen(<your_image_url>) as response:
        return response.read()

def read_50_threaded(fn):
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        futures = [executor.submit(fn) for _ in range(50)]
        for future in futures:
            yield future.result()

t1 = time.time()
results = list(read_50(read_fn))
t2 = time.time()

print("t: " + str(t2 - t1))

t1 = time.time()
results = list(read_50_threaded(read_fn))
t2 = time.time()

print("t: " + str(t2 - t1))

import torchaudio


class Cache:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could even make this a function annotations that assumes the input/output relationship is deterministic.

self._cache.append(file)

os.makedirs(self.location, exist_ok=True)
with open(file, "wb") as file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pickling and files are a relatively strong assumption. lots of things can't be pickled. in terms of caching things on list for each input argument, you might end up with a lot of small files for something like images. if you want to write a read-only file cache (which is what this kind of is) you want to have a single file you append to. but random seek is not very good when it comes to files.

i'd say for now we stick to in-memory only caching

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if you want to serialize it, read out the entire generator into a list first and then save that, or append to an open file etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference: SpooledTemporaryFile.

Copy link
Contributor Author

@vincentqb vincentqb Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytorch also has torch.save for serialization.



def read_audio(fp, downsample=True):
if downsample:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can use our resample transform or if this is necessary for this dataset

return sig, sr


def load_txts(dir):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could first run a "find" function to get a dictionary of file paths plus the corresponding text (in UTF-8) format.

Then somewhere you need a map between transcription files and audio files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the legacy datasets which I simply moved in a legacy folder.

if not found:
from warnings import warn

warn("Translation not found for {}.".format(fileid))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this ever happen? If not I'd not check for existence at all and just have the open function error out.

def check_integrity(fpath, md5=None):
if not os.path.isfile(fpath):
return False
if md5 is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually desired?

@vincentqb vincentqb changed the title [WIP] new dataset format. [WIP] new dataset format with librispeech and commonvoice Oct 9, 2019
@ryanleary
Copy link

For best efficiency (if that's a goal here), it might make some sense to format the audio such that utterances are shuffled and packed into one or more tars (or something like it) such that sequential i/o can be used. Especially for these larger datasets, dealing with thousands of files and accessing them randomly over network can become a significant challenge.

torchaudio.datasets.utils.download_url(_url, root)
torchaudio.datasets.utils.extract_archive(archive)

walker = torchaudio.datasets.utils.walk_files(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure walking here is the right approach. This is unfortunately one of those situations where the "input" to the pipeline is a list of paths that then are randomized (sometimes based on some per-path metadata). We could offer the user to provide these paths as an argument as well and reference the walk_files util as a means of retrieving them.


def calculate_md5(fpath, chunk_size=1024 * 1024):
md5 = hashlib.md5()
with open(fpath, "rb") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is a copy. In general this shouldn't be restricted to files, but binary blobs. we might very easily want to provide this functionality to files downloaded on the fly.

return filename.endswith(".zip")


def extract_archive(from_path, to_path=None, overwrite=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be restricted to paths.

if to_path is None:
to_path = os.path.dirname(from_path)

if from_path.endswith((".tar.gz", ".tgz")):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to try...except opening something as a tar file. The tar library will read the first few bytes and then decided whether it's a tarfile or not. We shouldn't make this dependent on the extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

)


class Cache:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd give this a more specific name. Like "DiskCache".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

@vincentqb vincentqb changed the title [WIP] new dataset format with librispeech and commonvoice new dataset format with librispeech and commonvoice Oct 22, 2019
@cpuhrsch
Copy link
Contributor

@ryanleary - I can't agree more that pipelining and chunking are necessary optimizations. For now we're aiming for decreased complexity and coverage for these new datasets without breaking backwards compatibility. After that we'll work on performance.

@vincentqb
Copy link
Contributor Author

I've defined the interface close to what was existing, threw a few deprecation warnings, and removed the "legacy" folder. @cpuhrsch thoughts?

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me :)

@vincentqb vincentqb merged commit 8920802 into pytorch:master Oct 23, 2019
@vincentqb vincentqb deleted the datasets branch October 23, 2019 17:55
@vincentqb vincentqb mentioned this pull request Oct 23, 2019
This was referenced Oct 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants