-
Notifications
You must be signed in to change notification settings - Fork 702
new dataset format with librispeech and commonvoice #303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
torchaudio/datasets/datasets.py
Outdated
return item | ||
|
||
|
||
def download(urls, root_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to write an abstraction that explicitly downloads a file to disk I'd call this "download_to_file".
In general a generic download function that returns a stream is a bit tricky, because if a user doesn't consume the stream the connection might break down (?). Also when writing a file straight to disk from the web we can potentially bypass system memory.
However, having something that does indeed give you a stream of data from the web is very useful.
If you want to get fancy you could also try to auto-discover the filename to be used as in here: https://github.com/pytorch/text/blob/fd31bf3722e17dfd5998b8c4f32c0431f3016d59/torchtext/utils.py#L33
torchaudio/datasets/datasets.py
Outdated
yield file, folder | ||
|
||
|
||
def extract(files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add a flag to have the user choose a format and also I don't think you need to make this specific to files. Have a look at https://github.com/pytorch/text/blob/fd31bf3722e17dfd5998b8c4f32c0431f3016d59/torchtext/utils.py#L152 , which however, regrettably, still requires files.
For an example of how to extract "on-the-fly" from an arbitrary file / buffer check out this snippet
def get_data(URL):
r = requests.get(URL)
file_like_object = io.BytesIO(r.content)
tar = tarfile.open(fileobj=file_like_object)
d = {}
for member in tar.getmembers():
if member.isfile() and member.name.endswith('csv'):
k = 'train' if 'train' in member.name else 'test'
d[k] = tar.extractfile(member)
return d
With something generic like this users can do
for tar_buffer in extract(open("/path/to/file.tar")
tar_buffer.write_to_disk("/path...")
In general https://docs.python.org/3/library/io.html is a good library to look at for this.
If you want to make it explicit that this extracts form and to a file, I'd change the name to extract_file_to_disk or such.
torchaudio/datasets/datasets.py
Outdated
Input: path | ||
Output: path, file name identifying a row of data | ||
""" | ||
for path in paths: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can potentially write this as as a list expression, which might be more performant.
In general this function is more akin to the http://man7.org/linux/man-pages/man1/find.1.html tool, so that might be a better name?
torchaudio/datasets/datasets.py
Outdated
yield path, f | ||
|
||
|
||
def shuffle(generator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary, except that random.shuffle is annoying because it doesn't give you a reference to the shuffled object.
Instead of generator you can also say "iterable"
torchaudio/datasets/datasets.py
Outdated
yield g | ||
|
||
|
||
def filtering(fileids, reference): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to https://linux.die.net/man/1/comm and can also be extended to work on generic streams of text plus one constant source. It's similar to set intersection etc.
torchaudio/datasets/utils.py
Outdated
return len(self._cache) | ||
|
||
|
||
class Buffer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this to provide a performance gain it will probably want to make use of multithreading. This does introduce various constraints on the given generator that need to be determined. I imagine this to be most useful if applied to function that perform IO operations such as reading from multiple files, downloading multiple URLs etc.
See the following snippet
import time
import concurrent.futures
import urllib.request
def read_50(fn):
for _ in range(50):
yield fn()
def read_fn():
with urllib.request.urlopen(<your_image_url>) as response:
return response.read()
def read_50_threaded(fn):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(fn) for _ in range(50)]
for future in futures:
yield future.result()
t1 = time.time()
results = list(read_50(read_fn))
t2 = time.time()
print("t: " + str(t2 - t1))
t1 = time.time()
results = list(read_50_threaded(read_fn))
t2 = time.time()
print("t: " + str(t2 - t1))
torchaudio/datasets/datasets.py
Outdated
import torchaudio | ||
|
||
|
||
class Cache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could even make this a function annotations that assumes the input/output relationship is deterministic.
torchaudio/datasets/datasets.py
Outdated
self._cache.append(file) | ||
|
||
os.makedirs(self.location, exist_ok=True) | ||
with open(file, "wb") as file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pickling and files are a relatively strong assumption. lots of things can't be pickled. in terms of caching things on list for each input argument, you might end up with a lot of small files for something like images. if you want to write a read-only file cache (which is what this kind of is) you want to have a single file you append to. but random seek is not very good when it comes to files.
i'd say for now we stick to in-memory only caching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or if you want to serialize it, read out the entire generator into a list first and then save that, or append to an open file etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference: SpooledTemporaryFile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytorch also has torch.save
for serialization.
torchaudio/datasets/legacy/vctk.py
Outdated
|
||
|
||
def read_audio(fp, downsample=True): | ||
if downsample: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can use our resample transform or if this is necessary for this dataset
torchaudio/datasets/legacy/vctk.py
Outdated
return sig, sr | ||
|
||
|
||
def load_txts(dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could first run a "find" function to get a dictionary of file paths plus the corresponding text (in UTF-8) format.
Then somewhere you need a map between transcription files and audio files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the legacy datasets which I simply moved in a legacy folder.
torchaudio/datasets/librispeech.py
Outdated
if not found: | ||
from warnings import warn | ||
|
||
warn("Translation not found for {}.".format(fileid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this ever happen? If not I'd not check for existence at all and just have the open function error out.
torchaudio/datasets/utils.py
Outdated
def check_integrity(fpath, md5=None): | ||
if not os.path.isfile(fpath): | ||
return False | ||
if md5 is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually desired?
For best efficiency (if that's a goal here), it might make some sense to format the audio such that utterances are shuffled and packed into one or more tars (or something like it) such that sequential i/o can be used. Especially for these larger datasets, dealing with thousands of files and accessing them randomly over network can become a significant challenge. |
torchaudio/datasets/librispeech.py
Outdated
torchaudio.datasets.utils.download_url(_url, root) | ||
torchaudio.datasets.utils.extract_archive(archive) | ||
|
||
walker = torchaudio.datasets.utils.walk_files( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure walking here is the right approach. This is unfortunately one of those situations where the "input" to the pipeline is a list of paths that then are randomized (sometimes based on some per-path metadata). We could offer the user to provide these paths as an argument as well and reference the walk_files util as a means of retrieving them.
torchaudio/datasets/utils.py
Outdated
|
||
def calculate_md5(fpath, chunk_size=1024 * 1024): | ||
md5 = hashlib.md5() | ||
with open(fpath, "rb") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is a copy. In general this shouldn't be restricted to files, but binary blobs. we might very easily want to provide this functionality to files downloaded on the fly.
return filename.endswith(".zip") | ||
|
||
|
||
def extract_archive(from_path, to_path=None, overwrite=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be restricted to paths.
torchaudio/datasets/utils.py
Outdated
if to_path is None: | ||
to_path = os.path.dirname(from_path) | ||
|
||
if from_path.endswith((".tar.gz", ".tgz")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to try...except
opening something as a tar file. The tar library will read the first few bytes and then decided whether it's a tarfile or not. We shouldn't make this dependent on the extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
torchaudio/datasets/utils.py
Outdated
) | ||
|
||
|
||
class Cache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd give this a more specific name. Like "DiskCache".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
@ryanleary - I can't agree more that pipelining and chunking are necessary optimizations. For now we're aiming for decreased complexity and coverage for these new datasets without breaking backwards compatibility. After that we'll work on performance. |
I've defined the interface close to what was existing, threw a few deprecation warnings, and removed the "legacy" folder. @cpuhrsch thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me :)
Introducing a new dataset format based on generators, along with two new datasets: LibriSpeech and CommonVoice.
Related to