ValidateWavs takes too long on a large corpus. #466

joanise · 2024-06-13T20:55:27Z

Working with cml_tts_dataset_french_v0.1, which has 110k sentences/audio files, the ValidateWavsStep takes a surprisingly long time to run without providing any feedback to the user that it's doing anything. Noticed by @roedoejet while testing #464

Agreed with AP: we'll sample 100 wav files randomly to check.

The main use case for this feature was that you gave the parent or child directory of the intended directory. and that'll get caught with the first file.
A second use case is if you added some data to the file list and forgot to add it to your wavs dir. If you're missing >10% of the wav files, a 100 wav file sample will find one missing file with a very high probability. If you're missing <1% of the wav files, we probably don't actually care, you'll just have a warning in the preprocessing logs, should you happen to look at them.
I'll run some tests, and increase 100 to something bigger if it's fast enough: my goal is that the validation delay should not be too noticeable, yet we're likely to catch problems we care about.

The text was updated successfully, but these errors were encountered:

For a large corpus, e.g., our 110k French sentence corpus, checking for the presence of all audio files takes a long time and is pointless. So check only a sample of 1000 when there are more than 1000. Fixes #466

joanise self-assigned this Jun 13, 2024

joanise mentioned this issue Jun 14, 2024

Dev.ej/466 sample wavs #471

Merged

joanise closed this as completed in cc1fd9a Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValidateWavs takes too long on a large corpus. #466

ValidateWavs takes too long on a large corpus. #466

joanise commented Jun 13, 2024

ValidateWavs takes too long on a large corpus. #466

ValidateWavs takes too long on a large corpus. #466

Comments

joanise commented Jun 13, 2024