Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValidateWavs takes too long on a large corpus. #466

Closed
joanise opened this issue Jun 13, 2024 · 0 comments · Fixed by #471
Closed

ValidateWavs takes too long on a large corpus. #466

joanise opened this issue Jun 13, 2024 · 0 comments · Fixed by #471
Assignees

Comments

@joanise
Copy link
Member

joanise commented Jun 13, 2024

Working with cml_tts_dataset_french_v0.1, which has 110k sentences/audio files, the ValidateWavsStep takes a surprisingly long time to run without providing any feedback to the user that it's doing anything. Noticed by @roedoejet while testing #464

Agreed with AP: we'll sample 100 wav files randomly to check.

  • The main use case for this feature was that you gave the parent or child directory of the intended directory. and that'll get caught with the first file.
  • A second use case is if you added some data to the file list and forgot to add it to your wavs dir. If you're missing >10% of the wav files, a 100 wav file sample will find one missing file with a very high probability. If you're missing <1% of the wav files, we probably don't actually care, you'll just have a warning in the preprocessing logs, should you happen to look at them.
  • I'll run some tests, and increase 100 to something bigger if it's fast enough: my goal is that the validation delay should not be too noticeable, yet we're likely to catch problems we care about.
@joanise joanise self-assigned this Jun 13, 2024
joanise added a commit that referenced this issue Jun 14, 2024
For a large corpus, e.g., our 110k French sentence corpus, checking for the
presence of all audio files takes a long time and is pointless. So check only a
sample of 1000 when there are more than 1000.

Fixes #466
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant