Speed up DSAlign #38

Add a unit-test (dsalign_lib_test.py) for checking that these speed ups actually work. I add cython as a dependency in order to make the smithwaterman function faster. This makes the runtime of "sw_align" approximately 10x faster. "sw_align_old" is retained in case we ever want to check that the new function exactly matches the old output (I already checked that it does with the unit test). This is the current results from profiling dsalign_lib_test.py. Previously we took over 200 seconds to align this segment, but now we are at around 120 seconds. Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 52003 48.730 0.001 111.226 0.002 text.py:184(similarity) 69031918 39.739 0.000 57.229 0.000 utils.py:105(enweight) 69215860 12.950 0.000 12.993 0.000 text.py:152(ngrams) 670 10.514 0.016 10.523 0.016 search.py:49(sw_align) 68719564 4.510 0.000 4.510 0.000 {built-in method builtins.abs} 13218994 2.369 0.000 2.369 0.000 {built-in method builtins.min} 30927766 2.250 0.000 2.250 0.000 __init__.py:570(__missing__) 601 1.939 0.003 12.488 0.021 search.py:107(find_best) 52003 0.422 0.000 111.648 0.002 dsalign_lib.py:196(<lambda>) 52576 0.206 0.000 0.206 0.000 {built-in method builtins.sum} 104678 0.187 0.000 0.269 0.000 __init__.py:550(__init__) 1 0.170 0.170 0.238 0.238 text.py:63(add_original_text) 1278000/1277989 0.150 0.000 0.150 0.000 {built-in method builtins.len} 312018 0.139 0.000 0.139 0.000 text.py:168(weighted_ngrams) 52003 0.103 0.000 111.786 0.002 dsalign_lib.py:215(<lambda>)

Ignore [noise] and other silence-like words. Convert them to silence in the ctm file. Do basic text normalization with gruut Split only on word boundaries in forced alignment. Disable (just by commenting out) the gap alignment stage of DSAlign. I have not found it helpful. It tends to include or disclude text that isn't part of the original audio.

Lots of changes in here that I did not do a good job of documenting. Sorry.

Kaldi requires data in sorted order according to key. Keeping the tar file data sorted by key makes that easier to support (i.e., the HDD won't have to seek around as much). Remove " " from key names, since kaldi doesn't support i " " in key names. Allow option to output audio codec in whatever format you want. Right now, it's wav file format because Ceron encountered some issues with loading flac from tar files in nemo. Comment out some bazel targets that we don't need right now (sorry...)

Commits on Sep 20, 2021

fixup: Format Python code with Black

autoblack committed Sep 20, 2021

Configuration menu

View commit details

Copy full SHA for 21b8878

Browse repository at this point

Copy the full SHA

21b8878 View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up DSAlign #38

Speed up DSAlign #38

Commits on Sep 16, 2021

Commits on Sep 19, 2021

Commits on Sep 20, 2021

Speed up DSAlign #38

Are you sure you want to change the base?

Speed up DSAlign #38

Commits on Sep 16, 2021

Commits on Sep 19, 2021

Commits on Sep 20, 2021