Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up DSAlign #38

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Speed up DSAlign #38

wants to merge 6 commits into from

Conversation

galv
Copy link
Collaborator

@galv galv commented Jul 6, 2021

Add a unit-test (dsalign_lib_test.py) for checking that these speed
ups actually work.

I add cython as a dependency in order to make the smithwaterman
function faster. This makes the runtime of "sw_align" approximately
10x faster. "sw_align_old" is retained in case we ever want to check
that the new function exactly matches the old output (I already
checked that it does with the unit test).

This is the current results from profiling
dsalign_lib_test.py. Previously we took over 200 seconds to align this
segment, but now we are at around 120 seconds.

Ordered by: internal time
ncalls tottime percall cumtime percall
filename:lineno(function)
52003 48.730 0.001 111.226 0.002 text.py:184(similarity)
69031918 39.739 0.000 57.229 0.000 utils.py:105(enweight)
69215860 12.950 0.000 12.993 0.000 text.py:152(ngrams)
670 10.514 0.016 10.523 0.016 search.py:49(sw_align)
68719564 4.510 0.000 4.510 0.000 {built-in method
builtins.abs}
13218994 2.369 0.000 2.369 0.000 {built-in method
builtins.min}
30927766 2.250 0.000 2.250 0.000
init.py:570(missing)
601 1.939 0.003 12.488 0.021 search.py:107(find_best)
52003 0.422 0.000 111.648 0.002
dsalign_lib.py:196()
52576 0.206 0.000 0.206 0.000 {built-in method
builtins.sum}
104678 0.187 0.000 0.269 0.000
init.py:550(init)
1 0.170 0.170 0.238 0.238
text.py:63(add_original_text)
1278000/1277989 0.150 0.000 0.150 0.000 {built-in method
builtins.len}
312018 0.139 0.000 0.139 0.000
text.py:168(weighted_ngrams)
52003 0.103 0.000 111.786 0.002 dsalign_lib.py:215()

@github-actions
Copy link

github-actions bot commented Jul 6, 2021

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Ciroye
Ciroye previously approved these changes Jul 7, 2021
@galv galv force-pushed the daniel/speed-up-alignment branch 2 times, most recently from ae4cdc4 to c4bf91a Compare September 14, 2021 21:18
Add a unit-test (dsalign_lib_test.py) for checking that these speed
ups actually work.

I add cython as a dependency in order to make the smithwaterman
function faster. This makes the runtime of "sw_align" approximately
10x faster. "sw_align_old" is retained in case we ever want to check
that the new function exactly matches the old output (I already
checked that it does with the unit test).

This is the current results from profiling
dsalign_lib_test.py. Previously we took over 200 seconds to align this
segment, but now we are at around 120 seconds.

   Ordered by: internal time
   ncalls  tottime  percall  cumtime  percall
   filename:lineno(function)
    52003   48.730    0.001  111.226    0.002 text.py:184(similarity)
 69031918   39.739    0.000   57.229    0.000 utils.py:105(enweight)
 69215860   12.950    0.000   12.993    0.000 text.py:152(ngrams)
      670   10.514    0.016   10.523    0.016 search.py:49(sw_align)
 68719564    4.510    0.000    4.510    0.000 {built-in method
 builtins.abs}
 13218994    2.369    0.000    2.369    0.000 {built-in method
 builtins.min}
 30927766    2.250    0.000    2.250    0.000
 __init__.py:570(__missing__)
      601    1.939    0.003   12.488    0.021 search.py:107(find_best)
    52003    0.422    0.000  111.648    0.002
    dsalign_lib.py:196(<lambda>)
    52576    0.206    0.000    0.206    0.000 {built-in method
    builtins.sum}
   104678    0.187    0.000    0.269    0.000
   __init__.py:550(__init__)
        1    0.170    0.170    0.238    0.238
        text.py:63(add_original_text)
1278000/1277989    0.150    0.000    0.150    0.000 {built-in method
builtins.len}
   312018    0.139    0.000    0.139    0.000
   text.py:168(weighted_ngrams)
    52003    0.103    0.000  111.786    0.002 dsalign_lib.py:215(<lambda>)
Ignore [noise] and other silence-like words. Convert them to silence
in the ctm file.

Do basic text normalization with gruut

Split only on word boundaries in forced alignment.

Disable (just by commenting out) the gap alignment stage of DSAlign. I
have not found it helpful. It tends to include or disclude text that
isn't part of the original audio.
Lots of changes in here that I did not do a good job of documenting. Sorry.
galv and others added 2 commits September 19, 2021 20:35
Kaldi requires data in sorted order according to key. Keeping the tar
file data sorted by key makes that easier to support (i.e., the HDD
won't have to seek around as much).

Remove " " from key names, since kaldi doesn't support i " " in key
names.

Allow option to output audio codec in whatever format you want. Right
now, it's wav file format because Ceron encountered some issues with
loading flac from tar files in nemo.

Comment out some bazel targets that we don't need right now (sorry...)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants