-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up DSAlign #38
Open
galv
wants to merge
6
commits into
main
Choose a base branch
from
daniel/speed-up-alignment
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Speed up DSAlign #38
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Ciroye
previously approved these changes
Jul 7, 2021
galv
force-pushed
the
daniel/speed-up-alignment
branch
2 times, most recently
from
September 14, 2021 21:18
ae4cdc4
to
c4bf91a
Compare
Add a unit-test (dsalign_lib_test.py) for checking that these speed ups actually work. I add cython as a dependency in order to make the smithwaterman function faster. This makes the runtime of "sw_align" approximately 10x faster. "sw_align_old" is retained in case we ever want to check that the new function exactly matches the old output (I already checked that it does with the unit test). This is the current results from profiling dsalign_lib_test.py. Previously we took over 200 seconds to align this segment, but now we are at around 120 seconds. Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 52003 48.730 0.001 111.226 0.002 text.py:184(similarity) 69031918 39.739 0.000 57.229 0.000 utils.py:105(enweight) 69215860 12.950 0.000 12.993 0.000 text.py:152(ngrams) 670 10.514 0.016 10.523 0.016 search.py:49(sw_align) 68719564 4.510 0.000 4.510 0.000 {built-in method builtins.abs} 13218994 2.369 0.000 2.369 0.000 {built-in method builtins.min} 30927766 2.250 0.000 2.250 0.000 __init__.py:570(__missing__) 601 1.939 0.003 12.488 0.021 search.py:107(find_best) 52003 0.422 0.000 111.648 0.002 dsalign_lib.py:196(<lambda>) 52576 0.206 0.000 0.206 0.000 {built-in method builtins.sum} 104678 0.187 0.000 0.269 0.000 __init__.py:550(__init__) 1 0.170 0.170 0.238 0.238 text.py:63(add_original_text) 1278000/1277989 0.150 0.000 0.150 0.000 {built-in method builtins.len} 312018 0.139 0.000 0.139 0.000 text.py:168(weighted_ngrams) 52003 0.103 0.000 111.786 0.002 dsalign_lib.py:215(<lambda>)
Ignore [noise] and other silence-like words. Convert them to silence in the ctm file. Do basic text normalization with gruut Split only on word boundaries in forced alignment. Disable (just by commenting out) the gap alignment stage of DSAlign. I have not found it helpful. It tends to include or disclude text that isn't part of the original audio.
Lots of changes in here that I did not do a good job of documenting. Sorry.
galv
force-pushed
the
daniel/speed-up-alignment
branch
from
September 16, 2021 17:50
c4bf91a
to
b9263ff
Compare
Kaldi requires data in sorted order according to key. Keeping the tar file data sorted by key makes that easier to support (i.e., the HDD won't have to seek around as much). Remove " " from key names, since kaldi doesn't support i " " in key names. Allow option to output audio codec in whatever format you want. Right now, it's wav file format because Ceron encountered some issues with loading flac from tar files in nemo. Comment out some bazel targets that we don't need right now (sorry...)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add a unit-test (dsalign_lib_test.py) for checking that these speed
ups actually work.
I add cython as a dependency in order to make the smithwaterman
function faster. This makes the runtime of "sw_align" approximately
10x faster. "sw_align_old" is retained in case we ever want to check
that the new function exactly matches the old output (I already
checked that it does with the unit test).
This is the current results from profiling
dsalign_lib_test.py. Previously we took over 200 seconds to align this
segment, but now we are at around 120 seconds.
Ordered by: internal time
ncalls tottime percall cumtime percall
filename:lineno(function)
52003 48.730 0.001 111.226 0.002 text.py:184(similarity)
69031918 39.739 0.000 57.229 0.000 utils.py:105(enweight)
69215860 12.950 0.000 12.993 0.000 text.py:152(ngrams)
670 10.514 0.016 10.523 0.016 search.py:49(sw_align)
68719564 4.510 0.000 4.510 0.000 {built-in method
builtins.abs}
13218994 2.369 0.000 2.369 0.000 {built-in method
builtins.min}
30927766 2.250 0.000 2.250 0.000
init.py:570(missing)
601 1.939 0.003 12.488 0.021 search.py:107(find_best)
52003 0.422 0.000 111.648 0.002
dsalign_lib.py:196()
52576 0.206 0.000 0.206 0.000 {built-in method
builtins.sum}
104678 0.187 0.000 0.269 0.000
init.py:550(init)
1 0.170 0.170 0.238 0.238
text.py:63(add_original_text)
1278000/1277989 0.150 0.000 0.150 0.000 {built-in method
builtins.len}
312018 0.139 0.000 0.139 0.000
text.py:168(weighted_ngrams)
52003 0.103 0.000 111.786 0.002 dsalign_lib.py:215()