-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up DSAlign #38
Open
galv
wants to merge
6
commits into
main
Choose a base branch
from
daniel/speed-up-alignment
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Speed up DSAlign #38
Commits on Sep 16, 2021
-
Add a unit-test (dsalign_lib_test.py) for checking that these speed ups actually work. I add cython as a dependency in order to make the smithwaterman function faster. This makes the runtime of "sw_align" approximately 10x faster. "sw_align_old" is retained in case we ever want to check that the new function exactly matches the old output (I already checked that it does with the unit test). This is the current results from profiling dsalign_lib_test.py. Previously we took over 200 seconds to align this segment, but now we are at around 120 seconds. Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 52003 48.730 0.001 111.226 0.002 text.py:184(similarity) 69031918 39.739 0.000 57.229 0.000 utils.py:105(enweight) 69215860 12.950 0.000 12.993 0.000 text.py:152(ngrams) 670 10.514 0.016 10.523 0.016 search.py:49(sw_align) 68719564 4.510 0.000 4.510 0.000 {built-in method builtins.abs} 13218994 2.369 0.000 2.369 0.000 {built-in method builtins.min} 30927766 2.250 0.000 2.250 0.000 __init__.py:570(__missing__) 601 1.939 0.003 12.488 0.021 search.py:107(find_best) 52003 0.422 0.000 111.648 0.002 dsalign_lib.py:196(<lambda>) 52576 0.206 0.000 0.206 0.000 {built-in method builtins.sum} 104678 0.187 0.000 0.269 0.000 __init__.py:550(__init__) 1 0.170 0.170 0.238 0.238 text.py:63(add_original_text) 1278000/1277989 0.150 0.000 0.150 0.000 {built-in method builtins.len} 312018 0.139 0.000 0.139 0.000 text.py:168(weighted_ngrams) 52003 0.103 0.000 111.786 0.002 dsalign_lib.py:215(<lambda>)
Configuration menu - View commit details
-
Copy full SHA for 885ef64 - Browse repository at this point
Copy the full SHA 885ef64View commit details -
Several alignment improvements
Ignore [noise] and other silence-like words. Convert them to silence in the ctm file. Do basic text normalization with gruut Split only on word boundaries in forced alignment. Disable (just by commenting out) the gap alignment stage of DSAlign. I have not found it helpful. It tends to include or disclude text that isn't part of the original audio.
Configuration menu - View commit details
-
Copy full SHA for 8ca78bf - Browse repository at this point
Copy the full SHA 8ca78bfView commit details -
Segment flac files Creation stages.
Lots of changes in here that I did not do a good job of documenting. Sorry.
Configuration menu - View commit details
-
Copy full SHA for 8ef3d04 - Browse repository at this point
Copy the full SHA 8ef3d04View commit details -
Configuration menu - View commit details
-
Copy full SHA for b9263ff - Browse repository at this point
Copy the full SHA b9263ffView commit details
Commits on Sep 19, 2021
-
Kaldi requires data in sorted order according to key. Keeping the tar file data sorted by key makes that easier to support (i.e., the HDD won't have to seek around as much). Remove " " from key names, since kaldi doesn't support i " " in key names. Allow option to output audio codec in whatever format you want. Right now, it's wav file format because Ceron encountered some issues with loading flac from tar files in nemo. Comment out some bazel targets that we don't need right now (sorry...)
Configuration menu - View commit details
-
Copy full SHA for 5ef6b7f - Browse repository at this point
Copy the full SHA 5ef6b7fView commit details
Commits on Sep 20, 2021
-
fixup: Format Python code with Black
autoblack committedSep 20, 2021 Configuration menu - View commit details
-
Copy full SHA for 21b8878 - Browse repository at this point
Copy the full SHA 21b8878View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.