Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up DSAlign #38

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Speed up DSAlign #38

wants to merge 6 commits into from

Commits on Sep 16, 2021

  1. Speed up DSAlign

    Add a unit-test (dsalign_lib_test.py) for checking that these speed
    ups actually work.
    
    I add cython as a dependency in order to make the smithwaterman
    function faster. This makes the runtime of "sw_align" approximately
    10x faster. "sw_align_old" is retained in case we ever want to check
    that the new function exactly matches the old output (I already
    checked that it does with the unit test).
    
    This is the current results from profiling
    dsalign_lib_test.py. Previously we took over 200 seconds to align this
    segment, but now we are at around 120 seconds.
    
       Ordered by: internal time
       ncalls  tottime  percall  cumtime  percall
       filename:lineno(function)
        52003   48.730    0.001  111.226    0.002 text.py:184(similarity)
     69031918   39.739    0.000   57.229    0.000 utils.py:105(enweight)
     69215860   12.950    0.000   12.993    0.000 text.py:152(ngrams)
          670   10.514    0.016   10.523    0.016 search.py:49(sw_align)
     68719564    4.510    0.000    4.510    0.000 {built-in method
     builtins.abs}
     13218994    2.369    0.000    2.369    0.000 {built-in method
     builtins.min}
     30927766    2.250    0.000    2.250    0.000
     __init__.py:570(__missing__)
          601    1.939    0.003   12.488    0.021 search.py:107(find_best)
        52003    0.422    0.000  111.648    0.002
        dsalign_lib.py:196(<lambda>)
        52576    0.206    0.000    0.206    0.000 {built-in method
        builtins.sum}
       104678    0.187    0.000    0.269    0.000
       __init__.py:550(__init__)
            1    0.170    0.170    0.238    0.238
            text.py:63(add_original_text)
    1278000/1277989    0.150    0.000    0.150    0.000 {built-in method
    builtins.len}
       312018    0.139    0.000    0.139    0.000
       text.py:168(weighted_ngrams)
        52003    0.103    0.000  111.786    0.002 dsalign_lib.py:215(<lambda>)
    galv committed Sep 16, 2021
    Configuration menu
    Copy the full SHA
    885ef64 View commit details
    Browse the repository at this point in the history
  2. Several alignment improvements

    Ignore [noise] and other silence-like words. Convert them to silence
    in the ctm file.
    
    Do basic text normalization with gruut
    
    Split only on word boundaries in forced alignment.
    
    Disable (just by commenting out) the gap alignment stage of DSAlign. I
    have not found it helpful. It tends to include or disclude text that
    isn't part of the original audio.
    galv committed Sep 16, 2021
    Configuration menu
    Copy the full SHA
    8ca78bf View commit details
    Browse the repository at this point in the history
  3. Segment flac files Creation stages.

    Lots of changes in here that I did not do a good job of documenting. Sorry.
    galv committed Sep 16, 2021
    Configuration menu
    Copy the full SHA
    8ef3d04 View commit details
    Browse the repository at this point in the history
  4. Rerun black.

    galv committed Sep 16, 2021
    Configuration menu
    Copy the full SHA
    b9263ff View commit details
    Browse the repository at this point in the history

Commits on Sep 19, 2021

  1. Sort by key name.

    Kaldi requires data in sorted order according to key. Keeping the tar
    file data sorted by key makes that easier to support (i.e., the HDD
    won't have to seek around as much).
    
    Remove " " from key names, since kaldi doesn't support i " " in key
    names.
    
    Allow option to output audio codec in whatever format you want. Right
    now, it's wav file format because Ceron encountered some issues with
    loading flac from tar files in nemo.
    
    Comment out some bazel targets that we don't need right now (sorry...)
    galv committed Sep 19, 2021
    Configuration menu
    Copy the full SHA
    5ef6b7f View commit details
    Browse the repository at this point in the history

Commits on Sep 20, 2021

  1. fixup: Format Python code with Black

    autoblack committed Sep 20, 2021
    Configuration menu
    Copy the full SHA
    21b8878 View commit details
    Browse the repository at this point in the history