Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Spellchecking ASR customization model (#6179)
* bug fixes Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com> * fix bugs, add preparation and evaluation scripts, add readme Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com> * small fixes Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add real coverage calculation, small fixes, more debug information Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add option to pass a filelist and output folder - to handle inference from multiple input files Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * added preprocessing for yago wikipedia articles - finding yago entities and their subphrases Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * yago wiki preprocessing, sampling, pseudonormalization Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * more scripts for preparation of training examples Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fixes Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add some alphabet checks Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add bert on subwords, concatenate it to bert on characters Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add calculation of character_pos_to_subword_pos Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * pdb Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * tensor join bug fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * double hidden_size in classifier Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * pdb Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * default index value 0 instead of -1 because index cannot be negative Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * pad index value 0 instead of -1 because index cannot be negative Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * remove pdb Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bugs, add creation of tarred dataset Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add possibility to change sequence len at inference Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * change sampling of dummy candidates at inference, add candidate info file Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix import Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bug Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * update transcription now uses info Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * write path Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * 1. add tarred dataset support(untested). 2. fix bug with ban_ngrams in indexing Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * skip short_sent if no real candidates Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix import Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add braceexpand Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fixes Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bug Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bug Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bug in np.ones Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bug in collate Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * change tensor type to long because of error in torch.gather Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix for empty spans tensor Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * same fixes in _collate_fn for tarred dataset Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bug from previous commit Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * change int types to be shorter to minimize tar size Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * refactoring of datasets and inference Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * tar by 100k examples, small fixes Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * small fixes, add analytics script Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * Add functions for dynamic programming comparison to get best path by ngrams Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fixes Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * small fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fixes to support testing on SPGISpeech Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add preprocessing for userlibri Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * some refactoring Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * some refactoring Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * move some functions to utils to reuse from other project Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * move some functions to utils to reuse from other project Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * move some functions to utils to reuse from other project Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * small refactoring before pr. Add bash-scripts reproducing evaluation Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * style fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * small fixes in inference Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fix - didn't move window on last symbol Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bug - shuffle was before truncation of sorted candidates Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * refactoring, fix some bugs Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * variour fixes. Add word_indices at inference Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add candidate positions Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move data preparation and evaluation to other repo Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add infer_reproduce_paper. Refactoring Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * refactor inference using fragment indices Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add some helper functions Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bug with parameters order Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bugs Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * refactoring, fix bug Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add multiple variants of adjusting start/end positions Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * more fixes Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add unit tests, other fixes Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix CodeQl warnings Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fixes Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com> * fix bugs, add preparation and evaluation scripts, add readme Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com> * small fixes Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add real coverage calculation, small fixes, more debug information Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add option to pass a filelist and output folder - to handle inference from multiple input files Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * added preprocessing for yago wikipedia articles - finding yago entities and their subphrases Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * yago wiki preprocessing, sampling, pseudonormalization Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * more scripts for preparation of training examples Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fixes Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add some alphabet checks Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add bert on subwords, concatenate it to bert on characters Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add calculation of character_pos_to_subword_pos Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * pdb Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * tensor join bug fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * double hidden_size in classifier Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * pdb Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * default index value 0 instead of -1 because index cannot be negative Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * pad index value 0 instead of -1 because index cannot be negative Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * remove pdb Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bugs, add creation of tarred dataset Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add possibility to change sequence len at inference Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * change sampling of dummy candidates at inference, add candidate info file Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix import Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bug Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * update transcription now uses info Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * write path Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * 1. add tarred dataset support(untested). 2. fix bug with ban_ngrams in indexing Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * skip short_sent if no real candidates Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix import Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add braceexpand Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fixes Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bug Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bug Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bug in np.ones Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bug in collate Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * change tensor type to long because of error in torch.gather Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix for empty spans tensor Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * same fixes in _collate_fn for tarred dataset Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bug from previous commit Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * change int types to be shorter to minimize tar size Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * refactoring of datasets and inference Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * tar by 100k examples, small fixes Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * small fixes, add analytics script Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * Add functions for dynamic programming comparison to get best path by ngrams Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fixes Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * small fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fixes to support testing on SPGISpeech Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add preprocessing for userlibri Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * some refactoring Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * some refactoring Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * move some functions to utils to reuse from other project Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * move some functions to utils to reuse from other project Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * move some functions to utils to reuse from other project Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * small refactoring before pr. Add bash-scripts reproducing evaluation Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * style fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * small fixes in inference Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * bug fix - didn't move window on last symbol Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bug - shuffle was before truncation of sorted candidates Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * refactoring, fix some bugs Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * variour fixes. Add word_indices at inference Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add candidate positions Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move data preparation and evaluation to other repo Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add infer_reproduce_paper. Refactoring Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * refactor inference using fragment indices Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add some helper functions Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bug with parameters order Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bugs Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * refactoring, fix bug Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add multiple variants of adjusting start/end positions Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * more fixes Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add unit tests, other fixes Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CodeQl warnings Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add script for full inference pipeline, refactoring Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add tutorial Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * take example data from HuggingFace Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add docs Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix comment Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix bug Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * small fixes for PR Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * add some more tests Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * try to fix tests adding with_downloads Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * skip tests with tokenizer download Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> --------- Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com> Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> Co-authored-by: Alexandra Antonova <aleksandraa@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information