Introduction

The text_search project can be used to create ASR (automatic speech recognition) dataset with long-form audios and even longer texts.

The core of text_search is a general audio alignment pipeline, which aims to align the audio files to the corresponding text and split them into short segments, while also excluding segments of audio that do not correspond exactly with the aligned text.

Installation

With pip

pip install fasttextsearch

For developers

pip install numpy

git clone https://github.com/danpovey/text_search
cd text_search

mkdir build
cd build
cmake ..
make -j
make test

# set PYTHONPATH so that you can use "import textsearch"

export PYTHONPATH=$PWD/../textsearch/python:$PWD/lib:$PYTHONPATH

Now you can use

python3 -c "import textsearch; print(textsearch.__file__)"

Caution: We did not use either python3 setup.py install or pip install. We only set the environment variable PYTHONPATH.

Recipes

References

More explainations are available in the following paper:

@misc{kang2023libriheavy,
      title={Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context}, 
      author={Wei Kang and Xiaoyu Yang and Zengwei Yao and Fangjun Kuang and Yifan Yang and Liyong Guo and Long Lin and Daniel Povey},
      year={2023},
      eprint={2309.08105},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github/workflows		.github/workflows
cmake		cmake
docs		docs
examples		examples
notes		notes
textsearch		textsearch
.flake8		.flake8
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Installation

With pip

For developers

Recipes

References

About

Releases

Packages

Contributors 6

Languages

k2-fsa/text_search

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

With pip

For developers

Recipes

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages