SpeechFlow is an advanced speech-to-text API that offers exceptional accuracy for businesses of all sizes and industries. With SpeechFlow, users can transcribe audio and video content into text with high precision, making it an ideal solution for companies that need to quickly and accurately convert speech into text for various purposes, such as captioning, transcription, and analysis. With support for multiple languages and dialects, SpeechFlow is a versatile tool that can cater to a wide range of businesses and industries.
Spoken Language Identification (LID) is defined as detecting language from an audio clip by an unknown speaker, regardless of gender, manner of speaking, and distinct age speaker. It has numerous applications in speech recognition, multilingual machine translations, and speech-to-speech translations.
Our model currently supports 13 languages: English, Spanish, Italian, French, German, Portuguese, Russian, Turkish, Vietnamese, Indonesian, Chinese, Japanese, and Korean.
The model uses convolutional and recurrent neural networks trained on two thousands of hours of speech data(private). Approximately 150 hours of speech supervision per language.
The figure below shows a ACC (Accuracy) breakdown by languages of the FLEURS test-set using pretrained model.
FLEURS dataset downloads can be fount here: Downloads
The models are implemented in TensorFlow.
To use all of the functionality of the library, you should have:
tensorflow==2.4.1
tensorflow-gpu==2.4.1
tensorflow-addons==0.15.0
matplotlib==3.5.0
numpy==1.19.5
scikit-learn==1.0.1
librosa==0.8.1
SoundFile==0.10.3.post1
PyYAML==6.0
Download the codebase and open up a terminal in the root directory. Make sure python 3.7 is installed in the current environment. Then execute
pip install -r requirements.txt
The wav files have 16KHz sampling rate, single channel, and 16-bit Signed Integer PCM encoding.
As speech features, 80-dimensional log mel-filterbank outputs were computed from 25ms window for each 10ms. Those log mel-filterbank features were further normalized to have zero mean and unit variance over the training partition of the dataset.
You must prepare your own data before training the model, refer to 'data/demo_txt/demo_train.txt' file.
To get start, please config 'congfigs/config.yml' file, and simple run this command in the console:
python train.py
This will train Spoken_language_identification model by data in the 'data/demo_txt/demo_train.txt', then store the model on saved_weights folder, perform inference on 'demo_txt/demo_test.txt', print the inference results, and save the averaged accuracy in a text file.
The pretrained model is provided in this project, simple run this command:
python predict_by_pb.py test_audios/chinese.wav
or
python predict_by_weights.py test_audios/chinese.wav
The provided chinese.wav audio needs to meet the Audio Format, if your audio file is not wav format(eg: mp3), you can convert the audio to wav format by ffmpeg. Run the following command in your audio directory convert to wav format.
ffmpeg -i audio.mp3 -ab 256k -ar 16000 -ac 1 -f wav audio.wav
If you don't have installed ffmpeg, please installed it first.
sudo apt-get update
sudo apt-get install ffmpeg
Spoken_language_identification is released under the Apache License, version 2.0. The Apache license is a popular BSD-like license. Spoken_language_identification can be redistributed for free, even for commercial purposes, although you can not take off the license headers (and under some circumstances, you may have to distribute a license document).