Code for ICSME 2020 "CrossASR: Efficient Differential Testing of Automatic Speech Recognition via Text-To-Speech" by Muhammad Hilmi Asyrofi, Ferdian Thung, David Lo, and Lingxiao Jiang
Automatic speech recognition (ASR) systems are ubiquitous parts of modern life. It can be found in our smartphones, desktops, and smart home systems. To ensure its correctness in recognizing speeches, ASR needs to be tested. Testing ASR requires test cases in the form of audio files and their transcribed texts. Building these test cases manually, however, is tedious and time-consuming.
To deal with the aforementioned challenge, in this work, we propose CrossASR, an approach that capitalizes the existing Text-To-Speech (TTS) systems to automatically generate test cases for ASR systems. CrossASR is a differential testing solution that compares outputs of multiple ASR systems to uncover erroneous behaviors among ASRs. CrossASR efficiently generates test cases to uncover failures with as few generated tests as possible; it does so by employing a failure probability predictor to pick the texts with the highest likelihood of leading to failed test cases. As a black-box approach, CrossASR can generate test cases for any ASR, including when the ASR model is not available (e.g., when evaluating the reliability of various third-party ASR services).
We evaluated the performance of CrossASR on 20,000 English texts (i.e., sentences) in the Europarl corpus. We use 4 TTSes (i.e., Google, ResponsiveVoice, Festival, and Espeak) and 4 ASRs (i.e., Deepspeech, Deepspeech2, wav2letter++, and wit). We use more than one TTS to avoid bias that comes from a particular TTS.
sudo apt update
sudo apt install python3-dev python3-pip python3-venv
Create a new virtual environment by choosing a Python interpreter and making a ./env directory to hold it:
python3 -m venv --system-site-packages ~/./env
Activate the virtual environment using a shell-specific command:
source ~/./env/bin/activate # sh, bash, or zsh
We use gTTS (Google Text-to-Speech), a Python library and CLI tool to interface with Google Translate text-to-speech API.
pip install gTTS
if [ ! -d "audio/" ]
mkdir audio
mkdir audio/google/
gtts-cli 'hello world google' --output audio/google/hello.mp3
ffmpeg -i audio/google/hello.mp3 -acodec pcm_s16le -ac 1 -ar 16000 audio/google/hello.wav -y
We use rvTTS, a cli tool for converting text to mp3 files using ResponsiveVoice's API.
pip install rvtts
mkdir audio/rv/
rvtts --voice english_us_male --text "hello responsive voice trial" -o audio/rv/hello.mp3
ffmpeg -i audio/rv/hello.mp3 -acodec pcm_s16le -ac 1 -ar 16000 audio/rv/hello.wav -y
Festival is a free TTS written in C++. It is developed by The Centre for Speech Technology Research at the University of Edinburgh. Festival are distributed under an X11-type licence allowing unrestricted commercial and non-commercial use alike. Festival is a command-line program that already installed on Ubuntu 16.04
sudo apt install festival
mkdir audio/festival/
festival -b "( (SayText \"hello festival \") \"audio/festival/hello.wav\" 'riff)"
eSpeak is a compact open source software speech synthesizer for English and other languages.
sudo apt install espeak
mkdir audio/espeak/
espeak "hello e speak" --stdout > audio/espeak/hello.wav
ffmpeg -i audio/espeak/hello.wav -acodec pcm_s16le -ac 1 -ar 16000 audio/espeak/hello.wav -y
DeepSpeech is an open source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Speech research paper. CrossASR uses Deepspeech-0.6.1
pip install deepspeech===0.6.1
if [ ! -d "models/" ]
mkdir models
cd models
mkdir deepspeech
cd deepspeech
curl -LO
tar xvf deepspeech-0.6.1-models.tar.gz
cd ../../
Please follow this link for more detailed installation.
deepspeech --model models/deepspeech/deepspeech-0.6.1-models/output_graph.pbmm --lm models/deepspeech/deepspeech-0.6.1-models/lm.binary --trie models/deepspeech/deepspeech-0.6.1-models/trie --audio audio/google/hello.wav
DeepSpeech2 is an open-source implementation of end-to-end Automatic Speech Recognition (ASR) engine, based on Baidu's Deep Speech 2 paper, with PaddlePaddle platform.
cd models/
git clone
cp DeepSpeech/
cd DeepSpeech/models/librispeech/
cd ../../../../
cd models/DeepSpeech/models/lm
cd ../../../../
docker pull paddlepaddle/paddle:1.6.2-gpu-cuda10.0-cudnn7
# please remove --gpus '"device=1"' if you only have one gpu
docker run --name deepspeech2 --rm --gpus '"device=1"' -it -v $(pwd)/models/DeepSpeech:/DeepSpeech -v $(pwd)/audio/:/DeepSpeech/audio/ -v $(pwd)/data/:/DeepSpeech/data/ paddlepaddle/paddle:1.6.2-gpu-cuda10.0-cudnn7 /bin/bash
apt-get update
apt-get install git -y
cd DeepSpeech
apt-get install libsndfile1-dev -y
in case you found error when running the
Error solution for ImportError: No module named swig_decoders
pip install paddlepaddle-gpu==1.6.2.post107
cd DeepSpeech
pip install soundfile
pip install llvmlite===0.31.0
pip install resampy
pip install python_speech_features
tar xvzf swig-3.0.12.tar.gz
cd swig-3.0.12
apt-get install automake -y
make install
cd ../decoders/swig/
cd ../../
pip install flask
--mean_std_path='models/librispeech/mean_std.npz' \
--vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech' \
Then detach from the docker using ctrl+p & ctrl+q after you see Running on (Press CTRL+C to quit)
docker exec -it deepspeech2 curl http://localhost:5000/transcribe?fpath=audio/google/hello.wav
wav2letter++ is a highly efficient end-to-end automatic speech recognition (ASR) toolkit written entirely in C++ by Facebook Research, leveraging ArrayFire and flashlight.
Please find the lastest image of wav2letter's docker.
cd models/
mkdir wav2letter
cd wav2letter
for f in acoustic_model.bin tds_streaming.arch decoder_options.json feature_extractor.bin language_model.bin lexicon.txt tokens.txt ; do wget${f} ; done
ls -sh
cd ../../
docker run --name wav2letter -it --rm -v $(pwd)/audio/:/root/host/audio/ -v $(pwd)/models/:/root/host/models/ --ipc=host -a stdin -a stdout -a stderr wav2letter/wav2letter:inference-latest
Then detach from the docker using ctrl+p & ctrl+q
docker exec -it wav2letter sh -c "cat /root/host/audio/google/hello.wav | /root/wav2letter/build/inference/inference/examples/simple_streaming_asr_example --input_files_base_path /root/host/models/wav2letter/"
Detail of wav2letter++ installation and wav2letter++ inference
Wit gives an API interface for ASR. We use pywit, the Python SDK for Wit. You need to create an WIT account to get access token.
pip install wit===5.1.0
export WIT_ACCESS_TOKEN=<your Wit access token>
curl -XPOST '' \
-i -L \
-H "Authorization: Bearer $WIT_ACCESS_TOKEN" \
-H "Content-Type: audio/wav" \
--data-binary "@audio/google/hello.wav"
Success Response
HTTP/1.1 100 Continue
Date: Fri, 11 Sep 2020 05:55:51 GMT
HTTP/1.1 200 OK
Content-Type: application/json
Date: Fri, 11 Sep 2020 05:55:52 GMT
Connection: keep-alive
Content-Length: 85
"entities": {},
"intents": [],
"text": "hello world google",
"traits": {}
python models/
pip install numpy
pip install pandas
pip install scikit-learn
pip install normalise
has several nltk data dependencies. Install these by running the following python commands (inside python):
import nltk
for dependency in ("brown", "names", "wordnet", "averaged_perceptron_tagger", "universal_tagset"):
python -t <tts> -o audio/<tts>/icsme.wav
Example on Google
python -t google -o audio/google/icsme.wav
python -a <asr> -i audio/<asr>/icsme.wav
Example on Deepspeech2
python -a paddledeepspeech -i audio/google/icsme.wav
We already provided corpus/europarl-20k.txt
on our Github repository. Thus you can skip this step actually. Please check in the folder corpus/
to make sure the dataset availability.
If you wanna reproduce how to generate dataset, please follow the next steps
Download Eurparl Raw Data. Then extract it inside the main folder. You will get europarl-parallel-corpus-19962011/
This code will generate full europarl corpus corpus/europarl-full.csv
and 20k texts corpus/europarl-20k.txt
for our experiment.
This script will generate audio files in the form of audio/without_classifier/<tts>/audio-<id>.wav
. The transcription is saved at result/without_classifier/<dataset name>/<tts>/<asr>/data.csv
. The statistic (number of failed test case, succuss test case, and indeterminable test case) is saved at result/without_classifier/<dataset name>/<tts>/<asr>/data.csv
pip install torch
pip install simpletransformers
This script will generate audio files in the form of audio/with_classifier/<tts>/audio-<id>.wav
. The transcription is saved at result/with_classifier/<dataset name>/<tts>/<asr>/data.csv
. The statistic (number of failed test case, succuss test case, and indeterminable test case) is saved at result/with_classifier/<dataset name>/<tts>/<asr>/data.csv
python --tts <tts name> --output-dir <output dir location> --lower-bound <lower bound id> --upper-bound <upper bound id>
python --tts google --output-dir audio/data/ --lower-bound 0 --upper-bound 20000
python --tts <tts name> --asr <asr name> --input-dir <input audio dir location> --output-dir <output transcription dir location> --lower-bound <lower bound id> --upper-bound <upper bound id>
python --tts google --asr paddledeepspeech --input-dir audio/data/ --output-dir transcription/ --lower-bound 0 --upper-bound 20000
