Yoruba Speech Dataset for Low-Resource Speech Tasks

This repo contains over 3 hours of recorded yoruba text from 1 native yoruba male speaker.

Downloading

Go to your terminal and enter;

git clone https://github.com/ogunlao/yoruba_speech_project.git

This adds a folder called "yoruba_speech_project" which contain the files to your local directory.

Statistics

1079 recordings of a maximum of 20 word length per recording
3.06 total recording hours
22 folders each containing 49 recorded audio with their respective metadata.

Recordings are divided into folders for easy access.

Data Collection

Text dataset used for recording was collected from Niger Volta LTI. It had a smooth narrative which made the text interesting to read.

Recording was done using the Lig-Aikuma Android app. It is an easy-to-use app with good interface for recording and elicitation. It offers 6 modes of usage;

Recording
Respeaking
Translating
Elicitation
Check
Share

The elicitation mode was used in data collection where a displayed text is read carefully by the speaker during recording.

Data Preprocessing

Preprocessing involved;

validating data for errors and removing corrupt files
merging folders
splitting data into train, val and test samples

Two main dataset directory with subdirectories;

recordings contain the unprocessed recorded files and metadata
data/records contain split data

Raw speech dataset split into;

1 hour of speech data (split into train and val)
1 hour of speech as test set
Additional extra 1 hour of speech as extra data
each contained in data/records/train, data/records/val, data/records/extra respectively

All preprocessing can be accessed through the notebook yoruba_speech_preprocessing and script yor_processor.py

Note that a python installation is required with some dependencies;

numpy
wave

Application

The dataset can be used majorly for low-resource speech model experiments or for cross-lingual ASR. Also, a pretrained model can be fine-tuned on this dataset to learn phonetic representations for yoruba language.

Problems Encountered

Difficulty in avoiding noise from environment when recording. Some recordings had to be performed multiple times to get a clearer audio
Few audio files were corrupted after recording. I filtered them out during preprocessing. Be aware of this when using the raw (unprocessed) audio files.
Some words in text were difficult to pronounce without proper accents.
Recording speech takes time and can become uninteresting to perform quickly.

Contributing

If you will like to contribute to this project by recording more audio files, I have text files under slit_text folder where you can easily read short sentences. Upload the text files to the lig-aikuma android app and start recording. Currently at yor_split_21 for this project. You can make a pull request and I will be happy to add you to the project.

Original Author

Sewade Olaolu Ogun

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
language_model		language_model
project		project
recordings		recordings
split_text		split_text
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
yor_CTC.ipynb		yor_CTC.ipynb
yor_processor.py		yor_processor.py
yor_split.txt		yor_split.txt
yor_trans.txt		yor_trans.txt
yoruba_speech_preprocessing.ipynb		yoruba_speech_preprocessing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yoruba Speech Dataset for Low-Resource Speech Tasks

Downloading

Statistics

Data Collection

Data Preprocessing

Application

Problems Encountered

Contributing

Original Author

License

About

Releases

Languages

License

ogunlao/yoruba_speech_project

Folders and files

Latest commit

History

Repository files navigation

Yoruba Speech Dataset for Low-Resource Speech Tasks

Downloading

Statistics

Data Collection

Data Preprocessing

Application

Problems Encountered

Contributing

Original Author

License

About

Resources

License

Stars

Watchers

Forks

Releases

Languages