Skip to content

AsoSoft/CKB-Sentence-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

CKB Sentences Corpus for TTS and ASR

Overview

The CKB Sentences Corpus is a comprehensive dataset designed for various natural language processing (NLP) applications, specifically focusing on Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) systems. This dataset contains 1000 sentences in Central Kurdish (CKB), covering a wide range of topics. The corpus is structured to provide diverse linguistic content, making it an invaluable resource for training and evaluating TTS and ASR models.

Corpus Details

The corpus includes sentences from the following topics, each contributing a specific number of sentences:

Topic Number of Sentences
History 50
Geography 50
Sports 100
Religious 50
General 50
News 50
Health 50
Weather 50
Arts 50
Science and Technology 50
Poetry 100
Economy 50
Very Common 50
Facebook Comment 50
Government 50
Normal 100
Total 1000

Usage

Text-to-Speech (TTS)

The corpus can be used to train TTS systems by providing diverse and phonetically rich sentences. It covers a wide range of topics, ensuring that the generated speech can handle various vocabulary and sentence structures. This diversity helps in creating a more natural and intelligible synthetic voice for CKB.

Automatic Speech Recognition (ASR)

For ASR, this corpus serves as a valuable resource for training and evaluating models. The sentences include a wide range of phonetic and syntactic structures, which are essential for developing robust ASR systems capable of understanding different accents and speaking styles in CKB.

Other Applications

In addition to TTS and ASR, the CKB Sentences Corpus can be utilized for:

  • Language Modeling: Developing models that can predict the next word or sentence in a sequence.
  • Speech Translation: Training models to translate spoken CKB into other languages.
  • Voice Conversion: Converting one speaker's voice to another within the CKB language.
  • Speech Synthesis Research: Analyzing and improving the quality of synthetic speech.

How to Access

You can access the corpus by cloning this repository or downloading the dataset directly from the provided links. Please adhere to the data usage policies and cite this repository if you use the data in your research.

git clone https://github.com/yourusername/ckb-sentences-corpus.git

Contribution

We welcome contributions to enhance the quality and scope of this corpus. If you have suggestions for new sentences, corrections, or additional topics, please submit a pull request or open an issue.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Citation

If you use this corpus in your research, please cite the following paper:

Abdullah, A.A., Veisi, H. and Rashid, T., 2024. Breaking Walls: Pioneering Automatic Speech Recognition for Central Kurdish: End-to-End Transformer Paradigm. arXiv preprint arXiv:2406.02561.

Contact

For any questions or additional information, please contact us at [info@asosoft.com].


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published