Arabic Text Diacritization

This repository contains the dataset, helpers, and systems comparison for our paper on Arabic Text Diacritization:

"Arabic Text Diacritization Using Deep Neural Networks", Ali Fadel, Ibraheem Tuffaha, Bara' Al-Jawarneh, and Mahmoud Al-Ayyoub, ICCAIS 2019.

Files

dataset

train.txt - Contains 50,000 lines of diacritized Arabic text which can be used as training dataset
val.txt - Contains 2,500 lines of diacritized Arabic text which can be used as validation dataset
test.txt - Contains 2,500 lines of diacritized Arabic text which can be used as testing dataset

helpers

constants
- ARABIC_LETTERS_LIST.pickle - Contains list of Arbaic letters
- CLASSES_LIST.pickle - Contains list of all possible classes
- DIACRITICS_LIST.pickle - Contains list of all diacritics
count_characters.py - Counts the number of Arabic letters and diacritics in a file
count_fathatan.py - Counts the number of fathatan occurrences before and after Alif in all files from a folder
diacritization_stat.py - Calculates DER and WER using the gold data and the predicted output
diacritics_rate_extractor.py - Keeps lines with p% diacritics to Arabic characters rate or more in all files from a folder
file_lookup.py - Searches for a line in all files from a folder
fix_fathatan.py - Changes after-Alif fathatan to before-Alit fathatan in a file
remove_diacritics.py - Removes diacritics from a file
transliteration.py - Converts a file from Arabic text to Buckwalter transliteration and vice-versa
pre_process_tashkeela_corpus.ipynb - Pre-process Tashkeela Corpus data

existing_systems

ali-soft - Contains some bugs that exist in Ali-Soft system
farasa - Contains Farasa system output, fixed output, and DER/WER statistics
harakat - Contains Harakat system testing script, output, fixed output, and DER/WER statistics
madamira - Contains MADAMIRA system output, fixed output, and DER/WER statistics
mishkal - Contains Mishkal system output, fixed output, and DER/WER statistics
shakkala - Contains Shakkala system data splitting script, output, fixed output, and DER/WER statistics
tashkeela_model - Contains Tashkeela-Model system output, fixed output, and DER/WER statistics for each n-gram model provided by them

Note: All codes in this repository tested on Ubuntu 18.04

Contributors

License

The project is available as open source under the terms of the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
dataset		dataset
existing_systems		existing_systems
helpers		helpers
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arabic Text Diacritization

Files

dataset

helpers

existing_systems

Note: All codes in this repository tested on Ubuntu 18.04

Contributors

License

About

Releases

Packages

Languages

License

resemble-ai/arabic-text-diacritization

Folders and files

Latest commit

History

Repository files navigation

Arabic Text Diacritization

Files

dataset

helpers

existing_systems

Note: All codes in this repository tested on Ubuntu 18.04

Contributors

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages