Skip to content

resemble-ai/arabic-text-diacritization

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arabic Text Diacritization

This repository contains the dataset, helpers, and systems comparison for our paper on Arabic Text Diacritization:

"Arabic Text Diacritization Using Deep Neural Networks", Ali Fadel, Ibraheem Tuffaha, Bara' Al-Jawarneh, and Mahmoud Al-Ayyoub, ICCAIS 2019.

Files

  • train.txt - Contains 50,000 lines of diacritized Arabic text which can be used as training dataset
  • val.txt - Contains 2,500 lines of diacritized Arabic text which can be used as validation dataset
  • test.txt - Contains 2,500 lines of diacritized Arabic text which can be used as testing dataset
  • constants
    • ARABIC_LETTERS_LIST.pickle - Contains list of Arbaic letters
    • CLASSES_LIST.pickle - Contains list of all possible classes
    • DIACRITICS_LIST.pickle - Contains list of all diacritics
  • count_characters.py - Counts the number of Arabic letters and diacritics in a file
  • count_fathatan.py - Counts the number of fathatan occurrences before and after Alif in all files from a folder
  • diacritization_stat.py - Calculates DER and WER using the gold data and the predicted output
  • diacritics_rate_extractor.py - Keeps lines with p% diacritics to Arabic characters rate or more in all files from a folder
  • file_lookup.py - Searches for a line in all files from a folder
  • fix_fathatan.py - Changes after-Alif fathatan to before-Alit fathatan in a file
  • remove_diacritics.py - Removes diacritics from a file
  • transliteration.py - Converts a file from Arabic text to Buckwalter transliteration and vice-versa
  • pre_process_tashkeela_corpus.ipynb - Pre-process Tashkeela Corpus data
  • ali-soft - Contains some bugs that exist in Ali-Soft system
  • farasa - Contains Farasa system output, fixed output, and DER/WER statistics
  • harakat - Contains Harakat system testing script, output, fixed output, and DER/WER statistics
  • madamira - Contains MADAMIRA system output, fixed output, and DER/WER statistics
  • mishkal - Contains Mishkal system output, fixed output, and DER/WER statistics
  • shakkala - Contains Shakkala system data splitting script, output, fixed output, and DER/WER statistics
  • tashkeela_model - Contains Tashkeela-Model system output, fixed output, and DER/WER statistics for each n-gram model provided by them

Note: All codes in this repository tested on Ubuntu 18.04

Contributors

  1. Ali Hamdi Ali Fadel.
  2. Ibraheem Tuffaha.
  3. Bara' Al-Jawarneh.
  4. Mahmoud Al-Ayyoub.

License

The project is available as open source under the terms of the MIT License.

About

Benchmark Arabic text diacritization dataset

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 56.8%
  • Jupyter Notebook 43.2%