Skip to content

yoosif0/process_arabic_discretized_corpora

Repository files navigation

Process arabic corpora for pronouncing dictionary

Pre-usage

Tashkeela

  • Add corpus in a folder named "tashkeela-raw" in the root folder

Nawar's Halabi Corpus

  • Add corpus in a folder named "tashkeela-raw" in the root folder

Kasct Corpus

  • Add corpus in a folder named "kasct-raw" in the root folder
  • Open both files and in VS Code
    • Reopen with encoding "Arabic Windows(1256)"
    • Save with encoding "utf-8"

Usage

  • python .\process_corpus.py --corpus nawar
  • python .\process_corpus.py --corpus kasct
  • python .\process_corpus.py --corpus tashkeela
  • python .\concat_corpora.py

Post Usage

  • Use ara pronounciaiton tool to produce the dict file using python corpus2cmudict.py -i output_with_last_haraka.txt -p haraka

About

Process Arabic diacretized corpora for static dictionary

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages