GOUD.MA: A NEWS ARTICLE DATASET FOR SUMMARIZATION IN MOROCCAN DARIJA

This repo holds the training code for Goud.ma: a News Dataset for Summarization in Moroccan Darija

Dataset

Goud dataset contains 158k articles and their headlines extracted from Goud.ma news website. The articles are written in the Arabic script. All headlines are in Moroccan Darija, while articles may be in Moroccan Darija, in Modern Standard Arabic, or a mix of both (code-switched Moroccan Darija).

You can find models and dataset on Goud Hugging Face Organization.

Data Splits

Split	Number of Instances
Train	139,288
Validation	9,497
Test	9,497

Characteristics

	Articles	Headlines
The number of tokens	26,780,273	2,143,493
The number of unique tokens	1,229,993	236,593
Minimum number of tokens	32	4
Maximum number of tokens	6,025	74
Average number of tokens	169.19	13.54

Models

We train encoder-decoder baselines that are available on HuggingFace. We warm-start the model with pretrained BERT checkpoints and finetune it for the task of Text Summarization. This approach was described in this paper: Leveraging Pre-trained Checkpoints for Sequence Generation Tasks.

Results

The results of warm-starting the encoder and decoder with 3 different BERT checkpoints on the test set.

BERT checkpoint	ROUGE-1	ROUGE-2	ROUGE-L
AraBERT	23.08	8.98	22.06
DarijaBERT	19.41	6.64	18.48
DziriBERT	17.98	5.83	17.22

Training

The code in this repository can be used to replicate the results presented.

Clone repo

git clone https://github.com/issam9/goud-summarization-dataset.git

Install requirements

cd goud-summarization-dataset
pip install -r requirements.txt

Launch training

python train.py

config/default.yaml contains config defaults for training the model. You can override these defaults via command line like the following.

python train.py trainer.num_epochs=10 generate.num_beams=3

How to use

Models are uploaded to Hugging Face

from transformers import EncoderDecoderModel, BertTokenizer

article = """توصل الاتحاد الأوروبي، في وقت مبكر من اليوم السبت، إلى اتفاق تاريخي يستهدف خطاب الكراهية والمعلومات المضللة والمحتويات الضارة الأخرى الموجودة على شبكة الإنترنيت.
وحسب تقارير صحفية، سيجبر القانون شركات التكنولوجيا الكبرى على مراقبة نفسها بشكل أكثر صرامة، ويسهل على المستخدمين الإبلاغ عن المشاكل، ويمكن الاتفاق المنظمين من معاقبة الشركات غير الممتثلة بغرامات تقدر بالملايير.
ويركز الاتفاق على قواعد جديدة تتطلب من شركات التكنولوجيا العملاقة بذل المزيد من الجهد لمراقبة المحتوى على منصاتها ودفع رسوم للجهات المنظمة التي تراقب مدى امتثالها.
ويعد قانون الخدمات الرقمية الشق الثاني من إستراتيجية المفوضة الأوروبية لشؤون المنافسة، مارغريت فيستاغر، للحد من هيمنة وحدة غوغل التابعة لألفابت، وميتا (فيسبوك سابقا) وغيرهما من شركات التكنولوجيا الأمريكية العملاقة.
وقالت فيستاغر في تغريدة “توصلنا إلى اتفاق بشأن قانون الخدمات الرقمية، موضحة أن القانون سيضمن أن ما يعتبر غير قانوني في حالة عدم الاتصال بالشبكة ينظر إليه أيضا ويتم التعامل معه على أنه غير قانوني عبر الشبكة (الإنترنت) – ليس كشعار (ولكن) كواقع”.
وتواجه الشركات بموجب قانون الخدمات الرقمية غرامات تصل إلى 6 في المائة من إجمالي عملياتها على مستوى العالم لانتهاك القواعد بينما قد تؤدي الانتهاكات المتكررة إلى حظرها من ممارسة أعمالها في الاتحاد الأوروبي.
وأيدت دول الاتحاد والمشرعون الشهر الماضي القواعد التي طرحتها فيستاغر والمسماة قانون الأسواق الرقمية التي قد تجبر غوغل وأمازون وأبل وميتا وميكروسوفت على تغيير ممارساتها الأساسية في أوروبا.
"""

tokenizer = BertTokenizer.from_pretrained("Goud/AraBERT-summarization-goud")
model = EncoderDecoderModel.from_pretrained("Goud/AraBERT-summarization-goud")

input_ids = tokenizer(article, return_tensors="pt", truncation=True, padding=True).input_ids
generated = model.generate(input_ids)[0]
output = tokenizer.decode(generated, skip_special_tokens=True)

Citation

@inproceedings{issam2022goudma,
  title={Goud.ma: a News Article Dataset for Summarization in Moroccan Darija},
  author={Abderrahmane Issam and Khalil Mrini},
  booktitle={3rd Workshop on African Natural Language Processing},
  year={2022},
  url={https://openreview.net/forum?id=BMVq5MELb9}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
configs		configs
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GOUD.MA: A NEWS ARTICLE DATASET FOR SUMMARIZATION IN MOROCCAN DARIJA

Dataset

Data Splits

Characteristics

Models

Results

Training

How to use

Citation

About

Releases

Packages

Contributors 2

Languages

issam9/goud-summarization-dataset

Folders and files

Latest commit

History

Repository files navigation

GOUD.MA: A NEWS ARTICLE DATASET FOR SUMMARIZATION IN MOROCCAN DARIJA

Dataset

Data Splits

Characteristics

Models

Results

Training

How to use

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages