AfriTeVa-Keji

This repository contains code to reproduce Better Quality Pre-training Data and T5 Models for African Languages which appeats in the 2023 conference on Empirical Methods in Natural Language Processing (EMNLP).

AfriTeVa V2 was trained on 20 languages (16 African Langauges + Arabic, English, French, Portuguese) as the successor to AfriTeVa and evaluated on text classification, summarisation, reading comprehension and machine translation.

We release the following models:

Setup

Create a conda environment. Note that this repo has only been tested with Python 3.9.

conda create -n teva python=3.9 -y

Install JAX and t5x for your device.

# For TPU
pip install -r requirements/requirements-tpu.txt

# For GPU. Note that this installs jax for CUDA 12
# For other CUDA versions, you may need to edit the requirements/requirements-gpu.txt
pip install -r requirements/requirements-gpu.txt

Install teva.

# For normal installation
pip install .

# For development installation
pip install -e .

There are a few environment variables you may need to set. See .example.env.

Experiments

Datasets

AfriTeVa-Keji was pretrained on the Wúrà dataset which is available through Huggingface Hub here.

Language Modelling

To pretrain AfriTeVa V2, simply follow the setup instructions

bash scripts/pretrain.sh

If you need to convert the t5x checkpoint to flax, run the following command

python -m transformers.models.t5x.convert_t5_checkpoint_to_flax \
--t5x_checkpoint_path /path/to/your/trained/base/model \
--config_name config/models/t5_1_1/base.json \
--flax_dump_folder_path /path/to/your/converted/model

Text Classification

AfriTeVa-Keji was evaluated on MasakhaNEWS 2.0 which covers 16 languages widely spoken in Africa.

# This will train and evaluate a classifier for each language over three seeds.
bash scripts/tasks/masahanews_ft.sh

Summarisation

AfriTeVa-Keji was evaluated on 15 of the languages in XL-SUM

# This will perform multilingual finetuning over 50,000 steps.
bash scripts/tasks/xlsum_xlingual.sh

Machine Translation

AfriTeVa V2 was evaluated on MAFAND-MT.

bash scripts/tasks/lafand_mt.sh

Citation

@inproceedings{oladipo-etal-2023-better,
    title = "Better Quality Pre-training Data and T5 Models for {A}frican Languages",
    author = "Oladipo, Akintunde  and
      Adeyemi, Mofetoluwa  and
      Ahia, Orevaoghene  and
      Owodunni, Abraham  and
      Ogundepo, Odunayo  and
      Adelani, David  and
      Lin, Jimmy",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.11",
    pages = "158--168",
    abstract = "In this study, we highlight the importance of enhancing the quality of pretraining data in multilingual language models. Existing web crawls have demonstrated quality issues, particularly in the context of low-resource languages. Consequently, we introduce a new multilingual pretraining corpus for 16 African languages, designed by carefully auditing existing pretraining corpora to understand and rectify prevalent quality issues. To compile this dataset, we undertake a rigorous examination of current data sources for thirteen languages within one of the most extensive multilingual web crawls, mC4, and extract cleaner data through meticulous auditing and improved web crawling strategies. Subsequently, we pretrain a new T5-based model on this dataset and evaluate its performance on multiple downstream tasks. Our model demonstrates better downstream effectiveness over existing pretrained models across four NLP tasks, underscoring the critical role data quality plays in pretraining language models in low-resource scenarios. Specifically, on cross-lingual QA evaluation, our new model is more than twice as effective as multilingual T5. All code, data and models are publicly available at https://github.com/castorini/AfriTeVa-keji.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
config		config
data		data
logs		logs
notebooks		notebooks
requirements		requirements
runs		runs
scripts		scripts
src/teva		src/teva
t5x @ 09076a5		t5x @ 09076a5
tokenizers/v150000		tokenizers/v150000
.example.env		.example.env
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AfriTeVa-Keji

Setup

Experiments

Datasets

Language Modelling

Text Classification

Summarisation

Machine Translation

Citation

About

Releases

Packages

Contributors 2

Languages

castorini/AfriTeVa-keji

Folders and files

Latest commit

History

Repository files navigation

AfriTeVa-Keji

Setup

Experiments

Datasets

Language Modelling

Text Classification

Summarisation

Machine Translation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages