Skip to content

Commit

Permalink
Update Queries (#19)
Browse files Browse the repository at this point in the history
  • Loading branch information
ToluClassics authored Apr 23, 2023
1 parent 17ffc45 commit 94c0278
Show file tree
Hide file tree
Showing 35 changed files with 2,647 additions and 26 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ wandb/
*.out
dumps/
collections/
queries/
queries/*
tranlation_script/twi_translated.csv
tranlation_script/test.csv

Expand Down
395 changes: 395 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

69 changes: 50 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,63 @@
# AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

[![CC BY 4.0][cc-by-shield]][cc-by]

This work is licensed under a
[Creative Commons Attribution 4.0 International License][cc-by].

[![CC BY 4.0][cc-by-image]][cc-by]

[cc-by]: http://creativecommons.org/licenses/by/4.0/
[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg


AfriQA is the first cross-lingual question answering (QA) dataset with a focus on African languages. The dataset includes over 12,000 XOR QA examples across 10 African languages, making it an invaluable resource for developing more equitable QA technology.
African languages have historically been underserved in the digital landscape, with far less in-language content available online. This makes it difficult for QA systems to provide accurate information to users in their native language. However, cross-lingual open-retrieval question answering (XOR QA) systems can help fill this gap by retrieving answer content from other languages.
AfriQA focuses specifically on African languages where cross-lingual answer content is the only high-coverage source of information. Previous datasets have primarily focused on languages where cross-lingual QA augments coverage from the target language, but AfriQA highlights the importance of African languages as a realistic use case for XOR QA.

## Languages

There are currently 10 languages covered in AfriQA:

- Bemba (bem)
- Fon (fon)
- Hausa (hau)
- Igbo (ibo)
- Kinyarwanda (kin)
- Swahili (swa)
- Twi (twi)
- Wolof (wol)
- Yorùbá (yor)

## Dataset Download

Question-answer pairs for each language and `train-dev-test` split are in the [data directory](data/queries) in `jsonlines` format.

- Dataset Naming Convention ==> `queries.afriqa.{lang_code}.{en/fr}.{split}.json`
- Data Format:
- id : Question ID
- question : Question in African Language
- translated_question : Question translated into a pivot language (English/French)
- answers : Answer in African Language
- lang : Datapoint Language (African Language) e.g `bem`
- split : Dataset Split
- translated_answer : Answer in Pivot Language
- translation_type : Translation type of question and answers


```bash
{ "id": 0,
"question": "Bushe icaalo ca Egypt caali tekwapo ne caalo cimbi?",
"translated_question": "Has the country of Egypt been colonized before?",
"answers": "['Emukwai']",
"lang": "bem",
"split": "dev",
"translated_answer": "['yes']",
"translation_type": "human_translation"
}
```

## Environment and Repository Setup

- Set up a virtual environment using Conda or Virtualenv or
Expand Down Expand Up @@ -48,25 +98,6 @@ To download:
- [French](https://huggingface.co/datasets/ToluClassics/masakhane_wiki_100/resolve/main/masakhane_wiki_100-french/corpus.jsonl)


However, to run the processing pipeline yourself; We adopt the same processing used in the [Dense Passage Retriever Paper](https://arxiv.org/pdf/2004.04906.pdf).
The pipeline has been bundled into this [script](scripts/download_process_dumps.sh). You can run using the code provided below:

```terminal
bash scripts/generate_process_dumps.sh /path/to/dir_containing_dumps
```

However, [this document](docs/process_wiki_dumps.md) provides a detailed break down of the individual steps.

## Retriever

### BM25

### mDPR

### Hybrid

## Reader


## BibTeX entry and citation info

Expand Down
417 changes: 417 additions & 0 deletions data/queries/swa/queries.afriqa.swa.en.dev.json

Large diffs are not rendered by default.

302 changes: 302 additions & 0 deletions data/queries/swa/queries.afriqa.swa.en.test.json

Large diffs are not rendered by default.

415 changes: 415 additions & 0 deletions data/queries/swa/queries.afriqa.swa.en.train.json

Large diffs are not rendered by default.

361 changes: 361 additions & 0 deletions data/queries/yor/queries.afriqa.yor.en.dev.json

Large diffs are not rendered by default.

332 changes: 332 additions & 0 deletions data/queries/yor/queries.afriqa.yor.en.test.json

Large diffs are not rendered by default.

360 changes: 360 additions & 0 deletions data/queries/yor/queries.afriqa.yor.en.train.json

Large diffs are not rendered by default.

14 changes: 14 additions & 0 deletions docs/process_wiki_dumps.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# Processing Wiki Dumps

The English and French passages for this project are drawn from Wikipedia snapshots of 2022-05-01 and 2022-04-20 respectively, and are downloaded from the [Internet Archive](https://archive.org/) to enable open-domain experiments.
The raw documents can be downloaded from the following URLS:

- https://archive.org/download/enwiki-20220501/enwiki-20220501-pages-articles-multistream.xml.bz2
- https://archive.org/download/frwiki-20220420/frwiki-20220420-pages-articles-multistream.xml.bz2

To run the Wikipedia processing pipeline; We adopt the same processing used in the [Dense Passage Retriever Paper](https://arxiv.org/pdf/2004.04906.pdf) and [Pre-processing Matters! Improved Wikipedia Corpora for Open-Domain Question Answering](https://link.springer.com/chapter/10.1007/978-3-031-28241-6_11).

The pipeline has been bundled into this [script](scripts/download_process_dumps.sh). You can run using the code provided below:

```terminal
bash scripts/generate_process_dumps.sh /path/to/dir_containing_dumps
```

This document contains information on how to convert downloaded XML wikipedia dumps into 100 token long passages stored in JSON files.

For processing, we extract the Wikipedia articles into multiple jsonlines file, The articles are then preprocessed, cleaned and stored in a SQLite database file after which they are chunked into 100 token long passages.
Expand Down
6 changes: 0 additions & 6 deletions scripts/generate_gold_paragraphs.sh

This file was deleted.

0 comments on commit 94c0278

Please sign in to comment.