Update Queries (#19)

masakhane-io · Apr 23, 2023 · 94c0278 · 94c0278
1 parent 17ffc45
commit 94c0278
Show file tree

Hide file tree

Showing 35 changed files with 2,647 additions and 26 deletions.
diff --git a/.gitignore b/.gitignore
@@ -135,7 +135,7 @@ wandb/
 *.out
 dumps/
 collections/
-queries/
+queries/*
 tranlation_script/twi_translated.csv
 tranlation_script/test.csv
 

diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -1,13 +1,63 @@
 # AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages
 
+[![CC BY 4.0][cc-by-shield]][cc-by]
+
+This work is licensed under a
+[Creative Commons Attribution 4.0 International License][cc-by].
+
+[![CC BY 4.0][cc-by-image]][cc-by]
+
+[cc-by]: http://creativecommons.org/licenses/by/4.0/
+[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
+[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg
+
+
 AfriQA is the first cross-lingual question answering (QA) dataset with a focus on African languages. The dataset includes over 12,000 XOR QA examples across 10 African languages, making it an invaluable resource for developing more equitable QA technology.
 African languages have historically been underserved in the digital landscape, with far less in-language content available online. This makes it difficult for QA systems to provide accurate information to users in their native language. However, cross-lingual open-retrieval question answering (XOR QA) systems can help fill this gap by retrieving answer content from other languages.
 AfriQA focuses specifically on African languages where cross-lingual answer content is the only high-coverage source of information. Previous datasets have primarily focused on languages where cross-lingual QA augments coverage from the target language, but AfriQA highlights the importance of African languages as a realistic use case for XOR QA.
 
+## Languages
+
+There are currently 10 languages covered in AfriQA:
+
+- Bemba (bem)
+- Fon (fon)
+- Hausa (hau)
+- Igbo (ibo)
+- Kinyarwanda (kin)
+- Swahili (swa)
+- Twi (twi)
+- Wolof (wol)
+- Yorùbá (yor)
+
 ## Dataset Download
 
+Question-answer pairs for each language and `train-dev-test` split are in the [data directory](data/queries) in `jsonlines` format.
+
+- Dataset Naming Convention ==> `queries.afriqa.{lang_code}.{en/fr}.{split}.json`
+- Data Format:
+    - id : Question ID
+    - question : Question in African Language
+    - translated_question : Question translated into a pivot language (English/French)
+    - answers : Answer in African Language
+    - lang : Datapoint Language (African Language) e.g `bem`
+    - split : Dataset Split
+    - translated_answer : Answer in Pivot Language
+    - translation_type : Translation type of question and answers
 
 
+    ```bash
+    {   "id": 0, 
+        "question": "Bushe icaalo ca Egypt caali tekwapo ne caalo cimbi?", 
+        "translated_question": "Has the country of Egypt been colonized before?", 
+        "answers": "['Emukwai']", 
+        "lang": "bem", 
+        "split": "dev", 
+        "translated_answer": "['yes']", 
+        "translation_type": "human_translation"
+        }
+    ```
+
 ## Environment and Repository Setup
 
 - Set up a virtual environment using Conda or Virtualenv or
@@ -48,25 +98,6 @@ To download:
 - [French](https://huggingface.co/datasets/ToluClassics/masakhane_wiki_100/resolve/main/masakhane_wiki_100-french/corpus.jsonl)
 
 
-However, to run the processing pipeline yourself; We adopt the same processing used in the [Dense Passage Retriever Paper](https://arxiv.org/pdf/2004.04906.pdf).
-The pipeline has been bundled into this [script](scripts/download_process_dumps.sh). You can run using the code provided below:
-
-```terminal
-bash scripts/generate_process_dumps.sh /path/to/dir_containing_dumps
-```
-
-However, [this document](docs/process_wiki_dumps.md) provides a detailed break down of the individual steps.
-
-## Retriever
-
-### BM25
-
-### mDPR
-
-### Hybrid
-
-## Reader
-
 
 ## BibTeX entry and citation info
 

diff --git a/...ies.xqa.bem.dev.en.human_translation.json → ...ueries/bem/queries.afriqa.bem.en.dev.json b/...ies.xqa.bem.dev.en.human_translation.json → ...ueries/bem/queries.afriqa.bem.en.dev.json
diff --git a/...es.xqa.bem.test.en.human_translation.json → ...eries/bem/queries.afriqa.bem.en.test.json b/...es.xqa.bem.test.en.human_translation.json → ...eries/bem/queries.afriqa.bem.en.test.json
diff --git a/...s.xqa.bem.train.en.human_translation.json → ...ries/bem/queries.afriqa.bem.en.train.json b/...s.xqa.bem.train.en.human_translation.json → ...ries/bem/queries.afriqa.bem.en.train.json
diff --git a/...ies.xqa.fon.dev.fr.human_translation.json → ...ueries/fon/queries.afriqa.fon.fr.dev.json b/...ies.xqa.fon.dev.fr.human_translation.json → ...ueries/fon/queries.afriqa.fon.fr.dev.json
diff --git a/...es.xqa.fon.test.fr.human_translation.json → ...eries/fon/queries.afriqa.fon.fr.test.json b/...es.xqa.fon.test.fr.human_translation.json → ...eries/fon/queries.afriqa.fon.fr.test.json
diff --git a/...s.xqa.fon.train.fr.human_translation.json → ...ries/fon/queries.afriqa.fon.fr.train.json b/...s.xqa.fon.train.fr.human_translation.json → ...ries/fon/queries.afriqa.fon.fr.train.json
diff --git a/...ies.xqa.hau.dev.en.human_translation.json → ...ueries/hau/queries.afriqa.hau.en.dev.json b/...ies.xqa.hau.dev.en.human_translation.json → ...ueries/hau/queries.afriqa.hau.en.dev.json
diff --git a/...es.xqa.hau.test.en.human_translation.json → ...eries/hau/queries.afriqa.hau.en.test.json b/...es.xqa.hau.test.en.human_translation.json → ...eries/hau/queries.afriqa.hau.en.test.json
diff --git a/...s.xqa.hau.train.en.human_translation.json → ...ries/hau/queries.afriqa.hau.en.train.json b/...s.xqa.hau.train.en.human_translation.json → ...ries/hau/queries.afriqa.hau.en.train.json
diff --git a/...ies.xqa.ibo.dev.en.human_translation.json → ...ueries/ibo/queries.afriqa.ibo.en.dev.json b/...ies.xqa.ibo.dev.en.human_translation.json → ...ueries/ibo/queries.afriqa.ibo.en.dev.json
diff --git a/...es.xqa.ibo.test.en.human_translation.json → ...eries/ibo/queries.afriqa.ibo.en.test.json b/...es.xqa.ibo.test.en.human_translation.json → ...eries/ibo/queries.afriqa.ibo.en.test.json
diff --git a/...s.xqa.ibo.train.en.human_translation.json → ...ries/ibo/queries.afriqa.ibo.en.train.json b/...s.xqa.ibo.train.en.human_translation.json → ...ries/ibo/queries.afriqa.ibo.en.train.json
diff --git a/...ies.xqa.kin.dev.en.human_translation.json → ...ueries/kin/queries.afriqa.kin.en.dev.json b/...ies.xqa.kin.dev.en.human_translation.json → ...ueries/kin/queries.afriqa.kin.en.dev.json
diff --git a/...es.xqa.kin.test.en.human_translation.json → ...eries/kin/queries.afriqa.kin.en.test.json b/...es.xqa.kin.test.en.human_translation.json → ...eries/kin/queries.afriqa.kin.en.test.json
diff --git a/...s.xqa.kin.train.en.human_translation.json → ...ries/kin/queries.afriqa.kin.en.train.json b/...s.xqa.kin.train.en.human_translation.json → ...ries/kin/queries.afriqa.kin.en.train.json
diff --git a/data/queries/swa/queries.afriqa.swa.en.dev.json b/data/queries/swa/queries.afriqa.swa.en.dev.json
diff --git a/data/queries/swa/queries.afriqa.swa.en.test.json b/data/queries/swa/queries.afriqa.swa.en.test.json
diff --git a/data/queries/swa/queries.afriqa.swa.en.train.json b/data/queries/swa/queries.afriqa.swa.en.train.json
diff --git a/...ies.xqa.twi.dev.en.human_translation.json → ...ueries/twi/queries.afriqa.twi.en.dev.json b/...ies.xqa.twi.dev.en.human_translation.json → ...ueries/twi/queries.afriqa.twi.en.dev.json
diff --git a/...es.xqa.twi.test.en.human_translation.json → ...eries/twi/queries.afriqa.twi.en.test.json b/...es.xqa.twi.test.en.human_translation.json → ...eries/twi/queries.afriqa.twi.en.test.json
diff --git a/...s.xqa.twi.train.en.human_translation.json → ...ries/twi/queries.afriqa.twi.en.train.json b/...s.xqa.twi.train.en.human_translation.json → ...ries/twi/queries.afriqa.twi.en.train.json
diff --git a/...ies.xqa.wol.dev.fr.human_translation.json → ...ueries/wol/queries.afriqa.wol.fr.dev.json b/...ies.xqa.wol.dev.fr.human_translation.json → ...ueries/wol/queries.afriqa.wol.fr.dev.json
diff --git a/...es.xqa.wol.test.fr.human_translation.json → ...eries/wol/queries.afriqa.wol.fr.test.json b/...es.xqa.wol.test.fr.human_translation.json → ...eries/wol/queries.afriqa.wol.fr.test.json
diff --git a/...s.xqa.wol.train.fr.human_translation.json → ...ries/wol/queries.afriqa.wol.fr.train.json b/...s.xqa.wol.train.fr.human_translation.json → ...ries/wol/queries.afriqa.wol.fr.train.json
diff --git a/data/queries/yor/queries.afriqa.yor.en.dev.json b/data/queries/yor/queries.afriqa.yor.en.dev.json
diff --git a/data/queries/yor/queries.afriqa.yor.en.test.json b/data/queries/yor/queries.afriqa.yor.en.test.json
diff --git a/data/queries/yor/queries.afriqa.yor.en.train.json b/data/queries/yor/queries.afriqa.yor.en.train.json
diff --git a/...ies.xqa.zul.dev.en.human_translation.json → ...ueries/zul/queries.afriqa.zul.en.dev.json b/...ies.xqa.zul.dev.en.human_translation.json → ...ueries/zul/queries.afriqa.zul.en.dev.json
diff --git a/...es.xqa.zul.test.en.human_translation.json → ...eries/zul/queries.afriqa.zul.en.test.json b/...es.xqa.zul.test.en.human_translation.json → ...eries/zul/queries.afriqa.zul.en.test.json
diff --git a/...s.xqa.zul.train.en.human_translation.json → ...ries/zul/queries.afriqa.zul.en.train.json b/...s.xqa.zul.train.en.human_translation.json → ...ries/zul/queries.afriqa.zul.en.train.json
diff --git a/docs/process_wiki_dumps.md b/docs/process_wiki_dumps.md
@@ -1,5 +1,19 @@
 # Processing Wiki Dumps
 
+The English and French passages for this project are drawn from Wikipedia snapshots of 2022-05-01 and 2022-04-20 respectively, and are downloaded from the [Internet Archive](https://archive.org/) to enable open-domain experiments.
+The raw documents can be downloaded from the following URLS:
+
+- https://archive.org/download/enwiki-20220501/enwiki-20220501-pages-articles-multistream.xml.bz2
+- https://archive.org/download/frwiki-20220420/frwiki-20220420-pages-articles-multistream.xml.bz2
+
+To run the Wikipedia processing pipeline; We adopt the same processing used in the [Dense Passage Retriever Paper](https://arxiv.org/pdf/2004.04906.pdf) and [Pre-processing Matters! Improved Wikipedia Corpora for Open-Domain Question Answering](https://link.springer.com/chapter/10.1007/978-3-031-28241-6_11).
+
+The pipeline has been bundled into this [script](scripts/download_process_dumps.sh). You can run using the code provided below:
+
+```terminal
+bash scripts/generate_process_dumps.sh /path/to/dir_containing_dumps
+```
+
 This document contains information on how to convert downloaded XML wikipedia dumps into 100 token long passages stored in JSON files.
 
 For processing, we extract the Wikipedia articles into multiple jsonlines file, The articles are then preprocessed, cleaned and stored in a SQLite database file after which they are chunked into 100 token long passages.

diff --git a/scripts/generate_gold_paragraphs.sh b/scripts/generate_gold_paragraphs.sh