ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages
ChroniclingAmetricaQA, is a large-scale question-answering dataset comprising question-answer pairs over a collection of historical American newspapers to facilitate the development of QA and MRC systems over historical texts.
Structured as JSON files, the ChricinclingAmericaQA dataset includes train.json
, dev.json
, and test.json
for training, validation, and testing phases, respectively.
- Data Structure:
[
{
"query_id": "",
"question": "",
"answer": "",
"org_answer": "",
"para_id": "",
"context": "",
"raw_ocr": "",
"publication_date": "",
"trans_que": "",
"trans_ans": "",
"url": ""
}
]
Training | Development | Test | |
---|---|---|---|
Num. of Questions | 439,302 | 24,111 | 24,084 |
If you find the dataset helpful, please consider citing our paper.
@inproceedings{10.1145/3626772.3657891,
author = {Piryani, Bhawna and Mozafari, Jamshid and Jatowt, Adam},
title = {ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages},
year = {2024},
isbn = {9798400704314},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3626772.3657891},
doi = {10.1145/3626772.3657891},
booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2038–2048},
numpages = {11},
keywords = {heritage collections, ocr text, question answering},
location = {Washington DC, USA},
series = {SIGIR '24}
}
This project is licensed under the MIT License - see the LICENSE file for details.