Skip to content

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

License

Notifications You must be signed in to change notification settings

DataScienceUIBK/ChroniclingAmericaQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Huggingface

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

ChroniclingAmetricaQA, is a large-scale question-answering dataset comprising question-answer pairs over a collection of historical American newspapers to facilitate the development of QA and MRC systems over historical texts.

Download Links

Dataset

Structured as JSON files, the ChricinclingAmericaQA dataset includes train.json, dev.json, and test.json for training, validation, and testing phases, respectively.

  • Data Structure:
[
    {
        "query_id": "",
        "question": "",
        "answer": "",
        "org_answer": "",
        "para_id": "",
        "context": "",
        "raw_ocr": "",
        "publication_date": "",
        "trans_que": "",
        "trans_ans": "",
        "url": ""
    }
]

Dataset Statistics

Training Development Test
Num. of Questions 439,302 24,111 24,084

Citation

If you find the dataset helpful, please consider citing our paper.

@inproceedings{10.1145/3626772.3657891,
author = {Piryani, Bhawna and Mozafari, Jamshid and Jatowt, Adam},
title = {ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages},
year = {2024},
isbn = {9798400704314},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3626772.3657891},
doi = {10.1145/3626772.3657891},
booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2038–2048},
numpages = {11},
keywords = {heritage collections, ocr text, question answering},
location = {Washington DC, USA},
series = {SIGIR '24}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages