GitHub - Update-For-Integrated-Business-AI/CORU

CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset

In the fields of Optical Character Recognition (OCR) and Natural Language Processing (NLP), integrating multilingual capabilities remains a critical challenge, especially when considering languages with complex scripts such as Arabic. This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding Dataset (CORU), a novel dataset specifically designed to enhance OCR and information extraction from receipts in multilingual contexts involving Arabic and English. CORU consists of over 20,000 annotated receipts from diverse retail settings in Egypt, including supermarkets and clothing stores, alongside 30,000 annotated images for OCR that were utilized to recognize each detected line, and 10,000 items annotated for detailed information extraction. These annotations capture essential details such as merchant names, item descriptions, total prices, receipt numbers, and dates. They are structured to support three primary computational tasks: object detection, OCR, and information extraction. We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods, like Tesseract OCR, and more advanced neural network-based approaches. These baselines are crucial for processing the complex and noisy document layouts typical of real-world receipts and for advancing the state of automated multilingual document processing.

Dataset Overview

CORU is divided into Three challenges:

Key Information Detection.
Large-Scale OCR Dataset
Item Information Extraction

Dataset Statistics

Category	Training	Validation	Test
Object Detection	12,600	3700	3700
OCR	21,000	4,500	4,500
IE	7000	1500	1500

Sample Images from the Dataset

Here are five examples from the dataset, showcasing the variety of receipts included:

Download Links

Key Information Detection

We will continue Uploading the updated receipt.
Training Set: Download (⚠️ Note: This link was updated on August 28, 2024, to include receipts with redacted PII.)
Validation Set: Download (⚠️ Note: This link was updated on August 11, 2024, to include receipts with redacted PII.)
Test Set: Download (⚠️ Note: This link was updated on August 30, 2024, to include receipts with redacted PII.)

OCR Dataset

Training Set: Download
Validation Set: Download
Test Set: Download

Item Information Extraction

Training Set: Download
Validation Set: Download
Test Set: Download

Citation

If you find these codes or data useful, please consider citing our paper as:

@misc{abdallah2024coru,
    title={CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset},
    author={Abdelrahman Abdallah and Mahmoud Abdalla and Mahmoud SalahEldin Kasem and Mohamed Mahmoud and Ibrahim Abdelhalim and Mohamed Elkasaby and Yasser ElBendary and Adam Jatowt},
    year={2024},
    eprint={2406.04493},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset

Dataset Overview

Dataset Statistics

Sample Images from the Dataset

Download Links

Key Information Detection

OCR Dataset

Item Information Extraction

Citation

About

Releases

Packages

Update-For-Integrated-Business-AI/CORU

Folders and files

Latest commit

History

Repository files navigation

CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset

Dataset Overview

Dataset Statistics

Sample Images from the Dataset

Download Links

Key Information Detection

OCR Dataset

Item Information Extraction

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages