DocGenome: A Large Benchmark for Multi-Modal Language Models in Real-World Academic Document Understanding

We present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline DocParser. DocGenome features four characteristics:

1. Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their \LaTeX\ source codes.
1. Logicality: It provides 6 logical relationships between different entities within each scientific document.
1. Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA.
1. Correctness: It undergoes rigorous quality control checks conducted by a specialized team.

👇🏻DocGenome-train Download

DocGenome-train

👇🏻DocGenome-Test Download

DocGenome-test

Please refer to Evaluation_README for the evaluation details.

File Structure

Please refer to Dataset_Details_README for the detalied explanation about the different file structures in DocGenome.

DocGenome Benchmark Introduction

Datasets	# Discipline	# Category of Units	# Pages in Train-set	# Pages in Test-set	# Task	# Used Metric	Publication	Entity Relations

DocVQA	-	N/A	11K	1K	1	2	1960-2000	❎
DocLayNet	-	11	80K	8K	1	1	-	❎
DocBank	-	13	0.45M	50K	3	1	2014-2018	❎
PubLayNet	-	5	0.34M	12K	1	1	-	❎
VRDU	-	10	7K	3K	3	1	-	❎
DUDE	-	N/A	20K	6K	3	3	1860-2022	❎
D^4LA	-	27	8K	2K	1	3	-	❎
Fox Benchmark	-	5	N/A (No train-set)	0.2K	3	5	-	❎
ArXivCap	32	N/A	6.4M*	N/A	4	3	-	❎
DocGenome (ours)	153	13	6.8M	9K	7	7	2007-2022	✅

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
docs		docs
Dataset_Details_README.md		Dataset_Details_README.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocGenome: A Large Benchmark for Multi-Modal Language Models in Real-World Academic Document Understanding

👇🏻DocGenome-train Download

👇🏻DocGenome-Test Download

File Structure

DocGenome Benchmark Introduction

About

Releases

Packages

License

renqiux0302/DocGenome

Folders and files

Latest commit

History

Repository files navigation

DocGenome: A Large Benchmark for Multi-Modal Language Models in Real-World Academic Document Understanding

👇🏻DocGenome-train Download

👇🏻DocGenome-Test Download

File Structure

DocGenome Benchmark Introduction

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages