This repository contains data and simple scripts accompanying the "CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data" paper.
The data represented here is a subset of data made public by the Common Crawl organization, see https://commoncrawl.org/2022/06/may-2022-crawl-archive-now-available/
ccpdf.tsv
— metadata of CCpdf filesrun.sh
— main script for downloading CCpdf files from publicly available sourcesdownload-from-crawl.sh
— script for the actual downloading