Data and scripts accompanying CCpdf paper

The data represented here is a subset of data made public by the Common Crawl organization, see https://commoncrawl.org/2022/06/may-2022-crawl-archive-now-available/

Files

ccpdf.tsv — metadata of CCpdf files
run.sh — main script for downloading CCpdf files from publicly available sources
download-from-crawl.sh — script for the actual downloading

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
ccpdf.tsv.xz		ccpdf.tsv.xz
download-from-crawl.sh		download-from-crawl.sh
run.sh		run.sh