Skip to content

Browsing CommonCrawl for Swedish government's PDF reports

Notifications You must be signed in to change notification settings

ESVdatalabb/myndighetscrawl

Repository files navigation

Myndighetscrawl

Myndighetscrawl is an attempt at extracting Swedish government agency information from CommonCrawl. More specifically, it iterates through these agencies' domain names and looks for PDF document links from these domains in the CommonCrawl index.

Since the index contains links fetched over 10 years ago, they are not all valid.

Set up

pip install -r requirements.txt

Usage

python run.py

About

Browsing CommonCrawl for Swedish government's PDF reports

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages