ASTRA scraper

Contains a simple scraper for the FEDRO webpage. With a few tweaks, you should be able to run it on most webpages of federal departements. Basically just grabs everything it can find on the webpage itself and on linked feldex data fields.

How to use it

Using a Docker container (for a full copy) of fedro data

To do this for the fedro webpage, copy the repo and run the follwoing commands in the directory:

docker build -t name_of_your_image .
docker run --rm -it -v $(pwd)/path_to_your_folder:/crawler/crawled_data name_of_your_image

Using it directly

Set up your virtual environment and run

pip install -r ./requirements.txt
python crawly.py --write_dir='path_to_your_write_dir'

and the optional arguments

What it does

Scrapes pages in sequential manner (not really efficient, but for one base page it suffices)
Stores the following filetypes
- .pdf
- .html
- .xml
- legal documents
- .zip
- excel (xlsx, xls)
- word (docx, doc, dotx)
- powerpoint (pptx, ppt)
- images (jpg, png, mpg)
- CAD tools (dxf, dwg)
html and legal texts are stored as Beautifulsoup objects

Keeps a Python "knowledge" dictionary, a dict that containts entries like:

url: {
  "storage_location": path_where_file_is_stored_on_machine,
  "hash": a hash of the content (makes it easier to keep track of changes),
  "neighbours": [a list of all urls of neighbours]
}

For legal documents, there is an additional crawler that uses the Fedlex SPARQL Endpoint to collect the full set of legal texts and also collect the dependencies specified by the JoLux model.

Get the docs

Docs can be recreated with sphinx using the docsource (if need be)

cd ./docs
sphinx-build -b html source build

Overview

The doctree of the repo is outlined below

.
├── README.md
├── crawly.py
├── requirements.txt
└── src
    ├── __init__.py
    ├── legal
    │   ├── __init__.py
    │   ├── helpers.py
    │   └── sparqlqueries.py
    ├── scraper.py
    └── utils
        ├── __init__.py
        └── adminlink.py

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs/source		docs/source
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
crawly.py		crawly.py
dockercrawler.py		dockercrawler.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ASTRA scraper

How to use it

Using a Docker container (for a full copy) of fedro data

Using it directly

What it does

Get the docs

Overview

About

Releases

Languages

phi-ra/kg-crawler

Folders and files

Latest commit

History

Repository files navigation

ASTRA scraper

How to use it

Using a Docker container (for a full copy) of fedro data

Using it directly

What it does

Get the docs

Overview

About

Resources

Stars

Watchers

Forks

Releases

Languages