Contains a simple scraper for the FEDRO webpage. With a few tweaks, you should be able to run it on most webpages of federal departements. Basically just grabs everything it can find on the webpage itself and on linked feldex data fields.
To do this for the fedro webpage, copy the repo and run the follwoing commands in the directory:
docker build -t name_of_your_image .
docker run --rm -it -v $(pwd)/path_to_your_folder:/crawler/crawled_data name_of_your_image
Set up your virtual environment and run
pip install -r ./requirements.txt
python crawly.py --write_dir='path_to_your_write_dir'
and the optional arguments
-
Scrapes pages in sequential manner (not really efficient, but for one base page it suffices)
-
Stores the following filetypes
- .html
- .xml
- legal documents
- .zip
- excel (xlsx, xls)
- word (docx, doc, dotx)
- powerpoint (pptx, ppt)
- images (jpg, png, mpg)
- CAD tools (dxf, dwg)
-
html and legal texts are stored as Beautifulsoup objects
-
Keeps a Python "knowledge" dictionary, a dict that containts entries like:
url: { "storage_location": path_where_file_is_stored_on_machine, "hash": a hash of the content (makes it easier to keep track of changes), "neighbours": [a list of all urls of neighbours] }
-
For legal documents, there is an additional crawler that uses the Fedlex SPARQL Endpoint to collect the full set of legal texts and also collect the dependencies specified by the JoLux model.
Docs can be recreated with sphinx
using the docsource (if need be)
cd ./docs
sphinx-build -b html source build
The doctree of the repo is outlined below
.
├── README.md
├── crawly.py
├── requirements.txt
└── src
├── __init__.py
├── legal
│ ├── __init__.py
│ ├── helpers.py
│ └── sparqlqueries.py
├── scraper.py
└── utils
├── __init__.py
└── adminlink.py