This is a kedro project to classify documents using various text- and image-based models.
You need a recent version of python (3.8+ should work).
Install all system dependencies as defined in django-filingcabinet's default.nix
.
Then install the dependencies using
pip install -r src/requirements.txt
The input data needs to be placed in data/01_raw
.
The folders in data/
follow the layered data engineering convention.
A script to download the annotated documents from the
fcdocs-annotate
api is provided
in scripts/download_data.py
. Assuming your server runs on 127.0.0.1:8000,
you can use the following command
python scripts/download_data.py --document-endpoint http://127.0.0.1:8000/api/document/ --feature-endpoint http://127.0.0.1:8000/api/feature/
This project was developed by the FragDenStaat-team at Open Knowledge Foundation Deutschland e.V.. FragDenStaat provides a simple interface to make and publish freedom-of-information requests to public bodies.
You can use the following script to download a bunch of attachments from the FragDenStaat.de-API
python scripts/download_data_fds.py
ℹ️ Also see the Configuration section
The project currently consists of three pipelines:
data_processing
(dp): Cleans the input data and calculates some features from itclassifier
(cf): Trains and evaluates a classification modelclustering
(cs): Trains and evaluates a clustering model
You can run them all using
kedro run
To run only one of the pipelines you can add the --pipeline
parameter with the
short-name of a pipeline (dp
, cf
, cs
)
For example to only run the classifier pipeline, use
kedro run --pipeline cf
ℹ️ You need to run the data processing pipeline at least once before running the model pipelines
You can configure the used models and their parameters in the conf/base/parameters.yml
file.
By default, the data processing pipeline uses 4 threads for pdf conversion.
If you have more cpu cores, you can change this number by creating a
conf/local/parameters.yml
file with the following content (replace YOUR_NUMBER_OF_WORKERS):
data_processing:
max_workers: YOUR_NUMBER_OF_WORKERS
You can run tests (see src/tests/
) with
kedro test
Note: We currently don't have tests
To generate or update the dependency requirements run:
kedro build-reqs
This will pip-compile
the contents of src/requirements.txt
into a new file
src/requirements.lock
. You can see the output of the resolution by opening
src/requirements.lock
.
Further information about project dependencies
Note: Using
kedro jupyter
orkedro ipython
to run your notebook provides these variables in scope:context
,catalog
, andstartup_error
.Jupyter, JupyterLab, and IPython are already included in the project requirements by default, so once you have run
pip install -r src/requirements.txt
you will not need to take any extra steps before you use them.
To use Jupyter notebooks in your Kedro project, you need to install Jupyter:
pip install jupyter
After installing Jupyter, you can start a local notebook server:
kedro jupyter notebook
To use JupyterLab, you need to install it:
pip install jupyterlab
You can also start JupyterLab:
kedro jupyter lab
And if you want to run an IPython session:
kedro ipython
You can move notebook code over into a Kedro project structure using a mixture of cell tagging and Kedro CLI commands.
By adding the node
tag to a cell and running the command below, the cell's source code will be copied over to a Python file within src/<package_name>/nodes/
:
kedro jupyter convert <filepath_to_my_notebook>
Note: The name of the Python file matches the name of the original notebook.
Alternatively, you may want to transform all your notebooks in one go. Run the following command to convert all notebook files found in the project root directory and under any of its sub-folders:
kedro jupyter convert --all
To automatically strip out all output cell contents before committing to git
, you can run kedro activate-nbstripout
. This will add a hook in .git/config
which will run nbstripout
before anything is committed to git
.
Note: Your output cells will be retained locally.
Take a look at the Kedro documentation to get started.
Further information about building project documentation and packaging your project
After you trained a classification with kedro run --pipeline cf
, you can test it with on some pdfs using
kedro predict-with-classifier data/06_models/classifier/ YOUR_PDF1 [YOUR_PDF2 ...]
This will load the newest version of your model and make a prediction on your pdf files.
If you want to use a specific version of your model, you can specify it using the --load-version
option:
kedro predict-with-classifier data/06_models/classifier/ --load-version 2022-07-08T21.22.07.918Z YOUR_PDFS
After you trained a clustering with kedro run --pipeline cs
, you can test it with on some pdfs using
kedro predict-with-clustering data/06_models/clustering/ YOUR_PDF1 [YOUR_PDF2 ...]
This will load the newest version of your model and make a prediction on your pdf files.
If you want to use a specific version of your model, you can specify it using the --load-version
option:
kedro predict-with-classifier data/06_models/clustering/ --load-version 2022-07-08T21.22.07.918Z YOUR_PDFS
To package a model into a single file, which can be imported into fcdocs-annotate, you can use the package-model
subcommand:
kedro package-model data/06_models/clustering/ OUTPUT_FILENAME