Inspect ML Pipelines in Python in the form of a DAG
Prerequisite: Python 3.10
-
Clone this repository
-
Set up the environment
cd mlinspect
python -m venv venv
source venv/bin/activate
-
If you want to use the visualisation functions we provide, install graphviz which can not be installed via pip
Linux:
apt-get install graphviz
MAC OS:
brew install graphviz
-
Install pip dependencies
SETUPTOOLS_USE_DISTUTILS=stdlib pip install -e .[dev]
-
To ensure everything works, you can run the tests (without graphviz, the visualisation test will fail)
python setup.py test
mlinspect makes it easy to analyze your pipeline and automatically check for common issues.
from mlinspect import PipelineInspector
from mlinspect.inspections import MaterializeFirstOutputRows
from mlinspect.checks import NoBiasIntroducedFor
IPYNB_PATH = ...
inspector_result = PipelineInspector\
.on_pipeline_from_ipynb_file(IPYNB_PATH)\
.add_required_inspection(MaterializeFirstOutputRows(5))\
.add_check(NoBiasIntroducedFor(['race']))\
.execute()
extracted_dag = inspector_result.dag
dag_node_to_inspection_results = inspector_result.dag_node_to_inspection_results
check_to_check_results = inspector_result.check_to_check_results
We prepared a demo notebook to showcase mlinspect and its features.
mlinspect already supports a selection of API functions from pandas
and scikit-learn
. Extending mlinspect to support more and more API functions and libraries will be an ongoing effort. However, mlinspect won't just crash when it encounters functions it doesn't recognize yet. For more information, please see here.
- For debugging in PyCharm, set the pytest flag
--no-cov
(Link)
- Stefan Grafberger, Paul Groth, Julia Stoyanovich, Sebastian Schelter (2022). Data Distribution Debugging in Machine Learning Pipelines. The VLDB Journal — The International Journal on Very Large Data Bases (Special Issue on Data Science for Responsible Data Management).
- Stefan Grafberger, Shubha Guha, Julia Stoyanovich, Sebastian Schelter (2021). mlinspect: a Data Distribution Debugger for Machine Learning Pipelines. ACM SIGMOD (demo).
- Stefan Grafberger, Julia Stoyanovich, Sebastian Schelter (2020). Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines. Conference on Innovative Data Systems Research (CIDR).
This library is licensed under the Apache 2.0 License.