This repo contains some query strategies and utils for active learning, as well as the widget for dataset annotation in Jupyter IDE. The repo has tight integration with libact Python library.
Example of active learning annotation of MNIST dataset with the Jupyter widget.
Active learning (AL) is an interactive approach to simultaneously building a labeled dataset and training a machine learning model. AL algorithm:
-
A relatively large unlabeled dataset is gathered.
-
A domain expert labels a few positive examples in the dataset.
-
A classifier is trained on labeled samples.
-
The classifier is applied to the rest of the corpus.
-
Few most “useful” examples are selected (e.g., that increase classification performance).
-
The examples labeled by the expert are added to the training set.
-
Goto 3.
The procedure repeats until the performance of the classifier stops improving or the expert is bored.
-
Python 3.6 (the package has not been tested with earlier versions)
-
numpy (1.12.1)
-
pandas (0.20.1)
-
sklearn (0.18.1)
-
scipy (0.19.0)
-
Pillow (4.2.1)
-
Jupyter (4.3.0)
-
LibAct from the fork (
pip install git+https://github.com/windj007/libact
)
The Jupyter widgets are not enabled by default. To install and activate them do the following.
pip install ipywidgets
jupyter nbextension enable --py --sys-prefix widgetsnbextension
For further details, please, refer to jupyter-widgets repo.
To install the library and the widget execute in command line with root priviledges:
pip install git+https://github.com/IINemo/active_learning_toolbox
See an example for MNIST dataset annotation and an example for 20 newsgroups annotation.
If you have Docker installed, you can test the examples with windj007/jupyter-keras-tool:
cd `<package dir>`/examples
docker run -ti --rm -v `pwd`:/notebook -p 8888:8888 windj007/jupyter-keras-tools
Then open http://localhost:8888 in a browser (will launch Jupyter IDE) and open an example notebook.
If you use active learning toolbox in academic works, please cite (to be published):
BibTex:
@inproceedings{suvorovshelmanov2017ainl,
title={Active Learning with Adaptive Density Weighted Sampling for Information Extraction from Scientific Papers},
author={Roman Suvorov and Artem Shelmanov and Ivan Smirnov},
booktitle={Proceedings of AINL: Artificial Intelligence and Natural Language Conference},
publisher = {Springer, Communications in Computer and Information Science},
year={2017}
}
Russian GOST:
Suvorov R., Shelmanov A., Smirnov I. Active learning with adaptive density weighted sampling for information extraction from scientific papers // Proceedings of AINL: Artificial Intelligence and Natural Language Conference. — Springer, Communications in Computer and Information Science, 2017.