robin is a RObust document image BINarization tool, written in Python.
- robin - fast document image binarization tool;
- dataset - links for DIBCO 2009-2018, Palm Leaf Manuscript and my own datasets with original and ground-truth images; scripts for creating training data from datasets;
- articles - selected binarization articles, that helped me a lot;
- weights - pretrained weigths for robin;
robin uses a number of open source projects to work properly:
- Keras - high-level neural networks API;
- Tensorflow - open-source machine-learning framework;
- OpenCV - a library of programming functions mainly aimed at real-time computer vision;
- Augmentor - a collection of augmentation algorithms;
robin requires Python v3.5+ to run.
Getting and installing robin is even more easier now with the following commands.
$ git clone https://github.com/venkatakolagotla/robin.git
$ cd robin
$ python setup.py install
Replace the dummy weight file with a real weight file with the same name in the weights directory and you are good to start binarizing some documents with robin!
robin consists of two main files: src/unet/train.py
, which generates weights for U-net model from input 128x128 pairs of
original and ground-truth images, and src/unet/binarize.py
for binarization group of input document images. Model works with 128x128 images, so binarization tool firstly splits input imags to 128x128 pieces. You can easily rewrite code for different size of U-net image, but researches show that 128 x 128 is the best size.
It is realy hard to find good document binarization dataset (DBD), so here I give links to 3 datasets, marked up in a single convenient format. All input image names satisfy [\d]*_in.png
regexp, and all ground-truth image names satisfy [\d]*_gt.png
regexp.
- DIBCO - 2009 - 2018 competition datasets;
- Palm Leaf Manuscript - Palm Leaf Manuscript dataset from ICHFR2016 competition;
- Borders - Small dataset containing bad text boundaries. It can be used with bigger DIBCO or Palm Lead Manuscript images;
- Improved LRDE - LRDE 2013 magazines dataset. I improved its ground-truths for better usage;
Also I have some simple script - src/dataset/dataset.py
. It can fastly generate train-validation-testing data from provided datasets. It is expected, that you train your simple robin on new dataset, then create new dataset with binarize.py
, correct generated ground-truths and train robin again with these new pair of input and ground-truth images.
While I was working on robin, I constantly read some scientific articles. Here I give links to all of them.
- DIBCO - 2009 - 2018 competition articles;
- DIBCO metrics - articles about 2 non-standard DIBCO metrics: pseudo F-Measure and DRD (PSNR and F-Measure is realy easy to find on the Web);
- U-net - articles about U-net convolutional network architecture;
- CTPN - articles about CTPN - fast neural network for finding text in images (My Neural Network doesn't use it, but it is great and I began my researches from it);
- ZF_UNET_224 - I think, this is best U-net implementation in the world;
Training neural network is not cheap, because you need powerful GPU and CPU, so I provide some pretrained weigths (For training I used two combinations: Nvidia 1050 Ti 4 Gb + Intel Core I7 7700 HQ + 8 Gb RAM
and Nvidia 1080 Ti SLI + Intel Xeon E2650 + 128 Gb RAM
).
- Base - weights after training NN on
DIBCO
andborders
data for 256 epochs with batchsize 128 and enabled augmentation. IT IS TRAINED FOR A4 300 DPI Images, so Your input data must have good resolution;
Keras
has some problems with parallel data augmentation: it creates too many processes. I hope it will be fixed soon, but now it is better to use zero value of--extraprocesses
flag (default value);
- Igor Vishnyakov and Mikhail Pinchukov - my scientific directors;
- Chen Jian - DIBCO 2017 article finder;