A design and implementation of a super lightweight algorithm for "overlapped handwritten signature extraction from scanned documents" using OpenCV and scikit-image on python. Please contact if you need professional signature detection & recognition & segmentation & counting project with the super high accuracy!
- Input = The scanned document
- Output = The signatures exist on the input
TODOs:
- "Outliar Removal" module will be improved to boost the signature extraction algorithm.
- CNN based "Signature Recognition" module will be developed.
- "Signature Spoofing Detection" algorithm will be developed.
- "Signature Detector (bounding box) & Counter" module will be developed.
- "Accuracy of detection on SigSA: On-line Handwritten Signature Database" will be calculated and shared.
You can find a sample project that is developed on top of "signature extractor" algorithm to extract the signatures on the digital photo of the document. Here are the functionalities of this sample project:
Explanation: For this case, the signature extraction algorithm can extract the 3 different handwritten signatures successfully. Just a very small portion of the signature, which is located at top-left, is lost because this part is not connected with the whole signature line so the algorithm interprets it is not a part of the signature.
Explanation: For this case, signature extraction algorithm can extract 2 handwriteetn signatures from the whole textual data but it can not remove the lines, that are located at bottom-center, because the signature has big connected pixels so the algorithm sees them as signatures.
Explanation: Some parts of the signatures are lost because they are not connected with the big connected components so the algorithm sees that they are not a part of signatures. They can cathc by setting the threshold value to a bigger value.
As already mentioned that the algorithm can extract the signatures from scanned documents based on "connected component analysis" so what is connected component algorithm then?: In image processing, a connected components algorithm finds regions of connected pixels which have the same value!
You can find more detailed information about the connected component analysis in here.
Thus, the connected components can be found and labelled by a cool functionality that is provided by scikit-image library! But why do we need it? Please just check the scanned documents, you can see that the biggest connect components are belongs the handwritten signatures! If we can get the biggest connected components, we can get the signatures from whole documents! However, we can also get the undesired lines or different shapes that have big connected components, right? So we also need a threshold value to get rid of them...
I've calculated the threshold value to detect the outliars (any lines, shapes and texts are not a part of the signatures) via performing many experiments. I've got an equation ,are calculated based experiement results, which are works pretty good for most of the scanned documents are a4 sized.
Detect and remove the small size outliars:
Here the code parts that start on signature_extractor.py - line#60:
# experimental-based ratio calculation, modify it for your cases
# a4_small_size_outliar_constant is used as a threshold value to remove connected outliar connected pixels
# are smaller than a4_small_size_outliar_constant for A4 size scanned documents
a4_small_size_outliar_constant = ((average/constant_parameter_1)*constant_parameter_2)+constant_parameter_3
print("a4_small_size_outliar_constant: " + str(a4_small_size_outliar_constant))
I determined the equation (x stands for scanned document size such as A4 or A0):
- ax_small_size_outliar_constant = ((average/constant_parameter_1) * constant_parameter_2) + constant_parameter_3
based full of my experiments. You can modify it for your cases and also the scanned document size such as for A0 and so on... Just configure the constants:
- constant_parameter_1
- constant_parameter_2
- constant_parameter_3
perform many experiments with the different parameter values till get the highest accuracy!
Detect and remove the big size outliars:
Here the code parts that start on signature_extractor.py - line#66:
# experimental-based ratio calculation, modify it for your cases
# a4_big_size_outliar_constant is used as a threshold value to remove outliar connected pixels
# are bigger than a4_big_size_outliar_constant for A4 size scanned documents
a4_big_size_outliar_constant = a4_small_size_outliar_constant*constant_parameter_4
print("a4_big_size_outliar_constant: " + str(a4_big_size_outliar_constant))
I determined the equation (x stands for scanned document size such as A4 or A0):
- ax_big_size_outliar_constant = ax_small_size_outliar_constant*constant_parameter_4
based full of my experiments. You can modify it for your cases and also the scanned document size such as for A0 and so on... Just configure the constant:
- constant_parameter_4
perform many experiments with the different parameter values till get the highest accuracy!
1.) Python and pip
Python is automatically installed on Ubuntu. Take a moment to confirm (by issuing a python -V command) that one of the following Python versions is already installed on your system:
- Python 3.3+
The pip or pip3 package manager is usually installed on Ubuntu. Take a moment to confirm (by issuing a pip -V or pip3 -V command) that pip or pip3 is installed. We strongly recommend version 8.1 or higher of pip or pip3. If Version 8.1 or later is not installed, issue the following command, which will either install or upgrade to the latest pip version:
$ sudo apt-get install python3-pip python3-dev # for Python 3.n
2.) scikit-image
On all other systems, install it via shell/command prompt:
pip install scikit-image
If you are running Anaconda or miniconda, use:
conda install -c conda-forge scikit-image
See details in here.
-
After completing these 2 installation steps that are given at above, you can test the project by this command:
python3 signature_extractor.py
If you use this code for your publications, please cite it as:
@ONLINE{hse,
author = "Ahmet Özlü",
title = "Overlapped handwritten signature extraction from scanned documents",
year = "2018",
url = "https://github.com/ahmetozlu/signature_extractor"
}
Ahmet Özlü
This system is available under the MIT license. See the LICENSE file for more info.