Skip to content

or-toledano/visual-data-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

visual-data-extractor - Computer Vision for Data Extraction

Phone screens/documents (rectangles) data extraction

My attempt on ROI for rectangular text boxes (no DL).

This was made primarily for learning OpenCV purposes;

Use some DL model like OpenCV's EAST for a fast and robust method in real life.

demo

The pipeline

Detect quads (blur, threshold, find contours, approxPolyDP), warp the perspective,
OCR preprocess (threshold), run OCR, output.

TODO:

serialize output, save the contour tree structure in a JSON (switch to RETR_TREE from RETR_EXTERNAL), implement HoughLines method for quad detection as an alternative.

Installation

Note: python, pip point to the latest versions (i.e. python3, pip3) on Arch Linux, but not on Debian.

git clone https://github.com/or-toledano/visual-data-extractor.git
pip install visual-data-extractor/

System dependencies: tesseract, tesseract-ocr-eng.

Usage

The rotate flag is for rotation of the image and each individual roi. The roi orientation can be estimated but might not always be correct, so use the --rotate flag to get the results for all of the rotations. I didn't get much luck with pytesseract.image_to_osd and still need to figure out minAreaRect rotation fix; With the current --rotate flag the code is trying all 4 rolls for each roi using rectified_roi_manual_roll, which isn't really optimal...
TL;DR:

python -m visualextract --rotate --path <path to image>

Or:

from visualextract.extract import extract_data
data = extract_data("/image/path/", rotate=<True for now, False in future fix>)
for text in data:
    print(text)

SPDX-License-Identifier: GPLv3-or-later
Copyright © 2020 Or Toledano

About

Computer Vision utilities for visual data extraction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages