Allows upload of an image for OCR using Tesseract and deployed using Docker. This uses Flask, a light weight web server framework - but for development purposes only. OpenCV is used to reduce noise in the image for better processing by pytesseract. Uploads on AWS are limited to 2MB - below are 3 images of a job posting taken on a Pixel 2XL phone, and reduced in size using Gimp by adjusting quality.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
You will need Docker installed on your system and a command line editor.
Docker
Git Bash (on Windows)
Terminal (Linux or Mac)
You can clone this repository or download a zip file, build and run the Docker image.
$ docker build -t ocr-tesseract-docker .
$ docker run -d -p 5000:5000 ocr-tesseract-docker
OR you can pull and/or run the Docker image from my repository on Docker Hub
docker pull ricktorzynski/ocr-tesseract-docker
docker run -d -p 5000:5000 ricktorzynski/ocr-tesseract-docker
Then open up browser to http://localhost:5000
You can use these images to test it - these are photos of a job posting:
This app was deployed to AWS Elastic Beanstalk, but is no longer available.
Python
Flask
Pytesseract
OpenCV
Bootstrap
Docker
Here are some helpful resources on the web that I used for this project.
- Deep Learning based Text Recognition (OCR) using Tesseract and OpenCV
- Using Tesseract OCR with Python
- Dockerize your Flask Application
- Dockerize Simple Flask App
I would like to thank Matt Berseth and Robert Marsh of NLP Logix for inspiring me to build this application.