OCR-on-pdf

How does it work?

This's my program which process input pdf file then give all the characters and each character’s bounding box coordinates which will be found in every pages of a pdf document.

In my approach, i will process each page of pdf file as an image. Then i will apply a simple segmentation method based on binary threshold & morphological processing algorithm to extract ROI on the image. I will explain my approach by the following steps:

Step 1: Download research paper in pdf format from https://arxiv.org/ then store in "inputdata" directory
Step 2: Based on filename of each pdf file, create directory with name corresponding to pdf filename
Step 3: Convert all pages of the pdf file to images and save it in "images" folder inside above directory
Step 4: Threshold the image to binary then apply morphological processing for the binary image. In here i will using image dilation to expand ROI of image
Step 5: Find boundary of ROI and draw rectangle for each finding out ROI.
Step 6: Apply OCR library pytesseract to each ROI. The output results will be saved to "outputdata" directory under text format

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
inputdata		inputdata
outputdata		outputdata
OCRpdf.py		OCRpdf.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR-on-pdf

How does it work?

About

Releases

Packages

Languages

nhthanh0809/OCR-on-pdf

Folders and files

Latest commit

History

Repository files navigation

OCR-on-pdf

How does it work?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages