Skip to content

nhthanh0809/OCR-on-pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

OCR-on-pdf

How does it work?

This's my program which process input pdf file then give all the characters and each character’s bounding box coordinates which will be found in every pages of a pdf document.

In my approach, i will process each page of pdf file as an image. Then i will apply a simple segmentation method based on binary threshold & morphological processing algorithm to extract ROI on the image. I will explain my approach by the following steps:

  • Step 1: Download research paper in pdf format from https://arxiv.org/ then store in "inputdata" directory
  • Step 2: Based on filename of each pdf file, create directory with name corresponding to pdf filename
  • Step 3: Convert all pages of the pdf file to images and save it in "images" folder inside above directory
  • Step 4: Threshold the image to binary then apply morphological processing for the binary image. In here i will using image dilation to expand ROI of image
  • Step 5: Find boundary of ROI and draw rectangle for each finding out ROI.
  • Step 6: Apply OCR library pytesseract to each ROI. The output results will be saved to "outputdata" directory under text format

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages