Create seachable PDF with tesseract #61

minamotorin · 2021-09-28T19:01:56Z

This pull request makes GoBooDo to make a searchable PDF with OCR. See also: #58.

How does this work

Tesseractor makes searchable PDFs from images and merge PDFs by PyPDF2.

Usage

If lang is not in settings.json or empty, GoBooDo create unsearchable PDF (same as now).

If not empty, GoBooDo create searchable PDF. GoBooDo do OCR as the book is written in item of lang.

Note

This pull requests increase dependence (PyPDF2). So if user update GoBooDo and haven't installed PyPDF2, no modules error will occur in makePDF.py.

It takes time to OCR and it is waste of time and electricity to do OCR even though GoBooDo hasn't finished downloading all images (#59).

If user want to do OCR with languages other than English, he or she should install additional language data. And there are other language datas for more accurate OCR (but slow) or for faster.

For OCR, default page_resolution will be not enough. I use 1200.

Some English sentence should get feedbacks.

vaibhavk97 · 2022-11-19T18:39:40Z

Thanks for your contribution! the use case is compelling, can you please add some tests too, so that we can ensure that these changes are not breaking the current functionality. Thanks!

minamotorin added 5 commits September 28, 2021 18:15

Update README.md for OCR

66d7b87

Update settings.json for OCR

03423ae

Update requirements.txt (add PyPDF2) for OCR

6a2942f

Update makePDF.py for OCR

5fe2bde

Update GoBooDo.py for OCR

9ddc8b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create seachable PDF with tesseract #61

Create seachable PDF with tesseract #61

minamotorin commented Sep 28, 2021 •

edited

Loading

vaibhavk97 commented Nov 19, 2022

Create seachable PDF with tesseract #61

Are you sure you want to change the base?

Create seachable PDF with tesseract #61

Conversation

minamotorin commented Sep 28, 2021 • edited Loading

How does this work

Usage

Note

vaibhavk97 commented Nov 19, 2022

minamotorin commented Sep 28, 2021 •

edited

Loading