About

PDF books and articles found online are usually poorly rendered on small e-readers (e.g. Kindle Oasis), as a whole PDF page is displayed on the small screen.

This lib uses OCR to correct the skewed angle of the page, crop around the text and re-paginate; as to optimize for the best reading experience on small e-readers.

The code was initially written in 2018 in Java, alongside an online converter website that I decided to take down as it would cost quite a bit (OCR and image processing being quite resource-intensive). I also couldn't maintain it as I was working full time.

Therefore, the project probably needs a bit of a cleanup.

The unit tests using full PDF books can not be shared publicly, so I will re-add them later, using only individual pages rather than complete books.

Examples

Example 1

Input

download PDF

Output

download PDF

Example 2

Input

download PDF

Output

download PDF

Example 3

Input

download PDF

Output

download PDF

Requirements

sudo apt-get install tesseract-ocr

The data in tessdata/ is found on https://github.com/tesseract-ocr/tessdata_best

Usage

    RequestConfig requestConfig = RequestConfig
        .builder()
        .pdfFile(file)
        .minPage(minPage)
        .maxPage(maxPage)
        .correctAngle(true)
        .build();

    Processor processor = new Processor(requestConfig);
    processor.process();
    processor.joinThread();
    File outputFile = processor.writeToPDFFile(fileName + "_optimized.pdf");

TODO

~~Move to Gradle~~
Re-add unit tests that can be shared publicly, adapt the other ones
Add language as a parameter
Create a user-friendly runnable
Move to Kotlin
Finish picture detection

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
gradle/wrapper		gradle/wrapper
src		src
tessdata		tessdata
thumbs		thumbs
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
cleanup.sh		cleanup.sh
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Examples

Example 1

Input

Output

Example 2

Input

Output

Example 3

Input

Output

Requirements

Usage

TODO

About

Releases

Packages

Languages

benckx/optimize-pdf-ereaders

Folders and files

Latest commit

History

Repository files navigation

About

Examples

Example 1

Input

Output

Example 2

Input

Output

Example 3

Input

Output

Requirements

Usage

TODO

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages