Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better text extraction quality by PDFLib Text and Image Extraction Toolkit(TET) #14

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jasonge27
Copy link

Extended the parse_pdf_to_text.py for TET usage. As TET can be used as fully functional evaluation versions (but with restrictions of 10 pages, 1MB size) without a commercial license, a split-and-combine procedure was written to use TET in evaluation mode.

Jian Ge and others added 2 commits December 20, 2015 11:30
text extraction, espically for TET in evaluation mode.
TET is observed to perform much better than pdftotext in text extraction.

Caution: TET in evalution mode can only only process PDF documents with
up to 10 pages and 1 MB size. Although we split the pdf file into single
pages before using TET, it is still possible that some page can be larger
than 1MB. TET will print error messages to stdout in this cases.
@karpathy
Copy link
Owner

I am hesitant to change my previous default functionality. If you made this into an independent script, maybe pdf_to_text_tet.py or something, then I'd be happy to merge it as an alternative.

@jasonge27
Copy link
Author

I hope you've had a great holiday! I put the TET utility in a separate script as you suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants