Better text extraction quality by PDFLib Text and Image Extraction Toolkit(TET) #14

jasonge27 · 2015-12-20T18:09:52Z

Extended the parse_pdf_to_text.py for TET usage. As TET can be used as fully functional evaluation versions (but with restrictions of 10 pages, 1MB size) without a commercial license, a split-and-combine procedure was written to use TET in evaluation mode.

text extraction, espically for TET in evaluation mode. TET is observed to perform much better than pdftotext in text extraction. Caution: TET in evalution mode can only only process PDF documents with up to 10 pages and 1 MB size. Although we split the pdf file into single pages before using TET, it is still possible that some page can be larger than 1MB. TET will print error messages to stdout in this cases.

karpathy · 2015-12-21T01:23:54Z

I am hesitant to change my previous default functionality. If you made this into an independent script, maybe pdf_to_text_tet.py or something, then I'd be happy to merge it as an alternative.

jasonge27 · 2015-12-28T08:50:47Z

I hope you've had a great holiday! I put the TET utility in a separate script as you suggested.

Jian Ge and others added 2 commits December 20, 2015 11:30

remove unused import statement in parse_pdf_to_text.py

38977cb

Put TET utility in the seperate script parse_pdf_to_text_TET.py.

29c19e6

Parse XML or Bibtex file from PDFs

4fe586c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better text extraction quality by PDFLib Text and Image Extraction Toolkit(TET) #14

Better text extraction quality by PDFLib Text and Image Extraction Toolkit(TET) #14

jasonge27 commented Dec 20, 2015

karpathy commented Dec 21, 2015

jasonge27 commented Dec 28, 2015

Better text extraction quality by PDFLib Text and Image Extraction Toolkit(TET) #14

Are you sure you want to change the base?

Better text extraction quality by PDFLib Text and Image Extraction Toolkit(TET) #14

Conversation

jasonge27 commented Dec 20, 2015

karpathy commented Dec 21, 2015

jasonge27 commented Dec 28, 2015