pdfExtraction

Extract information from pdf files, and convert unstructured text into structured information
提取 pdf 文件中的信息，将非结构化文本转化为结构化信息

Feature & ToDo List

xpdf Xpdf is a free PDF viewer and toolkit, including a text extractor, image converter, HTML converter, and more.
tika *detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). *
tika-python A Python port of the Apache Tika library that makes Tika available using the Tika REST Server.
tabula-py Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
pdfminer/pdfminer.six
Python PDF Parser
PyPDF2
A utility to read and write PDFs with Python
pdfplumber Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
pdftabextract
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
OCRmyPDF
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
tesseract Tesseract Open Source OCR Engine (main repository)

Programming with PDFMiner
[DATA MINING PDFS – THE SIMPLE CASES ](https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-\ pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/)
[DATA MINING OCR PDFS — USING PDFTABEXTRACT TO LIBERATE TABULAR DATA FROM SCANNED DOCUMENTS](https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-\ using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/)
[StackOverflow:how-to-extract-text-from-a-pdf-file](https://stackoverflow\ .com/questions/34837707/how-to-extract-text-from-a-pdf-file)
Python: OCR for PDF or Compare textract, pytesseract, and pyocr

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
notebook		notebook
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
elasticUtil.py		elasticUtil.py
pdfExtraction.py		pdfExtraction.py
requirements.txt		requirements.txt