Skip to content

echo-ray/pdfExtraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdfExtraction

Extract information from pdf files, and convert unstructured text into structured information
提取 pdf 文件中的信息,将非结构化文本转化为结构化信息

Feature & ToDo List

  • pdf 文档中的表格提取

    • 表格识别不全

      中文中有些表格没有较规范的使用表格线,因此需要通过文本对齐规律来识别,但这也可能导致目录页被识别为表格

    • 表格内换行导致的切分错误

      如果某一行中,具有有效值的列全部带 \n 换行符,则将下一行合并进来,并将下一行改为 NaN

    • 表格标题的提取

      获取表格的bbox,寻找它上面的行作为表格主题 表题应该在左边或者中间,而不是右边(右边和下面一般是表中内容的注释)

    • to be added
  • pdf 文档中的文字提取

    • 文本提取取
    • 切分清洗
      • 根据句号、�大量空格及换行做切分,并去除大量空格
      • to be added
    • 后处理:
      • 存放到 elasticsearch 中搜索
      • NLP etc.
  • pdf 文档中的图片处理

    • 获取�bbox,提取png格式图片
    • 处理图片,提取文字

      根据这篇文章,考虑PyOCR + tesseract + LSTM

    • 返回上面的文本处理和图片处理部分

Note

  • tesseract 是一个惠普、谷歌多年前开源,成熟且稳定更新的 OCR 工具库, 可以使用ImageMagick提取扫描图片然后由其识别

  • xpdf 项目提供了较为成熟稳定的文本pdf转换为纯文本的途径

Related Projects / 相关项目

  • xpdf Xpdf is a free PDF viewer and toolkit, including a text extractor, image converter, HTML converter, and more.

  • tika *detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). *

  • tika-python A Python port of the Apache Tika library that makes Tika available using the Tika REST Server.

  • tabula-py Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

  • pdfminer/pdfminer.six
    Python PDF Parser

  • PyPDF2
    A utility to read and write PDFs with Python

  • pdfplumber Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

  • pdftabextract
    A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

  • OCRmyPDF
    OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

  • tesseract Tesseract Open Source OCR Engine (main repository)

Reference / 参考资料

About

extract information in PDF file

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published