鉴于当前没有方便提取财报PDF中的财务数据的工具,于是研究了一下各方面资料写了两种语言提取的小工具,即:
备注:若需要提取10页以上的PDF转为excel,可以自行修改代码for循环使用spire.pdf-3.8.5.jar
提供的方法即可(免费API限制使用10页)
========================================================================
找到一款超级好用,更适合小白的开源PDF提取表格转化excel工具,下载安装即可。刚刚使用一下该工具对PDF中表格提取并转化为excel文件的准确率达到100%
- 使用条件:首先需要安装Java环境,然后下载windows的
tabula-win.zip
安装包解压后双击tabula.exe
即可~
备注:安装java环境可以自行百度,操作教程太多了。实在不会,我附上一个参考教程链接吧:win10安装java8
-
-
Windows & Linux users will need a copy of Java installed. You can download Java here. (Java is included in the Mac version.)
-
Download
tabula-win.zip
from https://tabula.technology/. Unzip the whole thing and open thetabula.exe
file inside. A browser should automatically open to http://127.0.0.1:8080/ . If not, open your web browser of choice and visit that link.
To close Tabula, just go back to the console window and press "Control-C" (as if to copy).
-
========================================================================
对于复杂的表格,使用tabula工具提取表格时也会有部分格式混乱。所以找到一款基于tabula-java工具包装的tabula-py
依赖库
python环境安装依赖库:pip install tabula-py
通过tabula-py依赖库提供的API进行读取PDF提取表格数据,然后按照自己的要求进行清洗即可,开发环境要求如下:
- Java 8+
- Python 3.7+
tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save the file as a CSV, a TSV, or a JSON.
import tabula
# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')
# Read remote pdf into list of DataFrame
dfs2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")
# convert PDF into CSV file
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')
# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')
See example notebook for more details. I also recommend to read the tutorial article.