tabula-py
is a simple Python wrapper of tabula-java, which can read table of PDF.
You can read tables from PDF and convert into pandas's DataFrame. tabula-py also enables you to convert a PDF file into CSV/TSV/JSON file.
- Java
- Confirmed working with Java 7, 8
- pandas
- urllib3
- distro
I confirmed working on macOS and Ubuntu. But some people confirm it works on Windows 10. See also following the setting procedure.
pip install tabula-py
If you want to become a contributor, you can install dependency for development of tabula-py as follows:
pip install -r requirements.txt -c constraints.txt
tabula-py enables you to extract table from PDF into DataFrame and JSON. It also can extract tables from PDF and save file as CSV, TSV or JSON.
import tabula
# Read pdf into DataFrame
df = tabula.read_pdf("test.pdf", options)
# Read remote pdf into DataFrame
df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")
# convert PDF into CSV
tabula.convert_into("test.pdf", "output.csv", output_format="csv")
# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv')
See example notebook
This instruction is originally written by @lahoffm. Thanks!
- If you don't have it already, install Java
- Try to run example code (replace the appropriate PDF file name).
- If there's a
FileNotFoundError
when it callsread_pdf()
, and when you typejava
on command line it says'java' is not recognized as an internal or external command, operable program or batch file
, you should setPATH
environment variable to point to the Java directory. - Find the main Java folder like
jre...
orjdk...
. On Windows 10 it was underC:\Program Files\Java
- On Windows 10: Control Panel -> System and Security -> System -> Advanced System Settings -> Environment Variables -> Select PATH --> Edit
- Add the
bin
folder likeC:\Program Files\Java\jre1.8.0_144\bin
, hit OK a bunch of times. - On command line,
java
should now print a list of options, andtabula.read_pdf()
should run.
- pages (str, int,
list
ofint
, optional)- An optional values specifying pages to extract from. It allows
str
,int
,list
ofint
. - Example: 1, '1-2,3', 'all' or [1,2]. Default is 1
- An optional values specifying pages to extract from. It allows
- guess (bool, optional):
- Guess the portion of the page to analyze per page. Default
True
- Guess the portion of the page to analyze per page. Default
- area (
list
offloat
, optional):- Portion of the page to analyze(top,left,bottom,right).
- Example: [269.875, 12.75, 790.5, 561] or [[12.1,20.5,30.1,50.2],[1.0,3.2,10.5,40.2]]. Default is entire page
- relative_area (bool, optional):
- If all area values are between 0-100 (inclusive) and preceded by '%', input will be taken as % of actual height or width of the page. Default
False
.
- If all area values are between 0-100 (inclusive) and preceded by '%', input will be taken as % of actual height or width of the page. Default
- lattice (bool, optional):
- [
spreadsheet
option is deprecated] Force PDF to be extracted using lattice-mode extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet).
- [
- stream (bool, optional):
- [
nospreadsheet
option is deprecated] Force PDF to be extracted using stream-mode extraction (if there are no ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
- [
- password (bool, optional):
- Password to decrypt document. Default is empty
- silent (bool, optional):
- Suppress all stderr output.
- columns (list, optional):
- X coordinates of column boundaries.
- Example: [10.1, 20.2, 30.3]
- output_format (str, optional):
- Format for output file or extracted object.
- For
read_pdf()
:json
,dataframe
- For
convert_into()
:csv
,tsv
,json
- output_path (str, optional):
- Output file path. File format of it is depends on
format
. - Same as
--outfile
option of tabula-java.
- Output file path. File format of it is depends on
- java_options (
list
, optional):- Set java options like
-Xmx256m
.
- Set java options like
- pandas_options (
dict
, optional):- Set pandas options like
{'header': None}
.
- Set pandas options like
- multiple_tables (bool, optional):
- (Experimental) Extract multiple tables.
- This option uses JSON as an intermediate format, so if tabula-java output format will change, this option doesn't work.
There are several possible reasons, but tabula-py
is just a wrapper of tabula-java
, make sure you've installed Java and you can use java
command on your terminal. Many issue reporters forget to set PATH for java
command.
You can check whether tabula-py can call java
from Python process with tabula.environment_info()
function.
If you've installed tabula
, it will be conflict the namespace. You should install tabula-py
after removing tabula
.
pip uninstall tabula
pip install tabula-py
tabula-py
set guess
option True
by default, for beginners. It is known to make a conflict between stream
option. If you feel something strange with your result, please set guess=False
.
Yes. You can use options
argument as following. The format is same as cli of tabula-java.
read_pdf(file_path, options="--columns 10.1,20.2,30.3")
In short, you can extract with area
and spreadsheet
option.
In [4]: tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91))
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Out[4]:
Unnamed: 0 Col2 Col3 Col4 Col5
0 A B 12 R G
1 NaN R T 23 H
2 B B 33 R A
3 C T 99 E M
4 D I 12 34 M
5 E I I W 90
6 NaN 1 2 W h
7 NaN 4 3 E H
8 F E E4 R 4
How to use area
option
According to tabula-java wiki, there is a explain how to specify the area: https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want
For example, using macOS's preview, I got area information of this PDF:
java -jar ./target/tabula-1.0.1-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename
given
Note the left, top, height, and width parameters and calculate the following:
y1 = top
x1 = left
y2 = top + height
x2 = left + width
I confirmed with tabula-java:
java -jar ./tabula/tabula-1.0.1-jar-with-dependencies.jar -a "337.29,226.49,472.85,384.91" table.pdf
Without -r
(same as --spreadsheet
) option, it does not work properly.
This error occurs pandas trys to extract multiple tables with different column size at once.
Use multiple_tables
option, then you can avoid this error.
Set java_options=["-Djava.awt.headless=true"]
. kudos @jakekara
If the encoding of PDF is UTF-8, you should set chcp 65001
on your terminal before launching a Python process.
chcp 65001
Then you can extract UTF-8 PDF with java_options="-Dfile.encoding=UTF8"
option. This option will be added with encoding='utf-8'
option, which is also set by default.
# This is an example for java_options is set explicitly
df = read_pdf(file_path, java_options="-Dfile.encoding=UTF8")
Replace 65001
and UTF-8
appropriately, if the file encoding isn't UTF-8.
You should escape file/directory name yourself.
Interested in helping out? I'd love to have your help!
You can help by:
- Reporting a bug.
- Adding or editing documentation.
- Contributing code via a Pull Request.
- Write a blog post or spreading the word about
tabula-py
to people who might be able to benefit from using it.
You can also support our continued work on tabula-py
with a donation on Patreon.