-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the extract_table() method to parse out such a table #268
Comments
Hi @CuteyBoy Could you confirm what version of pdfplumber are you using? Running on 0.5.23, and using the following code import pdfplumber
pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]
ts = {
"vertical_strategy": "text",
"horizontal_strategy": "text",
}
im = p.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")
tables = p.extract_tables(table_settings=ts)
for table in tables:
for row in table:
print(row) I am getting the following response which appears to be almost correct
The bottom line is missing and that is related to #265 As a workaround, you can crop the bottom portion of the page (
Does this resolve your issue? |
thank you very much! |
When I parse the pdf, I use the
extract_table()
method, whether the parameter information is passed({ "vertical_strategy": "lines", "horizontal_strategy": "lines", })
Neitherlines
nortext
can read the complete form informationThe pdf in question is as follows:
error_dpf_3.pdf
Using lines, the table cannot be detected, using text, the parsed result is wrong:
平安银行股份有限公司
资产负债表
2019年12月31日(除特别注明外,金额单位均为人民币百万元)
Can you help me see how the pdf of the table in this situation can be parsed?
The text was updated successfully, but these errors were encountered: