Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the extract_table() method to parse out such a table #268

Closed
wuliKingQin opened this issue Sep 9, 2020 · 2 comments
Closed

Use the extract_table() method to parse out such a table #268

wuliKingQin opened this issue Sep 9, 2020 · 2 comments
Labels

Comments

@wuliKingQin
Copy link

When I parse the pdf, I use the extract_table() method, whether the parameter information is passed ({ "vertical_strategy": "lines", "horizontal_strategy": "lines", }) Neither lines nor text can read the complete form information
The pdf in question is as follows:
error_dpf_3.pdf

Using lines, the table cannot be detected, using text, the parsed result is wrong:

平安银行股份有限公司
资产负债表
2019年12月31日(除特别注明外,金额单位均为人民币百万元)

Can you help me see how the pdf of the table in this situation can be parsed?

@wuliKingQin wuliKingQin added the bug label Sep 9, 2020
@samkit-jain
Copy link
Collaborator

Hi @CuteyBoy Could you confirm what version of pdfplumber are you using?

Running on 0.5.23, and using the following code

import pdfplumber

pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]

ts = {
    "vertical_strategy": "text",
    "horizontal_strategy": "text",
}

im = p.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")

tables = p.extract_tables(table_settings=ts)

for table in tables:
    for row in table:
        print(row)

I am getting the following response which appears to be almost correct

['平安银行股份有限公司', '', '', '']
['资产负债表', '', '', '']
['2019年12月31日', '', '', '']
['(除特别注明外,金额单位均为人民币百万元)', '', '', '']
['', '附注三', '2019年12月31日', '2018年12月31日']
['资产', '', '', '']
['现金及存放中央银行款项', '1', '252,230', '278,528']
['存放同业款项', '2', '85,684', '85,098']
['贵金属', '', '51,191', '56,835']
['拆出资金', '3', '79,369', '72,934']
['衍生金融资产', '4', '18,500', '21,460']
['买入返售金融资产', '5', '62,216', '36,985']
['发放贷款和垫款', '6', '2,259,349', '1,949,757']
['金融投资:', '', '', '']
['交易性金融资产', '7', '206,682', '148,768']
['债权投资', '8', '656,290', '629,366']
['其他债权投资', '9', '182,264', '70,664']
['其他权益工具投资', '10', '1,844', '1,519']
['投资性房地产', '11', '247', '194']
['固定资产', '12', '11,092', '10,899']
['使用权资产', '13', '7,517', '-']
['无形资产', '14', '4,361', '4,771']
['商誉', '15', '7,568', '7,568']
['递延所得税资产', '16', '34,725', '29,468']
['其他资产', '17', '17,941', '13,778']

image

The bottom line is missing and that is related to #265

As a workaround, you can crop the bottom portion of the page (p = p.crop((0, 0, p.width, p.height-100))) and then rerun which would give you the following response:

['平安银行股份有限公司', '', '', '']
['资产负债表', '', '', '']
['2019年12月31日', '', '', '']
['(除特别注明外,金额单位均为人民币百万元)', '', '', '']
['', '附注三', '2019年12月31日', '2018年12月31日']
['资产', '', '', '']
['现金及存放中央银行款项', '1', '252,230', '278,528']
['存放同业款项', '2', '85,684', '85,098']
['贵金属', '', '51,191', '56,835']
['拆出资金', '3', '79,369', '72,934']
['衍生金融资产', '4', '18,500', '21,460']
['买入返售金融资产', '5', '62,216', '36,985']
['发放贷款和垫款', '6', '2,259,349', '1,949,757']
['金融投资:', '', '', '']
['交易性金融资产', '7', '206,682', '148,768']
['债权投资', '8', '656,290', '629,366']
['其他债权投资', '9', '182,264', '70,664']
['其他权益工具投资', '10', '1,844', '1,519']
['投资性房地产', '11', '247', '194']
['固定资产', '12', '11,092', '10,899']
['使用权资产', '13', '7,517', '-']
['无形资产', '14', '4,361', '4,771']
['商誉', '15', '7,568', '7,568']
['递延所得税资产', '16', '34,725', '29,468']
['其他资产', '17', '17,941', '13,778']
['资产总计', '', '3,939,070', '3,418,592']

image

Does this resolve your issue?

@wuliKingQin
Copy link
Author

thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants