Use the extract_table() method to parse out such a table #268

wuliKingQin · 2020-09-09T03:51:58Z

When I parse the pdf, I use the extract_table() method, whether the parameter information is passed ({ "vertical_strategy": "lines", "horizontal_strategy": "lines", }) Neither lines nor text can read the complete form information
The pdf in question is as follows:
error_dpf_3.pdf

Using lines, the table cannot be detected, using text, the parsed result is wrong：

平安银行股份有限公司
资产负债表
2019年12月31日(除特别注明外，金额单位均为人民币百万元)

Can you help me see how the pdf of the table in this situation can be parsed?

The text was updated successfully, but these errors were encountered:

samkit-jain · 2020-09-09T06:12:42Z

Hi @CuteyBoy Could you confirm what version of pdfplumber are you using?

Running on 0.5.23, and using the following code

import pdfplumber

pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]

ts = {
    "vertical_strategy": "text",
    "horizontal_strategy": "text",
}

im = p.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")

tables = p.extract_tables(table_settings=ts)

for table in tables:
    for row in table:
        print(row)

I am getting the following response which appears to be almost correct

['平安银行股份有限公司', '', '', '']
['资产负债表', '', '', '']
['2019年12月31日', '', '', '']
['(除特别注明外，金额单位均为人民币百万元)', '', '', '']
['', '附注三', '2019年12月31日', '2018年12月31日']
['资产', '', '', '']
['现金及存放中央银行款项', '1', '252,230', '278,528']
['存放同业款项', '2', '85,684', '85,098']
['贵金属', '', '51,191', '56,835']
['拆出资金', '3', '79,369', '72,934']
['衍生金融资产', '4', '18,500', '21,460']
['买入返售金融资产', '5', '62,216', '36,985']
['发放贷款和垫款', '6', '2,259,349', '1,949,757']
['金融投资：', '', '', '']
['交易性金融资产', '7', '206,682', '148,768']
['债权投资', '8', '656,290', '629,366']
['其他债权投资', '9', '182,264', '70,664']
['其他权益工具投资', '10', '1,844', '1,519']
['投资性房地产', '11', '247', '194']
['固定资产', '12', '11,092', '10,899']
['使用权资产', '13', '7,517', '-']
['无形资产', '14', '4,361', '4,771']
['商誉', '15', '7,568', '7,568']
['递延所得税资产', '16', '34,725', '29,468']
['其他资产', '17', '17,941', '13,778']

The bottom line is missing and that is related to #265

As a workaround, you can crop the bottom portion of the page (p = p.crop((0, 0, p.width, p.height-100))) and then rerun which would give you the following response:

['平安银行股份有限公司', '', '', '']
['资产负债表', '', '', '']
['2019年12月31日', '', '', '']
['(除特别注明外，金额单位均为人民币百万元)', '', '', '']
['', '附注三', '2019年12月31日', '2018年12月31日']
['资产', '', '', '']
['现金及存放中央银行款项', '1', '252,230', '278,528']
['存放同业款项', '2', '85,684', '85,098']
['贵金属', '', '51,191', '56,835']
['拆出资金', '3', '79,369', '72,934']
['衍生金融资产', '4', '18,500', '21,460']
['买入返售金融资产', '5', '62,216', '36,985']
['发放贷款和垫款', '6', '2,259,349', '1,949,757']
['金融投资：', '', '', '']
['交易性金融资产', '7', '206,682', '148,768']
['债权投资', '8', '656,290', '629,366']
['其他债权投资', '9', '182,264', '70,664']
['其他权益工具投资', '10', '1,844', '1,519']
['投资性房地产', '11', '247', '194']
['固定资产', '12', '11,092', '10,899']
['使用权资产', '13', '7,517', '-']
['无形资产', '14', '4,361', '4,771']
['商誉', '15', '7,568', '7,568']
['递延所得税资产', '16', '34,725', '29,468']
['其他资产', '17', '17,941', '13,778']
['资产总计', '', '3,939,070', '3,418,592']

Does this resolve your issue?

wuliKingQin · 2020-09-09T07:43:07Z

thank you very much！

wuliKingQin added the bug label Sep 9, 2020

wuliKingQin closed this as completed Sep 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the extract_table() method to parse out such a table #268

Use the extract_table() method to parse out such a table #268

wuliKingQin commented Sep 9, 2020

samkit-jain commented Sep 9, 2020

wuliKingQin commented Sep 9, 2020

Use the extract_table() method to parse out such a table #268

Use the extract_table() method to parse out such a table #268

Comments

wuliKingQin commented Sep 9, 2020

Using lines, the table cannot be detected, using text, the parsed result is wrong：

samkit-jain commented Sep 9, 2020

wuliKingQin commented Sep 9, 2020