Is there any way to include blank lines when extracting texts? #516

flycattt · 2021-10-16T12:24:58Z

Hi, I am using this fabulous library to extract texts from PDFs. In my PDFs, records are separated by blank lines. However, after extracting, I only get one line break as marked in the screenshot. It would get me into trouble parsing the records cuz some records starts without a pattern string. I looked through the manual but didn't find a solution. Much appreciated if you could help me with it!

jsvine · 2021-10-16T18:06:47Z

Hi! Something like this has been a longstanding request — see, e.g., #10 from 2016. I think it's probably time to really try adding this feature! Or at least something useful enough, if not perfect. Thanks for the nudge. In the meantime, there are a few ways you could handle this, though the best approach will depend on your specific PDF. One approach you might try:

Use utils.cluster_objects(page.chars, attr="doctop", tolerance=??), where ?? is an integer slightly less than 2x the line height.
For each list of chars returned by the step above, run utils.extract_text(chars).

Closing this issue due to the similarity to #10, but feel free to continue the discussion here.

abtpltd · 2023-10-26T07:34:18Z

doctop_clusters:
{'text': u'Q', 'object_type': 'char', 'height': Decimal('12.268'), 'upright': 1, 'y1': Decimal('500.152'), 'y0': Decimal('487.884'), 'x0': Decimal('250.765'), 'x1': Decimal('257.541'), 'size': Decimal('15.379'), 'adv': Decimal('6.776'), 'fontname': 'ABCDEE+Segoe UI', 'doctop': Decimal('291.848'), 'bottom': Decimal('304.116'), 'top': Decimal('291.848'), 'width': Decimal('6.776'), 'page_number': 1}, {'text': u'I', 'object_type': 'char', 'height': Decimal('12.268'), 'upright': 1, 'y1': Decimal('500.152'), 'y0': Decimal('487.884'), 'x0': Decimal('257.541'), 'x1': Decimal('259.935'), 'size': Decimal('15.379'), 'adv': Decimal('2.394'), 'fontname': 'ABCDEE+Segoe UI', 'doctop': Decimal('291.848'), 'bottom': Decimal('304.116'), 'top': Decimal('291.848'), 'width': Decimal('2.394'), 'page_number': 1}

what value of x_tolerance=3 and y_tolerance=3 ?? i need to put for export blank line.
page.extract_text(x_tolerance=3, y_tolerance=10)

jsvine · 2023-10-26T13:54:22Z

@flycattt Try using page.extract_text(layout=True); you probably do not need to specify the x_tolerance or y_tolerance parameters.

abtpltd · 2023-10-26T15:56:48Z

i m using python 2.7 and pdfplumber version_info = (0, 5, 11)
Now its crashing: Pls Help.
page = pdf.pages[0]
bounding_box = (d['x1'], d['y1'], d['x2'],d['y2'])
crop_area = page.crop(bounding_box)
print crop_area.extract_text(layout=True)

jsvine · 2023-10-26T19:14:00Z

That's a very old version of pdfplumber (and of Python). I suggest updating.

flycattt added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Oct 16, 2021

jsvine closed this as completed Oct 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any way to include blank lines when extracting texts? #516

Is there any way to include blank lines when extracting texts? #516

flycattt commented Oct 16, 2021 •

edited

Loading

jsvine commented Oct 16, 2021

abtpltd commented Oct 26, 2023

jsvine commented Oct 26, 2023

abtpltd commented Oct 26, 2023 •

edited

Loading

jsvine commented Oct 26, 2023

Is there any way to include blank lines when extracting texts? #516

Is there any way to include blank lines when extracting texts? #516

Comments

flycattt commented Oct 16, 2021 • edited Loading

jsvine commented Oct 16, 2021

abtpltd commented Oct 26, 2023

jsvine commented Oct 26, 2023

abtpltd commented Oct 26, 2023 • edited Loading

jsvine commented Oct 26, 2023

flycattt commented Oct 16, 2021 •

edited

Loading

abtpltd commented Oct 26, 2023 •

edited

Loading