Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any way to include blank lines when extracting texts? #516

Closed
flycattt opened this issue Oct 16, 2021 · 5 comments
Closed

Is there any way to include blank lines when extracting texts? #516

flycattt opened this issue Oct 16, 2021 · 5 comments
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"

Comments

@flycattt
Copy link

flycattt commented Oct 16, 2021

Hi, I am using this fabulous library to extract texts from PDFs. In my PDFs, records are separated by blank lines. However, after extracting, I only get one line break as marked in the screenshot. It would get me into trouble parsing the records cuz some records starts without a pattern string. I looked through the manual but didn't find a solution. Much appreciated if you could help me with it!
image
image

@flycattt flycattt added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Oct 16, 2021
@jsvine
Copy link
Owner

jsvine commented Oct 16, 2021

Hi! Something like this has been a longstanding request — see, e.g., #10 from 2016. I think it's probably time to really try adding this feature! Or at least something useful enough, if not perfect. Thanks for the nudge. In the meantime, there are a few ways you could handle this, though the best approach will depend on your specific PDF. One approach you might try:

  • Use utils.cluster_objects(page.chars, attr="doctop", tolerance=??), where ?? is an integer slightly less than 2x the line height.
  • For each list of chars returned by the step above, run utils.extract_text(chars).

Closing this issue due to the similarity to #10, but feel free to continue the discussion here.

@jsvine jsvine closed this as completed Oct 16, 2021
@abtpltd
Copy link

abtpltd commented Oct 26, 2023

image

doctop_clusters:
{'text': u'Q', 'object_type': 'char', 'height': Decimal('12.268'), 'upright': 1, 'y1': Decimal('500.152'), 'y0': Decimal('487.884'), 'x0': Decimal('250.765'), 'x1': Decimal('257.541'), 'size': Decimal('15.379'), 'adv': Decimal('6.776'), 'fontname': 'ABCDEE+Segoe UI', 'doctop': Decimal('291.848'), 'bottom': Decimal('304.116'), 'top': Decimal('291.848'), 'width': Decimal('6.776'), 'page_number': 1}, {'text': u'I', 'object_type': 'char', 'height': Decimal('12.268'), 'upright': 1, 'y1': Decimal('500.152'), 'y0': Decimal('487.884'), 'x0': Decimal('257.541'), 'x1': Decimal('259.935'), 'size': Decimal('15.379'), 'adv': Decimal('2.394'), 'fontname': 'ABCDEE+Segoe UI', 'doctop': Decimal('291.848'), 'bottom': Decimal('304.116'), 'top': Decimal('291.848'), 'width': Decimal('2.394'), 'page_number': 1}

what value of x_tolerance=3 and y_tolerance=3 ?? i need to put for export blank line.
page.extract_text(x_tolerance=3, y_tolerance=10)

@jsvine
Copy link
Owner

jsvine commented Oct 26, 2023

@flycattt Try using page.extract_text(layout=True); you probably do not need to specify the x_tolerance or y_tolerance parameters.

@abtpltd
Copy link

abtpltd commented Oct 26, 2023

image
i m using python 2.7 and pdfplumber version_info = (0, 5, 11)
Now its crashing: Pls Help.
page = pdf.pages[0]
bounding_box = (d['x1'], d['y1'], d['x2'],d['y2'])
crop_area = page.crop(bounding_box)
print crop_area.extract_text(layout=True)

@jsvine
Copy link
Owner

jsvine commented Oct 26, 2023

That's a very old version of pdfplumber (and of Python). I suggest updating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"
Projects
None yet
Development

No branches or pull requests

3 participants