Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paragraph clustering will join empty lines into paragraphs #449

Closed
R0bk opened this issue Jun 28, 2020 · 5 comments
Closed

Paragraph clustering will join empty lines into paragraphs #449

R0bk opened this issue Jun 28, 2020 · 5 comments
Labels

Comments

@R0bk
Copy link

R0bk commented Jun 28, 2020

In some misgenerated PDFs I've found that there are empty lines placed in between the actual lines, this causes the paragraph clustering algo to group the empty lines into a single large paragraph, messing up any following tools that rely on paragraph bounding boxes.

out_lines
You can see above the lines that are are detected from the pdf (including empty ones), and below you can see the detected paragraphs.

out
I think that the correct behaviour should be for pdfminer to exclude empty lines (or lines just containing white space) from paragraph clustering.

Here is an example pdf (sorry for the redaction but the problem is still apparent).
test.pdf

@pietermarsman
Copy link
Member

Can you print the objects that make up these empty lines? I think these are LTAnno objects. These represent objects that are not actually in the PDF. They are inserted to make the textual output look better.

In PDF's often the space and newline character are not explicitly printed. It is just whitespace between characters.

@R0bk
Copy link
Author

R0bk commented Jun 30, 2020

Here's one of the empty lines

{'x0': 127.56, 'y0': 635.0954096999999, 'x1': 130.16243445, 'y1': 644.4566847, 'width': 2.602434450000004, 'height': 9.361275000000091, 'bbox': (127.56, 635.0954096999999, 130.16243445, 644.4566847), '_objs': [<LTChar 127.560,635.095,130.162,644.457 matrix=[9.36,0.00,0.00,9.36, (127.56,637.08)] font='TgiIHCRX+ArialMT' adv=0.278 text=' '>], 'word_margin': 0.1, '_x1': 130.16243445}

Which I believe is a LTChar but I'm not exactly across pdfminer/ pdf specs so please any info is helpful.

I also noticed that if I filter the neighbour clustering inside LTTextLineHorizontal to ignore empty lines it seems to solve my issue. See below:

return [obj for obj in objs
if (isinstance(obj, LTTextLineHorizontal) and
self._is_same_height_as(obj, tolerance=d) and
(self._is_left_aligned_with(obj, tolerance=d) or
self._is_right_aligned_with(obj, tolerance=d) or
self._is_centrally_aligned_with(obj, tolerance=d)))]

to (looks slightly different since I have a slightly older build of pdfminer locally)

        return [obj for obj in objs
                if (isinstance(obj, LTTextLineHorizontal) and
                    abs(obj.height-self.height) < d and
                    (abs(obj.x0-self.x0) < d or
                     abs(obj.x1-self.x1) < d)) and
                    ''.join([o._text for o in obj._objs]).strip() != '']

Changes the output from: (Blue boxes are lines, orange are paragraphs)
demo_git1
to
demo_git

Maybe I should open a PR to do this (I'd clean it up a bunch first) or do you think that there is a better way to solve this?

@pietermarsman pietermarsman added component: converter Related to any PDFLayoutAnalyzer type: new feature labels Jul 5, 2020
@pietermarsman
Copy link
Member

It could definitely be that PDF's with space characters give counter-intuitive results. And the layout algorithm should ignore these characters. So this feature request is accepted!

If you have time to work on it, feel free!

I'm not sure what the fix should be. I think the space must be part of the output when enumerating all the objects. But not part of the output when analyzing/converting to text. Maybe an approach similiar to LAParams.all_texts works. If all_texts=True the layout analysis algorithm stops at LTFigure objects and thus excludes text inside them. In a a similiar fashion the layout algorithm could stop at empty LTText objects.

@jwyawney
Copy link

jwyawney commented Nov 9, 2021

I know this is a long outstanding issue but I just submitted the PR for this as I was experiencing similar issues in my code. Please let me know if you see any issues with the changes and I can likely make any required changes some time this week or early next.

@pietermarsman
Copy link
Member

Fixed by #689

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants