Paragraph clustering will join empty lines into paragraphs #449

R0bk · 2020-06-28T23:22:28Z

In some misgenerated PDFs I've found that there are empty lines placed in between the actual lines, this causes the paragraph clustering algo to group the empty lines into a single large paragraph, messing up any following tools that rely on paragraph bounding boxes.

You can see above the lines that are are detected from the pdf (including empty ones), and below you can see the detected paragraphs.

I think that the correct behaviour should be for pdfminer to exclude empty lines (or lines just containing white space) from paragraph clustering.

Here is an example pdf (sorry for the redaction but the problem is still apparent).
test.pdf

pietermarsman · 2020-06-29T18:37:50Z

Can you print the objects that make up these empty lines? I think these are LTAnno objects. These represent objects that are not actually in the PDF. They are inserted to make the textual output look better.

In PDF's often the space and newline character are not explicitly printed. It is just whitespace between characters.

R0bk · 2020-06-30T00:38:22Z

Here's one of the empty lines

{'x0': 127.56, 'y0': 635.0954096999999, 'x1': 130.16243445, 'y1': 644.4566847, 'width': 2.602434450000004, 'height': 9.361275000000091, 'bbox': (127.56, 635.0954096999999, 130.16243445, 644.4566847), '_objs': [<LTChar 127.560,635.095,130.162,644.457 matrix=[9.36,0.00,0.00,9.36, (127.56,637.08)] font='TgiIHCRX+ArialMT' adv=0.278 text=' '>], 'word_margin': 0.1, '_x1': 130.16243445}

Which I believe is a LTChar but I'm not exactly across pdfminer/ pdf specs so please any info is helpful.

I also noticed that if I filter the neighbour clustering inside LTTextLineHorizontal to ignore empty lines it seems to solve my issue. See below:

pdfminer.six/pdfminer/layout.py

Lines 441 to 446 in ac2b20a

    
           return [obj for obj in objs 
        
                   if (isinstance(obj, LTTextLineHorizontal) and 
        
                       self._is_same_height_as(obj, tolerance=d) and 
        
                       (self._is_left_aligned_with(obj, tolerance=d) or 
        
                        self._is_right_aligned_with(obj, tolerance=d) or 
        
                        self._is_centrally_aligned_with(obj, tolerance=d)))]

to (looks slightly different since I have a slightly older build of pdfminer locally)

        return [obj for obj in objs
                if (isinstance(obj, LTTextLineHorizontal) and
                    abs(obj.height-self.height) < d and
                    (abs(obj.x0-self.x0) < d or
                     abs(obj.x1-self.x1) < d)) and
                    ''.join([o._text for o in obj._objs]).strip() != '']

Changes the output from: (Blue boxes are lines, orange are paragraphs)

to

Maybe I should open a PR to do this (I'd clean it up a bunch first) or do you think that there is a better way to solve this?

pietermarsman · 2020-07-05T11:31:52Z

It could definitely be that PDF's with space characters give counter-intuitive results. And the layout algorithm should ignore these characters. So this feature request is accepted!

If you have time to work on it, feel free!

I'm not sure what the fix should be. I think the space must be part of the output when enumerating all the objects. But not part of the output when analyzing/converting to text. Maybe an approach similiar to LAParams.all_texts works. If all_texts=True the layout analysis algorithm stops at LTFigure objects and thus excludes text inside them. In a a similiar fashion the layout algorithm could stop at empty LTText objects.

jwyawney · 2021-11-09T22:18:20Z

I know this is a long outstanding issue but I just submitted the PR for this as I was experiencing similar issues in my code. Please let me know if you see any issues with the changes and I can likely make any required changes some time this week or early next.

pietermarsman · 2022-02-22T20:20:45Z

Fixed by #689

pietermarsman added component: converter Related to any PDFLayoutAnalyzer type: new feature labels Jul 5, 2020

pietermarsman added the type: bug label Jul 10, 2020

jwyawney mentioned this issue Nov 9, 2021

Added in checks for sprurious lines in malformed PDFs #689

Merged

6 tasks

pietermarsman closed this as completed Feb 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paragraph clustering will join empty lines into paragraphs #449

Paragraph clustering will join empty lines into paragraphs #449

R0bk commented Jun 28, 2020

pietermarsman commented Jun 29, 2020

R0bk commented Jun 30, 2020

pietermarsman commented Jul 5, 2020

jwyawney commented Nov 9, 2021

pietermarsman commented Feb 22, 2022

Paragraph clustering will join empty lines into paragraphs #449

Paragraph clustering will join empty lines into paragraphs #449

Comments

R0bk commented Jun 28, 2020

pietermarsman commented Jun 29, 2020

R0bk commented Jun 30, 2020

pietermarsman commented Jul 5, 2020

jwyawney commented Nov 9, 2021

pietermarsman commented Feb 22, 2022