-
Notifications
You must be signed in to change notification settings - Fork 948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paragraph clustering will join empty lines into paragraphs #449
Comments
Can you print the objects that make up these empty lines? I think these are In PDF's often the space and newline character are not explicitly printed. It is just whitespace between characters. |
Here's one of the empty lines
Which I believe is a LTChar but I'm not exactly across pdfminer/ pdf specs so please any info is helpful. I also noticed that if I filter the neighbour clustering inside pdfminer.six/pdfminer/layout.py Lines 441 to 446 in ac2b20a
to (looks slightly different since I have a slightly older build of pdfminer locally) return [obj for obj in objs
if (isinstance(obj, LTTextLineHorizontal) and
abs(obj.height-self.height) < d and
(abs(obj.x0-self.x0) < d or
abs(obj.x1-self.x1) < d)) and
''.join([o._text for o in obj._objs]).strip() != ''] Changes the output from: (Blue boxes are lines, orange are paragraphs) Maybe I should open a PR to do this (I'd clean it up a bunch first) or do you think that there is a better way to solve this? |
It could definitely be that PDF's with space characters give counter-intuitive results. And the layout algorithm should ignore these characters. So this feature request is accepted! If you have time to work on it, feel free! I'm not sure what the fix should be. I think the space must be part of the output when enumerating all the objects. But not part of the output when analyzing/converting to text. Maybe an approach similiar to |
I know this is a long outstanding issue but I just submitted the PR for this as I was experiencing similar issues in my code. Please let me know if you see any issues with the changes and I can likely make any required changes some time this week or early next. |
Fixed by #689 |
In some misgenerated PDFs I've found that there are empty lines placed in between the actual lines, this causes the paragraph clustering algo to group the empty lines into a single large paragraph, messing up any following tools that rely on paragraph bounding boxes.
You can see above the lines that are are detected from the pdf (including empty ones), and below you can see the detected paragraphs.
I think that the correct behaviour should be for pdfminer to exclude empty lines (or lines just containing white space) from paragraph clustering.
Here is an example pdf (sorry for the redaction but the problem is still apparent).
test.pdf
The text was updated successfully, but these errors were encountered: