Layout speedup rebased #315

mikkkee · 2019-10-23T13:07:38Z

See #302 . Plus a small speed up from using a set of removed objects in the loops.

Checklist

There are (added) tests that prove my fix is effective or that my feature works
I have updated the README.md and other documentation, or I am sure that this is not necessary
I have added a consice human-readable description of the change to CHANGELOG.md
I have added docstrings to newly created methods and classes
I have optimized the code at least one time after creating the initial version

pietermarsman · 2019-10-23T17:12:21Z

Thanks for rebasing!

I've a couple of question that I want to get to know the answer, before merging this:

Is this implementation equivalent to the earlier implementation? (I'm going to review it)
Are there other ways we can speed up the code? For example, using a heap data structure instead of lists.
How can we monitor the performance of pdfminer.six in general. I hope there exist a tool that monitors the execution time of (parts of) the test suite. This way we can see if commits increase/decrease the execution time.

pietermarsman · 2019-10-24T19:47:41Z

I can confirm that it is faster. I've tried a pdf with 48 pages and it goes from 69 seconds to 36.

pietermarsman · 2019-10-24T20:16:24Z

I have my doubt about removing the order from the list. I think the order is important because it determines which page elements are merged first. In the original implementation 2 elements are merged if they have the lowest distance of all possible combinations of elements. In your implementation arbitrary elements are merged. (an element being an original pdf element or a group of those elements that are already merged). I'll try to create an example.

pietermarsman · 2019-10-24T20:52:09Z

Found an example by including random.shuffle(boxes) at the start of .group_textboxes(). This changes the order of the textboxes randomly. This mimics the arbitrary order that a pdf can use to provide the elements of a page. And since .group_objects() and .group_textlines() do not enforce any order, this arbitrary order of pdf elements is carried on to .group_textboxes().

Output PR without shuffle

Hello

World

Hello

World

H e l

l o

W o r

l d

H e l

l o

W o r

l d

Example run with shuffle

Hello

Hello

H e l

World

World

l o

W o r

H e l

l o

W o r

l d

l d

I think that merging boxes with low distance first yields a more intuitive text order that using arbitrary order. Thus, I am against on merging this PR in its current state. That being said, I think this code should still be optimised.

Alternative speedup:

Use heap queues. With heaps it is easy to get the minimum element of a set, which is precisely what we want here.
Drop the inline isany() function. I'm not sure what it is for. It checks if there are objects between two other objects. But if objects with lowest distance are merged first, this is never the case (I think).
Drop plane. It is only used (for its intention) in isany. In the other places it acts just like a list or set. Replacing it by a set might even speed up more.

@mikkkee do you want to work on this?

mikkkee · 2019-10-25T00:01:29Z

Thanks @pietermarsman . Nice finding! Apologies I didn't dig much on the sorted list part. Let me try to get this done over the weekend.

mikkkee · 2019-10-25T00:03:07Z

@pietermarsman On how to monitor the performance, what do you think of adding CI checks on test time using e.g., nose-timer?

pietermarsman · 2019-10-26T16:41:50Z

Yes, I was thinking about that to. That guarantees that it does not get much worse. But with travis-ci I expect the run-times to vary, based on how many jobs are running and on which hardware they run. And having test that fail occasionally because of the environment is just annoying.

mikkkee · 2019-10-28T14:00:53Z

Hi @pietermarsman, Thanks for the advice. I've used heapq to ensure we are merging closest objects first in the latest commit. The speed up is about 2x, from 81 s to 42 s.

Didn't change the plane part because a profile shows that the improvement of removing isanay and making plane a set is tiny (<1 s of the total 42 s in my case).

For performance monitoring, I think the first step can be just to add timing to tests to observe the stability of CI environment. No need to set any failing criteria of tests running time now.

pietermarsman

Your changes look good! But I hope you have some time to polish this function even more, because I think it is not very intuitive.

Could you try improving the docstring of the methods, the variable names and the overall flow of the method? I've made some suggestions about this in the review.

I also added a checklist to the description of the PR. This is the default todo list for all PR's.

I created a new issue for improving the distance function (#322) and possibly removing the isany method so you don't have to worry about that.

pietermarsman · 2019-10-28T19:36:56Z

pdfminer/layout.py

+            removed = set([id1, id2])
+            dists = [ ele for ele in dists
+                      if ((ele[2] not in removed) and (ele[3] not in removed)) ]
            for other in plane:
-                dists.add((0, dist(group, other), group, other))
+                dists.append((0, dist(group, other), id(group), id(other), group, other))
+            heapq.heapify(dists)


I think this can be even faster by keeping a list of object id's that are already done. Then dists does not need to be rebuild every time (saves sorting time), new dists can be pushed on the heap immediately and after popping a dist of the heap you could check if the object is already done (fast using a set).

This code is then simplified to:

heapq.heappush(dists, (0, dist(group, other), id(group), id(other), group, other))

Nice, seems it only takes 5s (vs 81 s) now.

Wow, that's really fast!

pietermarsman · 2019-10-28T19:47:03Z

pdfminer/layout.py

+                # Appending id before obj for tuple comparison as __lt__ is disabled for obj
+                # by default.


Move comments to docstring.

pietermarsman · 2019-10-28T19:54:33Z

pdfminer/layout.py

            for other in plane:
-                dists.add((0, dist(group, other), group, other))
+                dists.append((0, dist(group, other), id(group), id(other), group, other))
+            heapq.heapify(dists)
            plane.add(group)
        assert len(plane) == 1, str(len(plane))


Could you remove this assertion and the one at the start of the function. Having these inline asserts is annoying because they are not very helpful if something goes wrong.

pietermarsman · 2019-10-28T19:58:43Z

pdfminer/layout.py

+                # Appending id before obj for tuple comparison as __lt__ is disabled for obj
+                # by default.
+                dists.append((0, dist(obj1, obj2), id(obj1), id(obj2), obj1, obj2))
+        heapq.heapify(dists)
        plane = Plane(self.bbox)
        plane.extend(boxes)
        while dists:


Replace by while len(dists) > 0

pietermarsman · 2019-10-28T20:17:46Z

For performance monitoring, I think the first step can be just to add timing to tests to observe the stability of CI environment. No need to set any failing criteria of tests running time now.

Indeed, this is a good start. For starters, it would make it easier to check if a PR changes the execution time. I've created issue #323 for this.

mikkkee · 2019-10-30T14:14:26Z

Hi @pietermarsman, please check diff for latest commits that reflects things in the check list.

I have updated my forked branch to the main repo develop branch so many commits got included here. Let me know if you want me to send another PR. Thanks.

pietermarsman · 2019-10-30T15:01:57Z

Hi @pietermarsman, please check diff for latest commits that reflects things in the check list.

Thanks a lot. This looks good. I will take some time tonight to check it properly.

I have updated my forked branch to the main repo develop branch so many commits got included here. Let me know if you want me to send another PR. Thanks.

I think the best way is to use an interactive rebase. This allows you to pick the commits of the upstream develop branch first (in order), and then your own commits. If you force-push that to your own develop branch. Are you able to do that?

Speedup by: 1/ Using a heap instead of a SortedList and avoid rebuilding the heap in each iteration. 2/ Avoid potentially huge number of variable assignments in list comprehension. 3/ Avoid repeatly evaluating `obj is obj` in list comprehension by storing id(obj).

mikkkee · 2019-10-30T16:21:29Z

I think the best way is to use an interactive rebase. This allows you to pick the commits of the upstream develop branch first (in order), and then your own commits. If you force-push that to your own develop branch. Are you able to do that?

Yup, done. Thanks!

pietermarsman · 2019-10-30T21:06:25Z

I've tested it with the PDF reference (1000 pages). The execution time goes from 2m30s to 2m15s. Not a big difference.
The output is almost exactly the same! The only difference are a couple of parenthesis in equations that are switched.

pietermarsman · 2019-10-30T21:13:55Z

The sample/nonfree/i1040nr.pdf file goes from 40s to 31s.

What file are you using to test it? Is there anything special about it? Going from 80s to 5s is a much bigger difference than I am measuring.

… len(dists)

mikkkee · 2019-10-30T23:57:14Z

The file I used is 01.pdf. It is a 31x21 table. The script I used for testing is:

import time

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams

start = time.time()
with open('01.pdf', "rb") as f:
        parser = PDFParser(f)
        document = PDFDocument(parser)
        laparams = LAParams(
            char_margin=1.0,
            line_margin=0.5,
            word_margin=0.1,
            detect_vertical=True,
            all_texts=True,
        )
        rsrcmgr = PDFResourceManager()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
end = time.time()
print("Time elapsed: {}s".format(end-start))

mikkkee · 2019-10-31T00:00:06Z

Is there anything special about it?

Not sure what exactly makes the file so slow, but in my test profile, the while loop was executed for 952764 times.

pietermarsman · 2019-10-31T08:20:42Z

Wow, on my laptop it goes from 200s to 10s 👍

…t grouping of textboxes should be skipped if there are intermediate textboxes. (#335) Fixes #334

mikkkee mentioned this pull request Oct 23, 2019

Significant Layout speedup [2-5x] #302

Closed

pietermarsman added the type: new feature label Oct 24, 2019

pietermarsman requested changes Oct 28, 2019

View reviewed changes

rename to , move comments a few lines and add explicit comparison for…

4613f72

… len(dists)

pietermarsman merged commit 44b223c into pdfminer:develop Oct 31, 2019

mikkkee mentioned this pull request Oct 31, 2019

Page splitting is very slow for some PDFs camelot-dev/camelot#11

Open

This was referenced Nov 8, 2019

Version pdfminer.six 20191107 incorrectly orders some text #334

Closed

Fix ordering of grouping textboxes. The first grouping of textboxes should be skipped if there are intermediate textboxes. #335

Merged

pietermarsman added a commit that referenced this pull request Nov 10, 2019

Fix wrong ordering of grouping textboxes introduced by #315. The firs…

2bee7d8

…t grouping of textboxes should be skipped if there are intermediate textboxes. (#335) Fixes #334

This was referenced Aug 16, 2021

layout algorithm merges non-adjacent text lines #658

Closed

Fix regression in page layout that sometimes returned text lines out of order #659

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layout speedup rebased #315

Layout speedup rebased #315

mikkkee commented Oct 23, 2019 •

edited

Loading

pietermarsman commented Oct 23, 2019 •

edited

Loading

pietermarsman commented Oct 24, 2019

pietermarsman commented Oct 24, 2019

pietermarsman commented Oct 24, 2019 •

edited

Loading

mikkkee commented Oct 25, 2019

mikkkee commented Oct 25, 2019

pietermarsman commented Oct 26, 2019

mikkkee commented Oct 28, 2019

pietermarsman left a comment •

edited

Loading

pietermarsman Oct 28, 2019

mikkkee Oct 29, 2019

pietermarsman Oct 29, 2019

pietermarsman Oct 28, 2019

pietermarsman Oct 28, 2019

pietermarsman Oct 28, 2019

pietermarsman commented Oct 28, 2019

mikkkee commented Oct 30, 2019

pietermarsman commented Oct 30, 2019 •

edited

Loading

mikkkee commented Oct 30, 2019

pietermarsman commented Oct 30, 2019

pietermarsman commented Oct 30, 2019

mikkkee commented Oct 30, 2019

mikkkee commented Oct 31, 2019

pietermarsman commented Oct 31, 2019

		# Appending id before obj for tuple comparison as __lt__ is disabled for obj
		# by default.

Layout speedup rebased #315

Layout speedup rebased #315

Conversation

mikkkee commented Oct 23, 2019 • edited Loading

pietermarsman commented Oct 23, 2019 • edited Loading

pietermarsman commented Oct 24, 2019

pietermarsman commented Oct 24, 2019

pietermarsman commented Oct 24, 2019 • edited Loading

mikkkee commented Oct 25, 2019

mikkkee commented Oct 25, 2019

pietermarsman commented Oct 26, 2019

mikkkee commented Oct 28, 2019

pietermarsman left a comment • edited Loading

Choose a reason for hiding this comment

pietermarsman Oct 28, 2019

Choose a reason for hiding this comment

mikkkee Oct 29, 2019

Choose a reason for hiding this comment

pietermarsman Oct 29, 2019

Choose a reason for hiding this comment

pietermarsman Oct 28, 2019

Choose a reason for hiding this comment

pietermarsman Oct 28, 2019

Choose a reason for hiding this comment

pietermarsman Oct 28, 2019

Choose a reason for hiding this comment

pietermarsman commented Oct 28, 2019

mikkkee commented Oct 30, 2019

pietermarsman commented Oct 30, 2019 • edited Loading

mikkkee commented Oct 30, 2019

pietermarsman commented Oct 30, 2019

pietermarsman commented Oct 30, 2019

mikkkee commented Oct 30, 2019

mikkkee commented Oct 31, 2019

pietermarsman commented Oct 31, 2019

mikkkee commented Oct 23, 2019 •

edited

Loading

pietermarsman commented Oct 23, 2019 •

edited

Loading

pietermarsman commented Oct 24, 2019 •

edited

Loading

pietermarsman left a comment •

edited

Loading

pietermarsman commented Oct 30, 2019 •

edited

Loading