-
Notifications
You must be signed in to change notification settings - Fork 948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Layout speedup rebased #315
Conversation
Thanks for rebasing! I've a couple of question that I want to get to know the answer, before merging this:
|
I can confirm that it is faster. I've tried a pdf with 48 pages and it goes from 69 seconds to 36. |
I have my doubt about removing the order from the list. I think the order is important because it determines which page elements are merged first. In the original implementation 2 elements are merged if they have the lowest distance of all possible combinations of elements. In your implementation arbitrary elements are merged. (an element being an original pdf element or a group of those elements that are already merged). I'll try to create an example. |
Found an example by including Output PR without shuffle
Example run with shuffle
I think that merging boxes with low distance first yields a more intuitive text order that using arbitrary order. Thus, I am against on merging this PR in its current state. That being said, I think this code should still be optimised. Alternative speedup:
@mikkkee do you want to work on this? |
Thanks @pietermarsman . Nice finding! Apologies I didn't dig much on the sorted list part. Let me try to get this done over the weekend. |
@pietermarsman On how to monitor the performance, what do you think of adding CI checks on test time using e.g., |
Yes, I was thinking about that to. That guarantees that it does not get much worse. But with travis-ci I expect the run-times to vary, based on how many jobs are running and on which hardware they run. And having test that fail occasionally because of the environment is just annoying. |
Hi @pietermarsman, Thanks for the advice. I've used Didn't change the plane part because a profile shows that the improvement of removing For performance monitoring, I think the first step can be just to add timing to tests to observe the stability of CI environment. No need to set any failing criteria of tests running time now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your changes look good! But I hope you have some time to polish this function even more, because I think it is not very intuitive.
Could you try improving the docstring of the methods, the variable names and the overall flow of the method? I've made some suggestions about this in the review.
I also added a checklist to the description of the PR. This is the default todo list for all PR's.
I created a new issue for improving the distance function (#322) and possibly removing the isany
method so you don't have to worry about that.
pdfminer/layout.py
Outdated
removed = set([id1, id2]) | ||
dists = [ ele for ele in dists | ||
if ((ele[2] not in removed) and (ele[3] not in removed)) ] | ||
for other in plane: | ||
dists.add((0, dist(group, other), group, other)) | ||
dists.append((0, dist(group, other), id(group), id(other), group, other)) | ||
heapq.heapify(dists) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be even faster by keeping a list of object id's that are already done. Then dists
does not need to be rebuild every time (saves sorting time), new dists can be pushed on the heap immediately and after popping a dist of the heap you could check if the object is already done (fast using a set).
This code is then simplified to:
heapq.heappush(dists, (0, dist(group, other), id(group), id(other), group, other))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, seems it only takes 5s (vs 81 s) now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's really fast!
pdfminer/layout.py
Outdated
# Appending id before obj for tuple comparison as __lt__ is disabled for obj | ||
# by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move comments to docstring.
pdfminer/layout.py
Outdated
for other in plane: | ||
dists.add((0, dist(group, other), group, other)) | ||
dists.append((0, dist(group, other), id(group), id(other), group, other)) | ||
heapq.heapify(dists) | ||
plane.add(group) | ||
assert len(plane) == 1, str(len(plane)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove this assertion and the one at the start of the function. Having these inline asserts is annoying because they are not very helpful if something goes wrong.
pdfminer/layout.py
Outdated
# Appending id before obj for tuple comparison as __lt__ is disabled for obj | ||
# by default. | ||
dists.append((0, dist(obj1, obj2), id(obj1), id(obj2), obj1, obj2)) | ||
heapq.heapify(dists) | ||
plane = Plane(self.bbox) | ||
plane.extend(boxes) | ||
while dists: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace by while len(dists) > 0
Indeed, this is a good start. For starters, it would make it easier to check if a PR changes the execution time. I've created issue #323 for this. |
Hi @pietermarsman, please check diff for latest commits that reflects things in the check list. I have updated my forked branch to the main repo develop branch so many commits got included here. Let me know if you want me to send another PR. Thanks. |
Thanks a lot. This looks good. I will take some time tonight to check it properly.
I think the best way is to use an interactive rebase. This allows you to pick the commits of the upstream develop branch first (in order), and then your own commits. If you force-push that to your own develop branch. Are you able to do that? |
Speedup by: 1/ Using a heap instead of a SortedList and avoid rebuilding the heap in each iteration. 2/ Avoid potentially huge number of variable assignments in list comprehension. 3/ Avoid repeatly evaluating `obj is obj` in list comprehension by storing id(obj).
Yup, done. Thanks! |
I've tested it with the PDF reference (1000 pages). The execution time goes from 2m30s to 2m15s. Not a big difference. |
The What file are you using to test it? Is there anything special about it? Going from 80s to 5s is a much bigger difference than I am measuring. |
The file I used is 01.pdf. It is a 31x21 table. The script I used for testing is: import time
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
start = time.time()
with open('01.pdf', "rb") as f:
parser = PDFParser(f)
document = PDFDocument(parser)
laparams = LAParams(
char_margin=1.0,
line_margin=0.5,
word_margin=0.1,
detect_vertical=True,
all_texts=True,
)
rsrcmgr = PDFResourceManager()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
end = time.time()
print("Time elapsed: {}s".format(end-start)) |
Not sure what exactly makes the file so slow, but in my test profile, the |
Wow, on my laptop it goes from 200s to 10s 👍 |
See #302 . Plus a small speed up from using a
set
of removed objects in the loops.Checklist