-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
page.search("text", regex = True) is magnitudes slower in 0.10.4 compared to 0.10.3. #1097
Comments
Hi @mikejokic. Thank you for flagging. Unfortunately, I can't seem to reproduce your findings. If anything, it runs slightly faster on import pdfplumber
import time
import sys
start = time.time()
with pdfplumber.open(sys.stdin.buffer) as pdf:
for page in pdf.pages:
results = page.search(r'word', regex=True,return_chars=False)
if hasattr(page, "close"):
page.close()
else:
page.flush_cache()
end = time.time() - start
print(round(end, 3)) And then If you run the same, what do you see? |
Thanks for the reply @jsvine. I ran your code block in Docker and I found similar results to yours. But I have been able to reproduce my issue with the provided pdf. Here is code I have been able to run in Docker changing just the pdfplumber version number. I look for a set of relevant keywords/regex patterns (repeated keywords for simplicity), and then take the surrounding line info as well. 0.10.3 runs in around 30-36seconds, and 0.10.4 takes around 90-96 seconds.
|
Layout wasn't actually getting cached, now it is.
Big thanks, @mikejokic — that extra detail about looping through a bunch of Turns out 0bfffc2 introduced a bug in which the page layout calculations (necessary for |
Thanks @jsvine. Out of curiosity, does .search() run .extract_text() on each run or is the text also cached? |
|
Describe the bug
I have a pipeline to extract text and find relevant keywords from PDF's. After upgrading to the latest release, my code has slowed down 5x-10x.
Have you tried repairing the PDF?
I have repaired pdf through ghostscript
Code to reproduce the problem
with pdfplumber.open(pdfFile) as pdf:
for page in pdf.pages:
start = time.time()
results = page.search(r'word', regex=True,return_chars=False)
end = time.time() - start
print(end)
page.close()#in v10.4
page.flush_cache() #in v10.3
PDF file
Reproduced with any pdf
documentation.pdf
Environment
Any help @jsvine ?
The text was updated successfully, but these errors were encountered: