Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page.search("text", regex = True) is magnitudes slower in 0.10.4 compared to 0.10.3. #1097

Closed
mikejokic opened this issue Feb 21, 2024 · 5 comments
Labels

Comments

@mikejokic
Copy link

mikejokic commented Feb 21, 2024

Describe the bug

I have a pipeline to extract text and find relevant keywords from PDF's. After upgrading to the latest release, my code has slowed down 5x-10x.

Have you tried repairing the PDF?

I have repaired pdf through ghostscript

Code to reproduce the problem

with pdfplumber.open(pdfFile) as pdf:
for page in pdf.pages:
start = time.time()
results = page.search(r'word', regex=True,return_chars=False)
end = time.time() - start
print(end)
page.close()#in v10.4
page.flush_cache() #in v10.3

PDF file

Reproduced with any pdf

documentation.pdf

Environment

  • pdfplumber version: 0.10.4 (slow) vs 0.10.3 (fast)
  • Python version: [e.g., 3.11]

Any help @jsvine ?

@mikejokic mikejokic added the bug label Feb 21, 2024
@jsvine
Copy link
Owner

jsvine commented Mar 2, 2024

Hi @mikejokic. Thank you for flagging. Unfortunately, I can't seem to reproduce your findings. If anything, it runs slightly faster on 0.10.4 than 0.10.3 for me. Here's the exact code I'm running:

import pdfplumber
import time
import sys

start = time.time()
with pdfplumber.open(sys.stdin.buffer) as pdf:
   for page in pdf.pages: 
      results = page.search(r'word', regex=True,return_chars=False)
      if hasattr(page, "close"):
         page.close()
      else:
         page.flush_cache()
end = time.time() - start
print(round(end, 3))

And then python test.py < documentation.pdf. On 0.10.3, I'm seeing times of around 7.9 seconds; on 0.10.4, I'm seeing closer to 7.6 seconds.

If you run the same, what do you see?

@mikejokic
Copy link
Author

mikejokic commented Mar 2, 2024

Thanks for the reply @jsvine. I ran your code block in Docker and I found similar results to yours. But I have been able to reproduce my issue with the provided pdf.

Here is code I have been able to run in Docker changing just the pdfplumber version number.

I look for a set of relevant keywords/regex patterns (repeated keywords for simplicity), and then take the surrounding line info as well. 0.10.3 runs in around 30-36seconds, and 0.10.4 takes around 90-96 seconds.

keywords = ['capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS''capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS']


import time
import pdfplumber
start = time.time()
with pdfplumber.open('documentation.pdf') as pdf:
   for page in pdf.pages: 
        print(page,flush=True)
        for key in keywords:  
            results = page.search(r'.*\b' + key + r'\b.*', regex=True,case=False,return_chars=False)
        if hasattr(page, "close"):
            page.close()
        else:
            page.flush_cache()
end = time.time() - start
print(round(end, 3))

jsvine added a commit that referenced this issue Mar 3, 2024
Layout wasn't actually getting cached, now it is.
@jsvine
Copy link
Owner

jsvine commented Mar 3, 2024

Big thanks, @mikejokic — that extra detail about looping through a bunch of .search(...) calls per page helped me (a) reproduce your observation, (b) figure out what the problem was, and (c) fix it.

Turns out 0bfffc2 introduced a bug in which the page layout calculations (necessary for .search(...)) were no longer getting cached. The fix in efca277 resolves that, restoring the prior speed/performance. Now available on the develop branch and will be in the next release.

@jsvine jsvine closed this as completed Mar 3, 2024
@mikejokic
Copy link
Author

Thanks @jsvine. Out of curiosity, does .search() run .extract_text() on each run or is the text also cached?

@jsvine
Copy link
Owner

jsvine commented Mar 4, 2024

.search(...) uses the text-layout cache, which is based on the layout-dependent parameters you pass. E.g., if you run page.search("q1", x_tolerance=5) and page.search("q2", x_tolerance=5), then the .extract_text(...) is only run once, on the first search; but if you then call page.search("q2", x_tolerance=10), then .extract_text(...) is called again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants