page.search("text", regex = True) is magnitudes slower in 0.10.4 compared to 0.10.3. #1097

mikejokic · 2024-02-21T16:39:00Z

Describe the bug

I have a pipeline to extract text and find relevant keywords from PDF's. After upgrading to the latest release, my code has slowed down 5x-10x.

Have you tried repairing the PDF?

I have repaired pdf through ghostscript

Code to reproduce the problem

with pdfplumber.open(pdfFile) as pdf:
for page in pdf.pages:
start = time.time()
results = page.search(r'word', regex=True,return_chars=False)
end = time.time() - start
print(end)
page.close()#in v10.4
page.flush_cache() #in v10.3

PDF file

Reproduced with any pdf

documentation.pdf

Environment

pdfplumber version: 0.10.4 (slow) vs 0.10.3 (fast)
Python version: [e.g., 3.11]

Any help @jsvine ?

The text was updated successfully, but these errors were encountered:

jsvine · 2024-03-02T15:02:36Z

Hi @mikejokic. Thank you for flagging. Unfortunately, I can't seem to reproduce your findings. If anything, it runs slightly faster on 0.10.4 than 0.10.3 for me. Here's the exact code I'm running:

import pdfplumber
import time
import sys

start = time.time()
with pdfplumber.open(sys.stdin.buffer) as pdf:
   for page in pdf.pages: 
      results = page.search(r'word', regex=True,return_chars=False)
      if hasattr(page, "close"):
         page.close()
      else:
         page.flush_cache()
end = time.time() - start
print(round(end, 3))

And then python test.py < documentation.pdf. On 0.10.3, I'm seeing times of around 7.9 seconds; on 0.10.4, I'm seeing closer to 7.6 seconds.

If you run the same, what do you see?

mikejokic · 2024-03-02T23:22:42Z

Thanks for the reply @jsvine. I ran your code block in Docker and I found similar results to yours. But I have been able to reproduce my issue with the provided pdf.

Here is code I have been able to run in Docker changing just the pdfplumber version number.

I look for a set of relevant keywords/regex patterns (repeated keywords for simplicity), and then take the surrounding line info as well. 0.10.3 runs in around 30-36seconds, and 0.10.4 takes around 90-96 seconds.

keywords = ['capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS''capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS']


import time
import pdfplumber
start = time.time()
with pdfplumber.open('documentation.pdf') as pdf:
   for page in pdf.pages: 
        print(page,flush=True)
        for key in keywords:  
            results = page.search(r'.*\b' + key + r'\b.*', regex=True,case=False,return_chars=False)
        if hasattr(page, "close"):
            page.close()
        else:
            page.flush_cache()
end = time.time() - start
print(round(end, 3))

Layout wasn't actually getting cached, now it is.

jsvine · 2024-03-03T00:21:10Z

Big thanks, @mikejokic — that extra detail about looping through a bunch of .search(...) calls per page helped me (a) reproduce your observation, (b) figure out what the problem was, and (c) fix it.

Turns out 0bfffc2 introduced a bug in which the page layout calculations (necessary for .search(...)) were no longer getting cached. The fix in efca277 resolves that, restoring the prior speed/performance. Now available on the develop branch and will be in the next release.

mikejokic · 2024-03-03T03:42:25Z

Thanks @jsvine. Out of curiosity, does .search() run .extract_text() on each run or is the text also cached?

jsvine · 2024-03-04T19:53:04Z

.search(...) uses the text-layout cache, which is based on the layout-dependent parameters you pass. E.g., if you run page.search("q1", x_tolerance=5) and page.search("q2", x_tolerance=5), then the .extract_text(...) is only run once, on the first search; but if you then call page.search("q2", x_tolerance=10), then .extract_text(...) is called again.

mikejokic added the bug label Feb 21, 2024

jsvine added a commit that referenced this issue Mar 3, 2024

Fix layout-caching issue (#1097) caused by 0bfffc2

efca277

Layout wasn't actually getting cached, now it is.

jsvine closed this as completed Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page.search("text", regex = True) is magnitudes slower in 0.10.4 compared to 0.10.3. #1097

page.search("text", regex = True) is magnitudes slower in 0.10.4 compared to 0.10.3. #1097

mikejokic commented Feb 21, 2024 •

edited

Loading

jsvine commented Mar 2, 2024

mikejokic commented Mar 2, 2024 •

edited

Loading

jsvine commented Mar 3, 2024

mikejokic commented Mar 3, 2024

jsvine commented Mar 4, 2024

page.search("text", regex = True) is magnitudes slower in 0.10.4 compared to 0.10.3. #1097

page.search("text", regex = True) is magnitudes slower in 0.10.4 compared to 0.10.3. #1097

Comments

mikejokic commented Feb 21, 2024 • edited Loading

Describe the bug

Have you tried repairing the PDF?

Code to reproduce the problem

PDF file

Environment

jsvine commented Mar 2, 2024

mikejokic commented Mar 2, 2024 • edited Loading

jsvine commented Mar 3, 2024

mikejokic commented Mar 3, 2024

jsvine commented Mar 4, 2024

mikejokic commented Feb 21, 2024 •

edited

Loading

mikejokic commented Mar 2, 2024 •

edited

Loading