extract_words() slower when fewer extra_attrs are passed #483

hadikoub · 2021-07-28T20:54:52Z

hadikoub
Jul 28, 2021

The idea is that I'm trying to find Bold and Blank sections in a PDF file so I was experimenting with extract_words() function to be able to group sections based on the font family.

I found a way to extract Bold text by grouping sections by font name and size and then finding Bold font family

 sections = page.extract_words(keep_blank_chars=True, extra_attrs=["fontname", "size"])

and as a similar approach, I did the same for grouping sections to find blanks in between them

 sections = page.extract_words(keep_blank_chars=True, extra_attrs=[ "size"])

But the issue I faced is a big gap in performance between the 2 methods:

using extra_attrs=["fontname", "size"]
sections = page.extract_words(keep_blank_chars=True, extra_attrs=["fontname", "size"])
line execution time avg: 0.5 Sec
using extra_attrs=[ "size"]
sections = page.extract_words(keep_blank_chars=True, extra_attrs=[ "size"])
line execution time avg: 5.2 Sec

Knowing that both statement are using the same page.

Also, I noticed when adding more attributes that render the response larger like attr="adv" it reduces the execution speed furthermore at 22.7ms per page

why does a statement of extract_words() with more filters outperformed the second statement having fewer filters? and is there any way to improve the speed of the second statement?

jsvine · 2021-08-11T13:51:35Z

jsvine
Aug 11, 2021
Maintainer

Thanks for sharing. This is very interesting, and indeed surprising. For ease of investigation, can you share the relevant PDF / page?

4 replies

hadikoub Aug 13, 2021
Author

Unfortunately, I cannot upload the document I'm using due to a privacy issue but I will upload a sample PDF with the same issue.
The only difference is the avg time of the execution but still the same performance difference between various parameters.

I will also include a colab notebook referencing the issue with the steps I executed
colab
sample.pdf

jsvine Aug 30, 2021
Maintainer

Thank you for sharing these files and explanation. I haven't had the time to investigate this yet, but hope to soon.

jsvine Sep 2, 2021
Maintainer

Thanks again @hadikoub. That was a very helpful example in your notebook. I believe I have identified the problem with the pdfplumber code, which was the inefficient calculation of bounding boxes for very long "words," which is what your example was demonstrating. (Due to passing keep_blank_chars=True, the first extraction in your example generates just eight "words," with one of them 2,171 characters long. Adding extra_attrs in the example helps to split up the text into shorter words.)

I believe my adjustments in PR #497 fixes this problem. With that code, I see a roughly 60x speedup in your first example, a roughly 17x speedup in your second, and no major difference (though perhaps a slight slowdown, to be expected) in your third.

hadikoub Sep 2, 2021
Author

Thank you for your help, much appreciated

hadikoub · 2021-08-19T21:28:15Z

hadikoub
Aug 19, 2021
Author

Unfortunately, I cannot upload the document I'm using due to a privacy issue but I will upload a sample PDF with the same issue.
The only difference is the avg time of the execution but still the same performance difference between various parameters.

I will also include a colab notebook referencing the issue with the steps I executed
colab
sample.pdf

0 replies

sreeni5493 · 2021-08-30T06:50:36Z

sreeni5493
Aug 30, 2021

Is there any reasoning?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_words() slower when fewer extra_attrs are passed #483

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

extract_words() slower when fewer extra_attrs are passed #483

hadikoub Jul 28, 2021

Replies: 3 comments · 4 replies

jsvine Aug 11, 2021 Maintainer

hadikoub Aug 13, 2021 Author

jsvine Aug 30, 2021 Maintainer

jsvine Sep 2, 2021 Maintainer

hadikoub Sep 2, 2021 Author

hadikoub Aug 19, 2021 Author

sreeni5493 Aug 30, 2021

hadikoub
Jul 28, 2021

Replies: 3 comments 4 replies

jsvine
Aug 11, 2021
Maintainer

hadikoub Aug 13, 2021
Author

jsvine Aug 30, 2021
Maintainer

jsvine Sep 2, 2021
Maintainer

hadikoub Sep 2, 2021
Author

hadikoub
Aug 19, 2021
Author

sreeni5493
Aug 30, 2021