extract_words() slower when fewer extra_attrs are passed #483
Replies: 3 comments 4 replies
-
Thanks for sharing. This is very interesting, and indeed surprising. For ease of investigation, can you share the relevant PDF / page? |
Beta Was this translation helpful? Give feedback.
-
Unfortunately, I cannot upload the document I'm using due to a privacy issue but I will upload a sample PDF with the same issue. I will also include a colab notebook referencing the issue with the steps I executed |
Beta Was this translation helpful? Give feedback.
-
Is there any reasoning? |
Beta Was this translation helpful? Give feedback.
-
The idea is that I'm trying to find Bold and Blank sections in a PDF file so I was experimenting with
extract_words()
function to be able to group sections based on the font family.I found a way to extract Bold text by grouping sections by font name and size and then finding Bold font family
and as a similar approach, I did the same for grouping sections to find blanks in between them
But the issue I faced is a big gap in performance between the 2 methods:
using extra_attrs=["fontname", "size"]
sections = page.extract_words(keep_blank_chars=True, extra_attrs=["fontname", "size"])
line execution time avg: 0.5 Sec
using extra_attrs=[ "size"]
sections = page.extract_words(keep_blank_chars=True, extra_attrs=[ "size"])
line execution time avg: 5.2 Sec
Knowing that both statement are using the same page.
Also, I noticed when adding more attributes that render the response larger like attr="adv" it reduces the execution speed furthermore at 22.7ms per page
why does a statement of
extract_words()
with more filters outperformed the second statement having fewer filters? and is there any way to improve the speed of the second statement?Beta Was this translation helpful? Give feedback.
All reactions