Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] improve performance of pdf2parquet #573

Open
1 of 2 tasks
sujee opened this issue Sep 5, 2024 · 8 comments
Open
1 of 2 tasks

[Bug] improve performance of pdf2parquet #573

sujee opened this issue Sep 5, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request high priority

Comments

@sujee
Copy link
Contributor

sujee commented Sep 5, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Other

What happened + What you expected to happen

Extracting text from PDF into parquet seems slow. It is processing 1 page / second. So if a PDF has 300 pages, it takes 300 seconds (5 mins)

This negatively affects the user experience, as PDF2PQ is usually one of first few steps in many workflows.

Reproduction script

data : https://github.com/sujee/data-prep-kit/tree/perf-1-pdf2pq/test/perf-pdf2pq/input
(These PDFs are about 100 pages each)

Instructions and minimal code to reproduce the problem are here : https://github.com/sujee/data-prep-kit/tree/perf-1-pdf2pq/test/perf-pdf2pq

instructions (README.md) : https://github.com/sujee/data-prep-kit/blob/perf-1-pdf2pq/test/perf-pdf2pq/README.md

A py-spy generated speedscope file is attached. It can be viewed at https://www.speedscope.app/

test_pdf2pq_py.speed.txt

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@sujee sujee added the bug Something isn't working label Sep 5, 2024
@daw3rd daw3rd assigned daw3rd and dolfim-ibm and unassigned daw3rd Sep 12, 2024
@daw3rd daw3rd added enhancement New feature or request and removed bug Something isn't working labels Sep 12, 2024
@Bytes-Explorer
Copy link
Collaborator

Bytes-Explorer commented Oct 28, 2024

@dolfim-ibm Can you share any updates on this pls?
cc @touma-I

@dolfim-ibm
Copy link
Member

Soon we will update DPK to use the new Docling v2. As part of the new feature (together with support for docx, html, pptx, etc) we have a new parse which is about 10x faster. See https://github.com/DS4SD/docling-parse/?tab=readme-ov-file#performance-benchmarks. Note, this is not the speed up of the full pipeline, but one of the important pieces.

Medium term, we are actually running heavy benchmarks to identify the characteristic timing and compare with other tools.

@Bytes-Explorer
Copy link
Collaborator

Thank you @dolfim-ibm When do you expect to integrate this change?

@dolfim-ibm
Copy link
Member

Thank you @dolfim-ibm When do you expect to integrate this change?

Should be doable this week.

@Bytes-Explorer
Copy link
Collaborator

Bytes-Explorer commented Oct 28, 2024 via email

@sujee
Copy link
Contributor Author

sujee commented Oct 29, 2024

@dolfim-ibm is Docling v2 supported on windows natively?

@dolfim-ibm
Copy link
Member

@dolfim-ibm is Docling v2 supported on windows natively?

yes, this is supported since v1.17.0.

@dolfim-ibm
Copy link
Member

The version installed with #756 should now be faster (20-30%).

Additionally, you could also use the parameter bitmap_area_threshold to run OCR only of large images.

  • this is the fraction of the bitmap area to the page area. if the ratio is larger than the threshold, the image will be processed with OCR, otherwise skipped.
  • the default value is 0.05=5%
  • the default should already get rid of small logos, etc but you can try a value of 0.5=50% for skipping other embedded bitmap images which are not needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority
Projects
None yet
Development

No branches or pull requests

4 participants