[Bug] improve performance of pdf2parquet #573

sujee · 2024-09-05T07:24:11Z

Search before asking

I searched the issues and found no similar issues.

Component

Other

What happened + What you expected to happen

Extracting text from PDF into parquet seems slow. It is processing 1 page / second. So if a PDF has 300 pages, it takes 300 seconds (5 mins)

This negatively affects the user experience, as PDF2PQ is usually one of first few steps in many workflows.

Reproduction script

data : https://github.com/sujee/data-prep-kit/tree/perf-1-pdf2pq/test/perf-pdf2pq/input
(These PDFs are about 100 pages each)

Instructions and minimal code to reproduce the problem are here : https://github.com/sujee/data-prep-kit/tree/perf-1-pdf2pq/test/perf-pdf2pq

instructions (README.md) : https://github.com/sujee/data-prep-kit/blob/perf-1-pdf2pq/test/perf-pdf2pq/README.md

A py-spy generated speedscope file is attached. It can be viewed at https://www.speedscope.app/

test_pdf2pq_py.speed.txt

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Bytes-Explorer · 2024-10-28T04:52:47Z

@dolfim-ibm Can you share any updates on this pls?
cc @touma-I

dolfim-ibm · 2024-10-28T11:37:55Z

Soon we will update DPK to use the new Docling v2. As part of the new feature (together with support for docx, html, pptx, etc) we have a new parse which is about 10x faster. See https://github.com/DS4SD/docling-parse/?tab=readme-ov-file#performance-benchmarks. Note, this is not the speed up of the full pipeline, but one of the important pieces.

Medium term, we are actually running heavy benchmarks to identify the characteristic timing and compare with other tools.

Bytes-Explorer · 2024-10-28T14:07:44Z

Thank you @dolfim-ibm When do you expect to integrate this change?

dolfim-ibm · 2024-10-28T14:35:07Z

Thank you @dolfim-ibm When do you expect to integrate this change?

Should be doable this week.

Bytes-Explorer · 2024-10-28T15:02:37Z

Thanks! From: Michele Dolfi ***@***.***> Date: Monday, 28 October 2024 at 8:05 PM To: IBM/data-prep-kit ***@***.***> Cc: Hima Patel ***@***.***>, Comment ***@***.***> Subject: [EXTERNAL] Re: [IBM/data-prep-kit] [Bug] improve performance of pdf2parquet (Issue #573) Thank you @dolfim-ibm When do you expect to integrate this change? Should be doable this week. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented. Message ID: <IBM/data-prep-kit/issues/573/2441767461@ github. com> Thank you @dolfim-ibm<https://github.com/dolfim-ibm> When do you expect to integrate this change? Should be doable this week. — Reply to this email directly, view it on GitHub<#573 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANKCJ6TGG7SJEGSJBG5X5CTZ5Y4SFAVCNFSM6AAAAABNV2257KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBRG43DONBWGE>. You are receiving this because you commented.Message ID: ***@***.***>

sujee · 2024-10-29T14:33:42Z

@dolfim-ibm is Docling v2 supported on windows natively?

dolfim-ibm · 2024-10-29T14:38:01Z

@dolfim-ibm is Docling v2 supported on windows natively?

yes, this is supported since v1.17.0.

dolfim-ibm · 2024-11-01T07:31:54Z

The version installed with #756 should now be faster (20-30%).

Additionally, you could also use the parameter bitmap_area_threshold to run OCR only of large images.

this is the fraction of the bitmap area to the page area. if the ratio is larger than the threshold, the image will be processed with OCR, otherwise skipped.
the default value is 0.05=5%
the default should already get rid of small logos, etc but you can try a value of 0.5=50% for skipping other embedded bitmap images which are not needed

sujee added the bug Something isn't working label Sep 5, 2024

daw3rd assigned daw3rd and dolfim-ibm and unassigned daw3rd Sep 12, 2024

daw3rd added enhancement New feature or request and removed bug Something isn't working labels Sep 12, 2024

Bytes-Explorer added the high priority label Oct 29, 2024

dolfim-ibm mentioned this issue Oct 30, 2024

Update pdf2parquet to Docling v2 #756

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] improve performance of pdf2parquet #573

[Bug] improve performance of pdf2parquet #573

sujee commented Sep 5, 2024 •

edited

Loading

Bytes-Explorer commented Oct 28, 2024 •

edited

Loading

dolfim-ibm commented Oct 28, 2024

Bytes-Explorer commented Oct 28, 2024

dolfim-ibm commented Oct 28, 2024

Bytes-Explorer commented Oct 28, 2024 via email

sujee commented Oct 29, 2024

dolfim-ibm commented Oct 29, 2024

dolfim-ibm commented Nov 1, 2024

[Bug] improve performance of pdf2parquet #573

[Bug] improve performance of pdf2parquet #573

Comments

sujee commented Sep 5, 2024 • edited Loading

Search before asking

Component

What happened + What you expected to happen

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

Bytes-Explorer commented Oct 28, 2024 • edited Loading

dolfim-ibm commented Oct 28, 2024

Bytes-Explorer commented Oct 28, 2024

dolfim-ibm commented Oct 28, 2024

Bytes-Explorer commented Oct 28, 2024 via email

sujee commented Oct 29, 2024

dolfim-ibm commented Oct 29, 2024

dolfim-ibm commented Nov 1, 2024

sujee commented Sep 5, 2024 •

edited

Loading

Bytes-Explorer commented Oct 28, 2024 •

edited

Loading