Issues with Table Extraction in Multi-Column PDF #4293

chayennemosk · 2025-02-13T09:46:31Z

chayennemosk
Feb 13, 2025

Hi everyone,

I'm trying to extract text and tables from a multi-column PDF that was likely generated from PowerPoint. Since the layout is a bit different from standard PDFs, I decided to use pymupdf4llm for structured extraction.

Right now, I’m using:
md_text = pymupdf4llm.to_markdown("financial-management-strategic-planning-budgeting.pdf")
It gets most of the text right, but I am having trouble with tables.

Example of the issue is slide 4. Instead of a properly structured table, I get:

Summary of key takeaways

Col1	What good looks like	Key actions for finance leaders

The table structure is not correctly populated, and other tables produce merged/misaligned outputs, for example:

Col1	Col2	Why this is important
Medium-term planning Long-term strategic (3 to 5 years) planning (> 5 years)

• [Multi-year allocations ] • [Long-term clarity ] of funding on funding, to

The data is incorrectly formatted and difficult to parse.

Questions:

Is there a way to fine-tune pymupdf4llm for better table extraction?
Should I switch to another library (e.g., pdfplumber or pymupdf manual parsing)?

I've attached the pdf.

Many thanks in advance!

financial-management-strategic-planning-budgeting.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Table Extraction in Multi-Column PDF #4293

{{title}}

Replies: 0 comments

Select a reply

Issues with Table Extraction in Multi-Column PDF #4293

chayennemosk Feb 13, 2025

Summary of key takeaways

• [Multi-year allocations ] • [Long-term clarity ] of funding on funding, to

Replies: 0 comments

chayennemosk
Feb 13, 2025