Issues with Table Extraction in Multi-Column PDF #4293
Unanswered
chayennemosk
asked this question in
Looking for help
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone,
I'm trying to extract text and tables from a multi-column PDF that was likely generated from PowerPoint. Since the layout is a bit different from standard PDFs, I decided to use pymupdf4llm for structured extraction.
Right now, I’m using:
md_text = pymupdf4llm.to_markdown("financial-management-strategic-planning-budgeting.pdf")
It gets most of the text right, but I am having trouble with tables.
Example of the issue is slide 4. Instead of a properly structured table, I get:
Summary of key takeaways
The table structure is not correctly populated, and other tables produce merged/misaligned outputs, for example:
• [Multi-year allocations ] • [Long-term clarity ] of funding on funding, to
The data is incorrectly formatted and difficult to parse.
Questions:
I've attached the pdf.
Many thanks in advance!
financial-management-strategic-planning-budgeting.pdf
Beta Was this translation helpful? Give feedback.
All reactions