Skip to content

Conversation

@jamesbraza
Copy link
Collaborator

Adding a logo to the media duplication PDF somehow now exposes Docling not handling an edge case of FormulaItem in text

@jamesbraza jamesbraza self-assigned this Dec 2, 2025
@jamesbraza jamesbraza added the bug Something isn't working label Dec 2, 2025
Copilot AI review requested due to automatic review settings December 2, 2025 22:15
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Dec 2, 2025
@dosubot
Copy link

dosubot bot commented Dec 2, 2025

Documentation Updates

1 document(s) were updated by changes in this PR:

paper-qa

How did I do? Any feedback?  Join Discord

Copilot finished reviewing on behalf of jamesbraza December 2, 2025 22:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an edge case where Docling was not properly handling FormulaItem elements in PDFs. The fix was exposed by adding logos to the media duplication test PDF.

Key Changes

  • Added support for FormulaItem in both text extraction and media parsing
  • Added fallback to item.orig when formula text sanitization fails
  • Updated test expectations to account for logo images being detected

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.

File Description
tests/duplicate_media_template.md Added Wikimedia logos to each page and updated pandoc command instructions
packages/paper-qa-docling/tests/test_paperqa_docling.py Updated expected image counts and deduplication thresholds to account for additional logo images
packages/paper-qa-docling/src/paperqa_docling/reader.py Added FormulaItem handling for text extraction and media parsing, with defensive annotation access

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 3, 2025
@jamesbraza jamesbraza merged commit 931d39f into main Dec 4, 2025
15 checks passed
@jamesbraza jamesbraza deleted the filtering-logos branch December 4, 2025 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants