Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(xml-jats): parse XML JATS documents #967

Merged
merged 4 commits into from
Feb 17, 2025
Merged

feat(xml-jats): parse XML JATS documents #967

merged 4 commits into from
Feb 17, 2025

Conversation

ceberam
Copy link
Contributor

@ceberam ceberam commented Feb 14, 2025

Description

This pull request consists of parsing improvements and extensions for XML files with PubMed and JATS definition

  1. Improve the XML parser backend for PubMed Central articles
    • new features like: support 1-level lists, respect the order of the items in the document, parse new sections (e.g., the back matters), better metadata structure as typically rendered by PMC, support equations in blocks and in table cells.
    • fix some issues like: missing blank spaces, missing text in captions, wrong placement assumptions of certain elements (e.g., ref-list), parsing errors (e.g., etal in citations)
  2. Generalize the parser to other articles and books following the JATS XML DTD
    • ensure that the document conversion guesses JATS XML files through the XML doc type.
    • extend JATS parsing with Book Interchange Tag Suite (BITS) extension (e.g., for Springer Nature book sections)
    • add 2 more documents with CC0 license for regression tests, from the Full Article Samples page in JATS documentation.
  3. Rename the docling artifacts to replace PubMed keyword to JATS.

Further improvements may include:

  • Leverage the formatting when supported by docling-core (e.g., superscripts, italics and bold formatting)
  • Parse footnotes and inline equations
  • Further support on lists (e.g., ordered lists, different types of elements in list items)

Issue resolved by this Pull Request:
Resolves #893

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

In XML PubMed (JATS) backend, convert authors and affiliations as they
are typically rendered on PDFs.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Instead of removing new line character from text, replace it by a space character.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Partially support lists, respect reading order, parse more sections, support equations, better text formatting.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Copy link

mergify bot commented Feb 14, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@ceberam ceberam changed the title Dev/xml jats feat(xml-jats): parse XML JATS documents Feb 14, 2025
Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@ceberam ceberam merged commit 428b656 into main Feb 17, 2025
10 checks passed
@ceberam ceberam deleted the dev/xml-jats branch February 17, 2025 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support conversion of JATS format into DoclingDocument
3 participants