-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Link footnotes in the text #944
Conversation
Thanks a lot Luca ! I think without the constraint on superscript for footnote callout, this approach cannot work (too many false attachments). Normally the superscript attribute is reliable when it is set to true, but coverage is incomplete. There are several cases where pdfalto does not detect superscript for the moment. However, as pdfalto improves on this, the coverage of a heuristics with superscript condition will improve. Do you have examples of superscript attributes incorrectly set to true? This would be useful for pdfalto as I don't have any for the moment. Note: there is a typing of the reference callouts at document-level done in Grobid (https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/engines/citations/CalloutAnalyzer.java#L21). This would allow to know if the reference callouts are superscript too, and prevent some false positive in the rare case we have both footnotes and references as superscript numbers and the unlucky case with same number index on the same page for both reference and footnote. |
Thanks! I forgot to check the code in your note, I will have a look when after I land in JP. Moreover I share a the list of all the segmentation model training files containing footnotes:
|
I've reviewed the code of the CalloutAnalyzer and my code and I think it's ready to review. |
…e; review footnote object
…note are the same)
…lout in same paragraph; fix missing paragraph content
I made quite a lot of changes:
There's still one thing to do to have it working: most of the superscript numbers will be recognized as bibliographical markers. There is a filtering of them based on the value of -> So what needs to be done: to match the note labels with the filtered out superscript bibliographical markers. |
I added the bibliographical callout "recovery" as footnote callout. For instance in this PDF CIKM_2021_final_1085.pdf, we have 14 foot notes. We were matching 3 only in the text body. Now we match 10 and the 4 missing ones are foot notes not recognized by the segmentation model, so not matchable. |
This PR aims to link footnotes (already extracted from the segmentation model) to the text.
This current implementation uses an heuristic that uses the footnote number and search for the marker in the paragraph text from the same page as the footnote.
As an example:
The result is injected in the output XML as (using the same strategy as the references):
What still to verify:
and the output is linked to the list item: