Link footnotes in the text #944

lfoppiano · 2022-08-27T19:27:08Z

This PR aims to link footnotes (already extracted from the segmentation model) to the text.
This current implementation uses an heuristic that uses the footnote number and search for the marker in the paragraph text from the same page as the footnote.

As an example:

The result is injected in the output XML as (using the same strategy as the references):

The building blocks for these theories are phrasal or clausal units, and the targets of the analyses are usually very short texts, typically one to three paragraphs in length.
                    <ref type="foot" target="#b5">5</ref> Many problems in discourse analysis, such as dialogue generation and turntaking 
                    <ref type="bibr" target="#b47">(Moore and Pollack 1992;</ref>

What still to verify:

when we look up in the text, we might incorrectly link to the wrong place. When we have a list usually is when we get false positives. Example:

and the output is linked to the list item:

   <p>the relative length of the document,
                   <ref type="foot" target="#b2">2</ref>. the frequency of the term sets in the document, and 3. the distribution of the term sets with respect to the document and to each other.
               </p>
   ```

- the subscript/superscript are not reliable and I did not find a consistent alternative way to know when a marker in the text is not a footnote marker.

coveralls · 2022-08-27T19:47:10Z

Coverage increased (+0.08%) to 39.961% when pulling 655ccdf on features/footnotes into 54d1c29 on master.

kermitt2 · 2022-08-28T01:01:08Z

Thanks a lot Luca !

I think without the constraint on superscript for footnote callout, this approach cannot work (too many false attachments).

Normally the superscript attribute is reliable when it is set to true, but coverage is incomplete. There are several cases where pdfalto does not detect superscript for the moment. However, as pdfalto improves on this, the coverage of a heuristics with superscript condition will improve.

Do you have examples of superscript attributes incorrectly set to true? This would be useful for pdfalto as I don't have any for the moment.

Note: there is a typing of the reference callouts at document-level done in Grobid (https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/engines/citations/CalloutAnalyzer.java#L21). This would allow to know if the reference callouts are superscript too, and prevent some false positive in the rare case we have both footnotes and references as superscript numbers and the unlucky case with same number index on the same page for both reference and footnote.

lfoppiano · 2022-08-30T16:38:32Z

Thanks!
When looking for the footnote marker in the text, I've added the constraints requiring to have superscript = true.
I only have examples of documents where the superscripts attributes not being set at all.

I forgot to check the code in your note, I will have a look when after I land in JP.

Moreover I share a the list of all the segmentation model training files containing footnotes:

022152v1.training.segmentation.tei.xml
056150v1.training.segmentation.tei.xml
0807.3577.training.segmentation.tei.xml
0811_0088.training.segmentation.tei.xml
0911_5430.training.segmentation.tei.xml
100.v84-264.training.segmentation.tei.xml
1003._0908.0095.training.segmentation.tei.xml
1013._technote.training.segmentation.tei.xml
1022._GoNaPhSe07_CustomersGeneralProperties.training.segmentation.tei.xml
1027._PhyRevD77-064013.training.segmentation.tei.xml
104.v84-299.training.segmentation.tei.xml
1046._p33-hearst.training.segmentation.tei.xml
1050._CLEF08Working_Notes_QA_Overview.training.segmentation.tei.xml
1105.training.segmentation.tei.xml
119.v83-444.training.segmentation.tei.xml
12.v88-171.training.segmentation.tei.xml
120._10.1.1.31.3616.training.segmentation.tei.xml
121._10.1.1.47.6586.training.segmentation.tei.xml
128._61008.training.segmentation.tei.xml
130._10.1.1.31.8153.training.segmentation.tei.xml
1309.7222.training.segmentation.tei.xml
131._10.1.1.45.9641.training.segmentation.tei.xml
135._sigcomm98.training.segmentation.tei.xml
145._shade.training.segmentation.tei.xml
146._10.1.1.43.8658.training.segmentation.tei.xml
150._75067.training.segmentation.tei.xml
1512.00014.training.segmentation.tei.xml
155._10.1.1.52.3535.training.segmentation.tei.xml
167._10.1.1.25.1950.training.segmentation.tei.xml
17.v88-048.training.segmentation.tei.xml
2.v91-008.training.segmentation.tei.xml
2020.02.17.20023747v2.full.training.segmentation.tei.xml
27.v87-316.training.segmentation.tei.xml
270._45580.training.segmentation.tei.xml
3.v90-330.training.segmentation.tei.xml
31.v87-066.training.segmentation.tei.xml
368._10.1.1.49.6162.training.segmentation.tei.xml
390._woolf.training.segmentation.tei.xml
394._10.1.1.47.8740.training.segmentation.tei.xml
439._GoyalVT98.training.segmentation.tei.xml
440._10.1.1.41.3430.training.segmentation.tei.xml
452._29904.training.segmentation.tei.xml
474._10.1.1.117.8006.training.segmentation.tei.xml
491._FrH8.training.segmentation.tei.xml
492._10.1.1.62.4528.training.segmentation.tei.xml
50.v86-097.training.segmentation.tei.xml
55001267.training.segmentation.tei.xml
55001337.training.segmentation.tei.xml
60.v85-592.training.segmentation.tei.xml
71.v85-432.training.segmentation.tei.xml
9.v89-169.training.segmentation.tei.xml
9911409.training.segmentation.tei.xml
Amilhat_Parinas.training.segmentation.tei.xml
AUSSANT2014INTER.training.segmentation.tei.xml
Bioinformatics-2007-Rivals-401-7.training.segmentation.tei.xml
C02-1160.training.segmentation.tei.xml
C12-1005.training.segmentation.tei.xml
Document_image_zone_classification_A_simple_high-p.training.segmentation.tei.xml
E14-1007.training.segmentation.tei.xml
E14-1075.training.segmentation.tei.xml
ecdl_dilia.training.segmentation.tei.xml
exception-analysis-resilience-ist.training.segmentation.tei.xml
f7247483-2721-4a2f-ace0-6113d752418a.training.segmentation.tei.xml
HCII07-LongSteph.training.segmentation.tei.xml
ims.training.segmentation.tei.xml
ipamin2014_paper4.training.segmentation.tei.xml
JAP0897669-CC.training.segmentation.tei.xml
MNRAS-2015-Richard-L16-20.training.segmentation.tei.xml
nihms743075.training.segmentation.tei.xml
P98-2139.training.segmentation.tei.xml
PMC4317227.training.segmentation.tei.xml
SSRN-id1425692-2.training.segmentation.tei.xml
W00-0734.training.segmentation.tei.xml
W09-1401.training.segmentation.tei.xml
W09-1403.training.segmentation.tei.xml
W09-1417.training.segmentation.tei.xml
W12-4305.training.segmentation.tei.xml
Wang-paperAVE2008.training.segmentation.tei.xml

lfoppiano · 2022-09-07T03:43:58Z

I've reviewed the code of the CalloutAnalyzer and my code and I think it's ready to review.

…e; review footnote object

…note are the same)

…lout in same paragraph; fix missing paragraph content

kermitt2 · 2022-09-24T16:46:42Z

I made quite a lot of changes:

I remove some code not used and some redundant codes. There are two types of notes, foot notes and margin notes. Margin note was using the old code, but foot note the new one. I simplify with only one object Note, covering the two types and using then the same methods.
The TEI inline serialization of foot notes was not supporting more than one footnote callout match per paragraph and was "eating" some paragraph content in one case. I re-wrote it to support several matches, sort by position and rebuild the paragraph part by part.
Added a post-processing for foot notes not correctly segmented (it seems that the segmentation model tends to agglutinate several foot notes together when they follow each other)

There's still one thing to do to have it working: most of the superscript numbers will be recognized as bibliographical markers. There is a filtering of them based on the value of MarkerType. If bibliographical markers are mainly in parenthesis or bracket or superscript, the MarkerType will be set preliminary to this marker style to avoid mixing with table markers, figure markers and footnote markers (which must be of a different style in a proper document). So most note markers will not appear labeled as paragraph, but as bibliographical markers, which are then filtered out because not looking like the bibliographical style of the current document.

-> So what needs to be done: to match the note labels with the filtered out superscript bibliographical markers.

kermitt2 · 2022-09-24T18:25:18Z

I added the bibliographical callout "recovery" as footnote callout.
It probably needs a bit more more test and there is a minor issue with space after the footnote callout in the TEI.

For instance in this PDF CIKM_2021_final_1085.pdf, we have 14 foot notes. We were matching 3 only in the text body. Now we match 10 and the 4 missing ones are foot notes not recognized by the segmentation model, so not matchable.

lfoppiano added 2 commits August 28, 2022 00:17

link footnotes with heuristics

04b3500

fix @target attribute for footnotes

7df6a01

lfoppiano added 2 commits August 28, 2022 10:58

link footnotes to superscript tokens only

6403d12

cleanup

761a1f6

lfoppiano marked this pull request as ready for review September 7, 2022 03:44

lfoppiano and others added 7 commits September 9, 2022 14:57

add unit tests

8bf7e96

add integration test with sample document

0fbc152

Merge branch 'master' into features/footnotes

01699b7

clean not used and redundant code; factorize margin note and foot not…

1207e0b

…e; review footnote object

rename Footnote object to Note everywhere to avoid confusion (margin …

3cca788

…note are the same)

fix tests

063f559

rewrite footnote callout serialization; support multiple footnote cal…

77185b0

…lout in same paragraph; fix missing paragraph content

kermitt2 added 2 commits September 24, 2022 19:30

cover bibliographicla callout which are in fact foot note callout

f71fe72

fix test

e2ac939

case no ref but note

655ccdf

kermitt2 merged commit f9dc68f into master Sep 27, 2022

lfoppiano deleted the features/footnotes branch September 28, 2022 04:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link footnotes in the text #944

Link footnotes in the text #944

lfoppiano commented Aug 27, 2022

coveralls commented Aug 27, 2022 •

edited

Loading

kermitt2 commented Aug 28, 2022

lfoppiano commented Aug 30, 2022 •

edited

Loading

lfoppiano commented Sep 7, 2022

kermitt2 commented Sep 24, 2022 •

edited

Loading

kermitt2 commented Sep 24, 2022 •

edited

Loading

Link footnotes in the text #944

Link footnotes in the text #944

Conversation

lfoppiano commented Aug 27, 2022

coveralls commented Aug 27, 2022 • edited Loading

kermitt2 commented Aug 28, 2022

lfoppiano commented Aug 30, 2022 • edited Loading

lfoppiano commented Sep 7, 2022

kermitt2 commented Sep 24, 2022 • edited Loading

kermitt2 commented Sep 24, 2022 • edited Loading

coveralls commented Aug 27, 2022 •

edited

Loading

lfoppiano commented Aug 30, 2022 •

edited

Loading

kermitt2 commented Sep 24, 2022 •

edited

Loading

kermitt2 commented Sep 24, 2022 •

edited

Loading