Funders and funding #1046

kermitt2 · 2023-08-25T11:50:41Z

This PR introduces an additional model called funding-acknowledgement, to parse the content of the funding and acknowledgement sections. This include identification of the mentioned entities (person, affiliation/institution, project), with a particular effort on funder and funding information. Funder name, grant number, funded project, funding program and grant name are recognized.

Results are serialized in the TEI with the list of funders in the TEI header:

        <funder ref="#_JVMscTc">
		<orgName type="full">National Institutes of Health</orgName>
		<orgName type="abbreviated">NIH</orgName>
		<idno type="DOI" subtype="crossref">10.13039/100000002</idno>
	</funder>
	<funder>
		<orgName type="full">Hopkins Sommer Scholarship</orgName>
	</funder>
	<funder>
		<orgName type="full">Lieber Institute for Brain Development</orgName>
		<orgName type="abbreviated">LIBD</orgName>
		<idno type="DOI" subtype="crossref">10.13039/100015503</idno>
	</funder>

The @ref attribute link the funder to one or more funding element, which describe the funding with (when identified) grant number, grant name, project funded and name of the funding program:

<back>
	<div type="acknowledgement">
		<div>
			<head>Acknowledgements</head>
			<p>JL and BL are supported by <rs type="funder">NIH</rs> <rs type="grantNumber">R01 GM105705</rs>. AF is supported by a <rs type="funder">Hopkins Sommer Scholarship</rs>. AJ is supported by the <rs type="funder">Lieber Institute for Brain Development</rs></p>
		</div>
	</div>
	<listOrg type="funding">
		<org type="funding" xml:id="_JVMscTc">
			<idno type="grant-number">R01 GM105705</idno>
		</org>
	</listOrg>
	
	<div type="references">
...

As visible above, the acknowledgement and funding sections are enriched with inline mark-up corresponding to the identified entities.

In addition, it is possible to consolidate the identified funder by a look-up currently limited to CrossRef funder registry, using the CrossRef REST API. When the funder name is matched with a certain certainty with a registered CrossRef funder, we add the DOI of the funder as well as additional normalized name, acronym and country.

Consolidation will be improved via biblio-glutton in a next phase.

The PR includes a complete revision of the segmentation training data regarding acknowledgement and funding section and a set of around 1500 manually annotated funding and acknowledgement sections.

Standalone model accuracy (strict field matching): the winner is SciBERT+CRF, as usual for NER in scientific texts

CRF funding-acknowledgement model

===== Field-level results =====

label                accuracy     precision    recall       f1           support

<affiliation>        98.36        61.84        50           55.29        94     
<funderName>         93.6         69.1         68.96        69.03        480    
<grantName>          98.64        42.22        33.93        37.62        56     
<grantNumber>        98.43        90.7         88.95        89.82        362    
<institution>        95.93        36           36.73        36.36        147    
<person>             98.34        93.69        93.84        93.77        617    
<programName>        98.81        35.14        29.55        32.1         44     
<projectName>        99.09        46.67        17.07        25           41     

all (micro avg.)     97.65        77.3         74.52        75.88        1841   
all (macro avg.)     97.65        59.42        52.38        54.87        1841   

===== Instance-level results =====

Total expected instances:   316
Correct instances:          130
Instance-level recall:      41.14

BidLSTM_CRF_FEATURES

  f1 (micro): 75.89
                  precision    recall  f1-score   support

   <affiliation>     0.7000    0.8750    0.7778        24
    <funderName>     0.7165    0.7333    0.7248       255
     <grantName>     0.3636    0.3077    0.3333        26
   <grantNumber>     0.8171    0.8938    0.8537       160
   <institution>     0.4955    0.5340    0.5140       103
        <person>     0.9416    0.9699    0.9556       266
   <programName>     0.2800    0.3043    0.2917        23
   <projectName>     0.3750    0.4412    0.4054        34

all (micro avg.)     0.7399    0.7789    0.7589       891

BidLSTM_CRF_FEATURES + ELMo

Average over 10 folds
                  precision    recall  f1-score   support

   <affiliation>     0.7120    0.8833    0.7878        24
    <funderName>     0.6911    0.8000    0.7411       255
     <grantName>     0.4220    0.4423    0.4309        26
   <grantNumber>     0.8044    0.8781    0.8396       160
   <institution>     0.5717    0.5515    0.5596       103
        <person>     0.9511    0.9643    0.9576       266
   <programName>     0.2970    0.2913    0.2924        23
   <projectName>     0.4887    0.4912    0.4894        34

all (micro avg.)     0.7483    0.8012    0.7739

BERT (allenai/scibert_scivocab_cased)

                  precision    recall  f1-score   support

   <affiliation>     0.7368    0.8000    0.7671        35
    <funderName>     0.6900    0.7670    0.7264       206
     <grantName>     0.3143    0.4074    0.3548        27
   <grantNumber>     0.9185    0.9394    0.9288       132
   <institution>     0.4167    0.4348    0.4255        69
        <person>     0.9386    0.9701    0.9541       268
   <programName>     0.1500    0.3000    0.2000        10
   <projectName>     0.0952    0.2857    0.1429         7

all (micro avg.)     0.7449    0.8170    0.7793       754

BERT (allenai/scibert_scivocab_cased) + CRF

                  precision    recall  f1-score   support

   <affiliation>     0.7436    0.8286    0.7838        35
    <funderName>     0.6725    0.7476    0.7080       206
     <grantName>     0.3000    0.3333    0.3158        27
   <grantNumber>     0.8929    0.9470    0.9191       132
   <institution>     0.4557    0.5217    0.4865        69
        <person>     0.9628    0.9664    0.9646       268
   <programName>     0.1875    0.3000    0.2308        10
   <projectName>     0.1579    0.4286    0.2308         7

all (micro avg.)     0.7527    0.8196    0.7848       754

BERT (bert-base-cased) + CRF 

                  precision    recall  f1-score   support

   <affiliation>     0.7179    0.8000    0.7568        35
    <funderName>     0.6754    0.7476    0.7097       206
     <grantName>     0.2632    0.3704    0.3077        27
   <grantNumber>     0.8936    0.9545    0.9231       132
   <institution>     0.4430    0.5072    0.4730        69
        <person>     0.9526    0.9739    0.9631       268
   <programName>     0.1304    0.3000    0.1818        10
   <projectName>     0.1818    0.5714    0.2759         7

all (micro avg.)     0.7358    0.8236    0.7772       754

coveralls · 2023-08-25T18:47:38Z

coverage: 40.119% (+0.1%) from 39.98% when pulling c384ff1 on funders-and-funding into 707030a on master.

kermitt2 added 30 commits July 23, 2023 18:26

structures for funder/funding

fde184d

add funder gazetteer

4020ccc

training data, funding model

710da0a

data object for funder and funding

bf1dc1c

add funding model parser, trainer, features

2f98b9d

extend config for funding model

9b07338

fix training data

353aff5

review labels, various minor updates

f4dcebf

add tests; fix training

ae0a52d

review labeling/training data

6bf7d12

ensure xml:id are valid NCName in training data

413d445

add annotated funding sections

0c45706

review training data; update model

0bfc760

review annotations

e2bc569

label decoding

d62aa89

review training data

cd4f305

move everything to a single funding-acknowledgement model

02c60dc

new model organization; update model

ecca862

more training data corrections

b094942

review data object

7cfcbc4

add web service and update console/demo

ba3ad87

one more pass on training data

08c3c6d

additional training data

cee9468

set dynamic memory limit in pdfalto_server

b06bfd2

prepare additional json serialization

4b88b9c

keep xml for late serialization

a627fbe

review XML process

3e501fd

review TEI serialization

e799571

add funder consolidation with crossref rest api

77d809a

review funding cross-references

a28e2a1

kermitt2 added 11 commits August 13, 2023 16:35

filter out fp crossref results

2d1e9ed

manage acronyms, soft match from crossref api

c00384d

batch funder consolidation

12996fa

config for funder consolidation

d2d0aea

add funder consolidation service parameter

55216df

various fixes; better tei seialization

8163bba

training cases

d49b439

review training

df4ed0c

update trainer and CRF model

3042bcf

attachment for series of grant numbers

0bae3a6

fix merging

64c922e

kermitt2 self-assigned this Aug 25, 2023

kermitt2 added 2 commits August 25, 2023 16:57

Merge branch 'master' into funders-and-funding

e293cbf

add RNN models

c04af16

kermitt2 added 3 commits August 27, 2023 00:22

support consolidation modes for funders

9fbdf3c

update console demo for service

aa4d3fa

update doc for funder consolidation

c384ff1

kermitt2 merged commit 25caaaf into master Aug 28, 2023
3 checks passed

kermitt2 mentioned this pull request Feb 13, 2024

Parser for Acknowledgments #510

Closed

lfoppiano added this to the 0.8.0 milestone Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Funders and funding #1046

Funders and funding #1046

kermitt2 commented Aug 25, 2023 •

edited

Loading

coveralls commented Aug 25, 2023 •

edited

Loading

Funders and funding #1046

Funders and funding #1046

Conversation

kermitt2 commented Aug 25, 2023 • edited Loading

coveralls commented Aug 25, 2023 • edited Loading

kermitt2 commented Aug 25, 2023 •

edited

Loading

coveralls commented Aug 25, 2023 •

edited

Loading