Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Funders and funding #1046

Merged
merged 46 commits into from
Aug 28, 2023
Merged

Funders and funding #1046

merged 46 commits into from
Aug 28, 2023

Conversation

kermitt2
Copy link
Owner

@kermitt2 kermitt2 commented Aug 25, 2023

This PR introduces an additional model called funding-acknowledgement, to parse the content of the funding and acknowledgement sections. This include identification of the mentioned entities (person, affiliation/institution, project), with a particular effort on funder and funding information. Funder name, grant number, funded project, funding program and grant name are recognized.

Results are serialized in the TEI with the list of funders in the TEI header:

        <funder ref="#_JVMscTc">
		<orgName type="full">National Institutes of Health</orgName>
		<orgName type="abbreviated">NIH</orgName>
		<idno type="DOI" subtype="crossref">10.13039/100000002</idno>
	</funder>
	<funder>
		<orgName type="full">Hopkins Sommer Scholarship</orgName>
	</funder>
	<funder>
		<orgName type="full">Lieber Institute for Brain Development</orgName>
		<orgName type="abbreviated">LIBD</orgName>
		<idno type="DOI" subtype="crossref">10.13039/100015503</idno>
	</funder>

The @ref attribute link the funder to one or more funding element, which describe the funding with (when identified) grant number, grant name, project funded and name of the funding program:

<back>
	<div type="acknowledgement">
		<div>
			<head>Acknowledgements</head>
			<p>JL and BL are supported by <rs type="funder">NIH</rs> <rs type="grantNumber">R01 GM105705</rs>. AF is supported by a <rs type="funder">Hopkins Sommer Scholarship</rs>. AJ is supported by the <rs type="funder">Lieber Institute for Brain Development</rs></p>
		</div>
	</div>
	<listOrg type="funding">
		<org type="funding" xml:id="_JVMscTc">
			<idno type="grant-number">R01 GM105705</idno>
		</org>
	</listOrg>
	
	<div type="references">
...

As visible above, the acknowledgement and funding sections are enriched with inline mark-up corresponding to the identified entities.

In addition, it is possible to consolidate the identified funder by a look-up currently limited to CrossRef funder registry, using the CrossRef REST API. When the funder name is matched with a certain certainty with a registered CrossRef funder, we add the DOI of the funder as well as additional normalized name, acronym and country.

Consolidation will be improved via biblio-glutton in a next phase.

The PR includes a complete revision of the segmentation training data regarding acknowledgement and funding section and a set of around 1500 manually annotated funding and acknowledgement sections.

Standalone model accuracy (strict field matching): the winner is SciBERT+CRF, as usual for NER in scientific texts

CRF funding-acknowledgement model

===== Field-level results =====

label                accuracy     precision    recall       f1           support

<affiliation>        98.36        61.84        50           55.29        94     
<funderName>         93.6         69.1         68.96        69.03        480    
<grantName>          98.64        42.22        33.93        37.62        56     
<grantNumber>        98.43        90.7         88.95        89.82        362    
<institution>        95.93        36           36.73        36.36        147    
<person>             98.34        93.69        93.84        93.77        617    
<programName>        98.81        35.14        29.55        32.1         44     
<projectName>        99.09        46.67        17.07        25           41     

all (micro avg.)     97.65        77.3         74.52        75.88        1841   
all (macro avg.)     97.65        59.42        52.38        54.87        1841   

===== Instance-level results =====

Total expected instances:   316
Correct instances:          130
Instance-level recall:      41.14
BidLSTM_CRF_FEATURES

  f1 (micro): 75.89
                  precision    recall  f1-score   support

   <affiliation>     0.7000    0.8750    0.7778        24
    <funderName>     0.7165    0.7333    0.7248       255
     <grantName>     0.3636    0.3077    0.3333        26
   <grantNumber>     0.8171    0.8938    0.8537       160
   <institution>     0.4955    0.5340    0.5140       103
        <person>     0.9416    0.9699    0.9556       266
   <programName>     0.2800    0.3043    0.2917        23
   <projectName>     0.3750    0.4412    0.4054        34

all (micro avg.)     0.7399    0.7789    0.7589       891
BidLSTM_CRF_FEATURES + ELMo

Average over 10 folds
                  precision    recall  f1-score   support

   <affiliation>     0.7120    0.8833    0.7878        24
    <funderName>     0.6911    0.8000    0.7411       255
     <grantName>     0.4220    0.4423    0.4309        26
   <grantNumber>     0.8044    0.8781    0.8396       160
   <institution>     0.5717    0.5515    0.5596       103
        <person>     0.9511    0.9643    0.9576       266
   <programName>     0.2970    0.2913    0.2924        23
   <projectName>     0.4887    0.4912    0.4894        34

all (micro avg.)     0.7483    0.8012    0.7739 
BERT (allenai/scibert_scivocab_cased)

                  precision    recall  f1-score   support

   <affiliation>     0.7368    0.8000    0.7671        35
    <funderName>     0.6900    0.7670    0.7264       206
     <grantName>     0.3143    0.4074    0.3548        27
   <grantNumber>     0.9185    0.9394    0.9288       132
   <institution>     0.4167    0.4348    0.4255        69
        <person>     0.9386    0.9701    0.9541       268
   <programName>     0.1500    0.3000    0.2000        10
   <projectName>     0.0952    0.2857    0.1429         7

all (micro avg.)     0.7449    0.8170    0.7793       754
BERT (allenai/scibert_scivocab_cased) + CRF

                  precision    recall  f1-score   support

   <affiliation>     0.7436    0.8286    0.7838        35
    <funderName>     0.6725    0.7476    0.7080       206
     <grantName>     0.3000    0.3333    0.3158        27
   <grantNumber>     0.8929    0.9470    0.9191       132
   <institution>     0.4557    0.5217    0.4865        69
        <person>     0.9628    0.9664    0.9646       268
   <programName>     0.1875    0.3000    0.2308        10
   <projectName>     0.1579    0.4286    0.2308         7

all (micro avg.)     0.7527    0.8196    0.7848       754
BERT (bert-base-cased) + CRF 

                  precision    recall  f1-score   support

   <affiliation>     0.7179    0.8000    0.7568        35
    <funderName>     0.6754    0.7476    0.7097       206
     <grantName>     0.2632    0.3704    0.3077        27
   <grantNumber>     0.8936    0.9545    0.9231       132
   <institution>     0.4430    0.5072    0.4730        69
        <person>     0.9526    0.9739    0.9631       268
   <programName>     0.1304    0.3000    0.1818        10
   <projectName>     0.1818    0.5714    0.2759         7

all (micro avg.)     0.7358    0.8236    0.7772       754

@kermitt2 kermitt2 self-assigned this Aug 25, 2023
@coveralls
Copy link

coveralls commented Aug 25, 2023

Coverage Status

coverage: 40.119% (+0.1%) from 39.98% when pulling c384ff1 on funders-and-funding into 707030a on master.

@kermitt2 kermitt2 merged commit 25caaaf into master Aug 28, 2023
3 checks passed
@lfoppiano lfoppiano added this to the 0.8.0 milestone Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants