Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add funding statement in TEI output #959

Merged
merged 10 commits into from
Oct 19, 2022
Merged

Conversation

lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented Oct 13, 2022

This PR adds the funding statement (already in training data and models) in the TEI output.

The funding output will be placed in a standardised position in the <back> of the TEI output, as the availability statement.

e.g. (with sentence segmentation)

<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Funding</head><p><s>This study was supported by the International Collaborative Research Grants Scheme with joint grants from the Wellcome Trust UK (GR071587MA) and the Australian NHMRC (268055).</s><s>The funding sources played no role in study design, data collection, analysis or interpretation, writing the report, or the decision to submit the paper for publication.</s></p></div>
</div>

This should fix #895 #652

As discussed in #957, this does not cover the case when only header is processed.

@kermitt2
Copy link
Owner

Thanks !
I think the change in TEIFormatter.java is doing nothing? Just moving stuff that do the same (simply put a default space at inline position of a figure to avoid agglutination of text).

@kermitt2
Copy link
Owner

From the discussion #956, I am updating the PR with post-processing limited to processShort(), instead of changing the general fulltext model decoding.

@coveralls
Copy link

coveralls commented Oct 15, 2022

Coverage Status

Coverage increased (+0.02%) to 39.928% when pulling 6c8b888 on feature/funding-statement into 8e53a7d on master.

@kermitt2
Copy link
Owner

The changes seem to work fine, on the example from #956 we obtain:

           <div type="funding">
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <p>This study was supported by the South Asian Clinical Toxicology Research Collaboration, which is funded by The Wellcome Trust/ National Health and Medical Research Council International Collaborative Research Grant GR071669MA. The funding bodies had no role in analyzing or interpreting the data or writing the article.</p>
                </div>
            </div>

No runtime error on PMC_sample set and biorxiv_test_2000.

@kermitt2 kermitt2 marked this pull request as ready for review October 15, 2022 09:52
@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Oct 17, 2022

I found several cases of empty funding where the papers has the funding statement in the back of the document, the TEI serialisation calls getSectionAsTEI() which does not uses the processShort().

E.g. this document: bj4360053.pdf

Unfortunately, the result does not help, and the hack in processShort() does not work because of the different label at the first place won't fix the sequence.

FUNDING	funding	F	FU	FUN	FUND	G	NG	ING	DING	BLOCKSTART	LINESTART	ALIGNEDLEFT	NEWFONT	HIGHERFONT	1	0	ALLCAP	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	I-<section>
This	this	T	Th	Thi	This	s	is	his	This	BLOCKSTART	LINESTART	ALIGNEDLEFT	NEWFONT	LOWERFONT	0	0	INITCAP	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	I-<table>
work	work	w	wo	wor	work	k	rk	ork	work	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	<table>
was	was	w	wa	was	was	s	as	was	was	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	<table>
supported	supported	s	su	sup	supp	d	ed	ted	rted	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	<table>
[...]

I'm not sure how I could fix this properly

@kermitt2
Copy link
Owner

We could wait for future versions which will remove the tables and figures from the fulltext model, or apply the same fix as in processShort for every table and figure labels (not just the ones starting the sequence). What do you think?

@lfoppiano
Copy link
Collaborator Author

At the moment is better to fix it, as it is as some data are lost. I can apply the same fix for processshort for all the figures and tables labels.

@kermitt2
Copy link
Owner

Actually postProcessLabeledAbstract() seems to already do this more violent stuff.

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Oct 18, 2022

Using the biorxiv corpus for end2end evaluation, we get this result:

funding_stmt         99.79        59.4         42.42        49.49        745    

@kermitt2
Copy link
Owner

So the two problematic cases are working fine with the current version, yeah :D

I cleaned the previous hack in processShort() and replaced it with postProcessFullTextLabeledText(), to be sure it is always applied after calling processShort() (so this applied to the funding statement in header too now, otherwise it was only applied to the funding statement at the end of a paper).

@kermitt2
Copy link
Owner

Thanks !!

@kermitt2 kermitt2 merged commit dab259e into master Oct 19, 2022
@lfoppiano lfoppiano deleted the feature/funding-statement branch June 12, 2023 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants