-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add funding statement in TEI output #959
Conversation
Thanks ! |
From the discussion #956, I am updating the PR with post-processing limited to processShort(), instead of changing the general fulltext model decoding. |
The changes seem to work fine, on the example from #956 we obtain: <div type="funding">
<div
xmlns="http://www.tei-c.org/ns/1.0">
<p>This study was supported by the South Asian Clinical Toxicology Research Collaboration, which is funded by The Wellcome Trust/ National Health and Medical Research Council International Collaborative Research Grant GR071669MA. The funding bodies had no role in analyzing or interpreting the data or writing the article.</p>
</div>
</div> No runtime error on PMC_sample set and biorxiv_test_2000. |
I found several cases of empty funding where the papers has the funding statement in the back of the document, the TEI serialisation calls E.g. this document: bj4360053.pdf Unfortunately, the result does not help, and the hack in processShort() does not work because of the different label at the first place won't fix the sequence.
I'm not sure how I could fix this properly |
We could wait for future versions which will remove the tables and figures from the fulltext model, or apply the same fix as in processShort for every table and figure labels (not just the ones starting the sequence). What do you think? |
At the moment is better to fix it, as it is as some data are lost. I can apply the same fix for processshort for all the figures and tables labels. |
Actually |
Using the biorxiv corpus for end2end evaluation, we get this result:
|
So the two problematic cases are working fine with the current version, yeah :D I cleaned the previous hack in |
Thanks !! |
This PR adds the funding statement (already in training data and models) in the TEI output.
The funding output will be placed in a standardised position in the
<back>
of the TEI output, as the availability statement.e.g. (with sentence segmentation)
This should fix #895 #652
As discussed in #957, this does not cover the case when only header is processed.