-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Funding, acknowledgement statements are not split into sentences #1090
Comments
Good question. Maybe we could ask Tim about sentence-level granularity in general in any section. It does come at some cost. Maybe we should support 2 modes. I don't know what the significance of splitting into sentences is in the system. I know it doesn't play a role in deliverables to customers (unless it influences the rules - e.g. number of sentences with a specific value). It may be primarily for debugging purposes. |
I also found that the acknowledgement is not split into sentences. I'm assuming can be the same case. |
Digging deeper I notice that the funding statement is correctly split into sentences, however they are lost when it's passed through the acknowledgment/funding parser: fundingStmt = getSectionAsTEI("funding",
"\t\t\t",
doc,
SegmentationLabels.FUNDING,
teiFormatter,
resCitations,
config);
if (fundingStmt.length() > 0) {
MutablePair<Element, MutableTriple<List<Funding>,List<Person>,List<Affiliation>>> localResult =
parsers.getFundingAcknowledgementParser().processingXmlFragment(fundingStmt.toString(), config);
if (localResult != null && localResult.getLeft() != null){
String local_tei = localResult.getLeft().toXML();
local_tei = local_tei.replace(" xmlns=\"http://www.tei-c.org/ns/1.0\"", "");
annexStatements.add(local_tei);
} else {
annexStatements.add(fundingStmt.toString());
} |
Hello, indeed, everywhere the funding-acknowledgement parser is applies, the sentence segmentation is ignored. The reason is that it would require to take into account the (numerous) annotations produced by this model when re-segmenting into sentences, which was not supported by the sentence segmentation (it only supports reference marker annotations). As the current sentence segmentation is already quite complex, I thought about another approach, a more generic sentence segmentation, which I developed working on the final TEI XML directly and I think supporting any existing and future inline markup - this is available here: One idea would be to move to this simple generic sentence segmentation, instead of extending and complexifying the existing one. (as visible on Pub2TEI the other advantage of the generic approach working on TEI XML directly is that it can be applied to any TEI XML from Pub2TEI or from LaTeXML, making possible sentence segmentation consistent for all these sources, even if they introduce unexpected/new markup inline with the text in the future) |
Understood. It become more clear once I saw the
Sure, at the moment the current segmentation was just extended to avoid URLs being split between sentences (#1097). Because once the offset positions are collected is just a matter of extending the list of forbidden positions.
|
I think, with this approach (segmenting after the "final" markup is built) we won't be able to generate coordinate for each sentences because we have lost the layout token information after the transformation to XML. One solution comes to my. mind would be to work on the layout tokens before the TEI transformation and collect all the item in a list and apply them in order given that they are not overlapping, the same I did here:
This would require to remove any TEI dependency from the funding/acknowledgment parser and deal with the transformation in TEI outside the parser, instead of processing the Element/Node XML. @kermitt2 please let me know if you have any comment. |
After s few days trying different solutions, I implemented it by modifying the This approach also preserve the sentence coordinates and the reference markers that were lost as well. |
I've started testing and noticed that in rare cases (although possible), the sentence segmentation, which is performed before the funding-acknowledgment model, result in sentences that fall on funding-acknowledgment annotations. e.g. The first is the original version, without the sentence segmentation: <div type="acknowledgement">
<div>
<head coords="31,72.00,491.09,114.40,12.58">Acknowledgments:</head>
<p coords="31,191.82,493.44,347.12,9.57;31,72.00,522.72,81.26,9.57">We thank
<rs type="person">Drs. Carsten Korth</rs> and
<rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
</p>
</div>
</div> Here the first sentence falls on the annotation "Drs.Carsten Korth": <div type="acknowledgement">
<div>
<head>Acknowledgments:</head>
<p>
<s>We thank Drs.</s>
<s>Carsten Korth and
<rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
</s>
</p>
</div>
</div> I've then worked out a solution that allow merging and updating sentences that are in this situation, including their coordinates. Here the result: <div type="acknowledgement">
<div>
<head coords="31,72.00,491.09,114.40,12.58">Acknowledgments:</head>
<p>
<s coords="31,191.82,493.44,63.87,9.57;31,258.46,493.44,280.48,9.57;31,72.00,522.72,81.26,9.57">We thank
<rs type="person">Drs.Carsten Korth</rs> and
<rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
</s>
</p>
</div>
</div> |
I've noticed that while the data availability is split into sentences, the funding statement is not. Is this by design or should be implemented?
Example:
energies-14-08509.pdf
The text was updated successfully, but these errors were encountered: