Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data and code availability statement zone #951

Merged
merged 84 commits into from
Oct 7, 2022

Conversation

kermitt2
Copy link
Owner

Recognition of data and code availability statements, either in the header or the end of an article and marking the section in a normalized place in the TEI result (similarly as the acknowledgement section).

This involves the segmentation model (to recognize the zone as an additional section after the main article body) and the header model (when the zone is located within the header).

Thanks @lfoppiano :D

lfoppiano and others added 30 commits August 1, 2022 14:18
…itt2/grobid into feature/data-availability-statement
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.005%) to 39.879% when pulling f427ad7 on feature/data-availability-statement into 54d1c29 on master.

@coveralls
Copy link

coveralls commented Sep 25, 2022

Coverage Status

Coverage decreased (-0.03%) to 39.928% when pulling 1925296 on feature/data-availability-statement into f9dc68f on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.006%) to 39.878% when pulling 40331eb on feature/data-availability-statement into 54d1c29 on master.

@kermitt2
Copy link
Owner Author

kermitt2 commented Sep 25, 2022

Hi @lfoppiano !

I reviewed everything and made some few changes - in particular put back the header labels into TaggingLabels.java to avoid a breaking change in all the current grobid modules (which uses labels from the class TaggingLabels - I actually wanted to put back the segmentation labels too in TaggingLabels to simplify a few years ago, but I didn't do it to avoid a breaking change).

Currently, when the availability statement is in the header, it is correctly labeled and stored in BiblioItem.java, but it is not serialized in the final TEI because the method used can only work with segmentation label:

https://github.com/kermitt2/grobid/blob/feature/data-availability-statement/grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java#L2473

            // data availability statements
            StringBuilder dataAvailability = new StringBuilder();
            if (StringUtils.isNotBlank(resHeader.getDataAvailability())) {
                dataAvailability = getSectionAsTEI("availability", "\t\t\t", doc, TaggingLabels.HEADER_AVAILABILITY,
                    teiFormatter, resCitations, config);
            } else {
                dataAvailability = getSectionAsTEI("availability", "\t\t\t", doc, SegmentationLabels.AVAILABILITY,
                    teiFormatter, resCitations, config);
            }

getSectionAsTEI() then uses:

https://github.com/kermitt2/grobid/blob/feature/data-availability-statement/grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java#L2517

SortedSet<DocumentPiece> sectionPart = doc.getDocumentPart(taggingLabel);

Only the SegmentationLabel work here with doc.getDocumentPart() (using TaggingLabels.HEADER_AVAILABILITY here will return nothing).

To retrieve the layout tokens relative to the availability statement from the header stored in BiblioItem.java, normally we use:

https://github.com/kermitt2/grobid/blob/feature/data-availability-statement/grobid-core/src/main/java/org/grobid/core/data/BiblioItem.java#L4238

List<LayoutToken> headerAvailabilityStatementTokens = biblio.getLayoutTokens(TaggingLabels.HEADER_AVAILABILITY);

@kermitt2
Copy link
Owner Author

Latest commits fixed the header availability statement TEI serialization.
createTraining for header and segmentation now also pre-labeled availability stmt and funding sections.

@kermitt2 kermitt2 marked this pull request as ready for review September 26, 2022 05:20
@kermitt2 kermitt2 merged commit 2a16015 into master Oct 7, 2022
@lfoppiano lfoppiano deleted the feature/data-availability-statement branch October 7, 2022 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants