Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: general back section (section) #698

Open
de-code opened this issue Jan 22, 2021 · 8 comments
Open

Feature Request: general back section (section) #698

de-code opened this issue Jan 22, 2021 · 8 comments

Comments

@de-code
Copy link
Collaborator

de-code commented Jan 22, 2021

It only occurs to me now, that there doesn't seem to be a generic back section, section.

The Annotation guidelines for the 'segmentation' model do not mention any back section. The existing training data wraps elements in a back section, probably to keep the general TEI structure.

It does support the following specific back section elements:

  • listBibl
  • annex
  • acknowledgment

Out of those, acknowledgment is probably a sort of general back section (section).
But there could be others, e.g. relating to:

  • funding
  • competing interests
  • author contributions

It would be good to be able to just extract general back sections (with title and paragraph(s)).
(Then acknowledgment could just be a special case of that)

@de-code
Copy link
Collaborator Author

de-code commented Jan 22, 2021

Hi @kermitt2 in #652 (comment) you suggested to use annex for the funding section. Would you suggest to use that for all back sections?

@kermitt2
Copy link
Owner

Hi @de-code ! Yes the approach is to use annex for all these "back matters" sections and when enough training data is available, to add a distinct label to recognize explicitly the type of section (acknowledgment was annex too if I remember well at some point).
I was not fan of "front" and "back" stuff that we find in many XML article encoding (in TEI too), because it conveys the idea of layout position/presentation criteria. So I used rather header and annex.

@de-code
Copy link
Collaborator Author

de-code commented Jan 22, 2021

Okay, thank you for that.

In that case, how do you differentiate between those "back matters" sections and the appendix? (I somehow thought annex was for the appendix, app-group / app in JATS)
Or do you just see the appendix as one of those "back matters" section?

@kermitt2
Copy link
Owner

Well this is for the training data, when sections are recognized explicitly they fall at the right place in the TEI result.
When starting creating training data, the goal was to have something simplified to make the manual annotation easier, and then refine the annotation over time as we have more data and we are able to consider more ML labels, which is why back matters and annexes (in the sense of appendix/supplement) were in the same pot at the beginning.

@de-code
Copy link
Collaborator Author

de-code commented Jan 29, 2021

I have a similar question relating figures and tables. The annotation guideline specifies that the should be part of the body. But there are figures or tables that belong to the back section / appendix (e.g. in DOI: 10.1101/188706). Can they be annotated as annex in that case?

@de-code
Copy link
Collaborator Author

de-code commented Feb 12, 2021

Relating to my last question, there seems to be a problem (or I may be misunderstanding the guideline).
For example, where we have a Figure legends or Supplemental data section title, that is followed by figure information.
I am now annotating the section title as annex (as it's part of the back section), but the figure as the body.
In my case the model is then learning that, but GROBID probably thinks that the section is empty and doesn't include it in the response.

@kermitt2
Copy link
Owner

For the segmentation model, Figures and tables normally in the "zone" where they belong (where they are referenced primarily), which is mentioned here -> https://grobid.readthedocs.io/en/latest/training/segmentation/#tables-and-figures. So for instance in the header if we have a figure as part of the abstract, or in an annex if they are part of it.

Maybe the guidelines are not drafted clearly enough, because the general rule - figure/table in the body - is too much emphasized? For preprint/submission format it's frequent that all the figures appears at the very end of the article (sometimes separated from their captions), in this case they should be labelled as "body" as they are usually figures/tables for the body part, although after the bibliographical section and annex for formatting reasons.

@de-code
Copy link
Collaborator Author

de-code commented Feb 15, 2021

Okay, maybe I have misinterpreted the general rule as the overriding rule.
It is also a good point that preprint / submissions may locate figure descriptions at the end while the figures would otherwise belong to the body.

Perhaps we could say, that figures and tables belong to where they are referenced first? i.e. if a figure is referenced from a body section, then it belongs to the body. But if it is only referenced from a back section, then it belongs there?

172379v1 (DOI: 10.1101/172379) is an example (from the bioRxiv 10k validation set) where I am not so sure about actually.
It has a Figure legends section with Figure 1 etc.
But it also has a Supplemental data section with Figure S1 etc.
From that, it would appear that Figure S1 should belong to the back section.
Although it is referenced by a body section.

(There may also be the question whether it makes sense to extract sections titles like Figure legends.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants