-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
processShort may yeld loss of text in some cases #956
Comments
My quick solution could be to combine PARAGRAPH and FIGURE and TABLE as follows: } else if (clusterLabel.equals(TaggingLabels.PARAGRAPH)
|| clusterLabel.equals(TaggingLabels.FIGURE)
|| clusterLabel.equals(TaggingLabels.TABLE)) {
if (clusterLabel.equals(TaggingLabels.FIGURE)
|| clusterLabel.equals(TaggingLabels.TABLE)) {
//figureBlock = true;
if (curParagraph != null) {
curParagraph.appendChild(new Text(" "));
lastClusterLabel = cluster.getTaggingLabel();
continue;
}
}
``` |
Hi Luca, thanks ! Yes you're right, this is something that bothers me since the My initial problem is that we might have figures in abstract, so I kept figures/tables. I thought also about creating a specific distinct model for "short text" based on restricted labels from the fulltext training data (it's why there is a model Currently I plan to fix this the ongoing branch Your quick solution is good I think to avoid weird stuff in short term (although no figure then in abstract), but we would need to add the |
Ah, sorry, is not clear from the above comment, if there is no |
Ah then it is doing nothing... the processing of figures/tables is done separately with |
Before, if label was TABLE or FIGURE, it was aggregating to the |
Sorry I understand now ! But if |
Something not impacting the normal fulltext processing and decoding would be to post-process the shortText() result, to rewrite the |
Ah, I assumed that the
I did not understand this part, you mean to rewrite the labels in |
I was proposing to rewrite labels in I don't have a better idea :D |
@kermitt2 now I understand 😄 Yes I think is possible to hack the result in this way. |
It's hacking the hack in I would prefer to keep one hack in |
Yeah, that's better with your solution 😅 |
Working on the funding, I found a small glitch while investigating loss of text.
In my case, the funding section was empty.
Example PDF from the PMC set: main.pdf
This is a minor problem, but it could impact several parts:
Basically, when we use
processShort()
followed byTEIFormatter.processTEIDivSection()
:grobid/grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java
Line 2476 in b1c0cfd
grobid/grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java
Line 2506 in b1c0cfd
The method called within
TEIFormatter.processTEIDivSection()
, in particulartoTEITextPiece()
, around lineTEIFormatter:1159
expects basically text (not table, not figure) however the processShort could output crazy stuff.In the example above the funding statement is tagged as
<table>
for the whole chunk which results in being lost in the final output.These are usually assigned to the previous paragraph, else they are lost.
grobid/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java
Lines 1540 to 1548 in 0f6c1b4
I don't think is a big issue, however it could occur more ofter we expect, and, maybe in other parts (E.g. abstract?).
After checking and editing this post several time I'm not sure what's the best solution.
One thing is not clear, if
processShort
is supposed to process only text, we can consider to restrict certain labels that might be wrongly assigned inprocessShort()
, such asTABLE
,FIGURE
or perhaps having a model for modeling only text which would use the same data as fulltext without TABLE/FIGURE.The text was updated successfully, but these errors were encountered: