processShort may yeld loss of text in some cases #956

lfoppiano · 2022-10-13T03:33:26Z

Working on the funding, I found a small glitch while investigating loss of text.
In my case, the funding section was empty.

Example PDF from the PMC set: main.pdf

This is a minor problem, but it could impact several parts:

Basically, when we use processShort() followed by TEIFormatter.processTEIDivSection():

example on the availability statement

grobid/grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java

Line 2476 in b1c0cfd

Pair<String, List<LayoutToken>> headerAvailabilityProcessed = processShort(headerAvailabilityStatementTokens, doc);
example on the funding:

grobid/grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java

Line 2506 in b1c0cfd

Pair<String, List<LayoutToken>> headerFundingProcessed = processShort(headerFundingTokens, doc);

The method called within TEIFormatter.processTEIDivSection(), in particular toTEITextPiece(), around line TEIFormatter:1159 expects basically text (not table, not figure) however the processShort could output crazy stuff.

In the example above the funding statement is tagged as <table> for the whole chunk which results in being lost in the final output.

This	this	T	Th	Thi	This	s	is	his	This	BLOCKIN	LINEIN	ALIGNEDLEFT	NEWFONT	HIGHERFONT	0	0	INITCAP	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	I-<table>
study	study	s	st	stu	stud	y	dy	udy	tudy	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	<table>
was	was	w	wa	was	was	s	as	was	was	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	<table>
supported	supported	s	su	sup	supp	d	ed	ted	rted	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	<table>
by	by	b	by	by	by	y	by	by	by	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	<table>
the	the	t	th	the	the	e	he	the	the	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	1	10	0	NUMBER	0	0	<table>
South	south	S	So	Sou	Sout	h	th	uth	outh	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	1	10	0	NUMBER	0	0	<table>
Asian	asian	A	As	Asi	Asia	n	an	ian	sian	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	1	10	0	NUMBER	0	0	<table>
Clinical	clinical	C	Cl	Cli	Clin	l	al	cal	ical	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	1	10	0	NUMBER	0	0	<table>
Toxicology	toxicology	T	To	Tox	Toxi	y	gy	ogy	logy	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	1	10	0	NUMBER	0	0	<table>
Research	research	R	Re	Res	Rese	h	ch	rch	arch	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	2	10	0	NUMBER	0	0	<table>
Collaboration	collaboration	C	Co	Col	Coll	n	on	ion	tion	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	2	10	0	NUMBER	0	0	<table>
,	,	,	,	,	,	,	,	,	,	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	COMMA	3	10	0	NUMBER	0	0	<table>
which	which	w	wh	whi	whic	h	ch	ich	hich	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	3	10	0	NUMBER	0	0	<table>
is	is	i	is	is	is	s	is	is	is	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	3	10	0	NUMBER	0	0	<table>
funded	funded	f	fu	fun	fund	d	ed	ded	nded	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	3	10	0	NUMBER	0	0	<table>
by	by	b	by	by	by	y	by	by	by	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	3	10	0	NUMBER	0	0	<table>
The	the	T	Th	The	The	e	he	The	The	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	3	10	0	NUMBER	0	0	<table>
Wellcome	wellcome	W	We	Wel	Well	e	me	ome	come	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	4	10	0	NUMBER	0	0	<table>
Trust	trust	T	Tr	Tru	Trus	t	st	ust	rust	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	4	10	0	NUMBER	0	0	<table>
/	/	/	/	/	/	/	/	/	/	BLOCKIN	LINEEND	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	NOPUNCT	4	10	0	NUMBER	0	0	<table>
National	national	N	Na	Nat	Nati	l	al	nal	onal	BLOCKIN	LINESTART	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	4	10	0	NUMBER	0	0	<table>
Health	health	H	He	Hea	Heal	h	th	lth	alth	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	5	10	0	NUMBER	0	0	<table>
and	and	a	an	and	and	d	nd	and	and	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	5	10	0	NUMBER	0	0	<table>
Medical	medical	M	Me	Med	Medi	l	al	cal	ical	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	5	10	0	NUMBER	0	0	<table>
Research	research	R	Re	Res	Rese	h	ch	rch	arch	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	5	10	0	NUMBER	0	0	<table>
Council	council	C	Co	Cou	Coun	l	il	cil	ncil	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	6	10	0	NUMBER	0	0	<table>
International	international	I	In	Int	Inte	l	al	nal	onal	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	6	10	0	NUMBER	0	0	<table>
Collaborative	collaborative	C	Co	Col	Coll	e	ve	ive	tive	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	6	10	0	NUMBER	0	0	<table>
Research	research	R	Re	Res	Rese	h	ch	rch	arch	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	7	10	0	NUMBER	0	0	<table>
Grant	grant	G	Gr	Gra	Gran	t	nt	ant	rant	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	7	10	0	NUMBER	0	0	<table>
GR071669MA	gr071669ma	G	GR	GR0	GR07	A	MA	9MA	69MA	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	CONTAINSDIGITS	0	NOPUNCT	8	10	0	NUMBER	0	0	<table>
.	.	.	.	.	.	.	.	.	.	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	DOT	8	10	0	NUMBER	0	0	<table>
The	the	T	Th	The	The	e	he	The	The	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	8	10	0	NUMBER	0	0	<table>
funding	funding	f	fu	fun	fund	g	ng	ing	ding	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	8	10	0	NUMBER	0	0	<table>
bodies	bodies	b	bo	bod	bodi	s	es	ies	dies	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	8	10	0	NUMBER	0	0	<table>
had	had	h	ha	had	had	d	ad	had	had	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<table>
no	no	n	no	no	no	o	no	no	no	BLOCKIN	LINEEND	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<table>
role	role	r	ro	rol	role	e	le	ole	role	BLOCKIN	LINESTART	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<table>
in	in	i	in	in	in	n	in	in	in	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<table>
analyzing	analyzing	a	an	ana	anal	g	ng	ing	zing	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<table>
or	or	o	or	or	or	r	or	or	or	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	10	10	0	NUMBER	0	0	<table>
interpreting	interpreting	i	in	int	inte	g	ng	ing	ting	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	10	10	0	NUMBER	0	0	<table>
the	the	t	th	the	the	e	he	the	the	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	10	10	0	NUMBER	0	0	<table>
data	data	d	da	dat	data	a	ta	ata	data	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	10	10	0	NUMBER	0	0	<table>
or	or	o	or	or	or	r	or	or	or	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	11	10	0	NUMBER	0	0	<table>
writing	writing	w	wr	wri	writ	g	ng	ing	ting	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	11	10	0	NUMBER	0	0	<table>
the	the	t	th	the	the	e	he	the	the	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	11	10	0	NUMBER	0	0	<table>
article	article	a	ar	art	arti	e	le	cle	icle	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	11	10	0	NUMBER	0	0	<table>
.	.	.	.	.	.	.	.	.	.	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	DOT	11	10	0	NUMBER	0	0	<table>

These are usually assigned to the previous paragraph, else they are lost.

grobid/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java

Lines 1540 to 1548 in 0f6c1b4

    
               } else if (clusterLabel.equals(TaggingLabels.FIGURE) || clusterLabel.equals(TaggingLabels.TABLE)) { 
        
                   //figureBlock = true; 
        
                   if (curParagraph != null) 
        
                       curParagraph.appendChild(new Text(" ")); 
        
               } 
        
               lastClusterLabel = cluster.getTaggingLabel(); 
        
           }

I don't think is a big issue, however it could occur more ofter we expect, and, maybe in other parts (E.g. abstract?).

After checking and editing this post several time I'm not sure what's the best solution.
One thing is not clear, if processShort is supposed to process only text, we can consider to restrict certain labels that might be wrongly assigned in processShort(), such as TABLE, FIGURE or perhaps having a model for modeling only text which would use the same data as fulltext without TABLE/FIGURE.

The text was updated successfully, but these errors were encountered:

lfoppiano · 2022-10-13T05:39:33Z

My quick solution could be to combine PARAGRAPH and FIGURE and TABLE as follows:

            } else if (clusterLabel.equals(TaggingLabels.PARAGRAPH)
                || clusterLabel.equals(TaggingLabels.FIGURE)
                || clusterLabel.equals(TaggingLabels.TABLE)) {

                if (clusterLabel.equals(TaggingLabels.FIGURE)
                    || clusterLabel.equals(TaggingLabels.TABLE)) {
                    //figureBlock = true;
                    if (curParagraph != null) {
                        curParagraph.appendChild(new Text(" "));
                        lastClusterLabel = cluster.getTaggingLabel();
                        continue;
                    }
                }
                ```

kermitt2 · 2022-10-13T06:59:26Z

Hi Luca, thanks !

Yes you're right, this is something that bothers me since the processShort() is used, but as you say it's not very visible in general.

My initial problem is that we might have figures in abstract, so I kept figures/tables. I thought also about creating a specific distinct model for "short text" based on restricted labels from the fulltext training data (it's why there is a model shorttext declared in the GrobidModels class).

Currently I plan to fix this the ongoing branch fix-vector-graphics which takes out figure and table out of the fulltext model (figure and table are then recognized prior to the segmentation model, so no problem with figure in abstract anymore). But it's not progressing rapidly :D

Your quick solution is good I think to avoid weird stuff in short term (although no figure then in abstract), but we would need to add the clusterContent to curParagraph to avoid loss of text ?

lfoppiano · 2022-10-13T07:04:49Z

Your quick solution is good I think to avoid weird stuff in short term (although no figure then in abstract), but we would need to add the clusterContent to curParagraph to avoid loss of text ?

Ah, sorry, is not clear from the above comment, if there is no curParagraph it proceeds with the same approach as for the label PARAGRAPH. See the commit: 14f5c5a

kermitt2 · 2022-10-13T07:07:46Z

Ah then it is doing nothing... the processing of figures/tables is done separately with processFigures()and processTables().

lfoppiano · 2022-10-13T07:14:01Z

Before, if label was TABLE or FIGURE, it was aggregating to the curParagraph only if curParagraph != null (the text was not added anywhere if curParagraph == null)
Now if TABLE or FIGURE and currentParagraph == null, it's just treating this text as it was annotated as PARAGRAPH from the fulltext model.

kermitt2 · 2022-10-13T07:42:17Z

Sorry I understand now !

But if curParagraph == null (a figure or table before text) this would inject the figure and table content 2 times then for the normal full text case.

kermitt2 · 2022-10-13T07:47:25Z

Something not impacting the normal fulltext processing and decoding would be to post-process the shortText() result, to rewrite the <table> and <figure> labels into <paragraph> if they start the labels in the labeled sequence. It would be also hacky, but the hack is then confined in shortText() scenario.

lfoppiano · 2022-10-14T00:17:59Z

But if curParagraph == null (a figure or table before text) this would inject the figure and table content 2 times then for the normal full text case.

Ah, I assumed that the processShort() imply having only text, so if the figure is before the text then the full-text model would detect that and separate them before the processShort() is called.

Something not impacting the normal fulltext processing and decoding would be to post-process the shortText() result, to rewrite the <table> and <figure> labels into <paragraph> if they start the labels in the labeled sequence. It would be also hacky, but the hack is then confined in shortText() scenario.

I did not understand this part, you mean to rewrite the labels in processShort() or use the shortText model and apply this hack?

kermitt2 · 2022-10-14T05:47:58Z

I did not understand this part, you mean to rewrite the labels in processShort() or use the shortText model and apply this hack?

I was proposing to rewrite labels in processShort(), because in the PR now, the change affects the normal full text pieces and introduces a possible bug when the piece starts with a figure with the normal full text/annex decoding.

I don't have a better idea :D

lfoppiano · 2022-10-14T07:21:52Z

@kermitt2 now I understand 😄 Yes I think is possible to hack the result in this way.
I've have another suggestion, not sure is less hacky... How about supply a parameter to toTEITextPiece() (e.g., exclude tableAndFigure=true/false) to enable/disable this fix?

kermitt2 · 2022-10-15T07:10:07Z

How about supply a parameter to toTEITextPiece() (e.g., exclude tableAndFigure=true/false) to enable/disable this fix?

It's hacking the hack in toTEITextPiece() :D

I would prefer to keep one hack in processShort() because this is specific to processShort, instead of changing the general case and adding one more unreadable parameter (readTableAndFigureAsParagraphWhenNoTextBefore) in toTEITextPiece (there are already far too many in toTEITextPiece, with 13 parameters !).

lfoppiano · 2022-10-17T03:00:29Z

Yeah, that's better with your solution 😅

kermitt2 mentioned this issue Oct 15, 2022

Add funding statement in TEI output #959

Merged

lfoppiano added the bug From Hemiptera and especially its suborder Heteroptera label Oct 18, 2022

lfoppiano linked a pull request Oct 18, 2022 that will close this issue

Add funding statement in TEI output #959

Merged

kermitt2 closed this as completed in #959 Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

processShort may yeld loss of text in some cases #956

processShort may yeld loss of text in some cases #956

lfoppiano commented Oct 13, 2022 •

edited

Loading

lfoppiano commented Oct 13, 2022

kermitt2 commented Oct 13, 2022

lfoppiano commented Oct 13, 2022

kermitt2 commented Oct 13, 2022

lfoppiano commented Oct 13, 2022

kermitt2 commented Oct 13, 2022

kermitt2 commented Oct 13, 2022

lfoppiano commented Oct 14, 2022 •

edited

Loading

kermitt2 commented Oct 14, 2022

lfoppiano commented Oct 14, 2022 •

edited

Loading

kermitt2 commented Oct 15, 2022

lfoppiano commented Oct 17, 2022

processShort may yeld loss of text in some cases #956

processShort may yeld loss of text in some cases #956

Comments

lfoppiano commented Oct 13, 2022 • edited Loading

lfoppiano commented Oct 13, 2022

kermitt2 commented Oct 13, 2022

lfoppiano commented Oct 13, 2022

kermitt2 commented Oct 13, 2022

lfoppiano commented Oct 13, 2022

kermitt2 commented Oct 13, 2022

kermitt2 commented Oct 13, 2022

lfoppiano commented Oct 14, 2022 • edited Loading

kermitt2 commented Oct 14, 2022

lfoppiano commented Oct 14, 2022 • edited Loading

kermitt2 commented Oct 15, 2022

lfoppiano commented Oct 17, 2022

lfoppiano commented Oct 13, 2022 •

edited

Loading

lfoppiano commented Oct 14, 2022 •

edited

Loading

lfoppiano commented Oct 14, 2022 •

edited

Loading