Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processShort may yeld loss of text in some cases #956

Closed
lfoppiano opened this issue Oct 13, 2022 · 12 comments · Fixed by #959
Closed

processShort may yeld loss of text in some cases #956

lfoppiano opened this issue Oct 13, 2022 · 12 comments · Fixed by #959
Labels
bug From Hemiptera and especially its suborder Heteroptera

Comments

@lfoppiano
Copy link
Collaborator

lfoppiano commented Oct 13, 2022

Working on the funding, I found a small glitch while investigating loss of text.
In my case, the funding section was empty.

Example PDF from the PMC set: main.pdf

This is a minor problem, but it could impact several parts:
image

Basically, when we use processShort() followed by TEIFormatter.processTEIDivSection():

The method called within TEIFormatter.processTEIDivSection(), in particular toTEITextPiece(), around line TEIFormatter:1159 expects basically text (not table, not figure) however the processShort could output crazy stuff.

In the example above the funding statement is tagged as <table> for the whole chunk which results in being lost in the final output.

This	this	T	Th	Thi	This	s	is	his	This	BLOCKIN	LINEIN	ALIGNEDLEFT	NEWFONT	HIGHERFONT	0	0	INITCAP	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	I-<table>
study	study	s	st	stu	stud	y	dy	udy	tudy	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	<table>
was	was	w	wa	was	was	s	as	was	was	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	<table>
supported	supported	s	su	sup	supp	d	ed	ted	rted	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	<table>
by	by	b	by	by	by	y	by	by	by	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	0	10	0	NUMBER	0	0	<table>
the	the	t	th	the	the	e	he	the	the	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	1	10	0	NUMBER	0	0	<table>
South	south	S	So	Sou	Sout	h	th	uth	outh	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	1	10	0	NUMBER	0	0	<table>
Asian	asian	A	As	Asi	Asia	n	an	ian	sian	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	1	10	0	NUMBER	0	0	<table>
Clinical	clinical	C	Cl	Cli	Clin	l	al	cal	ical	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	1	10	0	NUMBER	0	0	<table>
Toxicology	toxicology	T	To	Tox	Toxi	y	gy	ogy	logy	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	1	10	0	NUMBER	0	0	<table>
Research	research	R	Re	Res	Rese	h	ch	rch	arch	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	2	10	0	NUMBER	0	0	<table>
Collaboration	collaboration	C	Co	Col	Coll	n	on	ion	tion	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	2	10	0	NUMBER	0	0	<table>
,	,	,	,	,	,	,	,	,	,	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	COMMA	3	10	0	NUMBER	0	0	<table>
which	which	w	wh	whi	whic	h	ch	ich	hich	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	3	10	0	NUMBER	0	0	<table>
is	is	i	is	is	is	s	is	is	is	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	3	10	0	NUMBER	0	0	<table>
funded	funded	f	fu	fun	fund	d	ed	ded	nded	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	3	10	0	NUMBER	0	0	<table>
by	by	b	by	by	by	y	by	by	by	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	3	10	0	NUMBER	0	0	<table>
The	the	T	Th	The	The	e	he	The	The	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	3	10	0	NUMBER	0	0	<table>
Wellcome	wellcome	W	We	Wel	Well	e	me	ome	come	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	4	10	0	NUMBER	0	0	<table>
Trust	trust	T	Tr	Tru	Trus	t	st	ust	rust	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	4	10	0	NUMBER	0	0	<table>
/	/	/	/	/	/	/	/	/	/	BLOCKIN	LINEEND	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	NOPUNCT	4	10	0	NUMBER	0	0	<table>
National	national	N	Na	Nat	Nati	l	al	nal	onal	BLOCKIN	LINESTART	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	4	10	0	NUMBER	0	0	<table>
Health	health	H	He	Hea	Heal	h	th	lth	alth	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	5	10	0	NUMBER	0	0	<table>
and	and	a	an	and	and	d	nd	and	and	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	5	10	0	NUMBER	0	0	<table>
Medical	medical	M	Me	Med	Medi	l	al	cal	ical	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	5	10	0	NUMBER	0	0	<table>
Research	research	R	Re	Res	Rese	h	ch	rch	arch	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	5	10	0	NUMBER	0	0	<table>
Council	council	C	Co	Cou	Coun	l	il	cil	ncil	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	6	10	0	NUMBER	0	0	<table>
International	international	I	In	Int	Inte	l	al	nal	onal	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	6	10	0	NUMBER	0	0	<table>
Collaborative	collaborative	C	Co	Col	Coll	e	ve	ive	tive	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	6	10	0	NUMBER	0	0	<table>
Research	research	R	Re	Res	Rese	h	ch	rch	arch	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	7	10	0	NUMBER	0	0	<table>
Grant	grant	G	Gr	Gra	Gran	t	nt	ant	rant	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	7	10	0	NUMBER	0	0	<table>
GR071669MA	gr071669ma	G	GR	GR0	GR07	A	MA	9MA	69MA	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	CONTAINSDIGITS	0	NOPUNCT	8	10	0	NUMBER	0	0	<table>
.	.	.	.	.	.	.	.	.	.	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	DOT	8	10	0	NUMBER	0	0	<table>
The	the	T	Th	The	The	e	he	The	The	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	8	10	0	NUMBER	0	0	<table>
funding	funding	f	fu	fun	fund	g	ng	ing	ding	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	8	10	0	NUMBER	0	0	<table>
bodies	bodies	b	bo	bod	bodi	s	es	ies	dies	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	8	10	0	NUMBER	0	0	<table>
had	had	h	ha	had	had	d	ad	had	had	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<table>
no	no	n	no	no	no	o	no	no	no	BLOCKIN	LINEEND	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<table>
role	role	r	ro	rol	role	e	le	ole	role	BLOCKIN	LINESTART	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<table>
in	in	i	in	in	in	n	in	in	in	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<table>
analyzing	analyzing	a	an	ana	anal	g	ng	ing	zing	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<table>
or	or	o	or	or	or	r	or	or	or	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	10	10	0	NUMBER	0	0	<table>
interpreting	interpreting	i	in	int	inte	g	ng	ing	ting	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	10	10	0	NUMBER	0	0	<table>
the	the	t	th	the	the	e	he	the	the	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	10	10	0	NUMBER	0	0	<table>
data	data	d	da	dat	data	a	ta	ata	data	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	10	10	0	NUMBER	0	0	<table>
or	or	o	or	or	or	r	or	or	or	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	11	10	0	NUMBER	0	0	<table>
writing	writing	w	wr	wri	writ	g	ng	ing	ting	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	11	10	0	NUMBER	0	0	<table>
the	the	t	th	the	the	e	he	the	the	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	11	10	0	NUMBER	0	0	<table>
article	article	a	ar	art	arti	e	le	cle	icle	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	11	10	0	NUMBER	0	0	<table>
.	.	.	.	.	.	.	.	.	.	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	DOT	11	10	0	NUMBER	0	0	<table>

These are usually assigned to the previous paragraph, else they are lost.

} else if (clusterLabel.equals(TaggingLabels.FIGURE) || clusterLabel.equals(TaggingLabels.TABLE)) {
//figureBlock = true;
if (curParagraph != null)
curParagraph.appendChild(new Text(" "));
}
lastClusterLabel = cluster.getTaggingLabel();
}

I don't think is a big issue, however it could occur more ofter we expect, and, maybe in other parts (E.g. abstract?).

After checking and editing this post several time I'm not sure what's the best solution.
One thing is not clear, if processShort is supposed to process only text, we can consider to restrict certain labels that might be wrongly assigned in processShort(), such as TABLE, FIGURE or perhaps having a model for modeling only text which would use the same data as fulltext without TABLE/FIGURE.

@lfoppiano
Copy link
Collaborator Author

My quick solution could be to combine PARAGRAPH and FIGURE and TABLE as follows:

            } else if (clusterLabel.equals(TaggingLabels.PARAGRAPH)
                || clusterLabel.equals(TaggingLabels.FIGURE)
                || clusterLabel.equals(TaggingLabels.TABLE)) {

                if (clusterLabel.equals(TaggingLabels.FIGURE)
                    || clusterLabel.equals(TaggingLabels.TABLE)) {
                    //figureBlock = true;
                    if (curParagraph != null) {
                        curParagraph.appendChild(new Text(" "));
                        lastClusterLabel = cluster.getTaggingLabel();
                        continue;
                    }
                }
                ```

@kermitt2
Copy link
Owner

Hi Luca, thanks !

Yes you're right, this is something that bothers me since the processShort() is used, but as you say it's not very visible in general.

My initial problem is that we might have figures in abstract, so I kept figures/tables. I thought also about creating a specific distinct model for "short text" based on restricted labels from the fulltext training data (it's why there is a model shorttext declared in the GrobidModels class).

Currently I plan to fix this the ongoing branch fix-vector-graphics which takes out figure and table out of the fulltext model (figure and table are then recognized prior to the segmentation model, so no problem with figure in abstract anymore). But it's not progressing rapidly :D

Your quick solution is good I think to avoid weird stuff in short term (although no figure then in abstract), but we would need to add the clusterContent to curParagraph to avoid loss of text ?

@lfoppiano
Copy link
Collaborator Author

Your quick solution is good I think to avoid weird stuff in short term (although no figure then in abstract), but we would need to add the clusterContent to curParagraph to avoid loss of text ?

Ah, sorry, is not clear from the above comment, if there is no curParagraph it proceeds with the same approach as for the label PARAGRAPH. See the commit: 14f5c5a

@kermitt2
Copy link
Owner

Ah then it is doing nothing... the processing of figures/tables is done separately with processFigures()and processTables().

@lfoppiano
Copy link
Collaborator Author

Before, if label was TABLE or FIGURE, it was aggregating to the curParagraph only if curParagraph != null (the text was not added anywhere if curParagraph == null)
Now if TABLE or FIGURE and currentParagraph == null, it's just treating this text as it was annotated as PARAGRAPH from the fulltext model.

@kermitt2
Copy link
Owner

Sorry I understand now !

But if curParagraph == null (a figure or table before text) this would inject the figure and table content 2 times then for the normal full text case.

@kermitt2
Copy link
Owner

Something not impacting the normal fulltext processing and decoding would be to post-process the shortText() result, to rewrite the <table> and <figure> labels into <paragraph> if they start the labels in the labeled sequence. It would be also hacky, but the hack is then confined in shortText() scenario.

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Oct 14, 2022

But if curParagraph == null (a figure or table before text) this would inject the figure and table content 2 times then for the normal full text case.

Ah, I assumed that the processShort() imply having only text, so if the figure is before the text then the full-text model would detect that and separate them before the processShort() is called.

Something not impacting the normal fulltext processing and decoding would be to post-process the shortText() result, to rewrite the <table> and <figure> labels into <paragraph> if they start the labels in the labeled sequence. It would be also hacky, but the hack is then confined in shortText() scenario.

I did not understand this part, you mean to rewrite the labels in processShort() or use the shortText model and apply this hack?

@kermitt2
Copy link
Owner

I did not understand this part, you mean to rewrite the labels in processShort() or use the shortText model and apply this hack?

I was proposing to rewrite labels in processShort(), because in the PR now, the change affects the normal full text pieces and introduces a possible bug when the piece starts with a figure with the normal full text/annex decoding.

I don't have a better idea :D

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Oct 14, 2022

@kermitt2 now I understand 😄 Yes I think is possible to hack the result in this way.
I've have another suggestion, not sure is less hacky... How about supply a parameter to toTEITextPiece() (e.g., exclude tableAndFigure=true/false) to enable/disable this fix?

@kermitt2
Copy link
Owner

How about supply a parameter to toTEITextPiece() (e.g., exclude tableAndFigure=true/false) to enable/disable this fix?

It's hacking the hack in toTEITextPiece() :D

I would prefer to keep one hack in processShort() because this is specific to processShort, instead of changing the general case and adding one more unreadable parameter (readTableAndFigureAsParagraphWhenNoTextBefore) in toTEITextPiece (there are already far too many in toTEITextPiece, with 13 parameters !).

@lfoppiano
Copy link
Collaborator Author

Yeah, that's better with your solution 😅

@lfoppiano lfoppiano added the bug From Hemiptera and especially its suborder Heteroptera label Oct 18, 2022
@lfoppiano lfoppiano linked a pull request Oct 18, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants