Add possibility to process multiple raw references by http request #540

NikodemKch · 2020-02-16T22:02:42Z

The GROBID service offers the possibility to extract bibliographical metadata from a raw String.
It would be nice if the GROBID service could also extract and return multiple references from a given String.

As I understand this would involve rewriting this method, but using the method processRawReferences() instead of processRawReference().

This would save clients some work when extracting multiple references via the service, since just one http call would be needed.

An example would be the integration in JabRef (a software for managing .bib-databases): JabRef/jabref#5614 (comment)

kermitt2 · 2020-02-20T16:22:54Z

Hello @NikodemKch !

Thanks for the integration in JabRef!

To clarify:

if you want to process a list of bibliographical references (e.g. multiple references in a list, or one per line in a txt file), individual async calls to the existing web service /api/processCitation - one for each reference - is normally the best because Grobid will parallelize the processing of each references with its pool of threads. I think this is the use case for JabRef?

In the current design, if we sent a web service call to Grobid, it is handled by one dedicated thread, so the list of bibliographical references will be processed sequentially (note that the consolidation however is always parallelized, and it's what take time when it is selected). So my guess is that it will be overall slower. But of course it's really easy to add such an additional web service with a list, for instance taking as input a JSON array?

if you have a full bibliographical section in text format (for example extracted from a PDF via a pdf2txt process), this is not supported by Grobid right now because the reference segmenter is exploiting layout features to segment individual bibliographical references and reference callout. The positioning of the text, the indent, the block information are used to guess when a reference starts and ends, and the model expect these layout information as features.

I understand that this second use case is not what you are requesting?

NikodemKch · 2020-02-25T08:49:16Z

Hello @kermitt2
Indeed, the second use case is not what I am looking for.

As I understand, I as caller would have the choice between sending everything in one request with little slower processing, or creating and collecting multiple async. threads but slightly faster processing. I think it would be great to have a choice between this options and JSON arrays sound good too!

I have forwarded this question to the admins of JabRef: JabRef/jabref#5614 (comment)

NikodemKch mentioned this issue Feb 16, 2020

Add option to parse new references from plain text using GROBID service [solving #4826] JabRef/jabref#5614

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add possibility to process multiple raw references by http request #540

Add possibility to process multiple raw references by http request #540

NikodemKch commented Feb 16, 2020

kermitt2 commented Feb 20, 2020

NikodemKch commented Feb 25, 2020

Add possibility to process multiple raw references by http request #540

Add possibility to process multiple raw references by http request #540

Comments

NikodemKch commented Feb 16, 2020

kermitt2 commented Feb 20, 2020

NikodemKch commented Feb 25, 2020