Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add possibility to process multiple raw references by http request #540

Open
NikodemKch opened this issue Feb 16, 2020 · 2 comments
Open

Comments

@NikodemKch
Copy link

The GROBID service offers the possibility to extract bibliographical metadata from a raw String.
It would be nice if the GROBID service could also extract and return multiple references from a given String.

As I understand this would involve rewriting this method, but using the method processRawReferences() instead of processRawReference().

This would save clients some work when extracting multiple references via the service, since just one http call would be needed.

An example would be the integration in JabRef (a software for managing .bib-databases): JabRef/jabref#5614 (comment)

@kermitt2
Copy link
Owner

Hello @NikodemKch !

Thanks for the integration in JabRef!

To clarify:

  • if you want to process a list of bibliographical references (e.g. multiple references in a list, or one per line in a txt file), individual async calls to the existing web service /api/processCitation - one for each reference - is normally the best because Grobid will parallelize the processing of each references with its pool of threads. I think this is the use case for JabRef?

In the current design, if we sent a web service call to Grobid, it is handled by one dedicated thread, so the list of bibliographical references will be processed sequentially (note that the consolidation however is always parallelized, and it's what take time when it is selected). So my guess is that it will be overall slower. But of course it's really easy to add such an additional web service with a list, for instance taking as input a JSON array?

  • if you have a full bibliographical section in text format (for example extracted from a PDF via a pdf2txt process), this is not supported by Grobid right now because the reference segmenter is exploiting layout features to segment individual bibliographical references and reference callout. The positioning of the text, the indent, the block information are used to guess when a reference starts and ends, and the model expect these layout information as features.

I understand that this second use case is not what you are requesting?

@NikodemKch
Copy link
Author

Hello @kermitt2
Indeed, the second use case is not what I am looking for.

As I understand, I as caller would have the choice between sending everything in one request with little slower processing, or creating and collecting multiple async. threads but slightly faster processing. I think it would be great to have a choice between this options and JSON arrays sound good too!

I have forwarded this question to the admins of JabRef: JabRef/jabref#5614 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants