Production Grobid Server Configuration #443

Zhenshan-Jin · 2019-06-24T21:38:28Z

Hi Grobid Team,

Thanks for the great technical document parsing tools.

I think this isn't an issue rather than a configuration question. Is there any recommendation about the Graboid server configuration in production.

for example, parameters like connection limit and timeout and also the CPU requirement for the server machine.

The case that we find out is that when the Grobid docker server coming out of box receives too many documents piling together, the server will collapse and throw 503 error. But I never has the issue when I am using the public server http://cloud.science-miner.com/grobid/

Thank you!

The text was updated successfully, but these errors were encountered:

lfoppiano · 2019-06-24T22:15:29Z

@Zhenshan-Jin thanks for your feedback. Check also #349 for some discussion on the same topic.

The parameters depends on your requirements, actually. The connection limit defines the maximum number of concurrent connection that the server can handle, above that it will start refusing any new one. Timeout the maximum time for a connection to last.

Regarding the docker image, how much memory did you allocate?
Could you give us some more information about your process? like how many documents do you try to parse? Are you running it with parallel requests? etc..

kermitt2 · 2019-06-24T22:40:55Z

Hello @Zhenshan-Jin !

Thanks for the interest in GROBID.

First, the 503 status returns by GROBID is not an error, and does not mean that the server collapses. On the contrary, it is the mechanism to avoid the service from collapsing and to keep it up alive and running according to its capacity for days.

If you look at the documentation, https://grobid.readthedocs.io/en/latest/Grobid-service/#apiprocessfulltextdocument, the 503 status indicates that the service is currently using all its available threads, and the client should simply wait a bit before re-sending the query.

Exploiting the 503 mechanism is implemented in the different GROBID clients listed here:

https://grobid.readthedocs.io/en/latest/Grobid-service/#clients-for-grobid-web-services

I invite you to use one of these clients, or to examine the response in case the server is returning a 503 status. Be sure to understand this mechanism, because scaling the system depends on it.

The exact server configuration will depend on the service you want to call. I present here the configuration used to process with processFulltextDocument around 10.6 PDF per second (around 915,000 PDF per day, around 20M pages per day) with the node.js client listed above during one week on a 16 CPU machine (16 threads, 32GB RAM, no SDD).
It ran without any crash during ~7 days at this rate. I processed 11.3M PDF in a bit less than 7 days with two 16-CPU servers like that in one of my projects.

if your server has 8-10 threads available, you can use the default settings of the docker image, otherwise you would rather need to build and start the service yourself to tune the parameters
keep the concurrency at the client (number of simultaneous calls) slightly higher than the available number of threads at the server side, for instance if the server has 16 threads, use a concurrency between 20 and 24 (it's the option n in the above mentioned clients, in my case I used 24)
in grobid/grobid-home/grobid.properties set the property org.grobid.max.connections to your number of available thread at server side or slightly higher (e.g. 16 to 20 for a 16 threads-machine, in my case I used 20)
set modelPreload to truein grobid/grobid-service/config/config.yaml, it will avoid some strange behavior at launch
in the query, consolidateHeader can be 1 or 2 if you are using the CrossRef consolidation. It significantly improves the accuracy and add useful metadata.
If you want to consolidate all the bibliographical references and use consolidateCitations as 1 or 2, CrossRef query rate limit will avoid scale to more than 1 document per second... For scaling the bibliographical reference resolution, you will need to use a local consolidation service, https://github.com/kermitt2/biblio-glutton. The overall capacity will depend on the biblio-glutton service then, and the number of elasticsearch nodes you can exploit. From experience, it is difficult to go beyond 300K PDF per day when using consolidation for every extracted bibliographical references.

lfoppiano added the question There's no such thing as a stupid question label Jun 24, 2019

kermitt2 mentioned this issue Jan 6, 2020

Grobid Server Configuration - memory #530

Closed

elonzh mentioned this issue May 28, 2021

Support loading models eagerly #762

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production Grobid Server Configuration #443

Production Grobid Server Configuration #443

Zhenshan-Jin commented Jun 24, 2019 •

edited

Loading

lfoppiano commented Jun 24, 2019

kermitt2 commented Jun 24, 2019 •

edited

Loading

Production Grobid Server Configuration #443

Production Grobid Server Configuration #443

Comments

Zhenshan-Jin commented Jun 24, 2019 • edited Loading

lfoppiano commented Jun 24, 2019

kermitt2 commented Jun 24, 2019 • edited Loading

Zhenshan-Jin commented Jun 24, 2019 •

edited

Loading

kermitt2 commented Jun 24, 2019 •

edited

Loading