Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production Grobid Server Configuration #443

Open
Zhenshan-Jin opened this issue Jun 24, 2019 · 2 comments
Open

Production Grobid Server Configuration #443

Zhenshan-Jin opened this issue Jun 24, 2019 · 2 comments
Labels
question There's no such thing as a stupid question

Comments

@Zhenshan-Jin
Copy link

Zhenshan-Jin commented Jun 24, 2019

Hi Grobid Team,

Thanks for the great technical document parsing tools.

I think this isn't an issue rather than a configuration question. Is there any recommendation about the Graboid server configuration in production.

for example, parameters like connection limit and timeout and also the CPU requirement for the server machine.

The case that we find out is that when the Grobid docker server coming out of box receives too many documents piling together, the server will collapse and throw 503 error. But I never has the issue when I am using the public server http://cloud.science-miner.com/grobid/

Thank you!

@lfoppiano
Copy link
Collaborator

@Zhenshan-Jin thanks for your feedback. Check also #349 for some discussion on the same topic.

The parameters depends on your requirements, actually. The connection limit defines the maximum number of concurrent connection that the server can handle, above that it will start refusing any new one. Timeout the maximum time for a connection to last.

Regarding the docker image, how much memory did you allocate?
Could you give us some more information about your process? like how many documents do you try to parse? Are you running it with parallel requests? etc..

@lfoppiano lfoppiano added the question There's no such thing as a stupid question label Jun 24, 2019
@kermitt2
Copy link
Owner

kermitt2 commented Jun 24, 2019

Hello @Zhenshan-Jin !

Thanks for the interest in GROBID.

First, the 503 status returns by GROBID is not an error, and does not mean that the server collapses. On the contrary, it is the mechanism to avoid the service from collapsing and to keep it up alive and running according to its capacity for days.

If you look at the documentation, https://grobid.readthedocs.io/en/latest/Grobid-service/#apiprocessfulltextdocument, the 503 status indicates that the service is currently using all its available threads, and the client should simply wait a bit before re-sending the query.

Exploiting the 503 mechanism is implemented in the different GROBID clients listed here:

https://grobid.readthedocs.io/en/latest/Grobid-service/#clients-for-grobid-web-services

I invite you to use one of these clients, or to examine the response in case the server is returning a 503 status. Be sure to understand this mechanism, because scaling the system depends on it.

The exact server configuration will depend on the service you want to call. I present here the configuration used to process with processFulltextDocument around 10.6 PDF per second (around 915,000 PDF per day, around 20M pages per day) with the node.js client listed above during one week on a 16 CPU machine (16 threads, 32GB RAM, no SDD).
It ran without any crash during ~7 days at this rate. I processed 11.3M PDF in a bit less than 7 days with two 16-CPU servers like that in one of my projects.

  • if your server has 8-10 threads available, you can use the default settings of the docker image, otherwise you would rather need to build and start the service yourself to tune the parameters
  • keep the concurrency at the client (number of simultaneous calls) slightly higher than the available number of threads at the server side, for instance if the server has 16 threads, use a concurrency between 20 and 24 (it's the option n in the above mentioned clients, in my case I used 24)
  • in grobid/grobid-home/grobid.properties set the property org.grobid.max.connections to your number of available thread at server side or slightly higher (e.g. 16 to 20 for a 16 threads-machine, in my case I used 20)
  • set modelPreload to truein grobid/grobid-service/config/config.yaml, it will avoid some strange behavior at launch
  • in the query, consolidateHeader can be 1 or 2 if you are using the CrossRef consolidation. It significantly improves the accuracy and add useful metadata.
  • If you want to consolidate all the bibliographical references and use consolidateCitations as 1 or 2, CrossRef query rate limit will avoid scale to more than 1 document per second... For scaling the bibliographical reference resolution, you will need to use a local consolidation service, https://github.com/kermitt2/biblio-glutton. The overall capacity will depend on the biblio-glutton service then, and the number of elasticsearch nodes you can exploit. From experience, it is difficult to go beyond 300K PDF per day when using consolidation for every extracted bibliographical references.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question There's no such thing as a stupid question
Projects
None yet
Development

No branches or pull requests

3 participants