-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error 408 when consolidating citations #54
Comments
Just to add - a basic import requests
GROBID_URL = 'http://localhost:8070'
url = '%s/api/processReferences' % GROBID_URL
pdf = './screened_PDF/meta-chinese.pdf'
xml = requests.post(url, files={'input': open(pdf, 'rb')}, data = {"consolidateCitations": "1"}) |
@LukasWallrich, the |
Hello ! The purpose of this client is to process a directory of files, so to do a batch process, managing concurrency efficiently. I tried to made it explicit from the readme and from the
If you want to process a single PDF file, you can use |
Thank you both! The input here is a folder with two files - the other one works fine. So that does not seem to be the issue. |
If it's 408 timeout, it might be simply that crossref API is too slow to consolidate citations. But for 2 files, it means the crossref API is very very slow. You can improve the response time a bit by indicating your email in the Grobid config file (the "polite" usage): However, sometimes when it is not in good shape, the Crossref API takes several seconds to answer each requests. With many references, the timeout might be reached (60 seconds). Even with a Plus token, this can happen. For production, it's not really possible to use Crossref web API, which is why biblio-glutton was developed. |
Thanks. Adding the email is a bit difficult as I am on an M2 mac and can thus only run grobid in the Docker container, which is hard to edit. Anyway, the request through the client fails even when there is only one PDF in the folder, while the manual Python request works. Also, the server log shows that crossref request go through every second or so ... so there might be something more specific going on. For my use case, I only need to process a couple of hundred PDFs, so I can go down the more manual route, but obviously, the client would be helpful ... |
You don't need to edit the container, simply edit the config file and mount it at launch of the container like that: docker run --rm --gpus all -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.7.2-SNAPSHOT (where
This is probably too slow... A good rate is to get at least 10 consolidated citations per second to avoid some painful slowness and timeout when parallelizing processing. If it's just a few hundred PDF, you can try the public biblio-glutton (which synchronizes itselft daily with Crossref) with a low concurrency to avoid too heavy load on this cheap server :D |
The following file can be processed by grobid when called with
curl
, but the equivalent (?) Python command fails witherror 408
.meta-chinese.pdf
curl call:
curl -v --form input=@./meta-chinese.pdf --form consolidateCitations=1 localhost:8070/api/processReferences
python:
client.process("processReferences", "./screened_PDF", consolidate_citations=True)
The server log does not show any obvious issues. The python command works when I don't consolidate citations.
Any ideas / suggestions?
The text was updated successfully, but these errors were encountered: