Error 408 when consolidating citations #54

LukasWallrich · 2022-10-04T12:24:36Z

The following file can be processed by grobid when called with curl, but the equivalent (?) Python command fails with error 408.

meta-chinese.pdf

curl call: curl -v --form input=@./meta-chinese.pdf --form consolidateCitations=1 localhost:8070/api/processReferences
python: client.process("processReferences", "./screened_PDF", consolidate_citations=True)

The server log does not show any obvious issues. The python command works when I don't consolidate citations.

Any ideas / suggestions?

The text was updated successfully, but these errors were encountered:

LukasWallrich · 2022-10-04T13:02:40Z

Just to add - a basic requests.post call works from Python. I can't quite see what the client is doing differently ...

import requests
GROBID_URL = 'http://localhost:8070'
url = '%s/api/processReferences' % GROBID_URL
pdf = './screened_PDF/meta-chinese.pdf'
xml = requests.post(url, files={'input': open(pdf, 'rb')}, data = {"consolidateCitations": "1"})

lfoppiano · 2022-10-04T22:38:20Z

@LukasWallrich, the input_path should be a directory. Indeed, this is a bug, as the client should say something about it. Single files can be processed by calling process_pdf. I'm not sure if process_pdf is meant to be called like that, though.

kermitt2 · 2022-10-05T06:05:43Z

Hello !

The purpose of this client is to process a directory of files, so to do a batch process, managing concurrency efficiently. I tried to made it explicit from the readme and from the --help:

--input INPUT         path to the directory containing PDF files or .txt
                        (for processCitationList only, one reference per line)
                        to process
  --output OUTPUT       path to the directory where to put the results
                        (optional)

If you want to process a single PDF file, you can use client.process_pdf(), but as Luca said, it's not written to be used like that outside a batch process, all the arguments must be provided.

LukasWallrich · 2022-10-05T06:41:32Z

Thank you both! The input here is a folder with two files - the other one works fine. So that does not seem to be the issue.

kermitt2 · 2022-10-05T06:58:51Z

If it's 408 timeout, it might be simply that crossref API is too slow to consolidate citations. But for 2 files, it means the crossref API is very very slow. You can improve the response time a bit by indicating your email in the Grobid config file (the "polite" usage):
https://grobid.readthedocs.io/en/latest/Consolidation/#crossref-rest-api

However, sometimes when it is not in good shape, the Crossref API takes several seconds to answer each requests. With many references, the timeout might be reached (60 seconds). Even with a Plus token, this can happen.

For production, it's not really possible to use Crossref web API, which is why biblio-glutton was developed.

LukasWallrich · 2022-10-05T09:08:28Z

Thanks. Adding the email is a bit difficult as I am on an M2 mac and can thus only run grobid in the Docker container, which is hard to edit. Anyway, the request through the client fails even when there is only one PDF in the folder, while the manual Python request works. Also, the server log shows that crossref request go through every second or so ... so there might be something more specific going on.

For my use case, I only need to process a couple of hundred PDFs, so I can go down the more manual route, but obviously, the client would be helpful ...

kermitt2 · 2022-10-09T10:25:48Z

Adding the email is a bit difficult as I am on an M2 mac and can thus only run grobid in the Docker container, which is hard to edit.

You don't need to edit the container, simply edit the config file and mount it at launch of the container like that:

docker run --rm --gpus all -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro  grobid/grobid:0.7.2-SNAPSHOT

(where /home/lopez/grobid/grobid-home/config/grobid.yaml is your edited local config file with your email for Crossref politeness)

the server log shows that crossref request go through every second or so ... so there might be something more specific going on.

This is probably too slow... A good rate is to get at least 10 consolidated citations per second to avoid some painful slowness and timeout when parallelizing processing. If it's just a few hundred PDF, you can try the public biblio-glutton (which synchronizes itselft daily with Crossref) with a low concurrency to avoid too heavy load on this cheap server :D

lfoppiano added the bug Something isn't working label Oct 4, 2022

kermitt2 removed the bug Something isn't working label Oct 5, 2022

lfoppiano mentioned this issue Mar 3, 2024

Error 408 #69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error 408 when consolidating citations #54

Error 408 when consolidating citations #54

LukasWallrich commented Oct 4, 2022

LukasWallrich commented Oct 4, 2022 •

edited

Loading

lfoppiano commented Oct 4, 2022 •

edited

Loading

kermitt2 commented Oct 5, 2022

LukasWallrich commented Oct 5, 2022

kermitt2 commented Oct 5, 2022

LukasWallrich commented Oct 5, 2022

kermitt2 commented Oct 9, 2022

Error 408 when consolidating citations #54

Error 408 when consolidating citations #54

Comments

LukasWallrich commented Oct 4, 2022

LukasWallrich commented Oct 4, 2022 • edited Loading

lfoppiano commented Oct 4, 2022 • edited Loading

kermitt2 commented Oct 5, 2022

LukasWallrich commented Oct 5, 2022

kermitt2 commented Oct 5, 2022

LukasWallrich commented Oct 5, 2022

kermitt2 commented Oct 9, 2022

LukasWallrich commented Oct 4, 2022 •

edited

Loading

lfoppiano commented Oct 4, 2022 •

edited

Loading