Passing files directly into Grobid without downloading #47

matthieu-perso · 2022-08-09T12:42:07Z

Hey Grobid team,

Thanks again for these incredible tools. I've been testing out the Python client - and encountered an issue when passing a PDF as an argument both while using the CLI and Python. I didn't receive any output.

Sample code below

grobid_client --input ./resource/my.PDF --output ./out processFulltextDocument

I realized while debugging that L122 of the grobid_client.py file implies passing in a directory and not the file itself as in the below request.

grobid_client --input ./resource/mypdfdir --output ./out processFulltextDocument

On GCP, I was trying to pass files directly in Grobid without downloading them - which I would have to do with the current setup. Anyway to stream PDFs in Grobid ? Or to send them as file objects ? If not, I'll try to see if I can pull something off quickly and test it.

The text was updated successfully, but these errors were encountered:

kermitt2 · 2022-08-11T04:11:10Z

Hi @MatthieuMoullecDev !

This client takes indeed a directory as input/output, as documented, because this is directed to batch processing of many files.

For me this client is a basis that can be adapted to different usage scenario, so I tried to keep it simple, with zero external dependencies. You can use the client as a package and then call process_batch() or process_pdf() as it is convenient on set of files and pipeline.

You can probably start sending files while downloading to the Grobid server, but Grobid will only start processing a file when it is entirely uploaded (for stability/robustness and technical reasons). So the easiest for your scenario is probably to download a file, add it to an executor, and then delete the file when the result is ready.

From my experience, if no consolidation of citation is used, Grobid is faster to process a file than required to download a typical Unpaywall file.

matthieu-perso mentioned this issue Aug 9, 2022

Significant amounts of timeouts while using threading on Grobid Docker Service kermitt2/grobid#939

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing files directly into Grobid without downloading #47

Passing files directly into Grobid without downloading #47

matthieu-perso commented Aug 9, 2022 •

edited

Loading

kermitt2 commented Aug 11, 2022

Passing files directly into Grobid without downloading #47

Passing files directly into Grobid without downloading #47

Comments

matthieu-perso commented Aug 9, 2022 • edited Loading

kermitt2 commented Aug 11, 2022

matthieu-perso commented Aug 9, 2022 •

edited

Loading