Adjustions to grobid-client.py #3

darjusp · 2019-08-12T06:10:44Z

Dear Patrice,

as we talked in other repo, i adjusted the client that it could parse citations from text.
The solution became a bit ugly. But now:

it reads "txt" file as an input with each citation in new line
groups citations by thousands (or batch_size specified) and saves them in XML file, naming it by input name plus each thousand (or batch_size specified)
At the end opens each file and adds appropriate XML beginning and END
The TXT and PDF files handling are separated after common function "process"

Issues:

I needed to rename 'input' variable to 'input2' as python was complaining for the name
Input file must be given in TXT
If workers specified more than 1, the input file and outcome file is loosing sorting order.

Examples:
if order matters - (--n < 2):
python grobid-client.py --input /path/to/refs/file.txt --n 1
if not - (--n >1 or default)
python grobid-client.py --input /path/to/refs/file.txt

to parse with single worker 2 millions citations with Macbook Pro 2015 it took around 6 hours. Not so slow :)

Here is the file https://github.com/darjusp/contribs/blob/master/grobid-client.py

The text was updated successfully, but these errors were encountered:

kermitt2 added the enhancement New feature or request label Jun 9, 2021

kermitt2 added the implemented At least you try label Mar 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjustions to grobid-client.py #3

Adjustions to grobid-client.py #3

darjusp commented Aug 12, 2019 •

edited

Loading

Adjustions to grobid-client.py #3

Adjustions to grobid-client.py #3

Comments

darjusp commented Aug 12, 2019 • edited Loading

darjusp commented Aug 12, 2019 •

edited

Loading