Use GROBID for extraction of metadata from PDFs #6158

tobiasdiez · 2020-03-22T21:31:59Z

Now that we have the GROBID server up and running, we can also use it to extract bibliographic metadata from PDFs.

https://grobid.readthedocs.io/en/latest/Grobid-service/
/api/processHeaderDocument

Old PR (using CERMINE instead of GROBID): #2474

koppor · 2020-03-23T19:51:00Z

Currently, it returns TEI XML format only, not BibTeX. I can try to patch the server accordingly. (Refs kermitt2/grobid#532 (comment))

I am curious in which cases GROBID is better than JabRef's custom implementation. It worked fine for me for IEEE and Springer LNCS. Still need to add more test cases though.

tobiasdiez · 2020-03-23T23:26:49Z

Would be nice if you could change the server accordingly. Grobid is the defacto standard for metadata extraction from pdf (and is used by ResearchGate, Mendeley, etc). Our implementation was really naïve and only works for a few publisher.

github-actions · 2020-12-08T19:27:57Z

This issue has been inactive for half a year. Since JabRef is constantly evolving this issue may not be relevant any longer and it will be closed in two weeks if no further activity occurs.

As part of an effort to ensure that the JabRef team is focusing on important and valid issues, we would like to ask if you could update the issue if it still persists. This could be in the following form:

If there has been a longer discussion, add a short summary of the most important points as a new comment (if not yet existing).
Provide further steps or information on how to reproduce this issue.
Upvote the initial post if you like to see it implemented soon. Votes are not the only metric that we use to determine the requests that are implemented, however, they do factor into our decision-making process.
If all information is provided and still up-to-date, then just add a short comment that the issue is still relevant.

Thank you for your contribution!

github-actions · 2021-06-07T20:00:38Z

This issue has been inactive for half a year. Since JabRef is constantly evolving this issue may not be relevant any longer and it will be closed in two weeks if no further activity occurs.

As part of an effort to ensure that the JabRef team is focusing on important and valid issues, we would like to ask if you could update the issue if it still persists. This could be in the following form:

If there has been a longer discussion, add a short summary of the most important points as a new comment (if not yet existing).
Provide further steps or information on how to reproduce this issue.
Upvote the initial post if you like to see it implemented soon. Votes are not the only metric that we use to determine the requests that are implemented, however, they do factor into our decision-making process.
If all information is provided and still up-to-date, then just add a short comment that the issue is still relevant.

Thank you for your contribution!

DesBw · 2021-08-29T14:22:40Z

Grobid is the defacto standard for metadata extraction from pdf (and is used by ResearchGate, Mendeley, etc). Our implementation was really naïve and only works for a few publisher.

Mendeley gives junks. I havn't finished cleaning the junk Mendeley gave me 10 years ago. Using the system that Mendeley is using is really bad idea. It never gets it right.

it is better to improve other aspects of Jabref than wasting resource on a system that will produce gibberish and unclean reference data.

Siedlerchr · 2021-12-04T16:14:55Z

JabRef now uses several sources for extracting metadata from PDF (XMP, embeded bibtex, DOI, Grobid) and allows comparing them

Thank you for reporting this issue. We think, that is already fixed in our development version and consequently the change will be included in the next release.

We would like to ask you to use a development build from https://builds.jabref.org/main and report back if it works for you. Please remember to make a backup of your library before trying-out this version.

koppor · 2021-12-06T19:24:59Z

Fixed by #2838

c750b6e APA: Put conditional event-title logic in a macro (#6161) a87414f Remove month from association-for-compuational-linguistics.csl (#6158) 6153db0 Remove issue numbers from BJOC style (#6155) e231ea3 Bug fix for `event` regression (#6154) 0dab651 Add event-title to other APA styles (#6153) 698cf1c APA: `event-title` and conditional `event` (#6152) 58d3f8f Update vancouver-author-date.csl (#6148) f1638a9 add substitute to Vancouver author date (#6147) 39fede5 Update associacao-brasileira-de-normas-tecnicas.csl (#6138) fde7695 Include chapter title (#6140) 1e3d8b4 Update n.d. abbreivation for DGP style (#6136) ebb728b suffix '.' after first group; changed e-mail (#6135) eed4f07 Update and rename sciences-po-ecole-doctorale-note-french.csl to scie… (#6127) f194647 Delete TU Dresden Medizin as requested by library (#6131) d8423d8 Create entomological-review.csl (#6120) 064a394 Create australasian-journal-of-philosophy.csl (#6063) a998ded Add composer.json (#5668) 37083c9 Update copernicus-publications.csl (#6062) 694c97b Create chaucer review (#6061) 625a424 Create haffner-style-manual.csl (#6054) 8b7224b make annals-of-allergy-asthma-and-immunology independent (#6041) 710748c Create university-of-pretoria-harvard-theology-religion.csl (#6106) d16dffd Create health-physics.csl (#6040) ca9e184 Update style-manual-australian-government.csl (#6119) e412277 Create chemical-engineering-technology.csl (#6039) bebdb48 Create bibliothek-forschung-und-praxis.csl (#6038) 29e49cd Update nature.csl (#6117) 891897d fix short title for SBL (#6118) git-subtree-dir: buildres/csl/csl-styles git-subtree-split: c750b6e

tobiasdiez added the type: feature label Mar 22, 2020

koppor self-assigned this Apr 9, 2020

github-actions bot added the status: stale label Dec 8, 2020

tobiasdiez removed the status: stale label Dec 8, 2020

github-actions bot added the status: stale label Jun 7, 2021

tobiasdiez removed the status: stale label Jun 7, 2021

koppor removed their assignment Jun 7, 2021

koppor mentioned this issue Jun 16, 2021

Manual sync of PDF meta data koppor/jabref#506

Closed

Siedlerchr closed this as completed Dec 4, 2021

koppor moved this to Done in Features & Enhancements Nov 7, 2022

koppor added this to Features & Enhancements Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use GROBID for extraction of metadata from PDFs #6158

Use GROBID for extraction of metadata from PDFs #6158

tobiasdiez commented Mar 22, 2020 •

edited

Loading

koppor commented Mar 23, 2020 •

edited

Loading

tobiasdiez commented Mar 23, 2020

github-actions bot commented Dec 8, 2020

github-actions bot commented Jun 7, 2021

DesBw commented Aug 29, 2021

Siedlerchr commented Dec 4, 2021

koppor commented Dec 6, 2021

Use GROBID for extraction of metadata from PDFs #6158

Use GROBID for extraction of metadata from PDFs #6158

Comments

tobiasdiez commented Mar 22, 2020 • edited Loading

koppor commented Mar 23, 2020 • edited Loading

tobiasdiez commented Mar 23, 2020

github-actions bot commented Dec 8, 2020

github-actions bot commented Jun 7, 2021

DesBw commented Aug 29, 2021

Siedlerchr commented Dec 4, 2021

koppor commented Dec 6, 2021

tobiasdiez commented Mar 22, 2020 •

edited

Loading

koppor commented Mar 23, 2020 •

edited

Loading