[WIP] Use CERMINE as PDF parser #2474

tobiasdiez · 2017-01-20T14:40:05Z

This PR replaces our own PDF parser with CERMINE.
In my tests, this library was able to extract (relatively) correct information from a wide variety of articles. It had some problems with books and thesises (what is the plural of a thesis?) through.
As far as I understand it, it uses neural networks that try to analyze the PDF on a structural level (e.g. the title is often placed rather prominently). More information can be found in a paper.
In summary:

Pros:

Good detection level
Works for a lot of different journals
OpenSource and actively developed
Provides facilities to extract bibliographic information from a citation string (i.e. plain text to BibEntry converter would be possible). Can be tested online
Also extracts document structure (i.e. sections) and a list of referenced articles! See for example

Cons:

Has problems with books/thesis
Increases the size of JabRef from around 40 mb to 70 mb.

A comparison of different metadata extract tools can be found in the following blog post, which ends with the following summary

the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results.

I tried to include Grobid but it appears to be hard to use and has some problems on Windows (until recently?).

@koppor I added you as reviewer since you wrote the PdfImporter as far as I understood it.

Change in CHANGELOG.md described
Tests created for changes
Screenshots added (for bigger UI changes)
Manually tested changed features in running JabRef
Check documentation status (Issue created for outdated help page at help.jabref.org?)
If you changed the localization: Did you run gradle localizationUpdate?

Siedlerchr · 2017-01-20T15:07:42Z

src/main/java/net/sf/jabref/model/entry/Date.java

+import java.util.List;
+import java.util.Optional;
+
+public class Date {


Sure you want to call that Date? Will create a lot of confusion with the import hell..

Siedlerchr · 2017-01-20T15:10:01Z

Really like the idea!

koppor · 2017-01-20T15:39:38Z

Grobid is Apache-licensed: https://github.com/kermitt2/grobid/blob/master/LICENSE

koppor · 2017-08-18T18:00:47Z

This is still WIP, because CERMINE is A-GPL-licensed. This is on of the highest risks in open-source-software.

Maybe, someone should re-try grobid.

koppor · 2017-08-18T18:01:30Z

We are currently trying to focus on other things. 🔥
We will come back later to this topic, therefore closing adding on-hold label and closing the issue for now.

Use CERMINE as PDF parser

79147f2

tobiasdiez added status: devcall status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers labels Jan 20, 2017

tobiasdiez requested a review from koppor January 20, 2017 14:40

tobiasdiez added this to the v4.0 milestone Jan 20, 2017

Siedlerchr reviewed Jan 20, 2017

View reviewed changes

tobiasdiez removed the status: devcall label Jan 20, 2017

lenhard removed the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Feb 2, 2017

tobiasdiez mentioned this pull request Mar 30, 2017

Add PDF Viewer #2692

Merged

6 tasks

koppor changed the title ~~Use CERMINE as PDF parser~~ [WIP] Use CERMINE as PDF parser Apr 17, 2017

tobiasdiez removed this from the v4.0 milestone May 15, 2017

koppor closed this Aug 18, 2017

koppor added the on-hold label Aug 18, 2017

stefan-kolb deleted the cermine branch August 25, 2017 11:27

tobiasdiez mentioned this pull request Mar 29, 2019

Make it easy to add entry based on textual representation #4826

Closed

koppor mentioned this pull request Aug 13, 2019

Replace PDFContentImporter by another library koppor/jabref#169

Closed

tobiasdiez mentioned this pull request Mar 22, 2020

Use GROBID for extraction of metadata from PDFs #6158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Use CERMINE as PDF parser #2474

[WIP] Use CERMINE as PDF parser #2474

tobiasdiez commented Jan 20, 2017 •

edited

Loading

Siedlerchr Jan 20, 2017

Siedlerchr commented Jan 20, 2017

koppor commented Jan 20, 2017

koppor commented Aug 18, 2017

koppor commented Aug 18, 2017

[WIP] Use CERMINE as PDF parser #2474

[WIP] Use CERMINE as PDF parser #2474

Conversation

tobiasdiez commented Jan 20, 2017 • edited Loading

Siedlerchr Jan 20, 2017

Choose a reason for hiding this comment

Siedlerchr commented Jan 20, 2017

koppor commented Jan 20, 2017

koppor commented Aug 18, 2017

koppor commented Aug 18, 2017

tobiasdiez commented Jan 20, 2017 •

edited

Loading