Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Use CERMINE as PDF parser #2474

Closed
wants to merge 1 commit into from
Closed

[WIP] Use CERMINE as PDF parser #2474

wants to merge 1 commit into from

Conversation

tobiasdiez
Copy link
Member

@tobiasdiez tobiasdiez commented Jan 20, 2017

This PR replaces our own PDF parser with CERMINE.
In my tests, this library was able to extract (relatively) correct information from a wide variety of articles. It had some problems with books and thesises (what is the plural of a thesis?) through.
As far as I understand it, it uses neural networks that try to analyze the PDF on a structural level (e.g. the title is often placed rather prominently). More information can be found in a paper.
In summary:

Pros:

  • Good detection level
  • Works for a lot of different journals
  • OpenSource and actively developed
  • Provides facilities to extract bibliographic information from a citation string (i.e. plain text to BibEntry converter would be possible). Can be tested online
  • Also extracts document structure (i.e. sections) and a list of referenced articles! See for example

Cons:

  • Has problems with books/thesis
  • Increases the size of JabRef from around 40 mb to 70 mb.

A comparison of different metadata extract tools can be found in the following blog post, which ends with the following summary

the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results.

I tried to include Grobid but it appears to be hard to use and has some problems on Windows (until recently?).

@koppor I added you as reviewer since you wrote the PdfImporter as far as I understood it.

  • Change in CHANGELOG.md described
  • Tests created for changes
  • Screenshots added (for bigger UI changes)
  • Manually tested changed features in running JabRef
  • Check documentation status (Issue created for outdated help page at help.jabref.org?)
  • If you changed the localization: Did you run gradle localizationUpdate?

@tobiasdiez tobiasdiez added status: devcall status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers labels Jan 20, 2017
@tobiasdiez tobiasdiez added this to the v4.0 milestone Jan 20, 2017
import java.util.List;
import java.util.Optional;

public class Date {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure you want to call that Date? Will create a lot of confusion with the import hell..

@Siedlerchr
Copy link
Member

Really like the idea!

@koppor
Copy link
Member

koppor commented Jan 20, 2017

@lenhard lenhard removed the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Feb 2, 2017
@tobiasdiez tobiasdiez mentioned this pull request Mar 30, 2017
6 tasks
@koppor koppor changed the title Use CERMINE as PDF parser [WIP] Use CERMINE as PDF parser Apr 17, 2017
@tobiasdiez tobiasdiez removed this from the v4.0 milestone May 15, 2017
@koppor
Copy link
Member

koppor commented Aug 18, 2017

This is still WIP, because CERMINE is A-GPL-licensed. This is on of the highest risks in open-source-software.

Maybe, someone should re-try grobid.

@koppor
Copy link
Member

koppor commented Aug 18, 2017

We are currently trying to focus on other things. 🔥
We will come back later to this topic, therefore closing adding on-hold label and closing the issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants