Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace PDFContentImporter by another library #169

Closed
koppor opened this issue Sep 26, 2016 · 7 comments · Fixed by JabRef/jabref#8001
Closed

Replace PDFContentImporter by another library #169

koppor opened this issue Sep 26, 2016 · 7 comments · Fixed by JabRef/jabref#8001
Labels
component: xmp Issues concerning the XMP PDF metadata

Comments

@koppor
Copy link
Member

koppor commented Sep 26, 2016

As user, I want to import a PDF into JabRef. Each PDF contains bibliographic information. Either by embedded XMP data or by just the txt on the title page (author, title, doi, ..., maybe included bibtex: https://www.ctan.org/pkg/coverpage or https://ctan.org/pkg/authorarchive?lang=de). This information should be extracted from the PDF.

Currently, a self-written functionality is employed. This works OK for LNCS and IEEE papers, but not for other publishers.

Solution Sketch

We have a grobid in place. This should be used. Check Apache Tika, too.

Steps:

  1. If first PDF page containts @article or something direct BibTeX data (created by https://www.ctan.org/pkg/coverpage).
  2. Check if a .bib file is embedded in the PDF. If yes, use this one. (created by https://ctan.org/pkg/authorarchive?lang=de)
  3. Check if XMP data is available. If yes -> use that. Stop. Else continue.
  4. Look for DOI in the first page. If present -> use that. Stop. Else continue.
  5. Use Apache Tika/GROBID to extract PDF. Use that data.

Improvement possibility

Offer merge dialog from the different options (e.g., XMP + PDF scraping via GROBID)

Challenges

  • Cover different cases (BibTeX text on the first page, .bib embedded, DOI on first page, different publisher PDFs (LNCS, LNI, IEEE, ACM, ...)
  • Good test cases
    • Create test PDFs

Side notes

Check current drag'n'drop behavior. In 3.8.2, the user was asked whether (s)he wants to create a new entry or link the PDF.

In http://discourse.jabref.org/t/more-control-on-the-duplicate-finder/120/4?u=koppor the tool https://github.com/CrossRef/pdfextract was recommended. At first sight, it can fully replace our PDFContentImporter.

Refs JabRef#7209

@koppor koppor changed the title Replace PDFContentImporter by Replace PDFContentImporter by pdfextract Sep 26, 2016
@tobiasdiez
Copy link
Member

But pdfextract is a (Ruby?) program and no Java library that can easily be integrated. ?

@koppor
Copy link
Member Author

koppor commented Sep 26, 2016

Depends on the definition of "easily". https://github.com/jruby/jruby/wiki/DirectJRubyEmbedding

(i) Being the initial author of PDFContentImporter, (ii) seeing that no one took over the last years, (iii) knowing the issues and (iv) limitations of the current implementation, I think, it is more easy to integrate the library than to fix PDFContentImporter.

Side story: Colleagues from an other department use the PDFContentImporter successfully with Springer and IEEE papers. Which are the two publishers it was designed for.

@koppor
Copy link
Member Author

koppor commented Dec 16, 2016

After rereading the homepage of pdfextract, the tool seems "only" to help to extract the title of the PDF. Thus, it does not help to improve the PDFContentImporter.

@koppor koppor closed this as completed Dec 16, 2016
@koppor koppor changed the title Replace PDFContentImporter by pdfextract Replace PDFContentImporter by another library Dec 16, 2016
@koppor
Copy link
Member Author

koppor commented Dec 16, 2016

@koppor koppor reopened this Dec 16, 2016
@koppor koppor added the component: xmp Issues concerning the XMP PDF metadata label Apr 12, 2018
@koppor
Copy link
Member Author

koppor commented Aug 13, 2019

Refs JabRef#2474,

@koppor
Copy link
Member Author

koppor commented Sep 2, 2020

The important thing on this issue is to categorize "related work".

  • Input: Set of represantive PDFs
  • Output:
    • BibTeX generated for each PDF
    • Quality rating

@StarTrooper08
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: xmp Issues concerning the XMP PDF metadata
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants