Replace PDFContentImporter by another library #169

koppor · 2016-09-26T11:56:58Z

As user, I want to import a PDF into JabRef. Each PDF contains bibliographic information. Either by embedded XMP data or by just the txt on the title page (author, title, doi, ..., maybe included bibtex: https://www.ctan.org/pkg/coverpage or https://ctan.org/pkg/authorarchive?lang=de). This information should be extracted from the PDF.

Currently, a self-written functionality is employed. This works OK for LNCS and IEEE papers, but not for other publishers.

Solution Sketch

We have a grobid in place. This should be used. Check Apache Tika, too.

Steps:

If first PDF page containts @article or something direct BibTeX data (created by https://www.ctan.org/pkg/coverpage).
Check if a .bib file is embedded in the PDF. If yes, use this one. (created by https://ctan.org/pkg/authorarchive?lang=de)
Check if XMP data is available. If yes -> use that. Stop. Else continue.
Look for DOI in the first page. If present -> use that. Stop. Else continue.
Use Apache Tika/GROBID to extract PDF. Use that data.

Improvement possibility

Offer merge dialog from the different options (e.g., XMP + PDF scraping via GROBID)

Challenges

Cover different cases (BibTeX text on the first page, .bib embedded, DOI on first page, different publisher PDFs (LNCS, LNI, IEEE, ACM, ...)
Good test cases
- Create test PDFs

Side notes

Check current drag'n'drop behavior. In 3.8.2, the user was asked whether (s)he wants to create a new entry or link the PDF.

In http://discourse.jabref.org/t/more-control-on-the-duplicate-finder/120/4?u=koppor the tool https://github.com/CrossRef/pdfextract was recommended. At first sight, it can fully replace our PDFContentImporter.

Refs JabRef#7209

The text was updated successfully, but these errors were encountered:

tobiasdiez · 2016-09-26T12:40:37Z

But pdfextract is a (Ruby?) program and no Java library that can easily be integrated. ?

koppor · 2016-09-26T13:02:13Z

Depends on the definition of "easily". https://github.com/jruby/jruby/wiki/DirectJRubyEmbedding

(i) Being the initial author of PDFContentImporter, (ii) seeing that no one took over the last years, (iii) knowing the issues and (iv) limitations of the current implementation, I think, it is more easy to integrate the library than to fix PDFContentImporter.

Side story: Colleagues from an other department use the PDFContentImporter successfully with Springer and IEEE papers. Which are the two publishers it was designed for.

koppor · 2016-12-16T07:31:32Z

After rereading the homepage of pdfextract, the tool seems "only" to help to extract the title of the PDF. Thus, it does not help to improve the PDFContentImporter.

koppor · 2016-12-16T07:50:33Z

There are questions on StackExchange asking for PDF2Bib:

Solutions outlined:

More solutions:

https://github.com/academia-edu/biblicit - a wrapper around the tools
GROBIT/GROBID - see Think about adding GROBIT for "PDF to BibTeX" functionality #327

koppor · 2019-08-13T04:03:15Z

Refs JabRef#2474,

koppor · 2020-09-02T07:53:13Z

The important thing on this issue is to categorize "related work".

Input: Set of represantive PDFs
Output:
- BibTeX generated for each PDF
- Quality rating

StarTrooper08 · 2021-03-27T17:39:28Z

There are questions on StackExchange asking for PDF2Bib:

Is there an open source tool for producing bibtex entries from paper PDFs?

Extract titles from each page of a PDF?

Extracting information from PDFs of research papers

Solutions outlined:

cs2Bib

They also have XMP support

CERMINE

https://code.google.com/p/pdfssa4met/

pdftk

gscholar in combination with pdftotext

bibfetch

More solutions:

https://github.com/academia-edu/biblicit - a wrapper around the tools

GROBIT/GROBID - see Think about adding GROBIT for "PDF to BibTeX" functionality #327

Can we use python to implement pdf2bibtex convertion??

koppor changed the title ~~Replace PDFContentImporter by~~ Replace PDFContentImporter by pdfextract Sep 26, 2016

koppor closed this as completed Dec 16, 2016

koppor changed the title ~~Replace PDFContentImporter by pdfextract~~ Replace PDFContentImporter by another library Dec 16, 2016

koppor reopened this Dec 16, 2016

koppor added the XMP Issues concerning the XMP PDF metadata label Apr 12, 2018

koppor mentioned this issue Jun 27, 2021

Manual sync of PDF meta data #506

Closed

btut mentioned this issue Aug 18, 2021

Reordered Pdf-Importer priorities JabRef/jabref#8001

Merged

5 tasks

Siedlerchr closed this as completed in JabRef/jabref#8001 Aug 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace PDFContentImporter by another library #169

Replace PDFContentImporter by another library #169

koppor commented Sep 26, 2016 •

edited

Loading

tobiasdiez commented Sep 26, 2016

koppor commented Sep 26, 2016

koppor commented Dec 16, 2016

koppor commented Dec 16, 2016 •

edited

Loading

koppor commented Aug 13, 2019 •

edited

Loading

koppor commented Sep 2, 2020

StarTrooper08 commented Mar 27, 2021

Replace PDFContentImporter by another library #169

Replace PDFContentImporter by another library #169

Comments

koppor commented Sep 26, 2016 • edited Loading

Solution Sketch

Improvement possibility

Challenges

Side notes

tobiasdiez commented Sep 26, 2016

koppor commented Sep 26, 2016

koppor commented Dec 16, 2016

koppor commented Dec 16, 2016 • edited Loading

koppor commented Aug 13, 2019 • edited Loading

koppor commented Sep 2, 2020

StarTrooper08 commented Mar 27, 2021

koppor commented Sep 26, 2016 •

edited

Loading

koppor commented Dec 16, 2016 •

edited

Loading

koppor commented Aug 13, 2019 •

edited

Loading