-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace PDFContentImporter by another library #169
Comments
But pdfextract is a (Ruby?) program and no Java library that can easily be integrated. ? |
Depends on the definition of "easily". https://github.com/jruby/jruby/wiki/DirectJRubyEmbedding (i) Being the initial author of PDFContentImporter, (ii) seeing that no one took over the last years, (iii) knowing the issues and (iv) limitations of the current implementation, I think, it is more easy to integrate the library than to fix PDFContentImporter. Side story: Colleagues from an other department use the PDFContentImporter successfully with Springer and IEEE papers. Which are the two publishers it was designed for. |
After rereading the homepage of pdfextract, the tool seems "only" to help to extract the title of the PDF. Thus, it does not help to improve the PDFContentImporter. |
There are questions on StackExchange asking for PDF2Bib:
Solutions outlined:
More solutions:
|
Refs JabRef#2474, |
The important thing on this issue is to categorize "related work".
|
Can we use python to implement pdf2bibtex convertion?? |
As user, I want to import a PDF into JabRef. Each PDF contains bibliographic information. Either by embedded XMP data or by just the txt on the title page (author, title, doi, ..., maybe included bibtex: https://www.ctan.org/pkg/coverpage or https://ctan.org/pkg/authorarchive?lang=de). This information should be extracted from the PDF.
Currently, a self-written functionality is employed. This works OK for LNCS and IEEE papers, but not for other publishers.
Solution Sketch
We have a grobid in place. This should be used. Check Apache Tika, too.
Steps:
@article
or something direct BibTeX data (created by https://www.ctan.org/pkg/coverpage)..bib
file is embedded in the PDF. If yes, use this one. (created by https://ctan.org/pkg/authorarchive?lang=de)Improvement possibility
Offer merge dialog from the different options (e.g., XMP + PDF scraping via GROBID)
Challenges
.bib
embedded, DOI on first page, different publisher PDFs (LNCS, LNI, IEEE, ACM, ...)Side notes
Check current drag'n'drop behavior. In 3.8.2, the user was asked whether (s)he wants to create a new entry or link the PDF.
In http://discourse.jabref.org/t/more-control-on-the-duplicate-finder/120/4?u=koppor the tool https://github.com/CrossRef/pdfextract was recommended. At first sight, it can fully replace our PDFContentImporter.
Refs JabRef#7209
The text was updated successfully, but these errors were encountered: