GSoC 2021 Improve pdf support in JabRef

Student	Benedikt Tutzer
Organization	JabRef
Primary repository	JabRef/jabref
Project name	Improve pdf support in JabRef
Project mentors	Oliver Kopp and Carl Christian Snethlage
Project page	Google Summer of Code 2021 Project Page
Status	Complete

Project summary

JabRef had only limited support to interact with pdfs. It could only read XMP metadata and open linked PDFs. Since pdfs are a common format to share scientific papers, this needed to be improved. Thanks to the features implemented by Benedikt Tutzer during Google Summer of Code 2021, JabRef users can now:

write XMP metadata to PDFs from the command line
extract PDF metadata
- by sending the PDF to JabRefs Grobid server
- by importing embedded BibTeX files
- by importing a verbatim BibTeX entry given on the first page of the PDF
- by merging the metadata obtained from the methods mentioned above automatically or using a merge dialogue.
search the contents of all linked PDF documents

Pull requests to main branch

Project-related work

7814 CLI option to write XMP metadata to pdfs

This expands JabRef's CLI to allow users to write XMP metadata of selected entries in their database to linked PDFs.

2838 Search in PDF Files

Started in May 2017 by Linus Dietz, this PR implements a fulltext-search feature based on Apache Lucene. The PR was taken over by Benedikt Tutzer as Part of this GSoC project. Tasks done by Benedikt:

Fix and update dependencies
Redefine what fields are indexed
Synchronization of Index with Bib-Database
- At startup:
  - Add all PDF's to the index that were not indexed before
  - Update all index-entries for PDF's that changed since they were indexed
  - Remove all index-entries for PDF's that were removed
- During use:
  - Add PDFs that are linked by the user
  - Remove PDFs that are unlinked by the user
Interface to search in the index
Presentation of search results

7931 Fix broken GroupDialog

This PR fixes an issue introduced with the fulltext-search feature

7980 Fulltext Index: Only index local pdf files

This PR makes sure only local PDF files are added to the index.

7981 Improved progress indication for fulltext-index operations

This PR improves the presentation of the indexing-progress.

7989 Improve presentation of fulltext search results

This PR improves how results are presented to the users.

7947 Implement more pdf importers

This PR adds multiple importers that can be used to determine metadata from PDF files:

PdfVerbatimBibTextImporter looks for a verbatim BibTeX entry on the first page of the pdf
PdfEmbeddedBibFileImporter looks for an embedded BibTeX file in the pdf
PdfGrobidMetadataImporter sends the pdf to the Web API at http://grobid.jabref.org to determine the metadata using the Deep-Learning Library Grobid
PdfMergeMetadataImporter merges the metadata found by other importers. If identifiers were found (DOI or ISBN), metadata is fetched for the identifier as well.

7963 Remove DOI lookup from PdfContentImporter

As the PdfMergeMetadataImporter now looks-up DOI and ISBN anyhow, there is no need to do that in the individual importers any more. This PR removes the DOI lookup from the previousely existing PdfContentImporter.

7929 Implement an interface to import PDF metadata from multiple sources (XMP, Grobid, ...)

This implements an n-way merge dialog to allow the user to extract metadata from multiple sources and then select what metadata to store in the database.

8001 Reordered Pdf-Importer priorities

This PR reorders the priorities of the pdf-importers.

8002 Preferences for Grobid

This PR makes all interaction with the Grobid-Server Opt-in. This is to make sure JabRef does not send PDF's to the Web-Service without the users clear intent to do so.

8003 Refactor processCitation in GrobidService to match processPdf

Follow up that improves the UnitTests.

More than core scope

7797 Added auto-key-generation task to task-progress

7804 JournalAbbreviation search feature

7907 Removed references to apache commons logging

8006 [PoC] Introduced read/write interface for preferences

This is a proof-of-concept to change how passing preferences objects is handeled in JabRef.

Pull requests related to project in other repositories

The API of Grobid maily returns TEI for most requests. We added BibTeX support for the request we use for the metadata extraction.

800 Accept application/x-bibtex for processHeaderDocument

Before GSoC

6469 Fix bracket collisions

6443 Implement task progress indicator (and dialog) in the toolbar

6437 Fixed entry duplication on file download

6436 Cleanup dead code

6381 Added a download checkbox to the import dialog

Statistics

Total commits	14
Lines added	3273
Lines removed	505

(For commits made by Benedikt Tutzer during GSoC 2021 to JabRef's main branch only. Commits were squashed before counting.)

Blog posts

Project blogpost: July 04, 2021 – JabRef GSoC’21 Projects

Home
General Information
Development
- Please go to our devdocs at https://devdocs.jabref.org
"Google Summer of Code" project ideas
- GSOC 2024 ideas list
- GSoC 2022 - Apache Lucene Search
Completed "Google Summer of Code" (GSoC) projects
Release
- Releasing a new version
- Information update after a release
JabCon Archive
- JabCon 2021
- JabCon 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly