Add a title guess method to get "better" title (JabRef#12018)

* Add title guess method Add title guess method * fix unit test fix unit test * update unit test to JDK 21 style update unit test to JDK 21 style * update unit test update unit test * update get title by area update get title by area * remove StringUtils.isBlank and add @AllowedToUseAwt remove StringUtils.isBlank and add @AllowedToUseAwt * add unit test ToDo:find a minimal pdf for test * change to get title by font size change to get title by font size add more unittest * RemoveTestPrefix RemoveTestPrefix * temp fix the unit test I should change the pdf used in importTwiceWorksAsExpected, or my code need to deal with the paper with same font size in AUTHOR and TITLE? * fix the unit test and open rewrite issue fix the unit test and open rewrite issue * remove commented code remove commented code * Add 5 more unittest case Add 5 more unittest case * resolve all comments so far 1. revert the temp change of unit test to original one. 2. all of title's character should stay together, add a `isFarAway` method to it to pass the unit test for hello world case. 3. change guess title variable name form old version `titleByPosition` to `titleByFontSize` 4. remove all commented code. 5. Add a `@VisibleForTesting` to `getEntryFromPDFContent`. 6. rewrite the javaDoc for `getEntryFromPDFContent` 7. fix the logic issue for setting title. * remove Blank line at start of block remove Blank line at start of block * rename and replace unit test file rename and replace unit test file * add bib and readme.md add bib and readme.md * Update CHANGELOG.md * rename the file to pass CI rename the file to pass CI * address all comments address all comments * fix the file name in unit test fix the file name in unit test --------- Co-authored-by: Oliver Kopp <kopp.dev@gmail.com>
koppor · Oct 30, 2024 · 7232836 · 7232836
1 parent 709386a
commit 7232836
Show file tree

Hide file tree

Showing 12 changed files with 285 additions and 23 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -61,6 +61,7 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv
 - ⚠️ We relaxed the escaping requirements for [bracketed patterns](https://docs.jabref.org/setup/citationkeypatterns), which are used for the [citaton key generator](https://docs.jabref.org/advanced/entryeditor#autogenerate-citation-key) and [filename and directory patterns](https://docs.jabref.org/finding-sorting-and-cleaning-entries/filelinks#auto-linking-files). One only needs to write `\"` if a quote sign should be escaped. All other escapings are not necessary (and working) any more. [#11967](https://github.com/JabRef/jabref/pull/11967)
 - When importing BibTeX data starging from on a PDF, the XMP metadata takes precedence over Grobid data. [#11992](https://github.com/JabRef/jabref/pull/11992)
 - JabRef now uses TLS 1.2 for all HTTPS connections. [#11852](https://github.com/JabRef/jabref/pull/11852)
+- We improved the functionality of getting BibTeX data out of PDF files. [#11999](https://github.com/JabRef/jabref/issues/11999)
 - We improved the display of long messages in the integrity check dialog. [#11619](https://github.com/JabRef/jabref/pull/11619)
 - We improved the undo/redo buttons in the main toolbar and main menu to be disabled when there is nothing to undo/redo. [#8807](https://github.com/JabRef/jabref/issues/8807)
 - We improved the DOI detection in PDF imports. [#11782](https://github.com/JabRef/jabref/pull/11782)

diff --git a/src/main/java/org/jabref/logic/importer/fileformat/PdfContentImporter.java b/src/main/java/org/jabref/logic/importer/fileformat/PdfContentImporter.java
@@ -26,9 +26,13 @@
 import org.jabref.model.entry.types.StandardEntryType;
 import org.jabref.model.strings.StringUtil;
 
+import com.google.common.annotations.VisibleForTesting;
 import com.google.common.base.Strings;
 import org.apache.pdfbox.pdmodel.PDDocument;
 import org.apache.pdfbox.text.PDFTextStripper;
+import org.apache.pdfbox.text.TextPosition;
+
+import static org.jabref.model.strings.StringUtil.isNullOrEmpty;
 
 /**
  * PdfContentImporter parses data of the first page of the PDF and creates a BibTeX entry.
@@ -196,7 +200,8 @@ public ParserResult importDatabase(Path filePath) {
         List<BibEntry> result = new ArrayList<>(1);
         try (PDDocument document = new XmpUtilReader().loadWithAutomaticDecryption(filePath)) {
             String firstPageContents = getFirstPageContents(document);
-            Optional<BibEntry> entry = getEntryFromPDFContent(firstPageContents, OS.NEWLINE);
+            String titleByFontSize = extractTitleFromDocument(document);
+            Optional<BibEntry> entry = getEntryFromPDFContent(firstPageContents, OS.NEWLINE, titleByFontSize);
             entry.ifPresent(result::add);
         } catch (EncryptedPdfsNotSupportedException e) {
             return ParserResult.fromErrorMessage(Localization.lang("Decryption not supported."));
@@ -208,17 +213,120 @@ public ParserResult importDatabase(Path filePath) {
         return new ParserResult(result);
     }
 
-    // make this method package visible so we can test it
-    Optional<BibEntry> getEntryFromPDFContent(String firstpageContents, String lineSeparator) {
-        // idea: split[] contains the different lines
-        // blocks are separated by empty lines
-        // treat each block
-        //   or do special treatment at authors (which are not broken)
-        //   therefore, we do a line-based and not a block-based splitting
-        // i points to the current line
-        // curString (mostly) contains the current block
-        //   the different lines are joined into one and thereby separated by " "
+    private static String extractTitleFromDocument(PDDocument document) throws IOException {
+        TitleExtractorByFontSize stripper = new TitleExtractorByFontSize();
+        return stripper.getTitleFromFirstPage(document);
+    }
+
+    private static class TitleExtractorByFontSize extends PDFTextStripper {
+
+        private final List<TextPosition> textPositionsList;
+
+        public TitleExtractorByFontSize() {
+            super();
+            this.textPositionsList = new ArrayList<>();
+        }
+
+        public String getTitleFromFirstPage(PDDocument document) throws IOException {
+            this.setStartPage(1);
+            this.setEndPage(1);
+            this.writeText(document, new StringWriter());
+            return findLargestFontText(textPositionsList);
+        }
+
+        @Override
+        protected void writeString(String text, List<TextPosition> textPositions) {
+            textPositionsList.addAll(textPositions);
+        }
+
+        private boolean isFarAway(TextPosition previous, TextPosition current) {
+            float XspaceThreshold = 3.0F;
+            float YspaceThreshold = previous.getFontSizeInPt() * 1.5F;
+            float Xgap = current.getXDirAdj() - (previous.getXDirAdj() + previous.getWidthDirAdj());
+            float Ygap = current.getYDirAdj() - (previous.getYDirAdj() - previous.getHeightDir());
+            return Xgap > XspaceThreshold && Ygap > YspaceThreshold;
+        }
+
+        private boolean isUnwantedText(TextPosition previousTextPosition, TextPosition textPosition) {
+            if (textPosition == null || previousTextPosition == null) {
+                return false;
+            }
+            // The title usually don't in the bottom 10% of a page.
+            if ((textPosition.getPageHeight() - textPosition.getYDirAdj())
+                    < (textPosition.getPageHeight() * 0.1)) {
+                return true;
+            }
+            // The title character usually stay together.
+            return isFarAway(previousTextPosition, textPosition);
+        }
+
+        private String findLargestFontText(List<TextPosition> textPositions) {
+            float maxFontSize = 0;
+            StringBuilder largestFontText = new StringBuilder();
+            TextPosition previousTextPosition = null;
+            for (TextPosition textPosition : textPositions) {
+                // Exclude unwanted text based on heuristics
+                if (isUnwantedText(previousTextPosition, textPosition)) {
+                    continue;
+                }
+                float fontSize = textPosition.getFontSizeInPt();
+                if (fontSize > maxFontSize) {
+                    maxFontSize = fontSize;
+                    largestFontText.setLength(0);
+                    largestFontText.append(textPosition.getUnicode());
+                    previousTextPosition = textPosition;
+                } else if (fontSize == maxFontSize) {
+                    if (previousTextPosition != null) {
+                        if (isThereSpace(previousTextPosition, textPosition)) {
+                            largestFontText.append(" ");
+                        }
+                    }
+                    largestFontText.append(textPosition.getUnicode());
+                    previousTextPosition = textPosition;
+                }
+            }
+            return largestFontText.toString().trim();
+        }
+
+        private boolean isThereSpace(TextPosition previous, TextPosition current) {
+            float XspaceThreshold = 0.5F;
+            float YspaceThreshold = previous.getFontSizeInPt();
+            float Xgap = current.getXDirAdj() - (previous.getXDirAdj() + previous.getWidthDirAdj());
+            float Ygap = current.getYDirAdj() - (previous.getYDirAdj() - previous.getHeightDir());
+            return Xgap > XspaceThreshold || Ygap > YspaceThreshold;
+        }
+    }
 
+    /**
+     * Parses the first page content of a PDF document and extracts bibliographic information such as title, author,
+     * abstract, keywords, and other relevant metadata. This method processes the content line-by-line and uses
+     * custom parsing logic to identify and assemble information blocks from academic papers.
+     *
+     * idea: split[] contains the different lines, blocks are separated by empty lines, treat each block
+     *       or do special treatment at authors (which are not broken).
+     *       Therefore, we do a line-based and not a block-based splitting i points to the current line
+     *       curString (mostly) contains the current block,
+     *       the different lines are joined into one and thereby separated by " "
+     *
+     * <p> This method follows the structure typically found in academic paper PDFs:
+     * - First, it attempts to detect the title by font size, if available, or by text position.
+     * - Authors are then processed line-by-line until reaching the next section.
+     * - Abstract and keywords, if found, are extracted as they appear on the page.
+     * - Finally, conference details, DOI, and publication information are parsed from the lower blocks.
+     *
+     * <p> The parsing logic also identifies and categorizes entries based on keywords such as "Abstract" or "Keywords"
+     * and specific terms that denote sections. Additionally, this method can handle
+     * publisher-specific formats like Springer or IEEE, extracting data like series, volume, and conference titles.
+     *
+     * @param firstpageContents The raw content of the PDF's first page, which may contain metadata and main content.
+     * @param lineSeparator     The line separator used to format and unify line breaks in the text content.
+     * @param titleByFontSize   An optional title string determined by font size; if provided, this overrides the
+     *                          default title parsing.
+     * @return An {@link Optional} containing a {@link BibEntry} with the parsed bibliographic data if extraction
+     *         is successful. Otherwise, an empty {@link Optional}.
+     */
+    @VisibleForTesting
+    Optional<BibEntry> getEntryFromPDFContent(String firstpageContents, String lineSeparator, String titleByFontSize) {
         String firstpageContentsUnifiedLineBreaks = StringUtil.unifyLineBreaks(firstpageContents, lineSeparator);
 
         lines = firstpageContentsUnifiedLineBreaks.split(lineSeparator);
@@ -275,8 +383,11 @@ Optional<BibEntry> getEntryFromPDFContent(String firstpageContents, String lineS
         // start: title
         fillCurStringWithNonEmptyLines();
         title = streamlineTitle(curString);
-        curString = "";
         // i points to the next non-empty line
+        curString = "";
+        if (!isNullOrEmpty(titleByFontSize)) {
+            title = titleByFontSize;
+        }
 
         // after title: authors
         author = null;
@@ -393,13 +504,6 @@ Optional<BibEntry> getEntryFromPDFContent(String firstpageContents, String lineS
                     // IEEE has the conference things at the end
                     publisher = "IEEE";
 
-                    // year is extracted by extractYear
-                    // otherwise, we could it determine as follows:
-                    // String yearStr = curString.substring(curString.length()-4);
-                    // if (isYear(yearStr)) {
-                    //  year = yearStr;
-                    // }
-
                     if (conference == null) {
                         pos = curString.indexOf('$');
                         if (pos > 0) {

diff --git a/src/test/java/org/jabref/logic/importer/fileformat/PdfContentImporterTest.java b/src/test/java/org/jabref/logic/importer/fileformat/PdfContentImporterTest.java
@@ -2,20 +2,25 @@
 
 import java.nio.file.Path;
 import java.util.List;
+import java.util.Objects;
 import java.util.Optional;
+import java.util.stream.Stream;
 
 import org.jabref.model.entry.BibEntry;
 import org.jabref.model.entry.LinkedFile;
 import org.jabref.model.entry.field.StandardField;
 import org.jabref.model.entry.types.StandardEntryType;
 
 import org.junit.jupiter.api.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
 
 import static org.junit.jupiter.api.Assertions.assertEquals;
 
 class PdfContentImporterTest {
 
-    private PdfContentImporter importer = new PdfContentImporter();
+    private final PdfContentImporter importer = new PdfContentImporter();
 
     @Test
     void doesNotHandleEncryptedPdfs() throws Exception {
@@ -65,7 +70,7 @@ void parsingEditorWithoutPagesorSeriesInformation() {
                 Corpus linguistics investigates human language by starting out from large
                 """;
 
-        assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContents, "\n"));
+        assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContents, "\n", ""));
     }
 
     @Test
@@ -88,7 +93,7 @@ Smith, Lucy Anna (2014) Mortality in the Ornamental Fish Retail Sector: an Analy
                 UNSPECIFIED
                 Master of Research (MRes) thesis, University of Kent,.""";
 
-        assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContents, "\n"));
+        assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContents, "\n", ""));
     }
 
     @Test
@@ -121,6 +126,29 @@ British Journal of Nutrition (2008), 99, 1–11 doi: 10.1017/S0007114507795296
                 British Journal of Nutrition
                 https://doi.org/10.1017/S0007114507795296 Published online by Cambridge University Press""";
 
-        assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContent, "\n"));
+        assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContent, "\n", ""));
+    }
+
+    @ParameterizedTest
+    @MethodSource("providePdfData")
+    void pdfTitleExtraction(String expectedTitle, String filePath) throws Exception {
+        Path file = Path.of(Objects.requireNonNull(PdfContentImporter.class.getResource(filePath)).toURI());
+        List<BibEntry> result = importer.importDatabase(file).getDatabase().getEntries();
+        assertEquals(Optional.of(expectedTitle), result.getFirst().getTitle());
+    }
+
+    private static Stream<Arguments> providePdfData() {
+        return Stream.of(
+                Arguments.of("On How We Can Teach – Exploring New Ways in Professional Software Development for Students", "/pdfs/PdfContentImporter/Kriha2018.pdf"),
+                Arguments.of("JabRef Example for Reference Parsing", "/pdfs/IEEE/ieee-paper.pdf"),
+                Arguments.of("Paper Title", "/org/jabref/logic/importer/util/LNCS-minimal.pdf"),
+                Arguments.of("Is Oil the future?", "/pdfs/example-scientificThesisTemplate.pdf"),
+                Arguments.of("Thesis Title", "/pdfs/thesis-example.pdf"),
+                Arguments.of("Recovering Trace Links Between Software Documentation And Code", "/pdfs/PdfContentImporter/Keim2024.pdf"),
+                Arguments.of("On the impact of service-oriented patterns on software evolvability: a controlled experiment and metric-based analysis", "/pdfs/PdfContentImporter/Bogner2019.pdf"),
+                Arguments.of("Pandemic programming", "/pdfs/PdfContentImporter/Ralph2020.pdf"),
+                Arguments.of("Do RESTful API design rules have an impact on the understandability of Web APIs?", "/pdfs/PdfContentImporter/Bogner2023.pdf"),
+                Arguments.of("Adopting microservices and DevOps in the cyber-physical systems domain: A rapid review and case study", "/pdfs/PdfContentImporter/Fritzsch2022.pdf")
+        );
     }
 }
diff --git a/src/test/resources/pdfs/PdfContentImporter/Bogner2019.pdf b/src/test/resources/pdfs/PdfContentImporter/Bogner2019.pdf
diff --git a/src/test/resources/pdfs/PdfContentImporter/Bogner2023.pdf b/src/test/resources/pdfs/PdfContentImporter/Bogner2023.pdf
diff --git a/src/test/resources/pdfs/PdfContentImporter/Fritzsch2022.pdf b/src/test/resources/pdfs/PdfContentImporter/Fritzsch2022.pdf
diff --git a/src/test/resources/pdfs/PdfContentImporter/Keim2024.pdf b/src/test/resources/pdfs/PdfContentImporter/Keim2024.pdf
diff --git a/src/test/resources/pdfs/PdfContentImporter/Kriha2018.pdf b/src/test/resources/pdfs/PdfContentImporter/Kriha2018.pdf
diff --git a/src/test/resources/pdfs/PdfContentImporter/Ralph2020.pdf b/src/test/resources/pdfs/PdfContentImporter/Ralph2020.pdf