Skip to content

Commit

Permalink
Add a title guess method to get "better" title (JabRef#12018)
Browse files Browse the repository at this point in the history
* Add title guess method

Add title guess method

* fix unit test

fix unit test

* update unit test to JDK 21 style

update unit test to JDK 21 style

* update unit test

update unit test

* update get title by area

update get title by area

* remove StringUtils.isBlank and add @AllowedToUseAwt

remove StringUtils.isBlank and add @AllowedToUseAwt

* add unit test

ToDo:find a minimal pdf for test

* change to get title by font size

change to get title by font size
add more unittest

* RemoveTestPrefix

RemoveTestPrefix

* temp fix the unit test

I should change the pdf used in importTwiceWorksAsExpected, or my code need to deal with the paper with same font size in AUTHOR and TITLE?

* fix the unit test and open rewrite issue

fix the unit test and open rewrite issue

* remove commented code

remove commented code

* Add 5 more unittest case

Add 5 more unittest case

* resolve all comments so far

1. revert the temp change of unit test to original one.
2. all of title's character should stay together, add a `isFarAway` method to it to pass the unit test for hello world case.
3. change guess title variable name form old version `titleByPosition` to `titleByFontSize`
4. remove all commented code.
5. Add a `@VisibleForTesting` to `getEntryFromPDFContent`.
6. rewrite the javaDoc for `getEntryFromPDFContent`
7. fix the logic issue for setting title.

* remove Blank line at start of block

remove Blank line at start of block

* rename and replace unit test file

rename and replace unit test file

* add bib and readme.md

add bib and readme.md

* Update CHANGELOG.md

* rename the file to pass CI

rename the file to pass CI

* address all comments

address all comments

* fix the file name in unit test

fix the file name in unit test

---------

Co-authored-by: Oliver Kopp <kopp.dev@gmail.com>
  • Loading branch information
leaf-soba and koppor authored Oct 30, 2024
1 parent 709386a commit 7232836
Show file tree
Hide file tree
Showing 12 changed files with 285 additions and 23 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv
- ⚠️ We relaxed the escaping requirements for [bracketed patterns](https://docs.jabref.org/setup/citationkeypatterns), which are used for the [citaton key generator](https://docs.jabref.org/advanced/entryeditor#autogenerate-citation-key) and [filename and directory patterns](https://docs.jabref.org/finding-sorting-and-cleaning-entries/filelinks#auto-linking-files). One only needs to write `\"` if a quote sign should be escaped. All other escapings are not necessary (and working) any more. [#11967](https://github.com/JabRef/jabref/pull/11967)
- When importing BibTeX data starging from on a PDF, the XMP metadata takes precedence over Grobid data. [#11992](https://github.com/JabRef/jabref/pull/11992)
- JabRef now uses TLS 1.2 for all HTTPS connections. [#11852](https://github.com/JabRef/jabref/pull/11852)
- We improved the functionality of getting BibTeX data out of PDF files. [#11999](https://github.com/JabRef/jabref/issues/11999)
- We improved the display of long messages in the integrity check dialog. [#11619](https://github.com/JabRef/jabref/pull/11619)
- We improved the undo/redo buttons in the main toolbar and main menu to be disabled when there is nothing to undo/redo. [#8807](https://github.com/JabRef/jabref/issues/8807)
- We improved the DOI detection in PDF imports. [#11782](https://github.com/JabRef/jabref/pull/11782)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,13 @@
import org.jabref.model.entry.types.StandardEntryType;
import org.jabref.model.strings.StringUtil;

import com.google.common.annotations.VisibleForTesting;
import com.google.common.base.Strings;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;

import static org.jabref.model.strings.StringUtil.isNullOrEmpty;

/**
* PdfContentImporter parses data of the first page of the PDF and creates a BibTeX entry.
Expand Down Expand Up @@ -196,7 +200,8 @@ public ParserResult importDatabase(Path filePath) {
List<BibEntry> result = new ArrayList<>(1);
try (PDDocument document = new XmpUtilReader().loadWithAutomaticDecryption(filePath)) {
String firstPageContents = getFirstPageContents(document);
Optional<BibEntry> entry = getEntryFromPDFContent(firstPageContents, OS.NEWLINE);
String titleByFontSize = extractTitleFromDocument(document);
Optional<BibEntry> entry = getEntryFromPDFContent(firstPageContents, OS.NEWLINE, titleByFontSize);
entry.ifPresent(result::add);
} catch (EncryptedPdfsNotSupportedException e) {
return ParserResult.fromErrorMessage(Localization.lang("Decryption not supported."));
Expand All @@ -208,17 +213,120 @@ public ParserResult importDatabase(Path filePath) {
return new ParserResult(result);
}

// make this method package visible so we can test it
Optional<BibEntry> getEntryFromPDFContent(String firstpageContents, String lineSeparator) {
// idea: split[] contains the different lines
// blocks are separated by empty lines
// treat each block
// or do special treatment at authors (which are not broken)
// therefore, we do a line-based and not a block-based splitting
// i points to the current line
// curString (mostly) contains the current block
// the different lines are joined into one and thereby separated by " "
private static String extractTitleFromDocument(PDDocument document) throws IOException {
TitleExtractorByFontSize stripper = new TitleExtractorByFontSize();
return stripper.getTitleFromFirstPage(document);
}

private static class TitleExtractorByFontSize extends PDFTextStripper {

private final List<TextPosition> textPositionsList;

public TitleExtractorByFontSize() {
super();
this.textPositionsList = new ArrayList<>();
}

public String getTitleFromFirstPage(PDDocument document) throws IOException {
this.setStartPage(1);
this.setEndPage(1);
this.writeText(document, new StringWriter());
return findLargestFontText(textPositionsList);
}

@Override
protected void writeString(String text, List<TextPosition> textPositions) {
textPositionsList.addAll(textPositions);
}

private boolean isFarAway(TextPosition previous, TextPosition current) {
float XspaceThreshold = 3.0F;
float YspaceThreshold = previous.getFontSizeInPt() * 1.5F;
float Xgap = current.getXDirAdj() - (previous.getXDirAdj() + previous.getWidthDirAdj());
float Ygap = current.getYDirAdj() - (previous.getYDirAdj() - previous.getHeightDir());
return Xgap > XspaceThreshold && Ygap > YspaceThreshold;
}

private boolean isUnwantedText(TextPosition previousTextPosition, TextPosition textPosition) {
if (textPosition == null || previousTextPosition == null) {
return false;
}
// The title usually don't in the bottom 10% of a page.
if ((textPosition.getPageHeight() - textPosition.getYDirAdj())
< (textPosition.getPageHeight() * 0.1)) {
return true;
}
// The title character usually stay together.
return isFarAway(previousTextPosition, textPosition);
}

private String findLargestFontText(List<TextPosition> textPositions) {
float maxFontSize = 0;
StringBuilder largestFontText = new StringBuilder();
TextPosition previousTextPosition = null;
for (TextPosition textPosition : textPositions) {
// Exclude unwanted text based on heuristics
if (isUnwantedText(previousTextPosition, textPosition)) {
continue;
}
float fontSize = textPosition.getFontSizeInPt();
if (fontSize > maxFontSize) {
maxFontSize = fontSize;
largestFontText.setLength(0);
largestFontText.append(textPosition.getUnicode());
previousTextPosition = textPosition;
} else if (fontSize == maxFontSize) {
if (previousTextPosition != null) {
if (isThereSpace(previousTextPosition, textPosition)) {
largestFontText.append(" ");
}
}
largestFontText.append(textPosition.getUnicode());
previousTextPosition = textPosition;
}
}
return largestFontText.toString().trim();
}

private boolean isThereSpace(TextPosition previous, TextPosition current) {
float XspaceThreshold = 0.5F;
float YspaceThreshold = previous.getFontSizeInPt();
float Xgap = current.getXDirAdj() - (previous.getXDirAdj() + previous.getWidthDirAdj());
float Ygap = current.getYDirAdj() - (previous.getYDirAdj() - previous.getHeightDir());
return Xgap > XspaceThreshold || Ygap > YspaceThreshold;
}
}

/**
* Parses the first page content of a PDF document and extracts bibliographic information such as title, author,
* abstract, keywords, and other relevant metadata. This method processes the content line-by-line and uses
* custom parsing logic to identify and assemble information blocks from academic papers.
*
* idea: split[] contains the different lines, blocks are separated by empty lines, treat each block
* or do special treatment at authors (which are not broken).
* Therefore, we do a line-based and not a block-based splitting i points to the current line
* curString (mostly) contains the current block,
* the different lines are joined into one and thereby separated by " "
*
* <p> This method follows the structure typically found in academic paper PDFs:
* - First, it attempts to detect the title by font size, if available, or by text position.
* - Authors are then processed line-by-line until reaching the next section.
* - Abstract and keywords, if found, are extracted as they appear on the page.
* - Finally, conference details, DOI, and publication information are parsed from the lower blocks.
*
* <p> The parsing logic also identifies and categorizes entries based on keywords such as "Abstract" or "Keywords"
* and specific terms that denote sections. Additionally, this method can handle
* publisher-specific formats like Springer or IEEE, extracting data like series, volume, and conference titles.
*
* @param firstpageContents The raw content of the PDF's first page, which may contain metadata and main content.
* @param lineSeparator The line separator used to format and unify line breaks in the text content.
* @param titleByFontSize An optional title string determined by font size; if provided, this overrides the
* default title parsing.
* @return An {@link Optional} containing a {@link BibEntry} with the parsed bibliographic data if extraction
* is successful. Otherwise, an empty {@link Optional}.
*/
@VisibleForTesting
Optional<BibEntry> getEntryFromPDFContent(String firstpageContents, String lineSeparator, String titleByFontSize) {
String firstpageContentsUnifiedLineBreaks = StringUtil.unifyLineBreaks(firstpageContents, lineSeparator);

lines = firstpageContentsUnifiedLineBreaks.split(lineSeparator);
Expand Down Expand Up @@ -275,8 +383,11 @@ Optional<BibEntry> getEntryFromPDFContent(String firstpageContents, String lineS
// start: title
fillCurStringWithNonEmptyLines();
title = streamlineTitle(curString);
curString = "";
// i points to the next non-empty line
curString = "";
if (!isNullOrEmpty(titleByFontSize)) {
title = titleByFontSize;
}

// after title: authors
author = null;
Expand Down Expand Up @@ -393,13 +504,6 @@ Optional<BibEntry> getEntryFromPDFContent(String firstpageContents, String lineS
// IEEE has the conference things at the end
publisher = "IEEE";

// year is extracted by extractYear
// otherwise, we could it determine as follows:
// String yearStr = curString.substring(curString.length()-4);
// if (isYear(yearStr)) {
// year = yearStr;
// }

if (conference == null) {
pos = curString.indexOf('$');
if (pos > 0) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,25 @@

import java.nio.file.Path;
import java.util.List;
import java.util.Objects;
import java.util.Optional;
import java.util.stream.Stream;

import org.jabref.model.entry.BibEntry;
import org.jabref.model.entry.LinkedFile;
import org.jabref.model.entry.field.StandardField;
import org.jabref.model.entry.types.StandardEntryType;

import org.junit.jupiter.api.Test;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;

import static org.junit.jupiter.api.Assertions.assertEquals;

class PdfContentImporterTest {

private PdfContentImporter importer = new PdfContentImporter();
private final PdfContentImporter importer = new PdfContentImporter();

@Test
void doesNotHandleEncryptedPdfs() throws Exception {
Expand Down Expand Up @@ -65,7 +70,7 @@ void parsingEditorWithoutPagesorSeriesInformation() {
Corpus linguistics investigates human language by starting out from large
""";

assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContents, "\n"));
assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContents, "\n", ""));
}

@Test
Expand All @@ -88,7 +93,7 @@ Smith, Lucy Anna (2014) Mortality in the Ornamental Fish Retail Sector: an Analy
UNSPECIFIED
Master of Research (MRes) thesis, University of Kent,.""";

assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContents, "\n"));
assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContents, "\n", ""));
}

@Test
Expand Down Expand Up @@ -121,6 +126,29 @@ British Journal of Nutrition (2008), 99, 1–11 doi: 10.1017/S0007114507795296
British Journal of Nutrition
https://doi.org/10.1017/S0007114507795296 Published online by Cambridge University Press""";

assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContent, "\n"));
assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContent, "\n", ""));
}

@ParameterizedTest
@MethodSource("providePdfData")
void pdfTitleExtraction(String expectedTitle, String filePath) throws Exception {
Path file = Path.of(Objects.requireNonNull(PdfContentImporter.class.getResource(filePath)).toURI());
List<BibEntry> result = importer.importDatabase(file).getDatabase().getEntries();
assertEquals(Optional.of(expectedTitle), result.getFirst().getTitle());
}

private static Stream<Arguments> providePdfData() {
return Stream.of(
Arguments.of("On How We Can Teach – Exploring New Ways in Professional Software Development for Students", "/pdfs/PdfContentImporter/Kriha2018.pdf"),
Arguments.of("JabRef Example for Reference Parsing", "/pdfs/IEEE/ieee-paper.pdf"),
Arguments.of("Paper Title", "/org/jabref/logic/importer/util/LNCS-minimal.pdf"),
Arguments.of("Is Oil the future?", "/pdfs/example-scientificThesisTemplate.pdf"),
Arguments.of("Thesis Title", "/pdfs/thesis-example.pdf"),
Arguments.of("Recovering Trace Links Between Software Documentation And Code", "/pdfs/PdfContentImporter/Keim2024.pdf"),
Arguments.of("On the impact of service-oriented patterns on software evolvability: a controlled experiment and metric-based analysis", "/pdfs/PdfContentImporter/Bogner2019.pdf"),
Arguments.of("Pandemic programming", "/pdfs/PdfContentImporter/Ralph2020.pdf"),
Arguments.of("Do RESTful API design rules have an impact on the understandability of Web APIs?", "/pdfs/PdfContentImporter/Bogner2023.pdf"),
Arguments.of("Adopting microservices and DevOps in the cyber-physical systems domain: A rapid review and case study", "/pdfs/PdfContentImporter/Fritzsch2022.pdf")
);
}
}
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading

0 comments on commit 7232836

Please sign in to comment.