Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search in PDF Files #2838

Merged
merged 122 commits into from
Jul 14, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
122 commits
Select commit Hold shift + click to select a range
321a0f0
Add lucene to gradle.build dependencies
Braunch Aug 2, 2016
2580757
Implement first attempt on a lucene powerded indexed search machine f…
Braunch Aug 2, 2016
a9d85dc
Ignore test and improve search handler and content reader to map bibt…
Braunch Aug 4, 2016
753c25d
Create search result wrapper classes for better readability of future…
Braunch Aug 4, 2016
6d1cb44
Add lucene score and collector for search meta data retrieval in the …
Braunch Aug 4, 2016
1af3f1a
Add example pdf and create test skeleton
Braunch Aug 4, 2016
40609ad
Merge branch 'PDFSearchFeature' of github.com:Braunch/jabref into lucene
LinusDietz Apr 20, 2017
8787bcd
import fulltext search from Braunch:PDFSearchFeature
LinusDietz Apr 20, 2017
d51919e
Merge branch 'master' of github.com:JabRef/jabref into lucene
LinusDietz May 10, 2017
692316b
Refactored the PDFContentReader
LinusDietz May 10, 2017
dacbc3c
Put DocumentReader under test
LinusDietz May 10, 2017
2663798
Use current Lucene Release, Indexing is working now.
LinusDietz May 10, 2017
7086a8f
Revised package structure
LinusDietz May 12, 2017
444bd39
added stem analysis
LinusDietz May 12, 2017
8547873
Implemented Searcher Test
LinusDietz May 12, 2017
c9cf893
Improve Naming
LinusDietz May 12, 2017
9eaa390
fix build
LinusDietz May 12, 2017
c83255b
First round of Feedback
LinusDietz May 14, 2017
25cdaff
Second round of Feedback
LinusDietz May 14, 2017
b7bfc9a
Removed PDF Creator from the Search index, add the UID to the search …
LinusDietz May 14, 2017
9a71cac
fixed failing test
LinusDietz May 15, 2017
aa76b72
Merge branch 'master' of https://github.com/JabRef/jabref into lucene
LinusDietz Aug 29, 2017
aed867d
update Lucene from 6.5.1 -> 6.6.0
LinusDietz Aug 29, 2017
a9a9486
integrate indexing into Jabref
LinusDietz Aug 29, 2017
b32af3c
Merge branch 'master' of https://github.com/JabRef/jabref into lucene
LinusDietz Oct 4, 2017
e64c3da
Merge branch 'master' of https://github.com/JabRef/jabref into lucene
LinusDietz Dec 20, 2017
9cd3df3
Update lucene from 6.6.0 -> 7.1.0
LinusDietz Dec 20, 2017
a12555a
Merge branch 'master' into lucene
LinusDietz Jan 23, 2018
e2662cd
Merge branch 'master' of https://github.com/JabRef/jabref into lucene
LinusDietz Jan 23, 2018
650c050
Merge branch 'lucene' of https://github.com/JabRef/jabref into lucene
LinusDietz Jan 23, 2018
a6fad30
Update lucene from 7.1.0 -> 7.2.1
LinusDietz Jan 23, 2018
74c90bd
Merge branch lucene of https://github.com/JabRef/jabref into lucene
btut Jun 16, 2021
6d9ba0b
Lucene dependencies
btut Jun 16, 2021
1a53c47
First shot for integrating lucene
koppor Jun 16, 2021
71a9cb0
Fix and document dependencies
btut Jun 18, 2021
4125ee8
Update to lucene 8.8.2
btut Jun 18, 2021
dd698ff
Checkstyle
btut Jun 18, 2021
e247540
Added fulltext-search button to GlobalSearchBar
btut Jun 18, 2021
63c2abf
Fixed typo
btut Jun 18, 2021
f6e2058
Update to lucene 8.9.0
btut Jun 21, 2021
3ba53be
Added update/remove from index for individual entries/files
btut Jun 22, 2021
4f5c8ed
Started integrating indexer
btut Jun 23, 2021
596525f
Ignore lucene index in git
btut Jun 23, 2021
516d1ca
Fixed tests
btut Jun 23, 2021
4de4dbe
Checkstyle
btut Jun 23, 2021
d83ae36
First draft of search-results tab
btut Jun 23, 2021
a32bec7
Merge branch 'main' of github.com:JabRef/jabref into lucene
btut Jun 28, 2021
addf51e
No tabs in build.gradle
btut Jun 29, 2021
4a3d3c8
Code cleanup
btut Jun 29, 2021
cb30b3b
Added highlighting dependency
btut Jun 29, 2021
8c9bf3c
Highlighted search results
btut Jun 29, 2021
651ef23
Added lib to access local app-data path
btut Jul 5, 2021
7804060
SearchRules only for single DatabaseContext
btut Jul 5, 2021
2c7dc25
Before reverting
btut Jul 5, 2021
148b4f6
Revert "SearchRules only for single DatabaseContext"
btut Jul 5, 2021
edc6fe4
Better SearchRules only for single DatabaseContext
btut Jul 5, 2021
2557fee
Fixed document duplication on update
btut Jul 5, 2021
b4a9c7a
Only write files if index is out of date
btut Jul 5, 2021
786828d
Use globals instead of passing BibDatabaseContext
btut Jul 6, 2021
25c3a40
Fixed tests
btut Jul 7, 2021
4646ab4
Checkstyle
btut Jul 7, 2021
67a7d0a
Ignore index created during tests in git
btut Jul 7, 2021
90ee454
Removed unused localization key
btut Jul 7, 2021
181e382
Listen for changes that concern the index
btut Jul 7, 2021
86d73d0
Added menu-item to rebuild index
btut Jul 7, 2021
a8781ae
Allow SearchRules to access Globals
btut Jul 8, 2021
85beb5c
Moved access to Globals to constructor
btut Jul 8, 2021
c7f3bc8
Removed most metadata from index
btut Jul 8, 2021
e511cbd
Do indexing per Library-tab
btut Jul 8, 2021
b61e14f
Removed tests for Metadata in DocumentReader
btut Jul 8, 2021
ebd9313
Consider cases where there is no open database
btut Jul 8, 2021
cffe5b5
Actually consider fulltext results in search predicate
btut Jul 8, 2021
4f546d1
Applied theme to search-results tab
btut Jul 8, 2021
8c3e76c
Files can be opened from the search-results tab
btut Jul 8, 2021
85a048b
Merge branch 'main' of github.com:JabRef/jabref into lucene
btut Jul 8, 2021
030d5b3
Fixed merge-artifact
btut Jul 8, 2021
acea590
Fixed file-type filter in indexer
btut Jul 8, 2021
18aac31
Changelog entry
btut Jul 8, 2021
6d40f7a
Fixed file types in tests
btut Jul 8, 2021
da3ed1e
Checkstyle
btut Jul 8, 2021
99decc1
Fixed benchmarks
btut Jul 8, 2021
5bde996
Removed spaces in CHANGELOG
btut Jul 8, 2021
a4cd102
Removed unecessary code in shadow-jar for lucene
btut Jul 9, 2021
856e20f
Removed weird endless recursion
btut Jul 9, 2021
33e6c4a
Fixed Typo in comment
btut Jul 9, 2021
27f201d
Use parseLong instead of valueOf
btut Jul 9, 2021
cfc5869
Rescoped BibDatabase variable
btut Jul 9, 2021
effbfdc
Use method instance
btut Jul 9, 2021
2e5818b
Remove unecessary return
btut Jul 9, 2021
adb195e
Remove unnecessary throws declaration
btut Jul 9, 2021
9e10107
Cleaner sort-predicate
btut Jul 9, 2021
8a8ac9c
Limit number of search result to 5
btut Jul 10, 2021
1bf636a
Apply suggestions from code review
btut Jul 10, 2021
989e266
Fixed logger formating
btut Jul 10, 2021
5f093d7
Add log for index-location
btut Jul 10, 2021
298a0d9
Log IO exception when rebuilding index from menu
btut Jul 10, 2021
2d711d4
Localized search results tab
btut Jul 10, 2021
2b879fa
Moved logging to slf4j
btut Jul 10, 2021
8afa013
Log search-result exception
btut Jul 10, 2021
e968483
Better naming for task queue pointer
btut Jul 10, 2021
63eb1a1
Replaced deprecated classes and methods
btut Jul 10, 2021
b500ba2
Removed unnecessary wrapping of unmodifyable list
btut Jul 10, 2021
a554a36
Simplify iteration over search results
btut Jul 10, 2021
36099a8
Use TempDir to store the index during tests
btut Jul 10, 2021
44c975f
Use EnumSet for flags instead of booleans
btut Jul 10, 2021
cb1c393
Delete out-of-date indices
btut Jul 10, 2021
99e723e
Fixed import order
btut Jul 10, 2021
82b9d69
Task-queue and better message for indexing task
btut Jul 10, 2021
cec2089
Remove empty line
koppor Jul 13, 2021
e69219b
Update src/main/java/org/jabref/gui/JabRefMain.java
koppor Jul 13, 2021
533331a
Merge branch 'main' into lucene
koppor Jul 13, 2021
bfd2582
Use literal JabRef for index path
btut Jul 14, 2021
46517ff
Apply suggestions from @koppor
btut Jul 14, 2021
9c887a3
Checkstyle
btut Jul 14, 2021
dc9f53d
Removed dead code
btut Jul 14, 2021
7601ce6
Removed more dead code
btut Jul 14, 2021
5dc3852
Add annotations to index
btut Jul 14, 2021
110601c
Checkstyle
btut Jul 14, 2021
964aa1f
Error handling when reading annotations
btut Jul 14, 2021
25d5733
Remove hardcoded appdata path
btut Jul 14, 2021
79ae32b
Refine toString
koppor Jul 14, 2021
e6055fc
Rename method
koppor Jul 14, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,13 @@
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.interactive.annotation.AnnotationFilter;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.text.PDFTextStripper;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import static org.jabref.model.pdf.search.SearchFieldConstants.ANNOTATIONS;
import static org.jabref.model.pdf.search.SearchFieldConstants.CONTENT;
import static org.jabref.model.pdf.search.SearchFieldConstants.MODIFIED;
import static org.jabref.model.pdf.search.SearchFieldConstants.PATH;
Expand Down Expand Up @@ -122,6 +125,20 @@ private void addContentIfNotEmpty(PDDocument pdfDocument, Document newDocument)
if (StringUtil.isNotBlank(pdfContent)) {
newDocument.add(new TextField(CONTENT, pdfContent, Field.Store.YES));
}
pdfDocument.getPages().forEach(page -> {
try {
for (PDAnnotation annotation : page.getAnnotations( annotation -> {
if (annotation.getContents() == null) {
return false;
}
return annotation.getSubtype().equals("Text") || annotation.getSubtype().equals("Highlight");
})) {
newDocument.add(new TextField(ANNOTATIONS, annotation.getContents(), Field.Store.YES));
}
} catch (IOException e) {
e.printStackTrace();
btut marked this conversation as resolved.
Show resolved Hide resolved
}
});
} catch (IOException e) {
LOGGER.info("Could not read contents of PDF document \"{}\"", pdfDocument.toString(), e);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,10 @@ public class SearchFieldConstants {

public static final String PATH = "path";
public static final String CONTENT = "content";
public static final String ANNOTATIONS = "annotations";
public static final String MODIFIED = "modified";

public static final String[] PDF_FIELDS = new String[]{PATH, CONTENT, MODIFIED};
public static final String[] PDF_FIELDS = new String[]{PATH, CONTENT, MODIFIED, ANNOTATIONS};

public static final String VERSION = "0.2a";
public static final String VERSION = "0.3a";
}
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,12 @@ public void searchForSecond() throws IOException, ParseException {
assertEquals(2, result.numSearchResults());
}

@Test
public void searchForAnnotation() throws IOException, ParseException {
PdfSearchResults result = search.search("annotation", 10);
assertEquals(2, result.numSearchResults());
}

@Test
public void searchForEmptyString() throws IOException {
PdfSearchResults result = search.search("", 10);
Expand Down
Binary file modified src/test/resources/pdfs/example.pdf
Binary file not shown.
Binary file modified src/test/resources/pdfs/metaData.pdf
Binary file not shown.