Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Lucene as search syntax #8206

Closed
wants to merge 11 commits into from
Closed

Add support for Lucene as search syntax #8206

wants to merge 11 commits into from

Conversation

koppor
Copy link
Member

@koppor koppor commented Nov 2, 2021

Fixes #1975

As far as I tested, the search syntax of groups can be reused (= signs also work at Lucene).

The search using regular expressions is different in Lucene. One has to use /.../ to indicate regular expressions. Thus, groups can now also use regular expressions.

Left TODOs:


  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

@koppor koppor linked an issue Nov 3, 2021 that may be closed by this pull request
@koppor koppor removed a link to an issue Nov 3, 2021
Copy link
Contributor

@btut btut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick and the static checker found an unused import.

CHANGELOG.md Outdated Show resolved Hide resolved
@btut
Copy link
Contributor

btut commented Nov 3, 2021

I think overall it would make sense to rework the search mechanism to use a Lucene index to improve speed.

A few thoughts on that:

  • Currently, search is done for each entry individually. So a query is executed on each entry and a boolean return value indicates whether the entry fits the query or not. With Lucene, a query is executed on the index and returns entries.
    For the Fulltext-search, I had to implement a workaround, storing all returned entries that fit a querry and then, when the query is executed on an entry, it actually checks if the entry is in that list. I dislike that workaround.
  • Lucene does not provide a binary decision whether the entry matches the query. Lucene provides a score.
    This means it cannot (should not) be used as a binary filter over the entry-table (as is done now). Instead, the entry-table should be sorted according to the scoring. To give the user feedback on where the query was found in an entry, it would make sense to leverage Lucene's...
  • Highlighting. Lucene can generate html formatted (converted to a JavaFX TextFlow for the fulltext results) highlighted string, similar to what google does when displaying search results.

@calixtus
Copy link
Member

calixtus commented Nov 3, 2021

I sense a damn f**** cool possible new (or renewed) feature for JabRef here. Sort by lucene score and/or highlight/opaque entries by score - like floating results in ancient swing times.

@btut
Copy link
Contributor

btut commented Nov 4, 2021

I agree. If done well, this could be a very neat feature and the speed-up would be icing on the cake.

Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also in favor of completely removing our existing search infrastructure and migrate completely to Lucene.
Not sure whats more efficient: first only except Lucene syntax in the search query, and in a second PR throw out the old stuff; or rewrite everything in one go.

CHANGELOG.md Outdated Show resolved Hide resolved
@koppor
Copy link
Member Author

koppor commented Nov 14, 2021

Current blocker - Lucene does not support == and !=.

  • = inexact match. In Lucene, a glob pattern would be needed. In the current implementation, I did not enforce that, but implemented an inexact match
  • == is an exact match. Parsing error in Lucene
  • != is NOT. Parsing error in Lucene

grafik

How to proceed?

  • Keep our search (and maybe map to Lucene for full text support)
  • Drop support for == and !=

@tobiasdiez
Copy link
Member

I would completely drop our search syntax and only use lucenes one. If users really want an equals instead of contains search, then we can add this later (for example by adding a field alias with no tokenizer).

@koppor
Copy link
Member Author

koppor commented Nov 15, 2021

Note to self: Read Lucene documentation thoroughly --> http://www.lucenetutorial.com/lucene-query-syntax.html

What is a match?

  • Currently: title:Jab --> Matches "JabRef tool"
  • Lucene: title:Jab --> does not match "JabRef tool"

  • Currently: title:Jab --> Matches "JabRef tool"
  • Lucene: title:Jab* --> Matches "JabRef tool"

  • Currently: title:JabRef --> Matches "JabRef tool"
  • Lucene: title:JabRef --> Matches "JabRef tool"

It is OK for me to implement like that. The "only" concern I have is that it does not match the UX one has from WWW search enginges (Google, Bing, ...)

@btut
Copy link
Contributor

btut commented Nov 15, 2021

If users really want an equals instead of contains search, then we can add this later (for example by adding a field alias with no tokenizer).

I think the most efficient way would be to do a fuzzy search anyhow and then do the equals search on the results of the fuzzy search as we did before. I think that would be better than storing everything twice (tokenized and untokenized).

... After writing this I did a short google search and it appears like exact matches are possible in lucene using quotes. https://stackoverflow.com/questions/37495639/how-to-match-exact-text-in-lucene-search
Isn't that exactly what we want?

@koppor
Copy link
Member Author

koppor commented Apr 4, 2022

Minor comment: I rebased this PR on main, because we merged #8636

@koppor
Copy link
Member Author

koppor commented Apr 19, 2022

# Conflicts:
#	src/main/java/org/jabref/logic/search/rules/ContainsBasedSearchRule.java
#	src/main/java/org/jabref/logic/search/rules/GrammarBasedSearchRule.java
#	src/main/java/org/jabref/logic/search/rules/RegexBasedSearchRule.java
public void findsCaseInSensitive(String query) {
assertTrue(luceneBasedSearchRuleCaseInsensitive.validateSearchStrings(query));
assertTrue(luceneBasedSearchRuleCaseInsensitive.applyRule(query, bibEntry));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [reviewdog] <com.puppycrawl.tools.checkstyle.checks.regexp.RegexpMultilineCheck> reported by reviewdog 🐶
Blank line at end of block should be removed

void simpleFieldedLuceneQueryReturnsLuceneBasedSearchRule() {
SearchRule searchRule = SearchRules.getSearchRuleByQuery("title:test", EnumSet.noneOf(SearchRules.SearchFlags.class));
assertInstanceOf(LuceneBasedSearchRule.class, searchRule);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [reviewdog] <com.puppycrawl.tools.checkstyle.checks.regexp.RegexpMultilineCheck> reported by reviewdog 🐶
Blank line at end of block should be removed

@koppor
Copy link
Member Author

koppor commented Jul 7, 2022

... After writing this I did a short google search and it appears like exact matches are possible in lucene using quotes.

I would object - based on https://www.lucenetutorial.com/lucene-query-syntax.html

grafik

grafik

I would follow the suggestion of #8206 (comment) and implement the interpretation of the Lucene syntax "more relaxed".

@ThiloteE
Copy link
Member

ThiloteE commented Jul 8, 2022

Question: getting rid of current search syntax will break groups that are configured to search based on

  • searching for a keyword
  • free search expression

right? Is there a way the syntax could be backwards compatible?

@btut
Copy link
Contributor

btut commented Jul 9, 2022

Question: getting rid of current search syntax will break groups that are configured to search based on

  • searching for a keyword
  • free search expression

right? Is there a way the syntax could be backwards compatible?

I never used that feature so I had to look it up first.
In general, one goal of the lucene-implementation would be to get rid of all the old search infrastructure. This is hard to do when keeping backwards compatibility. But there is hope. 'Simple' searches, just looking for a string in a field, should straight up just work in lucene as well. The problems arise when we look at the case-sensitive and regex switches and to scoring:

  • Lucene does not support case-insensitive searches. Searches are always case-sensitive. If you want a case-insensitive search, you lowercase everything that you put into the index and your search strings.
  • I do not know if the regex implementation of JabRef and Lucene match. If so, great. If not, we could have a 'migration' feature that parses search strings in the group settings on startup and if they match JabRef syntax they are translated to lucene syntax. We need the regex parser for other things anyways so we wouldn't get rid of it anyways.
  • The problem that makes the current fulltext-search less usable than it could be is the scoring. Lucene almost always finds something. It might not be a good fit but still better than other entries. We need a good scoring mechanism to filter out irrelevant results. I don't know if a simple score-threshold would suffice or we need something more sophisticated.

@koppor
Copy link
Member Author

koppor commented Sep 26, 2022

Superseeded by #8963

This PR can be used to check for similarities and differences by using the Lucene engine or (our) custom search.

@koppor koppor closed this Sep 26, 2022
@koppor koppor deleted the refine-search branch September 26, 2022 19:14
@LoayGhreeb LoayGhreeb mentioned this pull request Sep 2, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reuse query parser
6 participants