Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminate automaton when it can match all suffixes, and match suffixes directly. #13072

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

vsop-479
Copy link
Contributor

@vsop-479 vsop-479 commented Feb 5, 2024

For PrefixQuery, we can terminate the automaton on current term if we have matched the whole prefix, and match this term directly.
Furthermore, if there is a subBlock, we could match all its' sub terms.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Feb 20, 2024
@vsop-479
Copy link
Contributor Author

@jpountz Please take a look when you get a chance.

@github-actions github-actions bot removed the Stale label Feb 22, 2024
@jpountz
Copy link
Contributor

jpountz commented Feb 23, 2024

IntersectsTermsEnum is a bit scary to me, maybe @mikemccand can take a look, I expect him to be more familiar with it.

@mikemccand
Copy link
Member

I will have a look -- thanks for the ping @jpountz.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever optimization! You recognize that this Automaton will match all possible suffixes in this state, and so more efficiently enumerate all terms from block tree under that state.

I have concerns about storing this in Automaton itself, and the naming was confusing to me :) Could we instead store it in RunAutomaton? Or, possibly, do it on the fly in IntersectEnum by detecting a state that is both accept and has a .* transition back onto itself?

Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).

@vsop-479
Copy link
Contributor Author

vsop-479 commented Feb 28, 2024

@mikemccand Thanks for your suggestion, I will try to implement it.

Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).

I am working on this.

@jpountz Thanks for your reply.

@rmuir
Copy link
Member

rmuir commented Feb 28, 2024

I think the optimization may be similar to the one done in AutomatonTermsEnum?
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java#L149-L153

When "ping-ponging" the term dictionary against the automaton, it tracks visited bitset and looks for such loops in the automaton. when it finds one, it temporarily acts like a TermRangeQuery.

I think, it works a bit more general than just prefixquery and also helps with regex and wildcard queries too.

@vsop-479
Copy link
Contributor Author

I think the optimization may be similar to the one done in AutomatonTermsEnum?

Thanks for reminding that, I will dig into AutomatonTermsEnum's optimization.

@vsop-479
Copy link
Contributor Author

vsop-479 commented Mar 1, 2024

There is still problem in the state of match all suffix of IntersectTermsEnumFrame. I am trying to figure it out.

On the other hand, I will dig into AutomatonTermsEnum's optimization.

@vsop-479
Copy link
Contributor Author

vsop-479 commented Mar 2, 2024

@mikemccand
I renamed the field used to indicate whether an accept state can match all suffixes, and detected it in RunAutomaton.
Please take a look when you get a chance.

@vsop-479
Copy link
Contributor Author

vsop-479 commented Mar 6, 2024

I think the optimization may be similar to the one done in AutomatonTermsEnum?

It seems the optimization of AutomatonTermsEnum is to improve iterating term mode, from seeking bytes from DFA and seek termsEnum(seekCeil), to simply sequential reads the termsEnum, after finding a loop(setLinear).
In both mode, AutomatonTermsEnum needs to check if the term is accepted by running automaton.

If I am not mistaken, this optimization is different from AutomatonTermsEnum's. It directly matches all reminding suffixes and sub blocks, after detecting an accept state with a .* transition back onto itself.

Or maybe you mean we can improve AutomatonTermsEnum's optimization, to implement this optimization's effect? @rmuir

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vsop-479 -- this looks closer. I like that the opto is now contained under RunAutomaton, but I'm confused/concerned about sometimes checking for 255 max label and other times 127 depending on which query.

assert automaton.isAccept(state);
int numTransitions = automaton.getNumTransitions(state);
// Apply to PrefixQuery, TermRangeQuery.
if (numTransitions == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this special case? Just let the for loop below handle the 1-transition case too?

Edit: hmm, I see, it is subtly different: this is checking for max label 255 but the loop below is checking 127, hmmm. This is a bit messy -- this low level of code shouldn't be specializing to different automata that come from the high level queries. Can we use alphabetSize-1 as the transition.max check instead? But, separately, we need to figure out why Regexp/WildcardQuery are compiling down to 127 as their max on .* suffix transitions? That is not even correct for matching UTF-8 encoded terms.

Perhaps we could also add tests cases for custom Automata passed to AutomatonQuery matching sometimes binary (non-UTF8) terms?

Copy link
Contributor Author

@vsop-479 vsop-479 Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to figure out why Regexp/WildcardQuery are compiling down to 127 as their max on .* suffix transitions?

These queries' (including AutomatonQuery)Automaton like this: 3 -> 3: [0, 127]; 3 -> 4: [194, 194]; 4 -> 3: [128, 191]. assume 3 is an accept state.
It is more complex to detect whether a state can accept all remaining suffixes for these queries, because its accept states are split into many transitions like: [0, 127], [194, 223], [224, 239], [240, 243], [244], etc.

I am still working on this, any suggestion is welcome @mikemccand.

Perhaps we could also add tests cases for custom Automata passed to AutomatonQuery matching sometimes binary (non-UTF8) terms?

Added.

Copy link
Contributor Author

@vsop-479 vsop-479 Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These queries' (including AutomatonQuery)Automaton like this: 3 -> 3: [0, 127]; 3 -> 4: [194, 194]; 4 -> 3: [128, 191]. assume 3 is an accept state.

@mikemccand
I can track an accept state's other transitions, to check whether these transitions can finally ended on an accept (typically transited by [128, 191]). But i am not sure whether it is enough to judge an state can match all suffix, even not sure whether it is necessary, since maybe it is equivalent to just check the [0, 127] transition's dest is an accept state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to figure out why Regexp/WildcardQuery are compiling down to 127 as their max on .* suffix transitions?

I think we split the transition([0, 1114111]) with utf8 edges in UTF32ToUTF8.convertOneEdge.

Copy link
Contributor Author

@vsop-479 vsop-479 Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikemccand
I think I can detect a match all suffix state for Regexp/WildcardQuery, in UTF32ToUTF8.convert after convertOneEdge like this:

// Writes new transitions into pendingTransitions:
convertOneEdge(utf8State, destUTF8, scratch.min, scratch.max);
        
// Set match all suffix state.
if(scratch.min == 0 && scratch.max == 1114111 && utf8.isAccept(utf8State) && utf8.isAccept(destUTF8)){
  utf8.setMatchAllSuffix(utf8State, true);
}

Which is simple and reliable, but will violate the rule below:

Everything else about Automaton today is fundamental (states, transitions, isAccept) and necessary, but this new member is more a best effort optimization?

Other plan: Checking whether a candidate state can finally ended on an accept by [128, 191], which is added in UTF32ToUTF8.all:

utf8.addTransition(lastN, end, 128, 191); // type = all*

@vsop-479
Copy link
Contributor Author

vsop-479 commented Mar 7, 2024

Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).

I measured it with current implementation with wikimedium1m:

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
Prefix3      879.20      (9.1%)      995.37      (4.2%)   13.2% (   0% -   29%) 0.062
Prefix3      924.98      (9.9%)     1042.17      (7.9%)   12.7% (  -4% -   33%) 0.083
TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
Prefix3     1480.94      (8.8%)     1559.89      (6.4%)    5.3% (  -9% -   22%) 0.195
Prefix3     1242.80      (6.9%)     1307.30      (5.3%)    5.2% (  -6% -   18%) 0.299
Prefix3      177.54      (1.3%)      202.74      (6.3%)   14.2% (   6% -   22%) 0.000

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added Stale and removed Stale labels Apr 13, 2024
Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Apr 30, 2024
@github-actions github-actions bot removed the Stale label May 9, 2024
Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label May 24, 2024
@vsop-479
Copy link
Contributor Author

@mikemccand
I think this change is close for PrefixQuery, TermRangeQuery, and custom binary AutomatonQuery.

As for RegexpQuery and WildcardQuery, whose automata converted by UTF32ToUTF8. We just get accept transitions [0, 127] in RunAutomaton, I am not sure whether it is enough that just check the accept state is transition is [0, 127].

Another approach is set matchAllSuffix in UTF32ToUTF8, but need to add a unFundamental member to Automaton.

Or, can we just push this change for Prefix/TermRangeQuery, and leave a TODO for Regexp/WildcardQuery temporarily?

@github-actions github-actions bot removed the Stale label Jun 22, 2024
Copy link

github-actions bot commented Jul 7, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jul 7, 2024
@vsop-479
Copy link
Contributor Author

Conflicts resolved.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Aug 28, 2024
@vsop-479 vsop-479 changed the title Terminate automaton after matched the whole prefix for PrefixQuery. Terminate automaton when it can match all suffixes, and match suffixes directly. Sep 9, 2024
@github-actions github-actions bot removed the Stale label Sep 10, 2024
Copy link

github-actions bot commented Oct 5, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants