Terminate automaton when it can match all suffixes, and match suffixes directly. #13072

vsop-479 · 2024-02-05T07:55:18Z

For PrefixQuery, we can terminate the automaton on current term if we have matched the whole prefix, and match this term directly.
Furthermore, if there is a subBlock, we could match all its' sub terms.

Modify comment. Modify comment.

github-actions · 2024-02-20T00:16:44Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

vsop-479 · 2024-02-21T09:57:09Z

@jpountz Please take a look when you get a chance.

jpountz · 2024-02-23T16:15:16Z

IntersectsTermsEnum is a bit scary to me, maybe @mikemccand can take a look, I expect him to be more familiar with it.

mikemccand · 2024-02-27T15:58:22Z

I will have a look -- thanks for the ping @jpountz.

mikemccand

This is a clever optimization! You recognize that this Automaton will match all possible suffixes in this state, and so more efficiently enumerate all terms from block tree under that state.

I have concerns about storing this in Automaton itself, and the naming was confusing to me :) Could we instead store it in RunAutomaton? Or, possibly, do it on the fly in IntersectEnum by detecting a state that is both accept and has a .* transition back onto itself?

Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).

lucene/core/src/java/org/apache/lucene/util/automaton/Automaton.java

lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java

vsop-479 · 2024-02-28T02:32:27Z

@mikemccand Thanks for your suggestion, I will try to implement it.

Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).

I am working on this.

@jpountz Thanks for your reply.

rmuir · 2024-02-28T02:42:41Z

I think the optimization may be similar to the one done in AutomatonTermsEnum?
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java#L149-L153

When "ping-ponging" the term dictionary against the automaton, it tracks visited bitset and looks for such loops in the automaton. when it finds one, it temporarily acts like a TermRangeQuery.

I think, it works a bit more general than just prefixquery and also helps with regex and wildcard queries too.

vsop-479 · 2024-02-28T05:53:39Z

I think the optimization may be similar to the one done in AutomatonTermsEnum?

Thanks for reminding that, I will dig into AutomatonTermsEnum's optimization.

vsop-479 · 2024-03-01T09:46:16Z

There is still problem in the state of match all suffix of IntersectTermsEnumFrame. I am trying to figure it out.

On the other hand, I will dig into AutomatonTermsEnum's optimization.

vsop-479 · 2024-03-02T15:56:02Z

@mikemccand
I renamed the field used to indicate whether an accept state can match all suffixes, and detected it in RunAutomaton.
Please take a look when you get a chance.

vsop-479 · 2024-03-06T06:45:32Z

I think the optimization may be similar to the one done in AutomatonTermsEnum?

It seems the optimization of AutomatonTermsEnum is to improve iterating term mode, from seeking bytes from DFA and seek termsEnum(seekCeil), to simply sequential reads the termsEnum, after finding a loop(setLinear).
In both mode, AutomatonTermsEnum needs to check if the term is accepted by running automaton.

If I am not mistaken, this optimization is different from AutomatonTermsEnum's. It directly matches all reminding suffixes and sub blocks, after detecting an accept state with a .* transition back onto itself.

Or maybe you mean we can improve AutomatonTermsEnum's optimization, to implement this optimization's effect? @rmuir

mikemccand

Thanks @vsop-479 -- this looks closer. I like that the opto is now contained under RunAutomaton, but I'm confused/concerned about sometimes checking for 255 max label and other times 127 depending on which query.

lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/IntersectTermsEnum.java

lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java

mikemccand · 2024-03-06T11:05:43Z

lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java

+    assert automaton.isAccept(state);
+    int numTransitions = automaton.getNumTransitions(state);
+    // Apply to PrefixQuery, TermRangeQuery.
+    if (numTransitions == 1) {


Can we remove this special case? Just let the for loop below handle the 1-transition case too?

Edit: hmm, I see, it is subtly different: this is checking for max label 255 but the loop below is checking 127, hmmm. This is a bit messy -- this low level of code shouldn't be specializing to different automata that come from the high level queries. Can we use alphabetSize-1 as the transition.max check instead? But, separately, we need to figure out why Regexp/WildcardQuery are compiling down to 127 as their max on .* suffix transitions? That is not even correct for matching UTF-8 encoded terms.

Perhaps we could also add tests cases for custom Automata passed to AutomatonQuery matching sometimes binary (non-UTF8) terms?

we need to figure out why Regexp/WildcardQuery are compiling down to 127 as their max on .* suffix transitions?

These queries' (including AutomatonQuery)Automaton like this: 3 -> 3: [0, 127]; 3 -> 4: [194, 194]; 4 -> 3: [128, 191]. assume 3 is an accept state.
It is more complex to detect whether a state can accept all remaining suffixes for these queries, because its accept states are split into many transitions like: [0, 127], [194, 223], [224, 239], [240, 243], [244], etc.

I am still working on this, any suggestion is welcome @mikemccand.

Perhaps we could also add tests cases for custom Automata passed to AutomatonQuery matching sometimes binary (non-UTF8) terms?

Added.

These queries' (including AutomatonQuery)Automaton like this: 3 -> 3: [0, 127]; 3 -> 4: [194, 194]; 4 -> 3: [128, 191]. assume 3 is an accept state.

@mikemccand
I can track an accept state's other transitions, to check whether these transitions can finally ended on an accept (typically transited by [128, 191]). But i am not sure whether it is enough to judge an state can match all suffix, even not sure whether it is necessary, since maybe it is equivalent to just check the [0, 127] transition's dest is an accept state.

we need to figure out why Regexp/WildcardQuery are compiling down to 127 as their max on .* suffix transitions?

I think we split the transition([0, 1114111]) with utf8 edges in UTF32ToUTF8.convertOneEdge.

@mikemccand
I think I can detect a match all suffix state for Regexp/WildcardQuery, in UTF32ToUTF8.convert after convertOneEdge like this:

// Writes new transitions into pendingTransitions: convertOneEdge(utf8State, destUTF8, scratch.min, scratch.max); // Set match all suffix state. if(scratch.min == 0 && scratch.max == 1114111 && utf8.isAccept(utf8State) && utf8.isAccept(destUTF8)){ utf8.setMatchAllSuffix(utf8State, true); }

Which is simple and reliable, but will violate the rule below:

Everything else about Automaton today is fundamental (states, transitions, isAccept) and necessary, but this new member is more a best effort optimization?

Other plan: Checking whether a candidate state can finally ended on an accept by [128, 191], which is added in UTF32ToUTF8.all:

utf8.addTransition(lastN, end, 128, 191); // type = all*

lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java

vsop-479 · 2024-03-07T10:01:46Z

Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).

I measured it with current implementation with wikimedium1m:

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
Prefix3      879.20      (9.1%)      995.37      (4.2%)   13.2% (   0% -   29%) 0.062
Prefix3      924.98      (9.9%)     1042.17      (7.9%)   12.7% (  -4% -   33%) 0.083

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
Prefix3     1480.94      (8.8%)     1559.89      (6.4%)    5.3% (  -9% -   22%) 0.195
Prefix3     1242.80      (6.9%)     1307.30      (5.3%)    5.2% (  -6% -   18%) 0.299
Prefix3      177.54      (1.3%)      202.74      (6.3%)   14.2% (   6% -   22%) 0.000

github-actions · 2024-04-13T00:15:48Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

github-actions · 2024-04-30T00:17:32Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

github-actions · 2024-05-24T00:18:59Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

vsop-479 · 2024-06-21T06:14:43Z

@mikemccand
I think this change is close for PrefixQuery, TermRangeQuery, and custom binary AutomatonQuery.

As for RegexpQuery and WildcardQuery, whose automata converted by UTF32ToUTF8. We just get accept transitions [0, 127] in RunAutomaton, I am not sure whether it is enough that just check the accept state is transition is [0, 127].

Another approach is set matchAllSuffix in UTF32ToUTF8, but need to add a unFundamental member to Automaton.

Or, can we just push this change for Prefix/TermRangeQuery, and leave a TODO for Regexp/WildcardQuery temporarily?

github-actions · 2024-07-07T00:21:47Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

vsop-479 · 2024-08-12T08:17:05Z

Conflicts resolved.

github-actions · 2024-08-28T00:20:40Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

github-actions · 2024-10-05T00:22:46Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

Terminate after matched the whole prefix for PrefixQuery.

529399a

Modify comment. Modify comment.

github-actions bot added the Stale label Feb 20, 2024

Use terminable flag instead of value check.

a5f5e43

github-actions bot removed the Stale label Feb 22, 2024

Match sub block's entry directly.

181529e

Set minTermBlockSize to 2, maxTermBlockSize to 3, to generate subBlock

d6e69ca

mikemccand reviewed Feb 27, 2024

View reviewed changes

Detect accept state can match all suffix in RunAutomaton.

e997599

Reset frame's matchAllSuffix state.

b5e977e

mikemccand reviewed Mar 6, 2024

View reviewed changes

Add AutomatonQuery test case and improve code.

f447f8a

Merge branch 'main' into optimize_prefix_query

46036f6

github-actions bot added Stale and removed Stale labels Apr 13, 2024

github-actions bot added the Stale label Apr 30, 2024

vsop-479 added 2 commits May 8, 2024 14:13

Merge branch 'main' into optimize_prefix_query

9b40381

Set isBinary to true.

2579c57

vsop-479 added 2 commits May 8, 2024 17:07

Fix comment.

a1c587c

Add matchAllSuffix to equals and ramBytesUsed.

0199f4a

github-actions bot removed the Stale label May 9, 2024

Rename canMatchAllSuffix to detectMatchAllSuffix.

a5f30bc

github-actions bot added the Stale label May 24, 2024

vsop-479 added 2 commits June 21, 2024 13:36

Merge branch 'main' into optimize_prefix_query

7e7e924

Add matchAllSuffix to toString.

30cd976

github-actions bot removed the Stale label Jun 22, 2024

github-actions bot added the Stale label Jul 7, 2024

vsop-479 added 3 commits August 12, 2024 15:23

Revert TestWildcardQuery to resolve conflicts.

be3076a

Merge branch 'main' into optimize_prefix_query

736791c

Resolve conflicts.

0f6b434

vsop-479 requested a review from mikemccand August 12, 2024 08:17

github-actions bot removed the Stale label Aug 13, 2024

github-actions bot added the Stale label Aug 28, 2024

vsop-479 changed the title ~~Terminate automaton after matched the whole prefix for PrefixQuery.~~ Terminate automaton when it can match all suffixes, and match suffixes directly. Sep 9, 2024

github-actions bot removed the Stale label Sep 10, 2024

Merge branch 'main' into optimize_prefix_query

73b4ced

github-actions bot added the Stale label Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminate automaton when it can match all suffixes, and match suffixes directly. #13072

Terminate automaton when it can match all suffixes, and match suffixes directly. #13072

vsop-479 commented Feb 5, 2024

github-actions bot commented Feb 20, 2024

vsop-479 commented Feb 21, 2024

jpountz commented Feb 23, 2024

mikemccand commented Feb 27, 2024

mikemccand left a comment

vsop-479 commented Feb 28, 2024 •

edited

Loading

rmuir commented Feb 28, 2024

vsop-479 commented Feb 28, 2024

vsop-479 commented Mar 1, 2024 •

edited

Loading

vsop-479 commented Mar 2, 2024 •

edited

Loading

vsop-479 commented Mar 6, 2024

mikemccand left a comment

mikemccand Mar 6, 2024

vsop-479 Mar 7, 2024 •

edited

Loading

vsop-479 Mar 29, 2024 •

edited

Loading

vsop-479 Apr 15, 2024

vsop-479 Apr 15, 2024 •

edited

Loading

vsop-479 commented Mar 7, 2024 •

edited

Loading

github-actions bot commented Apr 13, 2024

github-actions bot commented Apr 30, 2024

github-actions bot commented May 24, 2024

vsop-479 commented Jun 21, 2024

github-actions bot commented Jul 7, 2024

vsop-479 commented Aug 12, 2024

github-actions bot commented Aug 28, 2024

github-actions bot commented Oct 5, 2024

Terminate automaton when it can match all suffixes, and match suffixes directly. #13072

Are you sure you want to change the base?

Terminate automaton when it can match all suffixes, and match suffixes directly. #13072

Conversation

vsop-479 commented Feb 5, 2024

github-actions bot commented Feb 20, 2024

vsop-479 commented Feb 21, 2024

jpountz commented Feb 23, 2024

mikemccand commented Feb 27, 2024

mikemccand left a comment

Choose a reason for hiding this comment

vsop-479 commented Feb 28, 2024 • edited Loading

rmuir commented Feb 28, 2024

vsop-479 commented Feb 28, 2024

vsop-479 commented Mar 1, 2024 • edited Loading

vsop-479 commented Mar 2, 2024 • edited Loading

vsop-479 commented Mar 6, 2024

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand Mar 6, 2024

Choose a reason for hiding this comment

vsop-479 Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

vsop-479 Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

vsop-479 Apr 15, 2024

Choose a reason for hiding this comment

vsop-479 Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

vsop-479 commented Mar 7, 2024 • edited Loading

github-actions bot commented Apr 13, 2024

github-actions bot commented Apr 30, 2024

github-actions bot commented May 24, 2024

vsop-479 commented Jun 21, 2024

github-actions bot commented Jul 7, 2024

vsop-479 commented Aug 12, 2024

github-actions bot commented Aug 28, 2024

github-actions bot commented Oct 5, 2024

vsop-479 commented Feb 28, 2024 •

edited

Loading

vsop-479 commented Mar 1, 2024 •

edited

Loading

vsop-479 commented Mar 2, 2024 •

edited

Loading

vsop-479 Mar 7, 2024 •

edited

Loading

vsop-479 Mar 29, 2024 •

edited

Loading

vsop-479 Apr 15, 2024 •

edited

Loading

vsop-479 commented Mar 7, 2024 •

edited

Loading