-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terminate automaton when it can match all suffixes, and match suffixes directly. #13072
base: main
Are you sure you want to change the base?
Conversation
Modify comment. Modify comment.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
@jpountz Please take a look when you get a chance. |
|
I will have a look -- thanks for the ping @jpountz. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a clever optimization! You recognize that this Automaton will match all possible suffixes in this state, and so more efficiently enumerate all terms from block tree under that state.
I have concerns about storing this in Automaton
itself, and the naming was confusing to me :) Could we instead store it in RunAutomaton
? Or, possibly, do it on the fly in IntersectEnum
by detecting a state that is both accept and has a .*
transition back onto itself?
Have you tried to measure any performance change with this? E.g. you could run a luceneutil
benchy with just PrefixQuery
, or, Regexp/WildcardQuery
that also have this property (match-all states in their automata).
lucene/core/src/java/org/apache/lucene/util/automaton/Automaton.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/Automaton.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java
Outdated
Show resolved
Hide resolved
@mikemccand Thanks for your suggestion, I will try to implement it.
I am working on this. @jpountz Thanks for your reply. |
I think the optimization may be similar to the one done in AutomatonTermsEnum? When "ping-ponging" the term dictionary against the automaton, it tracks I think, it works a bit more general than just prefixquery and also helps with regex and wildcard queries too. |
Thanks for reminding that, I will dig into |
There is still problem in the state of match all suffix of On the other hand, I will dig into AutomatonTermsEnum's optimization. |
@mikemccand |
It seems the optimization of If I am not mistaken, this optimization is different from Or maybe you mean we can improve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vsop-479 -- this looks closer. I like that the opto is now contained under RunAutomaton
, but I'm confused/concerned about sometimes checking for 255
max label and other times 127
depending on which query.
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/IntersectTermsEnum.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/IntersectTermsEnum.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java
Outdated
Show resolved
Hide resolved
assert automaton.isAccept(state); | ||
int numTransitions = automaton.getNumTransitions(state); | ||
// Apply to PrefixQuery, TermRangeQuery. | ||
if (numTransitions == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this special case? Just let the for
loop below handle the 1-transition case too?
Edit: hmm, I see, it is subtly different: this is checking for max label 255
but the loop below is checking 127
, hmmm. This is a bit messy -- this low level of code shouldn't be specializing to different automata that come from the high level queries. Can we use alphabetSize-1
as the transition.max
check instead? But, separately, we need to figure out why Regexp/WildcardQuery
are compiling down to 127
as their max on .*
suffix transitions? That is not even correct for matching UTF-8 encoded terms.
Perhaps we could also add tests cases for custom Automata
passed to AutomatonQuery
matching sometimes binary (non-UTF8) terms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to figure out why Regexp/WildcardQuery are compiling down to 127 as their max on .* suffix transitions?
These queries' (including AutomatonQuery
)Automaton
like this: 3 -> 3: [0, 127]; 3 -> 4: [194, 194]; 4 -> 3: [128, 191]. assume 3 is an accept state.
It is more complex to detect whether a state can accept all remaining suffixes for these queries, because its accept states are split into many transitions like: [0, 127], [194, 223], [224, 239], [240, 243], [244], etc.
I am still working on this, any suggestion is welcome @mikemccand.
Perhaps we could also add tests cases for custom Automata passed to AutomatonQuery matching sometimes binary (non-UTF8) terms?
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These queries' (including AutomatonQuery)Automaton like this: 3 -> 3: [0, 127]; 3 -> 4: [194, 194]; 4 -> 3: [128, 191]. assume 3 is an accept state.
@mikemccand
I can track an accept state's other transitions, to check whether these transitions can finally ended on an accept (typically transited by [128, 191]). But i am not sure whether it is enough to judge an state can match all suffix, even not sure whether it is necessary, since maybe it is equivalent to just check the [0, 127] transition's dest is an accept state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to figure out why Regexp/WildcardQuery are compiling down to 127 as their max on .* suffix transitions?
I think we split the transition([0, 1114111]) with utf8
edges in UTF32ToUTF8.convertOneEdge
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikemccand
I think I can detect a match all suffix state for Regexp/WildcardQuery
, in UTF32ToUTF8.convert
after convertOneEdge
like this:
// Writes new transitions into pendingTransitions:
convertOneEdge(utf8State, destUTF8, scratch.min, scratch.max);
// Set match all suffix state.
if(scratch.min == 0 && scratch.max == 1114111 && utf8.isAccept(utf8State) && utf8.isAccept(destUTF8)){
utf8.setMatchAllSuffix(utf8State, true);
}
Which is simple and reliable, but will violate the rule below:
Everything else about Automaton today is fundamental (states, transitions, isAccept) and necessary, but this new member is more a best effort optimization?
Other plan: Checking whether a candidate state can finally ended on an accept by [128, 191], which is added in UTF32ToUTF8.all
:
utf8.addTransition(lastN, end, 128, 191); // type = all*
lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java
Outdated
Show resolved
Hide resolved
I measured it with current implementation with
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
@mikemccand As for Another approach is set matchAllSuffix in Or, can we just push this change for |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Conflicts resolved. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
For PrefixQuery, we can terminate the automaton on current term if we have matched the whole prefix, and match this term directly.
Furthermore, if there is a subBlock, we could match all its' sub terms.