-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RegExp::toAutomaton no longer minimizes #13706
Comments
@mikemccand @rmuir - your thoughts here would be helpful, since I'm less familiar with this area of code. |
Minimization is a sure way to prove an automaton accepts all input strings because then the isTotal check is trivial [1]. You could try to trace all possible transitions, starting from the root and a full character range and see if everything in that range is always accepted... Could be fun, implementation-wise. Looking at the examples, I wonder if this has to be a strict optimization - maybe early checking for common regexp values (.*) would be sufficient and everything else would just run as an automaton (optimized or not)? If this isn't sufficient then I think you'll have to restore the minimization algorithm on ES side. |
In a similar vein, I wonder if |
@ChrisHegarty implementation of This is the only place that If you really need to minimize here, can you use something like this as a workaround? https://github.com/apache/lucene/blob/main/lucene/test-framework/src/java/org/apache/lucene/tests/util/automaton/AutomatonTestUtil.java#L338-L345 Sorry, I havent thought about this If we need to improve
|
This is just what i'm mulling over now, relaxing
edit: struggles :) |
I had a similar thought. Looking at the code it kinda looks a little tacky, but also kinda makes sone sense, e.g. case REGEXP_REPEAT:
+ if (exp1.kind == Kind.REGEXP_ANYCHAR && automaton_provider == null) {
+ return Automata.makeAnyString();
+ } else {
a = Operations.repeat(exp1.toAutomaton(automata, automaton_provider));
+ }
break; |
Here's a round two, to prevent any error on NFA or having transitions to dead states:
|
@rmuir Just skimming your update to isTotal, it looks good. I think that it will be more generally useful, given that we minimize less now. Separately, I might also make sense to improve RegExp, as suggested earlier in this issue. |
let's fix the regexp parser first? It is easier to reason about and less scary than stuff like Previously, regexp parser was calling |
need to stare at it some more. I don't like that it uses some stuff such as And since we are checking for "total", we definitely don't need to do subsetOf twice: and I don't like that |
@ChrisHegarty I created draft PR, but I am still not happy with it yet. |
See PR: #13707, I took a different approach which solves the practical problem without doing scary stuff.
|
I like the brevity of using sameLanguage! :) I keep trying to find a counterexample to the assertion that a deterministic, total automaton must accept full language in each state reachable from root (there may be more than one transition but they must cover the full minAlphabet..maxAlphabet range, always ending in an accepting state somewhere. If so, it should be possible to implement isTotal as a full traversal of the automaton in O(num states)? So something like this would also return true:
|
I like the sameLanguage too, but I don't like the potential quadratic cost, considering we currently expect the calculation to be fast, and it is called on every automaton. I think it should be avoided in production code? As far as your counterexample, it is actually difficult to create such an automaton, you managed it with union! e,g, if you just create a state and add several ranges instead of one big range, they will be collapsed into one single range when you That's why I thought, there is something to be said for a very simple, constant-time check that will be practical as opposed to perfect: it will work for the stuff coming out of regex parser, or for "normal" stuff coming from the api (e.g. repeat). for that it needs to check two states (or same state twice) instead of one. But if you are able to implement it in linear time that solves all the cases, that would be great, let's do that instead. |
I think the "full traversal" suggested by Dawid here would be very fast. The annoying part is probably just the reachability (e.g. regex parser produces automatons with some unreachable states), but we have some helpers for that already in Operations? |
I agree - I don't think it's a good practical replacement solution, but it's a very elegant theoretical one. :)
I think the relaxation patch is fine as a short first step - it doesn't claim to be optimal (PnP, as Mike loves to say). I'll add it to my todo list, it seems like a fun little project, although finding the time is difficult.
I don't think all states need to be considered - only those reachable from the initial state. Tracking which states have been checked already may add some overhead but even with this, it should be fast (enough)? |
Yes, this one is very important actually, you get 3 states with
lemme try adding your test, that is helpful as I was having a tough time coming up with "varieties" to test. I will take a stab at it. It would be great to fix the javadoc to not require minimization to call this function. |
You could probably create an automaton containing states with an insane number of outgoing transitions, for example one transition for each character... then resolving that such a state actually covers the full min..max range, with no gaps, could be costly. The only thing I can think of is sorting transition ranges and checking for continuity (and min/max)... this may get expensive. Whether such unrealistic automata can ever occur in reality (as a product of the set of operations we make available) is another question... |
I don't think it is too bad because transitions are already sorted and collapsed for each state when you call But when you "iterate transitions" in order (0..numTransitions) to resolve a state, you are walking them in sorted order. |
@dweiss i coded your idea up like this: /**
* Returns true if the given automaton accepts all strings for the specified min/max range of the
* alphabet.
*/
public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
BitSet states = getLiveStates(a);
Transition spare = new Transition();
for (int state = states.nextSetBit(0); state >= 0; state = states.nextSetBit(state + 1)) {
// all reachable states must be accept states
if (a.isAccept(state) == false) return false;
// all reachable states must contain transitions covering minAlphabet-maxAlphabet
int previousLabel = minAlphabet - 1;
for (int transition = 0; transition < a.getNumTransitions(state); transition++) {
a.getTransition(state, transition, spare);
// no gaps are allowed
if (spare.min > previousLabel + 1) return false;
previousLabel = spare.max;
}
if (previousLabel < maxAlphabet) return false;
if (state == Integer.MAX_VALUE) {
break; // or (state+1) would overflow
}
}
// we've checked all the states, if its non-empty, its total
return a.getNumStates() > 0;
} Only surprise was the last line, so the logic is:
|
where "empty" means, at least one reachable state. Automaton can have all dead states, and that doesn't make it total :) |
Yes, I like it! I had some time to think about it before I went to bed and this implementation is actually a direct rollout of the definition of accepted language equivalence for deterministic automata - just what you mentioned at the beginning. Two equivalent (deterministic) automata must accept the same set of symbols from any state reachable for any input starting at the initial state. The automaton we compare against just happens to be repeat(anyCharacter()), so in any reachable state of automaton A we compare against the only state in automaton B - a self-connected state accepting all symbols. Consistent with the conditions you mentioned. I'm glad this worked out to be so elegant and thank you for the implementation. |
💙 |
I am trying to review this PR, but got distracted / rat-holed into this statement lol: Is In total I think its cost is actually O( |
+1, heh. |
Ahh that last line was sneaky :) |
I think this is only one of the scarier parts about it. The other scary part is that it may throw For these reasons too, I wanted to avoid its usage in something that gets called e.g. by CompiledAutomaton and proposed moving it to AutomatonTestUtil for test purposes only: #13708 |
Yeah +1 to move it to test only! |
Closed by #13707 |
There are a number of optimization in Elasticsearch that depend upon the automaton from a
RegExp
being total - accepts all strings - [1] [2]. Changes in the upcoming Lucene 10, to not minimize automaton returned byRegExp
, has broken the assumption that these optimisations were building upon. At least how they stand today, and I'm not sure how best to replicate the functionality in Lucene 10.For example this is fine:
, while this is not:
Without an API to minimise (since
MinimizationOperations
is now test-only), I'm not sure how to re-code such optimizations. Or if we should be attempting to provide our own minimize implementation. Or if RegExp should be returning a total automaton for.*
?[1] https://github.com/elastic/elasticsearch/blob/0426e1fbd5dbf1eb9dae07f9af3592569165f5de/x-pack/plugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java#L383
[2] https://github.com/elastic/elasticsearch/blob/0426e1fbd5dbf1eb9dae07f9af3592569165f5de/x-pack/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/predicate/regex/AbstractStringPattern.java#L30
The text was updated successfully, but these errors were encountered: