Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegExp::toAutomaton no longer minimizes #13706

Closed
ChrisHegarty opened this issue Sep 2, 2024 · 31 comments
Closed

RegExp::toAutomaton no longer minimizes #13706

ChrisHegarty opened this issue Sep 2, 2024 · 31 comments
Assignees
Milestone

Comments

@ChrisHegarty
Copy link
Contributor

There are a number of optimization in Elasticsearch that depend upon the automaton from a RegExp being total - accepts all strings - [1] [2]. Changes in the upcoming Lucene 10, to not minimize automaton returned by RegExp, has broken the assumption that these optimisations were building upon. At least how they stand today, and I'm not sure how best to replicate the functionality in Lucene 10.

For example this is fine:

RegExp r = new RegExp("@");
Automaton a = r.toAutomaton();
assertTrue(a.isDeterministic());
assertTrue(Operations.isTotal(a));

, while this is not:

RegExp r = new RegExp(".*");
Automaton a = r.toAutomaton();
assertTrue(a.isDeterministic());
assertTrue(Operations.isTotal(a));  // <<< isTotal returns false

Without an API to minimise (since MinimizationOperations is now test-only), I'm not sure how to re-code such optimizations. Or if we should be attempting to provide our own minimize implementation. Or if RegExp should be returning a total automaton for .*?

[1] https://github.com/elastic/elasticsearch/blob/0426e1fbd5dbf1eb9dae07f9af3592569165f5de/x-pack/plugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java#L383
[2] https://github.com/elastic/elasticsearch/blob/0426e1fbd5dbf1eb9dae07f9af3592569165f5de/x-pack/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/predicate/regex/AbstractStringPattern.java#L30

@ChrisHegarty ChrisHegarty changed the title RegExp::automaton no longer minimizes RegExp::toAutomaton no longer minimizes Sep 2, 2024
@ChrisHegarty
Copy link
Contributor Author

@mikemccand @rmuir - your thoughts here would be helpful, since I'm less familiar with this area of code.

@dweiss
Copy link
Contributor

dweiss commented Sep 2, 2024

Minimization is a sure way to prove an automaton accepts all input strings because then the isTotal check is trivial [1]. You could try to trace all possible transitions, starting from the root and a full character range and see if everything in that range is always accepted... Could be fun, implementation-wise.

Looking at the examples, I wonder if this has to be a strict optimization - maybe early checking for common regexp values (.*) would be sufficient and everything else would just run as an automaton (optimized or not)?

If this isn't sufficient then I think you'll have to restore the minimization algorithm on ES side.

[1] https://github.com/cs-au-dk/dk.brics.automaton/blob/master/src/dk/brics/automaton/BasicOperations.java#L583-L590

@jpountz
Copy link
Contributor

jpountz commented Sep 2, 2024

In a similar vein, I wonder if RegExp could create more efficient automata out of the box. For instance, it looks like Operations#repeat could be optimized in the case when there is a single transition to directly create a minimal automaton, and this would in-turn make RegExp(".*") create an automaton that returns true for Operations#isTotal.

@rmuir
Copy link
Member

rmuir commented Sep 2, 2024

@ChrisHegarty implementation of isTotal() method requires a minimal DFA. If the automaton is not minimal, it may return false but it should not create a problem.

This is the only place that isTotal() is called in lucene (see the comment): https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java#L181

If you really need to minimize here, can you use something like this as a workaround? https://github.com/apache/lucene/blob/main/lucene/test-framework/src/java/org/apache/lucene/tests/util/automaton/AutomatonTestUtil.java#L338-L345

Sorry, I havent thought about this isTotal much to see if there is a more reasonable implementation, just need to think it over.

If we need to improve isTotal, it is definitely not necessary to minimize, e.g. the following only requires determinization + removal of dead states

boolean isTotal(Automaton a) {
  return sameLanguage(a, Automata.makeAnyString());
}

@rmuir
Copy link
Member

rmuir commented Sep 2, 2024

This is just what i'm mulling over now, relaxing isTotal to no longer require a minimal DFA:

diff --git a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
index 2052b1c50bf..0de4ac013ee 100644
--- a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
+++ b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
@@ -864,15 +864,22 @@ public final class Operations {
 
   /**
    * Returns true if the given automaton accepts all strings for the specified min/max range of the
-   * alphabet. The automaton must be minimized.
+   * alphabet. The automaton must be deterministic with no transitions to dead states.
    */
   public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
-    if (a.isAccept(0) && a.getNumTransitions(0) == 1) {
+    // minimal case
+    if (a.getNumStates() == 1 && a.isAccept(0) && a.getNumTransitions(0) == 1) {
       Transition t = new Transition();
       a.getTransition(0, 0, t);
       return t.dest == 0 && t.min == minAlphabet && t.max == maxAlphabet;
     }
-    return false;
+    // deterministic case
+    Automaton a2 = new Automaton();
+    int s = a2.createState();
+    a2.setAccept(s, true);
+    a2.addTransition(s, s, minAlphabet, maxAlphabet);
+    a2.finishState();
+    return sameLanguage(a, a2);
   }
 
   /**

edit: struggles :)

@ChrisHegarty
Copy link
Contributor Author

In a similar vein, I wonder if RegExp could create more efficient automata out of the box. For instance, it looks like Operations#repeat could be optimized in the case when there is a single transition to directly create a minimal automaton, and this would in-turn make RegExp(".*") create an automaton that returns true for Operations#isTotal.

I had a similar thought. Looking at the code it kinda looks a little tacky, but also kinda makes sone sense, e.g.

      case REGEXP_REPEAT:
+        if (exp1.kind == Kind.REGEXP_ANYCHAR && automaton_provider == null) {
+          return Automata.makeAnyString();
+        } else {
          a = Operations.repeat(exp1.toAutomaton(automata, automaton_provider));
+       }
        break;

@rmuir
Copy link
Member

rmuir commented Sep 2, 2024

Here's a round two, to prevent any error on NFA or having transitions to dead states:

diff --git a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
index 2052b1c50bf..10103628fad 100644
--- a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
+++ b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
@@ -864,14 +864,25 @@ public final class Operations {
 
   /**
    * Returns true if the given automaton accepts all strings for the specified min/max range of the
-   * alphabet. The automaton must be minimized.
+   * alphabet. The automaton must be deterministic with no transitions to dead states.
    */
   public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
-    if (a.isAccept(0) && a.getNumTransitions(0) == 1) {
+    // minimal case
+    if (a.getNumStates() == 1 && a.isAccept(0) && a.getNumTransitions(0) == 1) {
       Transition t = new Transition();
       a.getTransition(0, 0, t);
       return t.dest == 0 && t.min == minAlphabet && t.max == maxAlphabet;
     }
+    // deterministic case
+    if (a.isDeterministic() && hasDeadStatesFromInitial(a) == false) {
+      Automaton a2 = new Automaton();
+      int s = a2.createState();
+      a2.setAccept(s, true);
+      a2.addTransition(s, s, minAlphabet, maxAlphabet);
+      a2.finishState();
+      return sameLanguage(a, a2);
+    }
+    // NFA, or has transitions to dead states, return false
     return false;
   }

@ChrisHegarty
Copy link
Contributor Author

@rmuir Just skimming your update to isTotal, it looks good. I think that it will be more generally useful, given that we minimize less now.

Separately, I might also make sense to improve RegExp, as suggested earlier in this issue.

@rmuir
Copy link
Member

rmuir commented Sep 2, 2024

I had a similar thought. Looking at the code it kinda looks a little tacky, but also kinda makes sone sense, e.g.

      case REGEXP_REPEAT:
+        if (exp1.kind == Kind.REGEXP_ANYCHAR && automaton_provider == null) {
+          return Automata.makeAnyString();
+        } else {
          a = Operations.repeat(exp1.toAutomaton(automata, automaton_provider));
+       }
        break;

let's fix the regexp parser first? It is easier to reason about and less scary than stuff like isTotal and subsetOf.

Previously, regexp parser was calling minimize() on every...parsing...step. I removed this because it is obviously not good. But if we have a simple fix to make it emit better automaton for practical uses, let's do it?

@rmuir
Copy link
Member

rmuir commented Sep 2, 2024

@rmuir Just skimming your update to isTotal, it looks good. I think that it will be more generally useful, given that we minimize less now.

need to stare at it some more. I don't like that it uses some stuff such as sameLanguage which has a TODO to move to tests/test-framework and is currently only used by tests.

And since we are checking for "total", we definitely don't need to do subsetOf twice: subsetOf(a2, a1) && subsetOf(a1, a2).

and I don't like that subsetOf is quadratic to the number of states: it is heavy stuff. There is probably a simpler algorithm.

@rmuir
Copy link
Member

rmuir commented Sep 2, 2024

@ChrisHegarty I created draft PR, but I am still not happy with it yet.

@rmuir
Copy link
Member

rmuir commented Sep 2, 2024

See PR: #13707, I took a different approach which solves the practical problem without doing scary stuff.

   public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
+    // allows some "fuzziness" to detect non-minimal forms such as those from RegExp
     if (a.isAccept(0) && a.getNumTransitions(0) == 1) {
       Transition t = new Transition();
       a.getTransition(0, 0, t);
-      return t.dest == 0 && t.min == minAlphabet && t.max == maxAlphabet;
+      int state = t.dest;
+      if (t.min == minAlphabet
+          && t.max == maxAlphabet
+          && a.isAccept(state)
+          && a.getNumTransitions(state) == 1) {
+        a.getTransition(state, 0, t);
+        return t.dest == state && t.min == minAlphabet && t.max == maxAlphabet;
+      }
     }
     return false;
   }

@dweiss
Copy link
Contributor

dweiss commented Sep 2, 2024

I like the brevity of using sameLanguage! :)

I keep trying to find a counterexample to the assertion that a deterministic, total automaton must accept full language in each state reachable from root (there may be more than one transition but they must cover the full minAlphabet..maxAlphabet range, always ending in an accepting state somewhere.

If so, it should be possible to implement isTotal as a full traversal of the automaton in O(num states)? So something like this would also return true:

      Automaton c =
          Operations.repeat(
              Operations.union(
                  Automata.makeCharRange(Character.MIN_CODE_POINT, 100),
                  Automata.makeCharRange(101, Character.MAX_CODE_POINT)));

      System.out.println(c.toDot());

image

@rmuir
Copy link
Member

rmuir commented Sep 2, 2024

I like the sameLanguage too, but I don't like the potential quadratic cost, considering we currently expect the calculation to be fast, and it is called on every automaton. I think it should be avoided in production code?

As far as your counterexample, it is actually difficult to create such an automaton, you managed it with union! e,g, if you just create a state and add several ranges instead of one big range, they will be collapsed into one single range when you finishState: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/Automaton.java#L232

That's why I thought, there is something to be said for a very simple, constant-time check that will be practical as opposed to perfect: it will work for the stuff coming out of regex parser, or for "normal" stuff coming from the api (e.g. repeat). for that it needs to check two states (or same state twice) instead of one.

But if you are able to implement it in linear time that solves all the cases, that would be great, let's do that instead.

@rmuir
Copy link
Member

rmuir commented Sep 2, 2024

I think the "full traversal" suggested by Dawid here would be very fast. The annoying part is probably just the reachability (e.g. regex parser produces automatons with some unreachable states), but we have some helpers for that already in Operations?

@dweiss
Copy link
Contributor

dweiss commented Sep 2, 2024

I like the sameLanguage too, but I don't like the potential quadratic cost, considering we currently expect the calculation to be fast, and it is called on every automaton. I think it should be avoided in production code?

I agree - I don't think it's a good practical replacement solution, but it's a very elegant theoretical one. :)

But if you are able to implement it in linear time that solves all the cases, that would be great, let's do that instead.

I think the relaxation patch is fine as a short first step - it doesn't claim to be optimal (PnP, as Mike loves to say). I'll add it to my todo list, it seems like a fun little project, although finding the time is difficult.

The annoying part is probably just the reachability (e.g. regex parser produces automatons with some unreachable states),

I don't think all states need to be considered - only those reachable from the initial state. Tracking which states have been checked already may add some overhead but even with this, it should be fast (enough)?

@rmuir
Copy link
Member

rmuir commented Sep 2, 2024

I don't think all states need to be considered - only those reachable from the initial state. Tracking which states have been checked already may add some overhead but even with this, it should be fast (enough)?

Yes, this one is very important actually, you get 3 states with Operations.repeat(Automata.makeAnyChar()) and one is unreachable.

I think the relaxation patch is fine as a short first step - it doesn't claim to be optimal (PnP, as Mike loves to say). I'll add it to my todo list, it seems like a fun little project, although finding the time is difficult.

lemme try adding your test, that is helpful as I was having a tough time coming up with "varieties" to test. I will take a stab at it. It would be great to fix the javadoc to not require minimization to call this function.

@dweiss
Copy link
Contributor

dweiss commented Sep 2, 2024

You could probably create an automaton containing states with an insane number of outgoing transitions, for example one transition for each character... then resolving that such a state actually covers the full min..max range, with no gaps, could be costly. The only thing I can think of is sorting transition ranges and checking for continuity (and min/max)... this may get expensive.

Whether such unrealistic automata can ever occur in reality (as a product of the set of operations we make available) is another question...

@dweiss
Copy link
Contributor

dweiss commented Sep 2, 2024

Like so:

      Automaton c =
          Operations.repeat(
              Operations.union(
                  List.of(
                      Automata.makeCharRange(Character.MIN_CODE_POINT, 100),
                      Automata.makeChar(101),
                      Automata.makeChar(102),
                      Automata.makeChar(103),
                      Automata.makeCharRange(104, Character.MAX_CODE_POINT))));

image

@rmuir
Copy link
Member

rmuir commented Sep 2, 2024

I don't think it is too bad because transitions are already sorted and collapsed for each state when you call finishState(). For such an automaton you paid the price at construction time :)

But when you "iterate transitions" in order (0..numTransitions) to resolve a state, you are walking them in sorted order.

@rmuir
Copy link
Member

rmuir commented Sep 2, 2024

@dweiss i coded your idea up like this:

  /** 
   * Returns true if the given automaton accepts all strings for the specified min/max range of the
   * alphabet. 
   */   
  public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
    BitSet states = getLiveStates(a);
    Transition spare = new Transition();
    for (int state = states.nextSetBit(0); state >= 0; state = states.nextSetBit(state + 1)) {
      // all reachable states must be accept states
      if (a.isAccept(state) == false) return false;
      // all reachable states must contain transitions covering minAlphabet-maxAlphabet
      int previousLabel = minAlphabet - 1;
      for (int transition = 0; transition < a.getNumTransitions(state); transition++) {
        a.getTransition(state, transition, spare);
        // no gaps are allowed
        if (spare.min > previousLabel + 1) return false;
        previousLabel = spare.max;
      }
      if (previousLabel < maxAlphabet) return false;
      if (state == Integer.MAX_VALUE) {
        break; // or (state+1) would overflow
      }
    }
    // we've checked all the states, if its non-empty, its total
    return a.getNumStates() > 0;
  }

Only surprise was the last line, so the logic is:

  • all reachable states must be accept states
  • all reachable states must contain transitions covering the min..max range
  • automaton must not be empty

@rmuir
Copy link
Member

rmuir commented Sep 2, 2024

ce4cce0

@rmuir
Copy link
Member

rmuir commented Sep 2, 2024

  • automaton must not be empty

where "empty" means, at least one reachable state. Automaton can have all dead states, and that doesn't make it total :)

@dweiss
Copy link
Contributor

dweiss commented Sep 3, 2024

Yes, I like it!

I had some time to think about it before I went to bed and this implementation is actually a direct rollout of the definition of accepted language equivalence for deterministic automata - just what you mentioned at the beginning. Two equivalent (deterministic) automata must accept the same set of symbols from any state reachable for any input starting at the initial state. The automaton we compare against just happens to be repeat(anyCharacter()), so in any reachable state of automaton A we compare against the only state in automaton B - a self-connected state accepting all symbols. Consistent with the conditions you mentioned.

I'm glad this worked out to be so elegant and thank you for the implementation.

@ChrisHegarty
Copy link
Contributor Author

💙

@mikemccand
Copy link
Member

and I don't like that subsetOf is quadratic to the number of states: it is heavy stuff. There is probably a simpler algorithm.

I am trying to review this PR, but got distracted / rat-holed into this statement lol:

Is subsetOf really quadratic? I know its javadoc says so, but I'm not convinced. Yes it has embedded for loops (one for number of transitions leaving n1, the other for number of transitions leaving n2), but that is basically doing a merge sort so its cost is linear O(max(n1, n2)).

In total I think its cost is actually O(totalTransitionCountInA1)? Anyway, this is a (fun) distraction ... I'll try to actually review the change ;)

@mikemccand
Copy link
Member

I think the relaxation patch is fine as a short first step - it doesn't claim to be optimal (PnP, as Mike loves to say).

+1, heh.

@mikemccand
Copy link
Member

  • automaton must not be empty

Ahh that last line was sneaky :)

@rmuir
Copy link
Member

rmuir commented Sep 4, 2024

Is subsetOf really quadratic? I know its javadoc says so, but I'm not convinced.

I think this is only one of the scarier parts about it. The other scary part is that it may throw IllegalArgumentException, or in some cases AssertionError (if asserts are enabled, if not... UB?).

For these reasons too, I wanted to avoid its usage in something that gets called e.g. by CompiledAutomaton and proposed moving it to AutomatonTestUtil for test purposes only: #13708

@mikemccand
Copy link
Member

Is subsetOf really quadratic? I know its javadoc says so, but I'm not convinced.

I think this is only one of the scarier parts about it. The other scary part is that it may throw IllegalArgumentException, or in some cases AssertionError (if asserts are enabled, if not... UB?).

For these reasons too, I wanted to avoid its usage in something that gets called e.g. by CompiledAutomaton and proposed moving it to AutomatonTestUtil for test purposes only: #13708

Yeah +1 to move it to test only!

@rmuir rmuir added this to the 10.0.0 milestone Sep 5, 2024
@rmuir
Copy link
Member

rmuir commented Sep 5, 2024

Closed by #13707

@rmuir rmuir closed this as completed Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants