Relax Operations.isTotal() to work with a deterministic automaton #13707

rmuir · 2024-09-02T15:55:32Z

Operations.isTotal currently returns false unless the DFA is minimal.
This makes the method almost useless and we definitely don't want to encourage minimization just to make such a check.

Can we do a better job, e.g. return true for a non-minimal DFA?
There's an example test added that fails without the change to demonstrate:

// deterministic, but not minimal
assertTrue(Operations.isTotal(Operations.repeat(Automata.makeAnyChar())));

This is a draft PR because I still don't like that it uses subsetOf, the code literally makes a minimal "total DFA" and compares that the two automata recognize the same language. Because it is total, we only need to call subsetOf once, but I still don't like how heavy it is. Can we do better?

See #13706 for more background

rmuir · 2024-09-02T16:10:34Z

I feel like with this code there are only two options:

keep isTotal O(1) , to prevent traps and problems. it does document that the input must be minimized. Maybe it is enough to fix our regexp parser on another issue so that typical use cases of the method (optimizing) get optimized.
change isTotal to O(n^2) using the code here. Add warnings about this. Remove usage from CompiledAutomaton so at least we don't cause internal performance problems because of it.

a third option: isTotal that runs faster than O(n^2), would be awesome :)

rmuir · 2024-09-02T16:50:16Z

OK I took a third option here in the latest commit.

Javadoc is unchanged, we just detect the kind of "total" automaton being created by RegExp, too, without involving any scary algorithms.

It just relaxes the check and detects total automaton that looks like this (with a-z alphabet for demonstration):

State 0: accept
  a-z -> State 1
State 1: accept
  a-z -> State 1

This is what happens with RegExp parser today, or if you do Operations.repeat(Automata.makeAnyChar()). It is only slightly "non-minimal" and I think it doesn't make the code too ugly to handle it?

… by ecj

rmuir · 2024-09-02T20:24:28Z

iterating again, I changed the code here to @dweiss solution and added tests based on his examples.

it runs in linear time and space (bitset), and should work generally for any DFA, so IMO it is safe.

rmuir · 2024-09-02T20:38:03Z

heh that empty case got found in a different way by the randomized check.

it isn't enough to do a.getNumStates() > 0, for the situation where all states are dead states :)

will fix.

… could be unreachable)

rmuir · 2024-09-02T21:04:11Z

ok, the tests have found all the fun with empty cases and I think it is ready for review.

for background, this problem is similar to Operations.isEmpty():

lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java

Lines 821 to 858 in 5b125f3

    
           /** Returns true if the given automaton accepts no strings. */ 
        
           public static boolean isEmpty(Automaton a) { 
        
             if (a.getNumStates() == 0) { 
        
               // Common case: no states 
        
               return true; 
        
             } 
        
             if (a.isAccept(0) == false && a.getNumTransitions(0) == 0) { 
        
               // Common case: just one initial state 
        
               return true; 
        
             } 
        
             if (a.isAccept(0) == true) { 
        
               // Apparently common case: it accepts the damned empty string 
        
               return false; 
        
             } 
        
             ArrayDeque<Integer> workList = new ArrayDeque<>(); 
        
             BitSet seen = new BitSet(a.getNumStates()); 
        
             workList.add(0); 
        
             seen.set(0); 
        
             Transition t = new Transition(); 
        
             while (workList.isEmpty() == false) { 
        
               int state = workList.removeFirst(); 
        
               if (a.isAccept(state)) { 
        
                 return false; 
        
               } 
        
               int count = a.initTransition(state, t); 
        
               for (int i = 0; i < count; i++) { 
        
                 a.getNextTransition(t); 
        
                 if (seen.get(t.dest) == false) { 
        
                   workList.add(t.dest); 
        
                   seen.set(t.dest); 
        
                 } 
        
               } 
        
             } 
        
             return true; 
        
           }

to keep it simple, for Operations.isTotal() here we call getLiveStates() up-front and then just work every state individually. alternatively we could avoid that, and "traverse" automaton to look more like Operations.isEmpty() logic. but I think at least as a start, using getLiveStates() is easier to reason about. Either way you'd need a BitSet.

dweiss

Nice.

ChrisHegarty

Awesome!! LGTM.

( I filed an issue < 24hrs ago, had a brief exchange of messages, then went to a concern followed by a late and too short sleep, to now wakeup the delightful code in this PR. That is why I love this project. ❤️ )

mikemccand · 2024-09-04T12:29:45Z

for background, this problem is similar to Operations.isEmpty():

I love the symmetry.

mikemccand

I love the simplicity of this algo, yay! I just got confused on whether we claim to support NFAs?

lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java

rmuir · 2024-09-04T16:10:08Z

I updated the javadocs: "The automaton must be deterministic, or this method may return false."
Previous implementation was: "The automaton must be minimal, or this method may return false."

Relax Operations.isTotal() to work with a deterministic automaton

3dabc3d

solve the practical issue without any quadratic algorithm

411bef7

rmuir marked this pull request as ready for review September 2, 2024 16:50

rmuir mentioned this pull request Sep 2, 2024

RegExp::toAutomaton no longer minimizes #13706

Closed

rmuir added 2 commits September 2, 2024 16:14

switch to dawid's linear time algorithm, add fun tests

ce4cce0

fix test to ensure we check tricky4 (doesnt cover whole range). found…

5b125f3

… by ecj

fix test failure, we must have visited at least one state (all states…

2ce7904

… could be unreachable)

rmuir requested review from dweiss and mikemccand September 2, 2024 21:00

dweiss approved these changes Sep 3, 2024

View reviewed changes

ChrisHegarty approved these changes Sep 3, 2024

View reviewed changes

mikemccand reviewed Sep 4, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java Outdated Show resolved Hide resolved

lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java Show resolved Hide resolved

mikemccand reviewed Sep 4, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java Show resolved Hide resolved

javadocs

7771a85

rmuir added 2 commits September 5, 2024 08:28

Merge branch 'main' into operations_is_total

982f30f

CHANGES

8e83350

rmuir merged commit ea3a9b8 into apache:main Sep 5, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax Operations.isTotal() to work with a deterministic automaton #13707

Relax Operations.isTotal() to work with a deterministic automaton #13707

rmuir commented Sep 2, 2024

rmuir commented Sep 2, 2024

rmuir commented Sep 2, 2024

rmuir commented Sep 2, 2024

rmuir commented Sep 2, 2024

rmuir commented Sep 2, 2024

dweiss left a comment

ChrisHegarty left a comment

mikemccand commented Sep 4, 2024

mikemccand left a comment

rmuir commented Sep 4, 2024

Relax Operations.isTotal() to work with a deterministic automaton #13707

Relax Operations.isTotal() to work with a deterministic automaton #13707

Conversation

rmuir commented Sep 2, 2024

rmuir commented Sep 2, 2024

rmuir commented Sep 2, 2024

rmuir commented Sep 2, 2024

rmuir commented Sep 2, 2024

rmuir commented Sep 2, 2024

dweiss left a comment

Choose a reason for hiding this comment

ChrisHegarty left a comment

Choose a reason for hiding this comment

mikemccand commented Sep 4, 2024

mikemccand left a comment

Choose a reason for hiding this comment

rmuir commented Sep 4, 2024