Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax Operations.isTotal() to work with a deterministic automaton #13707

Merged
merged 8 commits into from
Sep 5, 2024

Conversation

rmuir
Copy link
Member

@rmuir rmuir commented Sep 2, 2024

Operations.isTotal currently returns false unless the DFA is minimal.
This makes the method almost useless and we definitely don't want to encourage minimization just to make such a check.

Can we do a better job, e.g. return true for a non-minimal DFA?
There's an example test added that fails without the change to demonstrate:

// deterministic, but not minimal
assertTrue(Operations.isTotal(Operations.repeat(Automata.makeAnyChar())));

This is a draft PR because I still don't like that it uses subsetOf, the code literally makes a minimal "total DFA" and compares that the two automata recognize the same language. Because it is total, we only need to call subsetOf once, but I still don't like how heavy it is. Can we do better?

See #13706 for more background

@rmuir
Copy link
Member Author

rmuir commented Sep 2, 2024

I feel like with this code there are only two options:

  • keep isTotal O(1) , to prevent traps and problems. it does document that the input must be minimized. Maybe it is enough to fix our regexp parser on another issue so that typical use cases of the method (optimizing) get optimized.
  • change isTotal to O(n^2) using the code here. Add warnings about this. Remove usage from CompiledAutomaton so at least we don't cause internal performance problems because of it.

a third option: isTotal that runs faster than O(n^2), would be awesome :)

@rmuir
Copy link
Member Author

rmuir commented Sep 2, 2024

OK I took a third option here in the latest commit.

Javadoc is unchanged, we just detect the kind of "total" automaton being created by RegExp, too, without involving any scary algorithms.

It just relaxes the check and detects total automaton that looks like this (with a-z alphabet for demonstration):

State 0: accept
  a-z -> State 1
State 1: accept
  a-z -> State 1

This is what happens with RegExp parser today, or if you do Operations.repeat(Automata.makeAnyChar()). It is only slightly "non-minimal" and I think it doesn't make the code too ugly to handle it?

@rmuir rmuir marked this pull request as ready for review September 2, 2024 16:50
@rmuir
Copy link
Member Author

rmuir commented Sep 2, 2024

iterating again, I changed the code here to @dweiss solution and added tests based on his examples.

it runs in linear time and space (bitset), and should work generally for any DFA, so IMO it is safe.

@rmuir
Copy link
Member Author

rmuir commented Sep 2, 2024

heh that empty case got found in a different way by the randomized check.

it isn't enough to do a.getNumStates() > 0, for the situation where all states are dead states :)

will fix.

@rmuir rmuir requested review from dweiss and mikemccand September 2, 2024 21:00
@rmuir
Copy link
Member Author

rmuir commented Sep 2, 2024

ok, the tests have found all the fun with empty cases and I think it is ready for review.

for background, this problem is similar to Operations.isEmpty():

/** Returns true if the given automaton accepts no strings. */
public static boolean isEmpty(Automaton a) {
if (a.getNumStates() == 0) {
// Common case: no states
return true;
}
if (a.isAccept(0) == false && a.getNumTransitions(0) == 0) {
// Common case: just one initial state
return true;
}
if (a.isAccept(0) == true) {
// Apparently common case: it accepts the damned empty string
return false;
}
ArrayDeque<Integer> workList = new ArrayDeque<>();
BitSet seen = new BitSet(a.getNumStates());
workList.add(0);
seen.set(0);
Transition t = new Transition();
while (workList.isEmpty() == false) {
int state = workList.removeFirst();
if (a.isAccept(state)) {
return false;
}
int count = a.initTransition(state, t);
for (int i = 0; i < count; i++) {
a.getNextTransition(t);
if (seen.get(t.dest) == false) {
workList.add(t.dest);
seen.set(t.dest);
}
}
}
return true;
}

to keep it simple, for Operations.isTotal() here we call getLiveStates() up-front and then just work every state individually. alternatively we could avoid that, and "traverse" automaton to look more like Operations.isEmpty() logic. but I think at least as a start, using getLiveStates() is easier to reason about. Either way you'd need a BitSet.

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!! LGTM.

( I filed an issue < 24hrs ago, had a brief exchange of messages, then went to a concern followed by a late and too short sleep, to now wakeup the delightful code in this PR. That is why I love this project. ❤️ )

@mikemccand
Copy link
Member

for background, this problem is similar to Operations.isEmpty():

I love the symmetry.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the simplicity of this algo, yay! I just got confused on whether we claim to support NFAs?

@rmuir
Copy link
Member Author

rmuir commented Sep 4, 2024

I updated the javadocs: "The automaton must be deterministic, or this method may return false."
Previous implementation was: "The automaton must be minimal, or this method may return false."

@rmuir rmuir merged commit ea3a9b8 into apache:main Sep 5, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants