Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Earley: share nodes created by the scanner with the completer #1451

Merged
merged 1 commit into from
Aug 30, 2024

Conversation

chanicpanic
Copy link
Contributor

There should only be one distinct SPPF start node now.

When a start rule alternative ends in a terminal, the scanner creates the start SymbolNode, and when a start rule alternative ends in a nonterminal, the completer creates the start SymbolNode. The issue was that if both cases occurred, the scanner would create a start SymbolNode, and then the completer would also create one instead of reusing the existing node because the completer did not know the node existed. Hence, we ended up with two different start SymbolNodes. The issue is resolved by sharing the node_cache from scan with predict_and_complete.

test_multiple_start_solutions and test_consistent_derivation_order1 were adjusted because the change affected the order in which derivations were produced for some grammars. The order is still consistent across executions though.

test_cycles2 is back to having only one derivation which I believe is the correct behavior after the change.

Before the change, the SPPF was:

sppfbefore

Notice that the traversal that produces the "triple v" derivation contains two symbol nodes labeled (v, 1, 2, -inf). The existence of two symbol nodes with the same label violates the "Shared" property of SPPFs.

After the change, the SPPF is:

sppfafter

Now, there is only one symbol node labeled (v, 1, 2, -inf). A traversal that produces the 'triple v" derivation still exists, but requires traversing the cycle through (v, 1, 2, -inf) which we don't do.

@erezsh
Copy link
Member

erezsh commented Aug 18, 2024

Thanks for looking into it, @chanicpanic !

In terms of API, it definitely looks like the right approach.

In terms of implementation, it looks like something is a bit off.

In test_consistent_derivation_order1, there should be 4 derivation.

When I run it with explicit ambiguity on this PR, I get this result, which is incorrect and also repetitive:

start
├── _ambig
│   ├── a
│   └── a
│       └── b
└── _ambig
    ├── a
    └── a
        └── b

The master branch returns the actual 4 derivations correctly.

@chanicpanic
Copy link
Contributor Author

@erezsh I believe that result is correct, although I can certainly understand why it may not appear so.

The first _ambig node means that the first child of start may be a or a -> b. Likewise, the second _ambig node means that the second child of start may be a or a -> b. With two options for the first child and two options for the second, we have a total of four derivations.

Using CollapseAmbiguities, we can see that we do in fact end up with four unique derivations:

parser = Lark('''
    start: a a
    a: "." | b
    b: "."
''', ambiguity='explicit')

tree = parser.parse('..')

for t in CollapseAmbiguities().transform(tree):
    print(t.pretty())

Output:

start
  a
  a

start
  a
  a
    b

start
  a
    b
  a

start
  a
    b
  a
    b

@erezsh
Copy link
Member

erezsh commented Aug 20, 2024

@chanicpanic Yes, you're right. I apologize for my confusion.

There is just one more thing I'm unsure about. It feels like the ordering is a bit arbitrary. In test_consistent_derivation_order1 the order changed towards the shorter derivation, but in test_multiple_start_solutions it changed towards the longer derivation.

I'm worried about it because changing the default order might cause errors for users who have accidental ambiguities in their grammar. So I want to make sure we're now changing to the "right order", so we won't have to break it again in the future.

@chanicpanic
Copy link
Contributor Author

I'm worried about it because changing the default order might cause errors for users who have accidental ambiguities in their grammar. So I want to make sure we're now changing to the "right order", so we won't have to break it again in the future.

This is a very valid concern.

I believe that the new order is more consistent with the way we typically resolve ambiguities. That is, in lieu of priorities, if there is a rule with multiple possible derivations, we choose the derivation based its alternative's rule order. Concretely, if we have rule: a | b and could resolve it with a or b, we choose a because it is defined first.

In the case of test_multiple_start_solutions, the current order gives the derivation for the terminal A first. However, the grammar defines start: a | A, so we would expect the derivation for the rule a first like the new order provides.

For test_consistent_derivation_order1, the old behavior resulted in duplicate symbol nodes for the a rule. The lack of shared symbol nodes for a meant that the ambiguity was pushed up the SPPF. As a consequence, rule order could not be used to choose a derivation, and the selected derivation was nondeterministic when using non-ordered sets. With the new behavior, symbol nodes for a are properly reused, and the ambiguity is represented by the children of those nodes. Hence, given a: "." | b, the derivation using "." is preferred by rule order. As a matter of fact, with rule order in play, the order of derivations produced by the new changes is consistent even without ordered sets (I just realized that means this test no longer tests what it was intended to).

Overall, I think the new behavior is more correct (no duplicate symbol nodes in the SPPF), aligns with rule order expectations even when an alternative ends in a terminal (as in test_multiple_start_solutions), and produces a more predictable order (as in test_consistent_derivation_order1).

For reference, grammars affected by the change contain a rule:

  • that has an alternative that ends in a nonterminal
  • and an alternative that ends in a terminal
  • both of which may produce identical strings

mihailefter added a commit to mutalyzer/hgvs-parser that referenced this pull request Aug 27, 2024
erezsh added a commit that referenced this pull request Aug 30, 2024
erezsh added a commit that referenced this pull request Aug 30, 2024
@erezsh erezsh merged commit 95e9700 into lark-parser:master Aug 30, 2024
9 checks passed
@erezsh
Copy link
Member

erezsh commented Aug 30, 2024

Okay, I'm sufficiently convinced.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants