Earley: share nodes created by the scanner with the completer #1451

chanicpanic · 2024-08-17T05:48:52Z

There should only be one distinct SPPF start node now.

When a start rule alternative ends in a terminal, the scanner creates the start SymbolNode, and when a start rule alternative ends in a nonterminal, the completer creates the start SymbolNode. The issue was that if both cases occurred, the scanner would create a start SymbolNode, and then the completer would also create one instead of reusing the existing node because the completer did not know the node existed. Hence, we ended up with two different start SymbolNodes. The issue is resolved by sharing the node_cache from scan with predict_and_complete.

test_multiple_start_solutions and test_consistent_derivation_order1 were adjusted because the change affected the order in which derivations were produced for some grammars. The order is still consistent across executions though.

test_cycles2 is back to having only one derivation which I believe is the correct behavior after the change.

Before the change, the SPPF was:

Notice that the traversal that produces the "triple v" derivation contains two symbol nodes labeled (v, 1, 2, -inf). The existence of two symbol nodes with the same label violates the "Shared" property of SPPFs.

After the change, the SPPF is:

Now, there is only one symbol node labeled (v, 1, 2, -inf). A traversal that produces the 'triple v" derivation still exists, but requires traversing the cycle through (v, 1, 2, -inf) which we don't do.

erezsh · 2024-08-18T14:48:28Z

Thanks for looking into it, @chanicpanic !

In terms of API, it definitely looks like the right approach.

In terms of implementation, it looks like something is a bit off.

In test_consistent_derivation_order1, there should be 4 derivation.

When I run it with explicit ambiguity on this PR, I get this result, which is incorrect and also repetitive:

start
├── _ambig
│   ├── a
│   └── a
│       └── b
└── _ambig
    ├── a
    └── a
        └── b

The master branch returns the actual 4 derivations correctly.

chanicpanic · 2024-08-20T01:43:06Z

@erezsh I believe that result is correct, although I can certainly understand why it may not appear so.

The first _ambig node means that the first child of start may be a or a -> b. Likewise, the second _ambig node means that the second child of start may be a or a -> b. With two options for the first child and two options for the second, we have a total of four derivations.

Using CollapseAmbiguities, we can see that we do in fact end up with four unique derivations:

parser = Lark('''
    start: a a
    a: "." | b
    b: "."
''', ambiguity='explicit')

tree = parser.parse('..')

for t in CollapseAmbiguities().transform(tree):
    print(t.pretty())

Output:

start
  a
  a

start
  a
  a
    b

start
  a
    b
  a

start
  a
    b
  a
    b

erezsh · 2024-08-20T14:29:54Z

@chanicpanic Yes, you're right. I apologize for my confusion.

There is just one more thing I'm unsure about. It feels like the ordering is a bit arbitrary. In test_consistent_derivation_order1 the order changed towards the shorter derivation, but in test_multiple_start_solutions it changed towards the longer derivation.

I'm worried about it because changing the default order might cause errors for users who have accidental ambiguities in their grammar. So I want to make sure we're now changing to the "right order", so we won't have to break it again in the future.

chanicpanic · 2024-08-21T06:04:42Z

I'm worried about it because changing the default order might cause errors for users who have accidental ambiguities in their grammar. So I want to make sure we're now changing to the "right order", so we won't have to break it again in the future.

This is a very valid concern.

I believe that the new order is more consistent with the way we typically resolve ambiguities. That is, in lieu of priorities, if there is a rule with multiple possible derivations, we choose the derivation based its alternative's rule order. Concretely, if we have rule: a | b and could resolve it with a or b, we choose a because it is defined first.

In the case of test_multiple_start_solutions, the current order gives the derivation for the terminal A first. However, the grammar defines start: a | A, so we would expect the derivation for the rule a first like the new order provides.

For test_consistent_derivation_order1, the old behavior resulted in duplicate symbol nodes for the a rule. The lack of shared symbol nodes for a meant that the ambiguity was pushed up the SPPF. As a consequence, rule order could not be used to choose a derivation, and the selected derivation was nondeterministic when using non-ordered sets. With the new behavior, symbol nodes for a are properly reused, and the ambiguity is represented by the children of those nodes. Hence, given a: "." | b, the derivation using "." is preferred by rule order. As a matter of fact, with rule order in play, the order of derivations produced by the new changes is consistent even without ordered sets (I just realized that means this test no longer tests what it was intended to).

Overall, I think the new behavior is more correct (no duplicate symbol nodes in the SPPF), aligns with rule order expectations even when an alternative ends in a terminal (as in test_multiple_start_solutions), and produces a more predictable order (as in test_consistent_derivation_order1).

For reference, grammars affected by the change contain a rule:

that has an alternative that ends in a nonterminal
and an alternative that ends in a terminal
both of which may produce identical strings

May be related: - lark-parser/lark#1443 - lark-parser/lark#1451 - lark-parser/lark#1456

Tiny refactor for PR #1451

erezsh · 2024-08-30T08:34:09Z

Okay, I'm sufficiently convinced.

Thank you!

Earley: share nodes created by the scanner with the completer

95e9700

erezsh mentioned this pull request Aug 21, 2024

Resolution order changed #1456

Open

mihailefter added a commit to mutalyzer/hgvs-parser that referenced this pull request Aug 27, 2024

Pin lark version due to extra ambiguities in 1.2.1 and 1.2.2

f615827

May be related: - lark-parser/lark#1443 - lark-parser/lark#1451 - lark-parser/lark#1456

erezsh added a commit that referenced this pull request Aug 30, 2024

Tiny refactor for PR #1451

6b52f72

erezsh added a commit that referenced this pull request Aug 30, 2024

Merge pull request #1459 from lark-parser/chanicpanic-share-nodes

906de31

Tiny refactor for PR #1451

erezsh merged commit 95e9700 into lark-parser:master Aug 30, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Earley: share nodes created by the scanner with the completer #1451

Earley: share nodes created by the scanner with the completer #1451

chanicpanic commented Aug 17, 2024

erezsh commented Aug 18, 2024

chanicpanic commented Aug 20, 2024

erezsh commented Aug 20, 2024 •

edited

Loading

chanicpanic commented Aug 21, 2024

erezsh commented Aug 30, 2024

Earley: share nodes created by the scanner with the completer #1451

Earley: share nodes created by the scanner with the completer #1451

Conversation

chanicpanic commented Aug 17, 2024

erezsh commented Aug 18, 2024

chanicpanic commented Aug 20, 2024

erezsh commented Aug 20, 2024 • edited Loading

chanicpanic commented Aug 21, 2024

erezsh commented Aug 30, 2024

erezsh commented Aug 20, 2024 •

edited

Loading