-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nondeterministic Parsing #2241
Comments
After some thought, I've determined that an additional optimization is needed to avoid poor performance in pathological cases. Motivating example(
a = (
b = (
c = (
0
) => null
) => null
) => null
) => null As established above, the JavaScript syntax should guess that an open paren denotes a parenthesized expression. Then, if it turns out to actually be an arguments list, it would backtrack. In this case, the parser would backtrack at each level, but at each level it would redo all of the work it had backtracked. This would lead to exponential runtime in a naive implementation. While such patterns would never occur in real JavaScript, an unscrupulous person could deliberately create a pathological code sample. SolutionThe solution, fortunately, is not very complicated. We can cache the result of each "decision" for the duration of a single parse. Whenever we The correctness of this shortcut is due to the stack-based nature of the parser. If the text does not change (e.g. within a single parse), then a given context pushed at a given index will always succeed or always fail; contexts below it on the stack, including nondeterministic contexts, cannot affect this result. We do not need to record successes (the typical case). The set of failure records could be stored in any manner allowing fast lookup. Various straightforward implementations would provide effective constant-time access. In addition to handling pathological parsing, this approach should slightly improve parsing performance in typical cases. Each alternative need only be tried once per parse, limiting the total amount of backtracking that can occur. |
N.B.: Atom's new tree-sitter system does support nondeterministic context-free languages, which allows it to handle JavaScript arrow functions and similar constructs. If Sublime were to implement this proposal, then it would again be strictly more powerful than Atom's highlighter. |
How is this issue coming along? Has anyone started progress on this? Any pointers on where one could start implementing this? |
Unfortunately, this would have to be implemented in core by the Sublime HQ devs. |
Why was this issue closed? I see @FichteFoll added it to the milestones, was it implemented? |
@borela Kind of done in ST 4. If you are interested, you can find the download link in the official Discord chat room. You will need a ST 3 license to try it at this moment. |
This idea has been discussed elsewhere, both here and on Discord. I'm writing it up more formally here.
Background
Sublime's parser is effectively a deterministic pushdown automaton. This makes it capable of parsing any deterministic context-free language. Many common programming languages are deterministic context-free, but not all. Some languages like YAML and Python are context-sensitive. Parsing those languages is outside the scope of this proposal. But some, like JavaScript, are context-free but not deterministic context-free. Sublime cannot correctly parse these languages.
Motivating Example
For a concrete example, consider the following two JavaScript snippets:
The first statement contains a parenthesized expression that in turn contains an assignment expression. By the scope naming guidelines,
foo
should be scopedvariable.other
. The second statement contains an arrow function whose parameterfoo
has a default value of100
, andfoo
should be scopedvariable.parameter.function
. This distinction is highly visible in most color schemes.In very simple cases like this, we can write syntax rules that can tell the examples apart. But such rules must fail in more complex cases for two reasons. The first and best-known is that Sublime only parses one line at a time. The second and more important is that in the example,
100
could be replaced by an arbitrarily complicated expression, which cannot generally be matched by a regular expression anyway.Here is a more complex example:
Code like this is very idiomatic. In particular, it is a common way of defining a React component. I have code like this all over the place. Unfortunately, Sublime will parse the arguments list as a parenthesized expression and the function parameters will not be highlighted correctly. (The JavaScript syntax could instead guess that it was an arrow function arguments list, but this would be wrong even more often.) There is simply no way for a deterministic parser to handle this common example.
Workarounds
Currently, we work around the problem with a regexp lookahead. This fails when there are newlines between the opening paren and the eventual
=>
. It also fails when there are complex expressions within an arrow function argument list. These workarounds fail conspicuously in many common cases. Unfortunately, there is no way to substantially improve them. When in doubt, we parse an ambiguous construct as a parenthesized expression, which does fail gracefully.Implementing multiline regexp matching would improve the situation. Lookaheads could identify more common patterns. The heuristics involved would increase in complexity as well as effectiveness. Perfect parsing would still be beyond reach with a nondeterministic context-free parser.
Implementation With Oniguruma Regexps
The astute reader may observe that old-style Oniguruma regexps actually have more power than the Sublime parser itself. In theory, such a regexp could "look past" a complex expression in a way that Sublime's parser cannot. Such a regexp would be extremely complicated. A first attempt at an Oniguruma regexp that recognizes a JavaScript expression is as follows:
This is certainly a formidable regexp. Anyone extending the syntax (e.g. with YAML Macros) would likely have to modify this arcane expression. It would use the old, slower regexp engine, and its heavy reliance on recursive subexpression calls raises performance concerns. Parenthesized expressions are very common in JavaScript, so this expression would be run many times in a typical parse.
While this expression should handle valid JavaScript, it is unlikely that it would handle invalid JavaScript in exactly the same way as the syntax itself, and if it doesn't, then the user could see badly mangled highlighting while they type. Finally, this approach requires unlimited multiline lookahead; without that, it is a non-starter. For all of these reasons, I don't believe that this is a good or adequate solution to the problem.
Proposed Solution
To parse nondeterministic context-free grammars, I propose an extension to
sublime-syntax
to allow it to function as a nondeterministic pushdown automaton. This is a fully general solution.Example
Before delving into the technical details of the proposed solution, this is what it might look like when used to solve the motivating example above:
Explanation
I presume that the parser currently maintains a stack of contexts, and that it can look up
meta
information related to each context. In order to allow nondeterministic parsing, it will need to maintain a stack of (context, index) pairs, also keeping track of the character index when the context was pushed. The overhead for this should hopefully be negligible.This proposal adds two context-level meta rules,
meta_backtracking_key
andmeta_backtracking_else
. These rules would be mutuall dependent; using only one wouldn't make sense. It also adds afail
key that would be used inmatch
rules. Thisfail
key would be exclusive withpush
,pop
,set
, andembed
.When a rule with a
fail
key is matched, the parser will:meta_backtracking_key
equals the value of thefail
key.*push
the value of the popped context'smeta_backtracking_else
.* If no such exists, shout at at the syntax author; this is an avoidable error. It may be possible to detect this statically.
Discussion
This proposal should allow Sublime to parse any context-free grammar, a rare distinction. The benefits to JavaScript in particular would be significant.
Performance
When no
fail
rule is encountered, the performance impact of this proposal should be negligible. The only additional overhead is a single index per context on the stack. As a result, there should be no performance impact for existing syntax definitions.Under this proposal, the parser only does extra work when it encounters a
fail
rule, signifying a wrong guess. In this case, the penalty is proportional to the amount of backtracking that the parser must do. This, in turn, depends on the nature of the language and the syntax author's use of the feature. (In this respect, this proposal resembles a backtracking regexp engine.)For example, in JavaScript, there would be one "miss" per arrow function.** Each arrow function argument list would have to be re-parsed. Typically, these lists are short and simple, although one could construct pathological cases that would require re-parsing large chunks of code. The worst-case runtime is exponential (like backtracking regexps).
** In practice, many of these misses could be avoided via lookaheads that "short-circuit" the cases that can already be handled today. It's an open question whether such lookaheads or this implementation would be faster in such simple cases.
There are some more sophisticated solutions offering guaranteed-polynomial-time parsing, such as Earley's algorithm or GLL. If better worse-case performance were desired, it's likely that Sublime's implementation of this proposal could be extended to incorporate more sophisticated techniques. For instance, pathological reparsing could be avoided using a sort of branch prediction cache.
In a typical syntax, in a typical file, the performance should be perfectly adequate -- no worse than occasionally using an Oniguruma regexp.
Complexity
One of the main benefits of this proposal is that it does not interfere with any existing syntaxes. Syntax authors would only need (or want) to use it when no other solution exists.
When it is needed, I believe that this proposal is the simplest thing that could possibly work. I've played around with other implementations, and this was the simplest to implement and use.
Other Implications
Any parsing extension that works across multiple lines runs into a known issue: the current line limitation exists to enable an optimization. That optimization, in its most general form, is incompatible with correct parsing of JavaScript and other languages. However, we can preserve that optimization to the greatest possible degree.
For each line L, we would keep track of the earliest prior line P whose parse is affected by the contents of L (one number per line in the view). When no nondeterministic contexts are used, each line would refer to itself. However, when a context with a
meta_backtracking_name
is popped from the stack on line L, we would mark L as affecting the line P that pushed that context (which we're already keeping track of on the context stack) We would mark L thus regardless of whether the context was popped normally or as part of afail
action. Then, when doing a partial reparse that included line L, we would see that line P would also need to be reparsed, so we'd expand the reparsing region to include it. (We wouldn't have to check these newly included lines. This is an important advantage over multiline regexp matching, where this expansion could snowball.)By keeping track of dependencies like this, it should always be possible to re-use partial results from prior parses without missing changes.
Related issues
This proposal would probably solve the following issues:
pop
multiple levels orset
keepingmeta
scope. #1495 - Pop multiple contexts.In addition, in JavaScript it could also be used to reliably:
import.meta
and similar constructs with unexpected newlines.These problems could be solved with multiline regexps; however, they could be solved just as easily with this proposal, with no need to implement both features. I expect that the same is true of features in many languages that would otherwise require multiline regexps.
The text was updated successfully, but these errors were encountered: