Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameterized contexts #2912

Open
Thom1729 opened this issue Aug 2, 2019 · 4 comments
Open

Parameterized contexts #2912

Thom1729 opened this issue Aug 2, 2019 · 4 comments

Comments

@Thom1729
Copy link

Thom1729 commented Aug 2, 2019

Background

Sublime's parser is effectively a deterministic pushdown automaton. This is sufficient to parse most languages to reasonable fidelity, but there are some languages with features that require more power. Issue #2241 discusses a proposal that would handle nondeterministic languages. This proposal addresses a different issue: context sensitivity.

Many popular coding languages have some features that are “context sensitive”. In lieu of a technical explanation, I provide the following examples of context-sensitive language features:

  • Indentation level. (Python, YAML)
  • Heredocs. (Ruby, Shell Script, PHP, Perl)
  • Matching variable-width delimiters. (Lua block comments, Markdown fenced code blocks)
  • Tag matching. (HTML, XML, JSX)

In some cases, Sublime can handle these constructs using an obscure feature involving backreferences (inherited from classic TextMate grammars), hereafter referred to as the “backreference trick”. That feature is currently essential for handling heredocs, and with some difficulty it can be used (or abused) to handle some indentation. However, it is also unintuitive, limited in scope and power, and not generally well understood by users or developers.

This proposal defines a new sublime-syntax feature that is more powerful, handles more varieties of syntax, and is easier to use. From an implementation and performance standpoint, it should not be far removed from the existing behavior.

Proposed Solution

Motivating example

The following syntax definition matches a toy XML-like syntax, handling improper close tags. When a close tag is encountered:

  • If the close tag matches the currently open tag, then highlight it and pop.
  • Otherwise, if close tag matches another open tag, then leave it and pop.
  • Otherwise, highlight it as unmatched and do not pop.
%YAML 1.2
---
name: Toy XML Example
scope: source.xml
variables:
  tag_name: \w+

contexts:
  main:
    - match: (<)({{tag_name}})(>)
      captures:
        1: punctuation.definition.tag.begin.xml
        2: entity.name.tag.xml
        3: punctuation.definition.tag.end.xml
      push: tag-body
      with_parameters:
        element_name: '\2'
        element_stack: '{{$stack}}|{{$element_name}}'

  element-body:
    - meta_scope: meta.element.xml
    - include: main

    # Close this tag
    - match: (</)({{$element_name}})(>)
      captures:
        1: punctuation.definition.tag.begin.xml
        2: entity.name.tag.xml
        3: punctuation.definition.tag.end.xml
      pop: true

    # Close a tag further up the tree
    - match: (?=</(?:{{$element_stack}})>)
      pop: true

    # Unmatched close tag
    - match: (</)({{tag_name}})(>)
      captures:
        1: punctuation.definition.tag.begin.xml
        2: entity.name.tag.xml invalid.illegal.unmatched.xml
        3: punctuation.definition.tag.end.xml

For comparison, the core XML syntax cannot keep track of the stack of elements at all. It would be possible to use the backreference trick to determine whether a close tag is the correct tag, but not to determine whether it matches another tag on the stack.

Explanation

This proposal has three components: the notion of “context parameters”, new syntax for using these parameters, and new syntax for setting parameter values.

Context parameters

Ordinarily, a context consists of a list of fixed rules. These rules behave identically no matter how the context was pushed onto the stack. (with_prototype creates new contexts with different behavior, which is why recursion is risky.) Under this proposal, a context may have one or more “parameters” that affect its behavior.

This proposal adds parameters to Sublime's parser state. A parameter state consists of a map of parameter names to string values. The parser maintains a stack of parameter states. The top parameter state on that stack is the current parameter state. When a rule pushes a context onto the stack, it may push a new parameter state. When that context is popped from the context stack, the associated parameter state is popped from the parameter stack.

If some parameter name is not assigned a value, then its value is the empty string.

Implementation notes

Each syntax definition can contain only a finite number of parameter names. These names can be interned and replaced by indices; then, a parameter state consists of a small array of pointers to parameter values. When a new parameter state is pushed, it would probably be fastest to copy the entire parameter state to optimize lookups. Alternatively, lookups could walk the stack downward until they find an assignment. Unassigned parameter names in the array should hold null pointers or another special value.

Using parameters

Currently, match rules may contain “variables” surrounded by double braces. These “variables” do not actually vary; they are substituted in at compile time. This proposal extends that syntax to allow a rule to refer to the current parameter state. When a context is pushed onto the stack, the engine will search its match expressions for the pattern {{$foo}} and replace that pattern with the value of the parameter named foo.

Substitution should also be performed in escape rules.

Implementation notes

Of course, most of the work will be done ahead of time. At compile time, Sublime can tell which contexts use parameters, just as it can tell which contexts use the backreference trick. Only contexts that use parameters need be compiled dynamically.

The result of this dynamic compilation should be cached to improve performance in the common case where a context is pushed many times with the same parameters. For instance, in a typical Python file, there need only be one compiled context per indentation level. As a further optimization, Sublime could pre-compile each parameterized context with all empty parameters, as it is likely to come up in many situations.

Setting parameter values

Any push, set, or embed rule may contain a new key, with_parameters. The value of with_parameters is a mapping from parameter names to parameter value expressions. These expressions are treated like match rules — the parser will substitute variables and parameters — and, in addition, numbered backreferences will be replaced with the value of the associated capture matched by the rule.

When such a rule is evaluated, a new parameter state will be pushed containing the values specified in with_parameters and inheriting any other values from the current parameter state. The new parameter state will apply as long as the first (bottommost) context pushed by the rule is on the stack. When that context is popped, then the parameter stack will pop and the previous parameter state will be restored.

Implementation notes

The parameter stack need only be pushed when a rule explicitly pushes a new parameter state, so the overhead should be negligible. At most, each context stack record would contain an extra flag indicating whether it has a new parameter state. When popping a context, the engine would check that flag and, if true, pop the parameter stack.

Discussion

This proposal would expand Sublime's parsing capabilities significantly, allowing it to parse many context-sensitive languages. Implementation complexity and performance should be quite reasonable.

Implementation

Sublime already implements the backreference trick, and in order for that to work, many of the pieces described here must already be in place:

  • Dynamically rewriting match rules and recompiling the context.
  • Storing regexp captures for future processing.
  • Maintaining the existing optimizations such as line-based parsing with dynamic contexts.

The new parts would be:

  • Parsing the with_parameters rules and interning parameter names.
  • Keeping a stack of parameter states. (Probably similar to how escape is implemented.)
  • Performing substitutions on parameter values and pushing the new parameter state.

Unlike #2241, which requires rewinding the token stream, there shouldn't be any substantial changes to the underlying assumptions that the parser makes. Implementation should be a purely additive process. (If the backreference trick weren't already implemented, then this would be a much bigger change.)

Performance

Performance when using parameters should be similar to the performance of the backreference trick — that is, slower than using statically compiled expressions, but reasonable when used in moderation. As with the backreference trick, parameters should only be used when they are necessary to perform a highlighting function; there is no faster alternative to compare them to.

In cases where a parameterized context would be hit very frequently, the parameterized portion could be split off into a separate smaller context. For example, in the XML example above, the element-body context could use a fixed lookahead to detect a close tag and set a minimal context containing the parameterized rules. A similar approach could be used in indentation-sensitive languages so that the expensive indentation-sensitive expression is only run once per line.

In some cases, the performance trade-off might not be worth it for the core syntax, but would be for third-party syntaxes. For instance, despite the use of an XML-like example above, the core XML syntax probably doesn't need that level of sophistication, but a syntax for a specific XML-based configuration format likely would. (I wrote this proposal while trying to create a syntax definition for AWS CloudFormation config files, which are YAML-based and hence extremely indentation-sensitive.)

Applications

Potential applications include:

  • Replacing the obscure backreference trick in the core syntaxes with something easier to work with. (We should be able to use this to provide better behavior for Ruby-style heredocs, which are otherwise hopeless.)
  • An HTML syntax that exactly follows the official parsing algorithm, so that Sublime's highlighter will interpret the file identically to a browser. (This may even let us correctly solve the issues with HTML comments in script tags.)
  • Better error handling (and highlighting while typing) for JSX and other languages with embedded tag-like syntax. (Currently, a single missing close tag will break highlighting for the rest of the file.)
  • Improved meta scopes for indentation-sensitive languages allowing for simple scope-based intelligence tools.
  • Better highlighting for YAML-based formats, including Sublime syntax definitions. (The current definition is brittle and hard-codes indentation preferences.)
  • My pet idea: a tool that takes in a JSON schema (or similar) and outputs a custom syntax highlighter for your favorite configuration files. This would allow even a novice syntax developer to create a complete, accurate definition for a complex JSON, YAML, or XML-based config format in a matter of minutes.

Other thoughts

  • The dollar sign sigil in parameter substitutions was chosen because (as far as I know) it's not valid for variables. That syntax can easily be changed as necessary.
  • In many cases, it would be useful to escape a parameter value as part of substitution. This could be done with a minor variation in syntax. For example, if the parameter tag_name has a value of bobby_tables\, then the match expression (<{{$!tag_name}}>) could evaluate to (<bobby_tables\\>).
  • More ambitiously, a user could define a regexp replacement in a parameter substition, e.g. {{indent/[ ?:]/ /}} to replace indicator characters in a YAML indent with spaces. This isn't an essential feature, but it might be reason to restrict the syntax of parameter names to leave room for future expansion.

That's the proposal. Questions/comments/complaints/requests for clarification?

@CreepySkeleton
Copy link

Yes, please.

@AmjadHD
Copy link

AmjadHD commented Dec 17, 2020

Can you describe how would a syntax definition handling nested indented blocks look like with your idea ?

@Thom1729
Copy link
Author

Example, which (obviously) I haven't tested and could contain errors:

contexts:
  main:
    - include: block

  statement:
    - match: \s+
      scope: invalid.illegal.excess-indentation
      pop: true
    - match: '(if)(:)$'
      captures:
        1: keyword.control
        2: punctuation.section.block.begin
      set: expect-block
    - match: (?=\S)
      set: expression-statement

  block:
    - meta_scope: meta.block
    - match: '^{{$indent}}'
      push: statement
    - include: else-pop # dedent

  expect-block:
    - match: '^{{$indent}}\s+'
      set:
        - block
        - statement
      with_parameters:
        indent: '\0'
    - match: '' # empty block (e.g. while typing)
      pop: true

  expression-statement:
    - match: $
      pop: true
    - include: expression

  expression:
    - match: \d+
      scope: constant.numeric
    - match: '[+-*/=]'
      scope: keyword.operator
    - match: \(
      scope: punctuation.section.group.begin
      push:
        - match: \)
          scope: punctuation.section.group.end
          pop: true
        - include: expression

This should handle a Python-inspired toy example:

2 + 2

if:
  2 + 2
  2 + (
2)
  2 + 2

@AmjadHD
Copy link

AmjadHD commented Apr 21, 2021

This looks promising, especially when later on, sbnf implements this. I am curious what the devs think ?
@wbond any comments on implementing this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants