Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group insides #1679

Closed
wants to merge 10 commits into from
Closed

Conversation

RunDevelopment
Copy link
Member

@RunDevelopment RunDevelopment commented Dec 26, 2018

This PR adds support for highlighting using capturing groups.

Motivation

On some occasions, inside is used to match exactly one substring of the parent pattern. This means that this substring has to be matched by the parent pattern and the inside pattern creating redundancy in the patterns (the inside pattern is just a part of the parent pattern) and taking longer than it has to (additional regex matching).

Example 1:

pattern: /foo bar/,
inside: {
    'foo': /^foo/,
    'bar': /bar$/
}

Solution

By utilizing capturing groups in the parent pattern, it is now possible to highlight the captured substring without the additional overhead of another regex operation.
The following is equal to example 1.

Example 2:

pattern: /(foo) (bar)/,
groups: {
    $1: 'foo',
    $2: 'bar'
}

For more examples see prism-javastacktrace.js.

Implementation, Syntax & usage

I choose the $n syntax because it is intuitive (replace uses the same syntax) and easy to read (in comparison to e.g. an array).
Also, it can be easily extended to support named groups.

The value of each $n (n > 0) is a Prism language definition (like inside) which will be matched against the string captured by the n-th group (match[n]).

($n: 'token-name' is just syntactic sugar for $n: { 'token-name': /[\s\S]+/ })

The resulting token streams of the capturing groups and the remaining substrings between groups will be concatenated to create a new token stream. The concatenated token stream will be the content of the new token.

Example 2 will be joined like so:

const content = [ ...tokenize('foo', groups.$1), ' ', ...tokenize('bar', groups.$2) ];

(Adjacent strings in the token stream will be concatenated later on.)

The grammar of inside (if present) will then be to tokenize all strings in content (non-recursively) replacing the strings with the items of resulting token streams.
This also means that greedy patterns in inside cannot change the tokens matched by groups. This is to avoid the hassle of how to deal with greedy re-matching.

Some details

Because it is not possible to get the offsets of capturing groups in the match array, I had to improvise and the result is getOffsets.
Assuming that there are no nested groups, we can use the order of the groups to repeatedly find the next offset using match[0].indexOf. For more details, see the implementation/doc.
This will fail if a captured string also appears before the group's actual position and after the last group; E.g.: /(b) a (a)/. But in that case is always possible to resolve this by adding more groups: e.g.: /(b) (a) (a)/

Other details:

  1. Because the positions of groups which capture an empty string cannot be determined, they will be ignored.
  2. Capturing groups in lookarounds might not be part of match[0]. These will throw errors but I guess that's ok for our highlighting porpuses.
  3. groups.$1 will be ignored if lookbehind: true.

Other ideas

Aliases (Implemented)

I'm thinking about making $n: ['token-name', 'alias-1', 'alias-2'] syntactic sugar for:

$n: {
    'token-name': {
        pattern: /[\s\S]+/,
        alias: ['alias-1', 'alias-2']
    }
}

But maybe that's too much magic? Thoughts?

Combined groups

I'm thinking about combining groups so that e.g. { $n$m: value } is syntactic sugar for

{
    $n: value,
    $m: value
}

Any number of groups are allowed. Duplicate groups will throw errors.

This is useful for cases like this:

'foo': {
    pattern: /({)([^{}]*)(})/,
    groups: {
        $1$3: 'delimiter',
        $2: Prism.languages.bar
    }
}

@RunDevelopment RunDevelopment mentioned this pull request Jan 1, 2019
@mAAdhaTTah
Copy link
Member

My initial inclination is this this adds more code than it eliminates. Is this really a benefit here?

@RunDevelopment
Copy link
Member Author

This is actually quite useful. While making a few of the new language definition, I could have used this a few times.
Of couse this is not the solution to all our problems but it does solve quite a few of mine:

Problems I lose sleep over 😪

Delimiters

There are some language syntaxes which have code surrounded by some delimiter.

An example is JS template strings where the string content and interpolated expressions are surrounded by backticks. Here I want to focus on the ${ expr } which are used for interpolation. We highlight the surrounding ${ and } as interpolation-punctuation.
The token use a fairly common way to match surrounding delimiters:

'interpolation': {
    pattern: /\${[^}]+}/,
    inside: {
        'interpolation-punctuation': {
            pattern: /^\${|}$/,
            alias: 'punctuation'
        },
        rest: Prism.languages.javascript
    }
}

The only problem is that this will match you as many of the left delimiter (here ${) as you'd like. E.g. ${${${}.
Sure, this is not valid syntax in JS but something similar is for other languages. See regex charsets below.

The correct way would be to use a wrapper token like so:

'interpolation': {
    pattern: /\${[^}]+}/,
    inside: {
        'wrapper': {
            pattern: /(^\${)[\s\S]+(?=}$)/,
            lookbehind: true,
            inside: Prism.languages.javascript
        },
        'interpolation-punctuation': {
            pattern: /\${|}/,
            alias: 'punctuation'
        }
    }
}

This is rarely done however because you have to write first pattern at least twice.

Using groups and the proposed alias syntax, it is quite easy to implement correctly without the wrapper token:

'interpolation': {
    pattern: /(\${)([^}]+)(})/,
    groups: {
        $1: ['interpolation-punctuation', 'punctuation'],
        $2: Prism.languages.javascript,
        $3: ['interpolation-punctuation', 'punctuation']
    }
}

Regex charset

Let's take an example from the regex language: Charsets.
Charsets can be negated like this [^not these chars]. Now I want to match [ and ] as punctuation, ^ as charset-negation and the content as something else.
This is pretty easy to implement using groups:

'charset': {
	pattern: /(lookbehind)(\[)(^)?(?:...content...)*(\])/,
	lookbehind: true,
	greedy: true,
	groups: {
		// starting from $2 because of the lookbehind
		$2: 'punctuation',
		$2: 'charset-negation',
		$3: { /* grammar for charset content */ },
		$4: 'punctuation'
	}
}

But it's impossible to do without them (right now) while also avoiding empty tokens and false positives.

Btw. You can have it working with either no empty tokens or no false positives but not both. (I won't proof this here, but try implementing it yourself and you'll see that you can always find a valid JS regex charset which produces false positives or empty tokens (or both if you're really unlucky). For those curious, I choose to avoid empty tokens.)

I understand that this adds a lot of code to Prism core (809 bytes or +13%) for just one feature which not even all languages profit from but I also consider this a very useful feature which can save me and others writing language definitions a lot of trouble. It makes it easier to write correct language definitions without the repetition of (more or less) complex patterns. It even enables matching which it's possible right now (charset example and this) using only one regex execution and a few string operations.

I should also mention that I only applied it to Java stack traces until now because I wanted to avoid merge conflicts (didn't quite work out) but as shown above, this feature can be applied to other languages as well.

@Golmote
Copy link
Contributor

Golmote commented Mar 28, 2019

Hello there! I came across this PR randomly, I thought I'd share my two cents.

I agree there is a need for something like this and many language definitions could benefit from it for sure.

Apart from the increase in code size, the issue I see is that your proposed solution, because of the restrictions JS regexps have, is not really bullet proof. It will probably work like a charm for simple cases, but when an issue occurs (nested captures, wrong offsets, etc.), it might be quite tricky for users to figure out why and how to fix it. What could be seen as a feature to ease the creation of lang def for beginners might instead become a neat tool for the more advanced users.

I also like you "Aliases" idea, it would be a nice shortcut. I'm less convinced by the "Combined groups" idea, which hurts the readability a lot IMO.

@RunDevelopment
Copy link
Member Author

@Golmote Thanks for the comment!

It will probably work like a charm for simple cases, but when an issue occurs (nested captures, wrong offsets, etc.) it might be quite tricky for users to figure out why and how to fix it.

Don't forget lookarounds like /(?<=(.))a(?=(.))/!
getOffsets is only a very simple approximation, so without the use of a different Regex engine or native support it will won't be perfect.
Regarding nested groups (and groups in lookarounds), we could write a test which just checks all patterns which use inside groups for nested groups.

I'm less convinced by the "Combined groups" idea, which hurts the readability a lot IMO.

I'm still looking for a good syntax... Ideas are welcome.

@RunDevelopment
Copy link
Member Author

I'll close this now.

It's a cool feature but I don't see this landing any time soon. RegExp matches indexes are behind a regex flag and browser support isn't great rn. It also adds quite a bit of code complexity to Prism Core.

@kuenzign
Copy link

This would be very beneficial if it was reevaluated. Both Pygments and Highlight.js support regex groups and trying to convert language definitions to PrismJS requires much more reworking than is ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants