Group insides #1679

RunDevelopment · 2018-12-26T20:38:46Z

This PR adds support for highlighting using capturing groups.

Motivation

On some occasions, inside is used to match exactly one substring of the parent pattern. This means that this substring has to be matched by the parent pattern and the inside pattern creating redundancy in the patterns (the inside pattern is just a part of the parent pattern) and taking longer than it has to (additional regex matching).

Example 1:

pattern: /foo bar/,
inside: {
    'foo': /^foo/,
    'bar': /bar$/
}

Solution

By utilizing capturing groups in the parent pattern, it is now possible to highlight the captured substring without the additional overhead of another regex operation.
The following is equal to example 1.

Example 2:

pattern: /(foo) (bar)/,
groups: {
    $1: 'foo',
    $2: 'bar'
}

For more examples see prism-javastacktrace.js.

Implementation, Syntax & usage

I choose the $n syntax because it is intuitive (replace uses the same syntax) and easy to read (in comparison to e.g. an array).
Also, it can be easily extended to support named groups.

The value of each $n (n > 0) is a Prism language definition (like inside) which will be matched against the string captured by the n-th group (match[n]).

($n: 'token-name' is just syntactic sugar for $n: { 'token-name': /[\s\S]+/ })

The resulting token streams of the capturing groups and the remaining substrings between groups will be concatenated to create a new token stream. The concatenated token stream will be the content of the new token.

Example 2 will be joined like so:

const content = [ ...tokenize('foo', groups.$1), ' ', ...tokenize('bar', groups.$2) ];

(Adjacent strings in the token stream will be concatenated later on.)

The grammar of inside (if present) will then be to tokenize all strings in content (non-recursively) replacing the strings with the items of resulting token streams.
This also means that greedy patterns in inside cannot change the tokens matched by groups. This is to avoid the hassle of how to deal with greedy re-matching.

Some details

Because it is not possible to get the offsets of capturing groups in the match array, I had to improvise and the result is getOffsets.
Assuming that there are no nested groups, we can use the order of the groups to repeatedly find the next offset using match[0].indexOf. For more details, see the implementation/doc.
This will fail if a captured string also appears before the group's actual position and after the last group; E.g.: /(b) a (a)/. But in that case is always possible to resolve this by adding more groups: e.g.: /(b) (a) (a)/

Other details:

Because the positions of groups which capture an empty string cannot be determined, they will be ignored.
Capturing groups in lookarounds might not be part of match[0]. These will throw errors but I guess that's ok for our highlighting porpuses.
groups.$1 will be ignored if lookbehind: true.

Other ideas

Aliases (Implemented)

I'm thinking about making $n: ['token-name', 'alias-1', 'alias-2'] syntactic sugar for:

$n: {
    'token-name': {
        pattern: /[\s\S]+/,
        alias: ['alias-1', 'alias-2']
    }
}

But maybe that's too much magic? Thoughts?

Combined groups

I'm thinking about combining groups so that e.g. { $n$m: value } is syntactic sugar for

{
    $n: value,
    $m: value
}

Any number of groups are allowed. Duplicate groups will throw errors.

This is useful for cases like this:

'foo': {
    pattern: /({)([^{}]*)(})/,
    groups: {
        $1$3: 'delimiter',
        $2: Prism.languages.bar
    }
}

mAAdhaTTah · 2019-03-10T16:43:23Z

My initial inclination is this this adds more code than it eliminates. Is this really a benefit here?

RunDevelopment · 2019-03-10T19:54:38Z

This is actually quite useful. While making a few of the new language definition, I could have used this a few times.
Of couse this is not the solution to all our problems but it does solve quite a few of mine:

Problems I lose sleep over 😪

Delimiters

There are some language syntaxes which have code surrounded by some delimiter.

An example is JS template strings where the string content and interpolated expressions are surrounded by backticks. Here I want to focus on the ${ expr } which are used for interpolation. We highlight the surrounding ${ and } as interpolation-punctuation.
The token use a fairly common way to match surrounding delimiters:

'interpolation': {
    pattern: /\${[^}]+}/,
    inside: {
        'interpolation-punctuation': {
            pattern: /^\${|}$/,
            alias: 'punctuation'
        },
        rest: Prism.languages.javascript
    }
}

The only problem is that this will match you as many of the left delimiter (here ${) as you'd like. E.g. ${${${}.
Sure, this is not valid syntax in JS but something similar is for other languages. See regex charsets below.

The correct way would be to use a wrapper token like so:

'interpolation': {
    pattern: /\${[^}]+}/,
    inside: {
        'wrapper': {
            pattern: /(^\${)[\s\S]+(?=}$)/,
            lookbehind: true,
            inside: Prism.languages.javascript
        },
        'interpolation-punctuation': {
            pattern: /\${|}/,
            alias: 'punctuation'
        }
    }
}

This is rarely done however because you have to write first pattern at least twice.

Using groups and the proposed alias syntax, it is quite easy to implement correctly without the wrapper token:

'interpolation': {
    pattern: /(\${)([^}]+)(})/,
    groups: {
        $1: ['interpolation-punctuation', 'punctuation'],
        $2: Prism.languages.javascript,
        $3: ['interpolation-punctuation', 'punctuation']
    }
}

Regex charset

Let's take an example from the regex language: Charsets.
Charsets can be negated like this [^not these chars]. Now I want to match [ and ] as punctuation, ^ as charset-negation and the content as something else.
This is pretty easy to implement using groups:

'charset': {
	pattern: /(lookbehind)(\[)(^)?(?:...content...)*(\])/,
	lookbehind: true,
	greedy: true,
	groups: {
		// starting from $2 because of the lookbehind
		$2: 'punctuation',
		$2: 'charset-negation',
		$3: { /* grammar for charset content */ },
		$4: 'punctuation'
	}
}

But it's impossible to do without them (right now) while also avoiding empty tokens and false positives.

Btw. You can have it working with either no empty tokens or no false positives but not both. (I won't proof this here, but try implementing it yourself and you'll see that you can always find a valid JS regex charset which produces false positives or empty tokens (or both if you're really unlucky). For those curious, I choose to avoid empty tokens.)

I understand that this adds a lot of code to Prism core (809 bytes or +13%) for just one feature which not even all languages profit from but I also consider this a very useful feature which can save me and others writing language definitions a lot of trouble. It makes it easier to write correct language definitions without the repetition of (more or less) complex patterns. It even enables matching which it's possible right now (charset example and this) using only one regex execution and a few string operations.

I should also mention that I only applied it to Java stack traces until now because I wanted to avoid merge conflicts (didn't quite work out) but as shown above, this feature can be applied to other languages as well.

Golmote · 2019-03-28T22:07:35Z

Hello there! I came across this PR randomly, I thought I'd share my two cents.

I agree there is a need for something like this and many language definitions could benefit from it for sure.

Apart from the increase in code size, the issue I see is that your proposed solution, because of the restrictions JS regexps have, is not really bullet proof. It will probably work like a charm for simple cases, but when an issue occurs (nested captures, wrong offsets, etc.), it might be quite tricky for users to figure out why and how to fix it. What could be seen as a feature to ease the creation of lang def for beginners might instead become a neat tool for the more advanced users.

I also like you "Aliases" idea, it would be a nice shortcut. I'm less convinced by the "Combined groups" idea, which hurts the readability a lot IMO.

RunDevelopment · 2019-03-28T22:53:14Z

@Golmote Thanks for the comment!

It will probably work like a charm for simple cases, but when an issue occurs (nested captures, wrong offsets, etc.) it might be quite tricky for users to figure out why and how to fix it.

Don't forget lookarounds like /(?<=(.))a(?=(.))/!
getOffsets is only a very simple approximation, so without the use of a different Regex engine or native support it will won't be perfect.
Regarding nested groups (and groups in lookarounds), we could write a test which just checks all patterns which use inside groups for nested groups.

I'm less convinced by the "Combined groups" idea, which hurts the readability a lot IMO.

I'm still looking for a good syntax... Ideas are welcome.

RunDevelopment · 2021-10-01T23:45:58Z

I'll close this now.

It's a cool feature but I don't see this landing any time soon. RegExp matches indexes are behind a regex flag and browser support isn't great rn. It also adds quite a bit of code complexity to Prism Core.

kuenzign · 2023-05-15T17:24:25Z

This would be very beneficial if it was reevaluated. Both Pygments and Highlight.js support regex groups and trying to convert language definitions to PrismJS requires much more reworking than is ideal.

RunDevelopment added 2 commits December 26, 2018 20:33

Added support for highlighting inside of groups

5f09a64

Java Stack traces now use group insides

ec45a94

RunDevelopment added enhancement needs review labels Dec 26, 2018

RunDevelopment mentioned this pull request Dec 26, 2018

Group insides #1472

Closed

RunDevelopment added 2 commits December 28, 2018 14:03

Made offsets and joinStrings private variables

56f76c7

Added some newlines

78eaba9

RunDevelopment mentioned this pull request Jan 1, 2019

C# improvements #1444

Merged

RunDevelopment added 4 commits January 22, 2019 16:47

Resolve merge conflict

ee36323

Minor refactoring

cb94223

Resolved conflict

4040e10

Fixed minor issues

1d30711

mAAdhaTTah added the needs consensus label Mar 5, 2019

RunDevelopment mentioned this pull request Mar 27, 2019

Solidity support revisited #1843

Closed

RunDevelopment added 2 commits March 29, 2019 00:17

Resolved conflict

632e96f

Added alias syntax

0350aeb

RunDevelopment mentioned this pull request Mar 30, 2019

Markdown: Too greedy italic punctuation #1849

Closed

RunDevelopment closed this Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group insides #1679

Group insides #1679

RunDevelopment commented Dec 26, 2018 •

edited

Loading

mAAdhaTTah commented Mar 10, 2019

RunDevelopment commented Mar 10, 2019

Delimiters

Regex charset

Golmote commented Mar 28, 2019

RunDevelopment commented Mar 28, 2019

RunDevelopment commented Oct 1, 2021

kuenzign commented May 15, 2023

Group insides #1679

Group insides #1679

Conversation

RunDevelopment commented Dec 26, 2018 • edited Loading

Motivation

Solution

Implementation, Syntax & usage

Some details

Other ideas

Aliases (Implemented)

Combined groups

mAAdhaTTah commented Mar 10, 2019

RunDevelopment commented Mar 10, 2019

Delimiters

Regex charset

Golmote commented Mar 28, 2019

RunDevelopment commented Mar 28, 2019

RunDevelopment commented Oct 1, 2021

kuenzign commented May 15, 2023

RunDevelopment commented Dec 26, 2018 •

edited

Loading