-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Group insides #1679
Group insides #1679
Conversation
My initial inclination is this this adds more code than it eliminates. Is this really a benefit here? |
This is actually quite useful. While making a few of the new language definition, I could have used this a few times. Problems I lose sleep over 😪DelimitersThere are some language syntaxes which have code surrounded by some delimiter. An example is JS template strings where the string content and interpolated expressions are surrounded by backticks. Here I want to focus on the 'interpolation': {
pattern: /\${[^}]+}/,
inside: {
'interpolation-punctuation': {
pattern: /^\${|}$/,
alias: 'punctuation'
},
rest: Prism.languages.javascript
}
} The only problem is that this will match you as many of the left delimiter (here The correct way would be to use a wrapper token like so: 'interpolation': {
pattern: /\${[^}]+}/,
inside: {
'wrapper': {
pattern: /(^\${)[\s\S]+(?=}$)/,
lookbehind: true,
inside: Prism.languages.javascript
},
'interpolation-punctuation': {
pattern: /\${|}/,
alias: 'punctuation'
}
}
} This is rarely done however because you have to write first pattern at least twice. Using groups and the proposed alias syntax, it is quite easy to implement correctly without the wrapper token: 'interpolation': {
pattern: /(\${)([^}]+)(})/,
groups: {
$1: ['interpolation-punctuation', 'punctuation'],
$2: Prism.languages.javascript,
$3: ['interpolation-punctuation', 'punctuation']
}
} Regex charsetLet's take an example from the regex language: Charsets. 'charset': {
pattern: /(lookbehind)(\[)(^)?(?:...content...)*(\])/,
lookbehind: true,
greedy: true,
groups: {
// starting from $2 because of the lookbehind
$2: 'punctuation',
$2: 'charset-negation',
$3: { /* grammar for charset content */ },
$4: 'punctuation'
}
} But it's impossible to do without them (right now) while also avoiding empty tokens and false positives. Btw. You can have it working with either no empty tokens or no false positives but not both. (I won't proof this here, but try implementing it yourself and you'll see that you can always find a valid JS regex charset which produces false positives or empty tokens (or both if you're really unlucky). For those curious, I choose to avoid empty tokens.) I understand that this adds a lot of code to Prism core (809 bytes or +13%) for just one feature which not even all languages profit from but I also consider this a very useful feature which can save me and others writing language definitions a lot of trouble. It makes it easier to write correct language definitions without the repetition of (more or less) complex patterns. It even enables matching which it's possible right now (charset example and this) using only one regex execution and a few string operations. I should also mention that I only applied it to Java stack traces until now because I wanted to avoid merge conflicts (didn't quite work out) but as shown above, this feature can be applied to other languages as well. |
Hello there! I came across this PR randomly, I thought I'd share my two cents. I agree there is a need for something like this and many language definitions could benefit from it for sure. Apart from the increase in code size, the issue I see is that your proposed solution, because of the restrictions JS regexps have, is not really bullet proof. It will probably work like a charm for simple cases, but when an issue occurs (nested captures, wrong offsets, etc.), it might be quite tricky for users to figure out why and how to fix it. What could be seen as a feature to ease the creation of lang def for beginners might instead become a neat tool for the more advanced users. I also like you "Aliases" idea, it would be a nice shortcut. I'm less convinced by the "Combined groups" idea, which hurts the readability a lot IMO. |
@Golmote Thanks for the comment!
Don't forget lookarounds like
I'm still looking for a good syntax... Ideas are welcome. |
I'll close this now. It's a cool feature but I don't see this landing any time soon. RegExp matches indexes are behind a regex flag and browser support isn't great rn. It also adds quite a bit of code complexity to Prism Core. |
This would be very beneficial if it was reevaluated. Both Pygments and Highlight.js support regex groups and trying to convert language definitions to PrismJS requires much more reworking than is ideal. |
This PR adds support for highlighting using capturing groups.
Motivation
On some occasions,
inside
is used to match exactly one substring of the parent pattern. This means that this substring has to be matched by the parent pattern and the inside pattern creating redundancy in the patterns (the inside pattern is just a part of the parent pattern) and taking longer than it has to (additional regex matching).Example 1:
Solution
By utilizing capturing groups in the parent pattern, it is now possible to highlight the captured substring without the additional overhead of another regex operation.
The following is equal to example 1.
Example 2:
For more examples see
prism-javastacktrace.js
.Implementation, Syntax & usage
I choose the
$n
syntax because it is intuitive (replace
uses the same syntax) and easy to read (in comparison to e.g. an array).Also, it can be easily extended to support named groups.
The value of each
$n
(n
> 0) is a Prism language definition (likeinside
) which will be matched against the string captured by the n-th group (match[n]
).(
$n: 'token-name'
is just syntactic sugar for$n: { 'token-name': /[\s\S]+/ }
)The resulting token streams of the capturing groups and the remaining substrings between groups will be concatenated to create a new token stream. The concatenated token stream will be the content of the new token.
Example 2 will be joined like so:
(Adjacent strings in the token stream will be concatenated later on.)
The grammar of
inside
(if present) will then be to tokenize all strings incontent
(non-recursively) replacing the strings with the items of resulting token streams.This also means that greedy patterns in
inside
cannot change the tokens matched by groups. This is to avoid the hassle of how to deal with greedy re-matching.Some details
Because it is not possible to get the offsets of capturing groups in the match array, I had to improvise and the result is
getOffsets
.Assuming that there are no nested groups, we can use the order of the groups to repeatedly find the next offset using
match[0].indexOf
. For more details, see the implementation/doc.This will fail if a captured string also appears before the group's actual position and after the last group; E.g.:
/(b) a (a)/
. But in that case is always possible to resolve this by adding more groups: e.g.:/(b) (a) (a)/
Other details:
match[0]
. These will throw errors but I guess that's ok for our highlighting porpuses.groups.$1
will be ignored iflookbehind: true
.Other ideas
Aliases (Implemented)
I'm thinking about making
$n: ['token-name', 'alias-1', 'alias-2']
syntactic sugar for:But maybe that's too much magic? Thoughts?
Combined groups
I'm thinking about combining groups so that e.g.
{ $n$m: value }
is syntactic sugar forAny number of groups are allowed. Duplicate groups will throw errors.
This is useful for cases like this: