Group insides #1472

RunDevelopment · 2018-07-09T21:34:59Z

Group insides

I would like to propose a new way to highlight the insides of matched patterns similar to the current inside.

The idea is to allow to further apply patterns inside a capturing group.
This will be a generalization of inside.

Syntax

inside<n> will be applied to the n-th capturing group, where 1 will be the first capturing group which is not a lookbehind group.

Example:

'method-declaration': { 
	pattern: /([A-Z]\w*)\s+(\w+)\s*\([^)]\)/,
	inside1: {
		'class-name': /.+/
	},
	inside2: {
		'function': /.+/
	},
}

inside1 will match the first capturing group ([A-Z]\w*) and inside2 will match the second capturing group (\w+).

As a shorter version of:

inside1: {
	'token': /[^]+/
}

we could use:

inside1: 'token'

Alternatives

Maybe we could also use $n instead of insiden because e.g. inside1 is kind of hard to read and 1 looks similar to l.

For the time being, I used insiden for simplicity's sake.

Matching

Before we start: Two insides insideB -> insideA can have only two relations to each other:

insideA and insideB are disjunct.
In this case, the order doesn't matter and they can be highlighted in any order.
insideA fully contains insideB.
In this case, it's simply more useful if insideB were to be highlighted before insideA.

From these relations, a tree emerges. A InsideMatch will be a node in this tree. The root node will be inside.
If the grammar of a node is not defined by the user, the node will be created with an empty grammar. This is to prevent child nodes without a parent node.

interface InsideMatch {
	index: number;
	length: number;
	text: string;
	children: InsideMatch[]; // disjunct children
	grammar: Object;
}

The pseudo code will illustrate how the matching will occur.

function matchInside(inside: InsideMatch): (string | Token)[] {
	const tokens = matchDisjunctInsides(inside.text, inside.children);
	Prism.matchGrammar(inside.text, tokens, inside.grammar);
	return tokens;
}

function matchDisjunctInsides(text: string, insides: InsideMatch[]): (string | Token)[] {
	const tokens: (string | Token)[] = [];
	for (const inside of insides) {
		tokens.push(getTextBefore(text, inside));
		tokens.push(...matchInside(inside));
	}
	tokens.push(getRemainingTextAfterInsides(text, insides));
	cleanTokens(tokens);
	return tokens;
}

getTextBefore will return the text before a given InsideMatch without the text before and matched by the previous match.
getRemainingTextAfterInsides will return the text after the last InsideMatch or the whole text if no InsideMatch was given.
cleanTokens will remove empty strings and join adjacent ones.

Implementation

Sadly, JS does not return the position of a group, so it's a little tricky to implement.

The idea is to do the following:

/a(b+)c(d+)e/ -> /(a)(b+)(c)(d+)e/

We rewrite the pattern adding new groups to capture everything preceding a capturing group as well. Keep track of how many groups you added and you can calculate the index of each group.
Nested capturing groups can be handled as well be doing this method described above recursively for the contents of each capturing group if the said group contains a capturing group.

One will have to be careful because lookbehind groups are sometimes preceded by things like ^ or \b. To solve this, we can use the established assumption that everything preceding the lookbehind groups will have length 0.

Backreferences might also have to be rewritten because of the new capturing group.

Rewriting the pattern won't be easy. Needless to say, we won't do this with every pattern, only where we have to.
(Maybe we could even do it with gulp?)

Limitations

Backreferences will be a fundamental limiting factor because JS only allowed backreferences to the first 10 capturing groups. This means that it might not be possible to rewrite a given backreference.

Use cases

When I want to match the return type of a function is languages like Java, C or C#, I usually write 2 patterns. One to match the function name and one to match the return type. The problem is that this is inefficient because to correctly match the return type in front of a function, I also have to match the function name itself to be sure that it is indeed the return type of a function. In the end, I match the return type once and the function name twice.

This can get even more complicated in Languages like C# where a function declaration can look like this:

ReturnType SuperInterface.Function();

Matching ReturnType , SuperInterface and Function would require 3 complicated patterns with lots of redundancy if it were to be done the current way.

The text was updated successfully, but these errors were encountered:

RunDevelopment · 2018-12-26T20:39:39Z

Closed because of #1679.

RunDevelopment mentioned this issue Jul 11, 2018

C# improvements #1444

Merged

mAAdhaTTah added the enhancement label Jul 20, 2018

RunDevelopment mentioned this issue Nov 19, 2018

Add support for HCL #1594

Merged

RunDevelopment closed this as completed Dec 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group insides #1472

Group insides #1472

RunDevelopment commented Jul 9, 2018 •

edited

Loading

RunDevelopment commented Dec 26, 2018

Group insides #1472

Group insides #1472

Comments

RunDevelopment commented Jul 9, 2018 • edited Loading

Group insides

Syntax

Alternatives

Matching

Implementation

Limitations

Use cases

RunDevelopment commented Dec 26, 2018

RunDevelopment commented Jul 9, 2018 •

edited

Loading