Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JS: Tokenize : as an operator #2073

Merged
merged 1 commit into from
Sep 30, 2019

Conversation

RunDevelopment
Copy link
Member

This resolves #2072.

This will make JS and JSON consistent in how they tokenize :.

@mAAdhaTTah
Copy link
Member

To make this broader: I'm not in love with changing all : from punctuation -> operator because of ternaries. Their usage in object literals is far more common than in ternaries, so punctuation is more likely to be correct.

@RunDevelopment
Copy link
Member Author

I am for changing : to operator because of consistency. JSON also uses operator and it ternaries look a little strange.
There is also the point that other highlighters also use something akin to operator, e.g. GitHub:

const value = { hello: 'world', foo: (bar ? 1 : 0) };

@Golmote
Copy link
Contributor

Golmote commented Sep 26, 2019

It's true that the JSON component already has it tagged as an operator for some reason.

I'm only worried that in the end, someone will open another issue asking for the exact opposite of this PR.

@MattMcFarland
Copy link

I agree that : should not be an operator when it is punctuation.

I'm not sure how PrismJS is tokenizing things, but, I believe that inorder to fix this, you would have to not broadly assume : is punctuation, but take in prior tokenized items as context and apply the the correct token type "punctuation" or "operator"

@RunDevelopment
Copy link
Member Author

Well, IMO the underlying issue here is that there is no clear-cut definition of what differentiates an operator from mere punctuation.
I always felt that this divide is kind of artificial but it's quite useful for highlighting.

So my question: What defines an operator vs punctuation?
The JS language spec doesn't.

@MattMcFarland
Copy link

It's true that the JSON component already has it tagged as an operator for some reason.

I'm only worried that in the end, someone will open another issue asking for the exact opposite of this PR.

I am afraid of this happening too.

The reason why I've brought this up in an issue is because our UX designer is looking at our code snippets on our site and not seeing parity github (we are using github's theme) - the first thing they pointed out to us was that our colons were not matching. Since they are going over this with a fine-toothed comb, I'm sure if it does happen, it will happen much sooner rather than later lol.

@Golmote
Copy link
Contributor

Golmote commented Sep 26, 2019

@RunDevelopment The JS spec does define it, I believe. The colon is part of the Conditional Operator (12.14) which is an operator as stated, but also part of the PropertyName of an ObjectLiteral (12.2.6) in which case nothing says it's an operator.
But the spec also defines a comma operator, which we do not handle. x')

@MattMcFarland The thing is Prism is regex-based, it has no notion of context. Absolute correctness is not a goal for this project, as Lea once said.

If we can handle some cases of colons used as operator, without adding too much complexity, that's good. But we most likely won't be able to handle the generic case.

If your UX designer expects your code snippets to be highlighted with absolute correctness, then I'm afraid Prism is not the right tool for the job.

@MattMcFarland
Copy link

Well, IMO the underlying issue here is that there is no clear-cut definition of what differentiates an operator from mere punctuation.
I always felt that this divide is kind of artificial but it's quite useful for highlighting.

So my question: What defines an operator vs punctuation?
The JS language spec doesn't.

Having not looking at the spec myself, I believe you. I think you have a very convincing argument for broadly assuming : is an operator. Even if someone comes back saying they think it should be punctuation, it looks like : is used as an operator for github's syntax highlighting, if you look at this snippet:

const foo = bar ? 0 : 1
const baz = {
  stuff: things,
  things: 'stuff',
  doStuff: () => MyThing.doStuff()
}

@MattMcFarland
Copy link

@Golmote Please.. Say it ain't so! I'd much rather stick with prism if I can. :) I do think that if there are no negative consequences for switching : to operator from punctuation, then you should do it. It does look like it is treated like operator when looking at github snippets at least.

@RunDevelopment
Copy link
Member Author

I also agree that we shouldn't change our tokenization to fit a specific theme.
I implemented this because of consistency.


@Golmote
The spec even has this punctuator section.

@RunDevelopment
Copy link
Member Author

@Golmote @mAAdhaTTah
Do we have some guidelines to distinguish between punctuation and operator?

Because when I create a language, I just choose whatever I think looks better in Prism's themes. Even though this kinda goes against 'we shouldn't change our tokenization to fit a specific theme'.

@MattMcFarland
Copy link

I agree you shouldnt change tokenization to fit a specific theme as well.

When I bring up github's theme and am looking for parity, it's only to show my use case.

As for the issue, I made it a point to leave theme's out of it and just stick with the facts, where an operator was being misread as punctuation. That's what I was hoping a fix for :)

As for parity with Github, that's something my UX designer is using as a benchmark. If they can give us some leeway that would be great, if not, it would just create more work on our board that we would probably have to defer.

@Golmote
Copy link
Contributor

Golmote commented Sep 26, 2019

Consistency is a good enough argument to me, honestly. So I won't fight against this PR.

@RunDevelopment No we don't have guidelines for this. I personnally tried to stick to the specs, whenever possible, but I might have made aesthetic choices too, from time to time. Again, given Prism is no linter, that's perfectly fine.

@mAAdhaTTah
Copy link
Member

I'm not sure that spec reference is that helpful, cuz it shows some tokens that are definitely operators, like spread .... Broadly, what is an operator is defined the spec. It sounds like punctuation is... made up? for the benefit of syntax highlighting?

@RunDevelopment So no, I don't think we have guidelines for "punctuation".


It does look like it is treated like operator when looking at github snippets at least.

It looks this way because the GitHub theme highlights both with the same color. The ? is also highlighted in that color, so hypothetically, if whatever highlighter GH uses highlighted them as different things but applied the same color, the result would be the same while "treating" them as different.

@mAAdhaTTah
Copy link
Member

If you need to satisfy your designer, you could easily just change the color of punctuation & operator to be the same in your theme. No conflict 😄

@MattMcFarland
Copy link

@mAAdhaTTah I tried that but it didn't work. It changes commas and dots to red (github treats them differently than operator)

@RunDevelopment
Copy link
Member Author

@MattMcFarland Maybe we could add a bit functionality similar to Highlight keywords for general tokens to satisfy your UX designer? I'm thinking about the ability to modify the classes of tokens on a per-language level.
Anyway, we shouldn't discuss this here.


@Golmote

I personally tried to stick to the specs, whenever possible

Me too but most specs just call everything an operator that's not very applicable. Also, if present, what is punctuation vs operators varies from spec to spec.

@mAAdhaTTah
Copy link
Member

Ah so maybe GH's highlighter is treating : as an operator.

@RunDevelopment
Copy link
Member Author

Ah so maybe GH's highlighter is treating : as an operator.

Seems like it. It's highlighted like = but differently from (){}.

@mAAdhaTTah
Copy link
Member

Inconsistency between JS & JSON is a bit weird. We should probably bring them into alignment. The issue for me is it seems weird to apply ": is an operator" to JSON because it's definitely not an operator there. I know this flies in the face of the original request, but my inclination is to make them both punctuation.

@Golmote
Copy link
Contributor

Golmote commented Sep 26, 2019

Funnily enough, the Rouge syntax highlighter switched from having ? and : tagged as operators to having them both tagged as punctuation in 2012.

@mAAdhaTTah
Copy link
Member

lol there's your consistency: make ? punctuation!

@RunDevelopment
Copy link
Member Author

make ? punctuation!

We can also do that.
But with that, I have another question: What about optional chaining and nullish coalescing?

?? is an operator for sure but ?. ?

@mAAdhaTTah
Copy link
Member

sorry sorry i was kidding 😄

@RunDevelopment
Copy link
Member Author

Seems like the discussion isn't over. 😄

But on the point of : being an operator: When we then consider other C style languages such as C/C++ where :: is an op but : is punctuation even though : can only be part of ?: to my knowledge.
And regarding JSON: In Yaml, a superset of JSON, : is being highlighted as punctuation.

Point is: If we really want to make it consistent, we have a looot of languages to go through where we can have the same discussion. Maybe we should create guidelines so we can make all languages roughly consistent with each other and not just JS and JSON.

@Golmote
Copy link
Contributor

Golmote commented Sep 26, 2019

In C/C++, the colon isn't only used in ternary. (see https://stackoverflow.com/questions/1711990/what-is-this-weird-colon-member-syntax-in-the-constructor for example)

Guidelines would be nice, but what would they be based of? What kind of generic property can be used to definitively make a choice here (and one that hopefully doesn't require a real deep knowledge of each language)?

@mAAdhaTTah
Copy link
Member

Seems like the discussion isn't over. 😄

Hah yeah I didn't mean to imply I was making a definitive answer here.


Part of me was thinking guidelines would be helpful here, but I actually don't know we could come up with guidelines that apply broadly enough to be useful.


@mAAdhaTTah Is it that bad to have : tagged as an operator solely for aesthetic purpose?

No. The goal of Prism has never been correctness, so I don't really mind doing it for that reason either. A goal of consistency leads me to suggest punctuation for :, but if the goal is aesthetics then operator may be the way to go here.

Y'all are the regex experts, but I'm assuming it's impossible to highlight the ternary's colon as an operator separately from an object literal's colon as punctuation, correct?


TIL YAML is a superset of JSON.

@Golmote
Copy link
Contributor

Golmote commented Sep 27, 2019

Y'all are the regex experts, but I'm assuming it's impossible to highlight the ternary's colon as an operator separately from an object literal's colon as punctuation, correct?

We can certainly handle simple cases of ternary, like literals. But we most likely won't be able to handle every complex expression, especially expressions containing other colons. Furthermore ternaries can be infinitely nested, which we can't handle either.

@Golmote
Copy link
Contributor

Golmote commented Sep 27, 2019

Suff like

a ? {b: function() { label: return 1; }, c: /d:/ ? e : "f:g"} : h;

Ugh.

@RunDevelopment
Copy link
Member Author

I'm assuming it's impossible to highlight the ternary's colon as an operator separately from an object literal's colon as punctuation, correct?

The problem, like @Golmote said, is that there can be any valid JS expression between ? and :.
The infinite nesting problem can be solved by just stopping after a fixed level of nesting.
I actually wrote a little function which could transform a CF grammar into a RE with that idea. The pattern of a JS expression with a max nesting level 3 was >10kB if I remember correctly. Nesting with multiple branches means that our pattern grows exponentially. It's not fun and it becomes even worse because when I tested this with Chrome, it just wouldn't work if patterns were too long.

Point is: While possible, please don't.

A goal of consistency leads me to suggest punctuation for :

@mAAdhaTTah Why is punctuation consistent for :? Do you mean consistent with other languages? Because most C style languages have the same problem.

This is why I said that making ? punctuation isn't such a bad idea. And we could even change this in C like to apply it to all C style languages.
It's just a little... strange. Because ?: is definitely an operator.


Regarding guidelines:
I don't think we can do much better than trivial stuff like ()[]{},; is punctuation.
But then there are languages like EBNF where , is the concatenation operator... (I tokenized it as punctuation despite that.)

The only thing we might be able to do is to go for aesthetics and utility (aka not too ugly and not all the same token type).

@mAAdhaTTah
Copy link
Member

I think I've got a principle we can apply that gets us : as an operator:

Don't make any token "punctuation" if it fits into another token type. In other words, "punctuation" is a token of last resort.

If we accept this principle, we have a guideline for deciding what "punctuation" is in the future while giving us the outcome we want on JS & JSON (this should stay as an operator to be consistent).

@MattMcFarland
Copy link

MattMcFarland commented Sep 30, 2019

I like that principle @mAAdhaTTah

EDIT: PS: Just realized your tag is "mad hatter" - neat!

Copy link
Member

@mAAdhaTTah mAAdhaTTah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're all on board with that as a principle, then we can merge this PR. I do want to hear from @Golmote if he agrees with that, but I think we've got some consensus on this.

@RunDevelopment
Copy link
Member Author

@mAAdhaTTah Your principle sounds good.

I'll make an issue, so we can collect these guidelines and put them on the website once we have enough.

@mAAdhaTTah
Copy link
Member

If we have docs on making a new language, that's where they should live. If not, then... well, we need that lol.

@Golmote
Copy link
Contributor

Golmote commented Sep 30, 2019

We only have https://prismjs.com/extending.html AFAIK.

@mAAdhaTTah's idea for the guidelines sounds good to me. This way we have the best of all worlds: this issue can be fixed, we have JS/JSON consistency and a rule to follow that "makes sense".

I still think a plugin that extends the possibilities for the users would be a nice addition.

@Golmote
Copy link
Contributor

Golmote commented Sep 30, 2019

Speaking of guidelines, @RunDevelopment what rule are you following to order the operator regexp parts, if any?

@RunDevelopment
Copy link
Member Author

what rule are you following to order the operator regexp parts, if any?

To order? You mean why a|b instead of b|a?
Well, above all: correctness. E.g. -|-- doesn't match --, so it has to be --|-. After correctness is ensured, I generally order them to group together as many as possible to make the regex as small as possible. E.g. [!=]=|- instead of !=|-|==.
After that is done, nothing really. I just choose whatever order as long as it's correct and fairly small.

@RunDevelopment
Copy link
Member Author

@Golmote

I still think a plugin that extends the possibilities for the users would be a nice addition.

#2075.

@RunDevelopment
Copy link
Member Author

I'll merge this now and will make another PR for ?? and ?..

We can continue the guideline discussion in #2083.

@RunDevelopment RunDevelopment merged commit 0e5c48d into PrismJS:master Sep 30, 2019
@RunDevelopment RunDevelopment deleted the js-operator branch September 30, 2019 19:09
@Golmote
Copy link
Contributor

Golmote commented Sep 30, 2019

To order? You mean why a|b instead of b|a?
Well, above all: correctness. E.g. -|-- doesn't match --, so it has to be --|-. After correctness is ensured, I generally order them to group together as many as possible to make the regex as small as possible. E.g. [!=]=|- instead of !=|-|==.
After that is done, nothing really. I just choose whatever order as long as it's correct and fairly small.

Fair enough. At one point in time, I tried to group them by their starting character(s), because that would theoretically be more optimized (least amount of backtracking needed when testing each alternative).

#2075.

You're far too quick and I'm far too blind! 😂

@RunDevelopment
Copy link
Member Author

that would theoretically be more optimized (least amount of backtracking needed when testing each alternative).

I would really like to know how they optimize this under the hood. Most regexes are fairly simple and could easily be transformed into a DFA.
Time to look into the V8 source I guess.

You're far too quick

Gotta go fast! blue hedgehog noises

@Golmote
Copy link
Contributor

Golmote commented Sep 30, 2019

I would really like to know how they optimize this under the hood. Most regexes are fairly simple and could easily be transformed into a DFA.
Time to look into the V8 source I guess.

From "Mastering Regular Expressions (2nd edition)", by Jeffrey E. F. Friedl:

OkWruU3
[...]
LtoeaGj

I don't know how true are those statements but this is definitely interesting.

EDIT: Apparently, V8 uses https://github.com/ashinn/irregex.

DFA matching is used
when possible, otherwise a closure-compiled NFA approach is used.

@MattMcFarland
Copy link

I now know where to find regex ninjas ;)

@mAAdhaTTah
Copy link
Member

@MattMcFarland Yeah dude, these two are so good at this. I'm mostly just here to make sure things move along.

If you wanna learn regex, look over any of the PRs with extensive comments. Lotta good insights.

@RunDevelopment
Copy link
Member Author

@Golmote
I did some perf tests.

sample        | func          | comp          | avg        | dev        | min        | max        | samples
------------- | ------------- | ------------- | ---------- | ---------- | ---------- | ---------- | ----------
prism-core.js | old operator  | 100% +-21%    | 0.07933    | 0.01674    | 0.04493    | 0.5169     | 1318
              | new operator  | 102% +-13%    | 0.08099    | 0.009933   | 0.04720    | 0.2069     | 1335
              | flat operator | 101% +-15%    | 0.08043    | 0.01168    | 0.04720    | 0.2379     | 1329
prism.js      | old operator  | 100% +-9%     | 3.470      | 0.3251     | 3.247      | 5.950      | 396
              | new operator  | 101% +-9%     | 3.503      | 0.3267     | 3.248      | 5.839      | 394
              | flat operator | 101% +-9%     | 3.495      | 0.3012     | 3.274      | 5.523      | 394

avg (average), dev (standard deviation), min, and max are all in ms. comp is the relative average and standard deviation compared to the lowest average.
old operator is the operator pattern before this PR, new operator is the operator pattern after this PR sans :, and flat operator is just all operators as a keyword list of sorts.
prism-core.js is what it sounds like from master, and prism.js is the current Prism version from the download page will all languages included.

Testing code
const fs = require("fs");
const { performance } = require("perf_hooks");
// force the lazy init to happen
performance.now();


const samples = fs.readdirSync('./samples').filter(f => !/^_/.test(f)).map(f => {
	return {
		title: f,
		value: fs.readFileSync('./samples/' + f, 'utf8')
	}
});

benchmark(samples, [
	toTestCase(
		"old operator",
		/-[-=]?|\+[+=]?|!=?=?|<<?=?|>>?>?=?|=(?:==?|>)?|&[&=]?|\|[|=]?|\*\*?=?|\/=?|~|\^=?|%=?|\?|\.{3}/g
	),
	toTestCase(
		"new operator",
		/--|\+\+|\*\*=?|=>|&&|\|\||[!=]==|<<=?|>>>?=?|[-+*/%&|^!=<>]=?|\.{3}|[~?]/g
	),
	toTestCase(
		"flat operator",
		/--|-=|-|\+\+|\+=|\+|!==|!=|!|<<=|<<|<=|<|>>>=|>>>|>>=|>>|>=|>|===|==|=>|=|&&|&=|&|\|\||\|=|\||\*\*=|\*\*|\*=|\*|\/=|\/|~|\^=|\^|%=|%|\?|\.\.\./g
	),
]);

/**
 *
 * @param {string} title
 * @param {RegExp} regex
 */
function toTestCase(title, regex) {
	return {
		title,
		value: text => allMatches(regex, text),
	};
}


/**
 *
 * @param {{ title: string; value: T }[]} testSamples
 * @param {{ title: string; value: (value: T) => void}[]} testFunctions
 * @template T
 */
function benchmark(testSamples, testFunctions) {
	const warmupIterations = 1_000;
	const warmupMaxTime = 200;

	const warmupSample = testSamples[0].value;
	testFunctions.forEach(({ value }) => {
		const start = performance.now();
		for (let i = 0; i < warmupIterations; i++) {
			if (performance.now() - start > warmupMaxTime) {
				break;
			}
			value(warmupSample);
		}
	});

	const maxSamples = 10_000; // samples
	const maxTimePerFunc = 2_000; // ms


	const columnWidth = [
		Math.max(...testSamples.map(s => s.title.length)),
		Math.max(...testFunctions.map(s => s.title.length)),
	];

	function printLine(...cells) {
		let s = "";
		for (let i = 0; i < cells.length; i++) {
			let cell = String(cells[i]);
			if (i < columnWidth.length) {
				cell = cell.padEnd(columnWidth[i], " ");
			}
			if (s) {
				s += " | "
			}
			s += cell;
		}
		console.log(s);
	}

	let headerPrinted = false;

	/**
	 * @param {{ title: string; value: T }} sample
	 * @param {{ title: string; value: (value: T) => void}} func
	 * @param {Object<string, any>} result
	 */
	function printResult(sample, func, result) {
		if (!headerPrinted) {
			columnWidth.push(...Object.keys(result).map(k =>
				Math.max(10, k.length, Math.ceil(String(result[k]).length * 1.25))
			));
			printLine("sample", "func", ...Object.keys(result));
			printLine(...columnWidth.map(i => "-".repeat(i)));
			headerPrinted = true;
		}
		const cells = [
			func === testFunctions[0] ? sample.title : "",
			func.title
		];

		for (const key in result) {
			cells.push(result[key]);
		}
		printLine(...cells);
	}

	/**
	 * @param {Object<string, any>[]} results
	 */
	function printResults(sample, results) {
		const min = Math.min(...results.map(r => r.comp[0]));
		results.forEach(r => {
			const [avg, dev] = r.comp;
			r.comp = `${Math.round(100 * avg / min)}% +-${Math.round(100 * dev / min)}%`;
		});
		results.forEach((r, i) => {
			printResult(sample, testFunctions[i], r);
		})
	}

	for (const sample of testSamples) {
		const results = [];
		for (const func of testFunctions) {
			/** @type {number[]} */
			let samples = [];
			const samplingStart = performance.now();

			for (let i = 0; i < maxSamples; i++) {
				if (performance.now() - samplingStart > maxTimePerFunc) {
					break;
				}
				global.gc && global.gc();

				let start = performance.now();
				func.value(sample.value);
				samples.push(performance.now() - start);
			}

			const min = Math.min(...samples);
			const max = Math.max(...samples);
			const avg = samples.reduce((x, y) => x + y) / samples.length;
			const dev = Math.sqrt(samples.reduce((s, x) => s + (x - avg) ** 2) / (samples.length - 1))


			const prec = (n = 0) => n.toPrecision(4);

			results.push({
				comp: [avg, dev],
				avg: prec(avg),
				dev: prec(dev),
				min: prec(min),
				max: prec(max),
				samples: samples.length,
			});
		}
		printResults(sample, results);
	}
}

function allMatches(re, text) {
	if (!re.global) {
		throw new Error(`RegExp has to be global! ${re}`);
	}
	re.lastIndex = 0;
	while (re.exec(text)) { }
}

Run using Node.js v12.11.0 with the following command: node --expose-gc bench.js.
I take control over the GC (as much as I can) to avoid minor GCs during testing.

bench.zip

Backtracking optimization or not doesn't seem to matter.
While your pattern generally seems to be just a tat faster than the other ones (like 1% faster), it's hard to tell because of the noise.

To reduce this noise, I experimented a little with v8's GC and got it down by ~10% but there's still too much to conclusively say that your pattern is faster.

@Golmote
Copy link
Contributor

Golmote commented Oct 1, 2019

@RunDevelopment Nice! 😮

The results do not confirm my theory that much. I guess the engine optimizes those patterns indeed!

The conclusion seems to be the same in Firefox too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PrismJS fails to tokenize operators, calls them "punctuation" (reproducable on prismjs.com)
4 participants