-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JS: Tokenize :
as an operator
#2073
Conversation
To make this broader: I'm not in love with changing all |
I am for changing const value = { hello: 'world', foo: (bar ? 1 : 0) }; |
It's true that the JSON component already has it tagged as an operator for some reason. I'm only worried that in the end, someone will open another issue asking for the exact opposite of this PR. |
I agree that I'm not sure how PrismJS is tokenizing things, but, I believe that inorder to fix this, you would have to not broadly assume |
Well, IMO the underlying issue here is that there is no clear-cut definition of what differentiates an operator from mere punctuation. So my question: What defines an operator vs punctuation? |
I am afraid of this happening too. The reason why I've brought this up in an issue is because our UX designer is looking at our code snippets on our site and not seeing parity github (we are using github's theme) - the first thing they pointed out to us was that our colons were not matching. Since they are going over this with a fine-toothed comb, I'm sure if it does happen, it will happen much sooner rather than later lol. |
@RunDevelopment The JS spec does define it, I believe. The colon is part of the Conditional Operator (12.14) which is an operator as stated, but also part of the PropertyName of an ObjectLiteral (12.2.6) in which case nothing says it's an operator. @MattMcFarland The thing is Prism is regex-based, it has no notion of context. Absolute correctness is not a goal for this project, as Lea once said. If we can handle some cases of colons used as operator, without adding too much complexity, that's good. But we most likely won't be able to handle the generic case. If your UX designer expects your code snippets to be highlighted with absolute correctness, then I'm afraid Prism is not the right tool for the job. |
Having not looking at the spec myself, I believe you. I think you have a very convincing argument for broadly assuming const foo = bar ? 0 : 1
const baz = {
stuff: things,
things: 'stuff',
doStuff: () => MyThing.doStuff()
} |
@Golmote Please.. Say it ain't so! I'd much rather stick with prism if I can. :) I do think that if there are no negative consequences for switching |
I also agree that we shouldn't change our tokenization to fit a specific theme. @Golmote |
@Golmote @mAAdhaTTah Because when I create a language, I just choose whatever I think looks better in Prism's themes. Even though this kinda goes against 'we shouldn't change our tokenization to fit a specific theme'. |
I agree you shouldnt change tokenization to fit a specific theme as well. When I bring up github's theme and am looking for parity, it's only to show my use case. As for the issue, I made it a point to leave theme's out of it and just stick with the facts, where an operator was being misread as punctuation. That's what I was hoping a fix for :) As for parity with Github, that's something my UX designer is using as a benchmark. If they can give us some leeway that would be great, if not, it would just create more work on our board that we would probably have to defer. |
Consistency is a good enough argument to me, honestly. So I won't fight against this PR. @RunDevelopment No we don't have guidelines for this. I personnally tried to stick to the specs, whenever possible, but I might have made aesthetic choices too, from time to time. Again, given Prism is no linter, that's perfectly fine. |
I'm not sure that spec reference is that helpful, cuz it shows some tokens that are definitely operators, like spread @RunDevelopment So no, I don't think we have guidelines for "punctuation".
It looks this way because the GitHub theme highlights both with the same color. The |
If you need to satisfy your designer, you could easily just change the color of punctuation & operator to be the same in your theme. No conflict 😄 |
@mAAdhaTTah I tried that but it didn't work. It changes commas and dots to red (github treats them differently than operator) |
@MattMcFarland Maybe we could add a bit functionality similar to Highlight keywords for general tokens to satisfy your UX designer? I'm thinking about the ability to modify the classes of tokens on a per-language level.
Me too but most specs just call everything an operator that's not very applicable. Also, if present, what is punctuation vs operators varies from spec to spec. |
Ah so maybe GH's highlighter is treating |
Seems like it. It's highlighted like |
Inconsistency between JS & JSON is a bit weird. We should probably bring them into alignment. The issue for me is it seems weird to apply " |
Funnily enough, the Rouge syntax highlighter switched from having |
lol there's your consistency: make |
We can also do that.
|
sorry sorry i was kidding 😄 |
Seems like the discussion isn't over. 😄 But on the point of Point is: If we really want to make it consistent, we have a looot of languages to go through where we can have the same discussion. Maybe we should create guidelines so we can make all languages roughly consistent with each other and not just JS and JSON. |
In C/C++, the colon isn't only used in ternary. (see https://stackoverflow.com/questions/1711990/what-is-this-weird-colon-member-syntax-in-the-constructor for example) Guidelines would be nice, but what would they be based of? What kind of generic property can be used to definitively make a choice here (and one that hopefully doesn't require a real deep knowledge of each language)? |
Hah yeah I didn't mean to imply I was making a definitive answer here. Part of me was thinking guidelines would be helpful here, but I actually don't know we could come up with guidelines that apply broadly enough to be useful.
No. The goal of Prism has never been correctness, so I don't really mind doing it for that reason either. A goal of consistency leads me to suggest punctuation for Y'all are the regex experts, but I'm assuming it's impossible to highlight the ternary's colon as an operator separately from an object literal's colon as punctuation, correct? TIL YAML is a superset of JSON. |
We can certainly handle simple cases of ternary, like literals. But we most likely won't be able to handle every complex expression, especially expressions containing other colons. Furthermore ternaries can be infinitely nested, which we can't handle either. |
Suff like a ? {b: function() { label: return 1; }, c: /d:/ ? e : "f:g"} : h; Ugh. |
The problem, like @Golmote said, is that there can be any valid JS expression between Point is: While possible, please don't.
@mAAdhaTTah Why is punctuation consistent for This is why I said that making Regarding guidelines: The only thing we might be able to do is to go for aesthetics and utility (aka not too ugly and not all the same token type). |
I think I've got a principle we can apply that gets us
If we accept this principle, we have a guideline for deciding what "punctuation" is in the future while giving us the outcome we want on JS & JSON (this should stay as an operator to be consistent). |
I like that principle @mAAdhaTTah EDIT: PS: Just realized your tag is "mad hatter" - neat! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're all on board with that as a principle, then we can merge this PR. I do want to hear from @Golmote if he agrees with that, but I think we've got some consensus on this.
@mAAdhaTTah Your principle sounds good. I'll make an issue, so we can collect these guidelines and put them on the website once we have enough. |
If we have docs on making a new language, that's where they should live. If not, then... well, we need that lol. |
We only have https://prismjs.com/extending.html AFAIK. @mAAdhaTTah's idea for the guidelines sounds good to me. This way we have the best of all worlds: this issue can be fixed, we have JS/JSON consistency and a rule to follow that "makes sense". I still think a plugin that extends the possibilities for the users would be a nice addition. |
Speaking of guidelines, @RunDevelopment what rule are you following to order the operator regexp parts, if any? |
To order? You mean why |
I'll merge this now and will make another PR for We can continue the guideline discussion in #2083. |
Fair enough. At one point in time, I tried to group them by their starting character(s), because that would theoretically be more optimized (least amount of backtracking needed when testing each alternative). You're far too quick and I'm far too blind! 😂 |
I would really like to know how they optimize this under the hood. Most regexes are fairly simple and could easily be transformed into a DFA.
Gotta go fast! blue hedgehog noises |
From "Mastering Regular Expressions (2nd edition)", by Jeffrey E. F. Friedl: I don't know how true are those statements but this is definitely interesting. EDIT: Apparently, V8 uses https://github.com/ashinn/irregex.
|
I now know where to find regex ninjas ;) |
@MattMcFarland Yeah dude, these two are so good at this. I'm mostly just here to make sure things move along. If you wanna learn regex, look over any of the PRs with extensive comments. Lotta good insights. |
@Golmote sample | func | comp | avg | dev | min | max | samples
------------- | ------------- | ------------- | ---------- | ---------- | ---------- | ---------- | ----------
prism-core.js | old operator | 100% +-21% | 0.07933 | 0.01674 | 0.04493 | 0.5169 | 1318
| new operator | 102% +-13% | 0.08099 | 0.009933 | 0.04720 | 0.2069 | 1335
| flat operator | 101% +-15% | 0.08043 | 0.01168 | 0.04720 | 0.2379 | 1329
prism.js | old operator | 100% +-9% | 3.470 | 0.3251 | 3.247 | 5.950 | 396
| new operator | 101% +-9% | 3.503 | 0.3267 | 3.248 | 5.839 | 394
| flat operator | 101% +-9% | 3.495 | 0.3012 | 3.274 | 5.523 | 394
Testing codeconst fs = require("fs");
const { performance } = require("perf_hooks");
// force the lazy init to happen
performance.now();
const samples = fs.readdirSync('./samples').filter(f => !/^_/.test(f)).map(f => {
return {
title: f,
value: fs.readFileSync('./samples/' + f, 'utf8')
}
});
benchmark(samples, [
toTestCase(
"old operator",
/-[-=]?|\+[+=]?|!=?=?|<<?=?|>>?>?=?|=(?:==?|>)?|&[&=]?|\|[|=]?|\*\*?=?|\/=?|~|\^=?|%=?|\?|\.{3}/g
),
toTestCase(
"new operator",
/--|\+\+|\*\*=?|=>|&&|\|\||[!=]==|<<=?|>>>?=?|[-+*/%&|^!=<>]=?|\.{3}|[~?]/g
),
toTestCase(
"flat operator",
/--|-=|-|\+\+|\+=|\+|!==|!=|!|<<=|<<|<=|<|>>>=|>>>|>>=|>>|>=|>|===|==|=>|=|&&|&=|&|\|\||\|=|\||\*\*=|\*\*|\*=|\*|\/=|\/|~|\^=|\^|%=|%|\?|\.\.\./g
),
]);
/**
*
* @param {string} title
* @param {RegExp} regex
*/
function toTestCase(title, regex) {
return {
title,
value: text => allMatches(regex, text),
};
}
/**
*
* @param {{ title: string; value: T }[]} testSamples
* @param {{ title: string; value: (value: T) => void}[]} testFunctions
* @template T
*/
function benchmark(testSamples, testFunctions) {
const warmupIterations = 1_000;
const warmupMaxTime = 200;
const warmupSample = testSamples[0].value;
testFunctions.forEach(({ value }) => {
const start = performance.now();
for (let i = 0; i < warmupIterations; i++) {
if (performance.now() - start > warmupMaxTime) {
break;
}
value(warmupSample);
}
});
const maxSamples = 10_000; // samples
const maxTimePerFunc = 2_000; // ms
const columnWidth = [
Math.max(...testSamples.map(s => s.title.length)),
Math.max(...testFunctions.map(s => s.title.length)),
];
function printLine(...cells) {
let s = "";
for (let i = 0; i < cells.length; i++) {
let cell = String(cells[i]);
if (i < columnWidth.length) {
cell = cell.padEnd(columnWidth[i], " ");
}
if (s) {
s += " | "
}
s += cell;
}
console.log(s);
}
let headerPrinted = false;
/**
* @param {{ title: string; value: T }} sample
* @param {{ title: string; value: (value: T) => void}} func
* @param {Object<string, any>} result
*/
function printResult(sample, func, result) {
if (!headerPrinted) {
columnWidth.push(...Object.keys(result).map(k =>
Math.max(10, k.length, Math.ceil(String(result[k]).length * 1.25))
));
printLine("sample", "func", ...Object.keys(result));
printLine(...columnWidth.map(i => "-".repeat(i)));
headerPrinted = true;
}
const cells = [
func === testFunctions[0] ? sample.title : "",
func.title
];
for (const key in result) {
cells.push(result[key]);
}
printLine(...cells);
}
/**
* @param {Object<string, any>[]} results
*/
function printResults(sample, results) {
const min = Math.min(...results.map(r => r.comp[0]));
results.forEach(r => {
const [avg, dev] = r.comp;
r.comp = `${Math.round(100 * avg / min)}% +-${Math.round(100 * dev / min)}%`;
});
results.forEach((r, i) => {
printResult(sample, testFunctions[i], r);
})
}
for (const sample of testSamples) {
const results = [];
for (const func of testFunctions) {
/** @type {number[]} */
let samples = [];
const samplingStart = performance.now();
for (let i = 0; i < maxSamples; i++) {
if (performance.now() - samplingStart > maxTimePerFunc) {
break;
}
global.gc && global.gc();
let start = performance.now();
func.value(sample.value);
samples.push(performance.now() - start);
}
const min = Math.min(...samples);
const max = Math.max(...samples);
const avg = samples.reduce((x, y) => x + y) / samples.length;
const dev = Math.sqrt(samples.reduce((s, x) => s + (x - avg) ** 2) / (samples.length - 1))
const prec = (n = 0) => n.toPrecision(4);
results.push({
comp: [avg, dev],
avg: prec(avg),
dev: prec(dev),
min: prec(min),
max: prec(max),
samples: samples.length,
});
}
printResults(sample, results);
}
}
function allMatches(re, text) {
if (!re.global) {
throw new Error(`RegExp has to be global! ${re}`);
}
re.lastIndex = 0;
while (re.exec(text)) { }
} Run using Node.js v12.11.0 with the following command: Backtracking optimization or not doesn't seem to matter. To reduce this noise, I experimented a little with v8's GC and got it down by ~10% but there's still too much to conclusively say that your pattern is faster. |
@RunDevelopment Nice! 😮 The results do not confirm my theory that much. I guess the engine optimizes those patterns indeed! The conclusion seems to be the same in Firefox too. |
This resolves #2072.
This will make JS and JSON consistent in how they tokenize
:
.