-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix v-flag bugs #85
Fix v-flag bugs #85
Conversation
tests/fixtures/unicode-set.js
Outdated
@@ -105,6 +105,8 @@ const unicodeSetFixtures = [ | |||
}, | |||
{ | |||
pattern: '[^[a-z][f-h]]', | |||
matches: ["A", "\u{12345}"], | |||
nonMatches: ["a", "z"], | |||
expected: '(?:(?![a-z])[\\s\\S])', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current transpiled result does not match "\u{12345}"
.
); | ||
const negativeSet = UNICODE_SET.clone().remove(singleChars); | ||
const bmpOnly = regenerateContainsAstral(negativeSet); | ||
update(characterClassItem, negativeSet.toString({ bmpOnly: bmpOnly })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the regenerate set spans from code points before surrogate to astral sets, toString({ bmpOnly: false })
returns much more verbose results while toString({ bmpOnly: false })
is already correct: I think it should be fixed in regenerate later.
const regenerate = require('regenerate');
const set = regenerate().addRange(0xd000, 0x10000);
console.log(set.toString());
// [\uD000-\uD7FF\uE000-\uFFFF]|\uD800\uDC00|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]
console.log(set.toString({ bmpOnly: true }));
// [\uD000-\uFFFF]|\uD800\uDC00
The latter is apparently correct as it matches lone surrogates as well as U+10000. The former seems like [\uD000-\uFFFF]|\uD800\uDC00
is passed to the bmp pass again.
expected: '(?:[\\0-JL-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF])', | ||
matches: ["k", "\u212a", "\u{12345}", "\uDAAA", "\uDDDD"], | ||
nonMatches: ["K"], | ||
expected: '(?:[\\0-JL-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF])', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are now much shorter and easier to reason about. I also added matches
tests so that we are confident that transpiled result is correct.
23feaf3
to
8501e5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
In this PR we reuse the unicode fixtures for the v-flag tests, based the observation that
/.../u
and/.../v
should yield the same result unless set/string properties features are involved.We also introduce the
matches
andnonMatches
properties to the v-flag fixture runner: They includes the strings that the transpiled regex is supposed to match / reject. It is useful when the transpiled regex is too verbose for proper comprehension.This PR includes commits from #84, I will rebase once that PR is merged.This is a draft PR as I still haven't figured out how to avoid double-bmpify regex strings: In the negative set notation we extract single code points from theUNICODE_SET
, which yields surrogate stuffs in the output, but then it was bmp-ified again in theregenerate
, yielding longer than necessary results, though it seems correct.