Skip to content

Improve parser suppport for regexp group names #184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 26, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 77 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1336,6 +1336,34 @@ We have the following node (the `name` property with value `foo` is added):
type: 'Group',
capturing: true,
name: 'foo',
nameRaw: 'foo',
number: 1,
expression: {
type: 'Char',
value: 'x',
symbol: 'x',
kind: 'simple',
codePoint: 120
}
}
```

Note: The `nameRaw` property represents the name *as parsed from the original source*, including escape sequences. The `name` property represents the canonical decoded form of the name.

For example, given the `/u` flag and the following group:

```regexp
(?<\u{03C0}>x)
```

We would have the following node:

```js
{
type: 'Group',
capturing: true,
name: 'π',
nameRaw: '\\u{03C0}',
number: 1,
expression: {
type: 'Char',
Expand Down Expand Up @@ -1465,6 +1493,7 @@ A node:
type: 'Group',
capturing: true,
name: 'foo',
nameRaw: 'foo',
number: 1,
expression: {
type: 'Char',
Expand All @@ -1478,7 +1507,8 @@ A node:
type: 'Backreference',
kind: 'name',
number: 1,
reference: 'foo'
reference: 'foo',
referenceRaw: 'foo'
},
{
type: 'Backreference',
Expand All @@ -1490,6 +1520,52 @@ A node:
}
```

Note: The `referenceRaw` property represents the reference *as parsed from the original source*, including escape sequences. The `reference` property represents the canonical decoded form of the reference.

For example, given the `/u` flag and the following pattern (matches `www`):

```regexp
(?<π>w)\k<\u{03C0}>\1
```

We would have the following node:

```js
{
type: 'Alternative',
expressions: [
{
type: 'Group',
capturing: true,
name: 'π',
nameRaw: 'π',
number: 1,
expression: {
type: 'Char',
value: 'w',
symbol: 'w',
kind: 'simple',
codePoint: 119
}
},
{
type: 'Backreference',
kind: 'name',
number: 1,
reference: 'π',
referenceRaw: '\\u{03C0}'
},
{
type: 'Backreference',
kind: 'number',
number: 1,
reference: 1
}
]
}
```


#### Quantifiers

Quantifiers specify _repetition_ of a regular expression (or of its part). Below are the quantifiers which _wrap_ a parsed expression into a `Repetition` node. The quantifier itself can be of different _kinds_, and has `Quantifier` node type.
Expand Down
2 changes: 2 additions & 0 deletions index.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ declare module 'regexp-tree/ast' {
capturing: true;
number: number;
name?: string;
nameRaw?: string;
expression: Expression | null;
}

Expand All @@ -76,6 +77,7 @@ declare module 'regexp-tree/ast' {
kind: 'name';
number: number;
reference: string;
referenceRaw: string;
}

export type Backreference = NumericBackreference | NamedBackreference;
Expand Down
158 changes: 158 additions & 0 deletions scripts/generate-unicode-id-parts.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
// based on https://github.com/microsoft/TypeScript/tree/master/scripts/regenerate-unicode-identifier-parts.js

/** @param {number} i */
function toHex4Digits(i) {
let s = i.toString(16);
while (s.length < 4) {
s = '0' + s;
}
if (s.length > 4) throw new Error('Invalid Hex4Digits value');
return s;
}

class NonSurrogateRange {
/** @param {number} codePoint */
constructor(codePoint) {
this.firstCodePoint = codePoint;
this.lastCodePoint = codePoint;
}
toString() {
let text = '\\u' + toHex4Digits(this.firstCodePoint);
if (this.lastCodePoint !== this.firstCodePoint) {
text += '-\\u' + toHex4Digits(this.lastCodePoint);
}
return text;
}
}

class LeadSurrogateRange {
/** @param {number} leadSurrogate */
constructor(leadSurrogate) {
this.leadSurrogate = leadSurrogate;
/** @type {TrailSurrogateRange[]} */
this.ranges = [];
}

toString() {
return (
'\\u' +
toHex4Digits(this.leadSurrogate) +
'[' +
this.ranges.join('') +
']'
);
}
}

class TrailSurrogateRange {
/** @param {number} trailSurrogate */
constructor(trailSurrogate) {
this.firstTrailSurrogate = trailSurrogate;
this.lastTrailSurrogate = trailSurrogate;
}
toString() {
let text = '\\u' + toHex4Digits(this.firstTrailSurrogate);
if (this.lastTrailSurrogate !== this.firstTrailSurrogate) {
text += '-\\u' + toHex4Digits(this.lastTrailSurrogate);
}
return text;
}
}

class Writer {
constructor() {
/** @type {number} */
this.lastCodePoint = -1;
/** @type {NonSurrogateRange[]} */
this.nonSurrogateRanges = [];
/** @type {LeadSurrogateRange[]} */
this.surrogateRanges = [];
/** @type {NonSurrogateRange} */
this.nonSurrogateRange;
/** @type {LeadSurrogateRange} */
this.leadSurrogateRange;
/** @type {TrailSurrogateRange} */
this.trailSurrogateRange;
}

/** @param {number} codePoint */
push(codePoint) {
if (codePoint <= this.lastCodePoint)
throw new Error('Code points must be added in order.');
this.lastCodePoint = codePoint;

if (codePoint < MAX_UNICODE_NON_SURROGATE) {
if (
this.nonSurrogateRange &&
this.nonSurrogateRange.lastCodePoint === codePoint - 1
) {
this.nonSurrogateRange.lastCodePoint = codePoint;
return;
}
this.nonSurrogateRange = new NonSurrogateRange(codePoint);
this.nonSurrogateRanges.push(this.nonSurrogateRange);
} else {
const leadSurrogate = Math.floor((codePoint - 0x10000) / 0x400) + 0xd800;
const trailSurrogate = ((codePoint - 0x10000) % 0x400) + 0xdc00;
if (
!this.leadSurrogateRange ||
this.leadSurrogateRange.leadSurrogate !== leadSurrogate
) {
this.trailSurrogateRange = undefined;
this.leadSurrogateRange = new LeadSurrogateRange(leadSurrogate);
this.surrogateRanges.push(this.leadSurrogateRange);
}

if (
this.trailSurrogateRange &&
this.trailSurrogateRange.lastTrailSurrogate === trailSurrogate - 1
) {
this.trailSurrogateRange.lastTrailSurrogate = trailSurrogate;
return;
}

this.trailSurrogateRange = new TrailSurrogateRange(trailSurrogate);
this.leadSurrogateRange.ranges.push(this.trailSurrogateRange);
}
}

toString() {
let first = this.nonSurrogateRanges.join('');
let second = this.surrogateRanges.join('|');
return first && second
? `([${first}]|${second})`
: first
? `[${first}]`
: second
? `(${second})`
: '';
}
}

const MAX_UNICODE_NON_SURROGATE = 0xffff;
const MAX_UNICODE_CODEPOINT = 0x10ffff;
const isStart = c => /\p{ID_Start}/u.test(c);
const isContinue = c => /\p{ID_Continue}/u.test(c);

let idStartWriter = new Writer();
let idContinueWriter = new Writer();

for (let cp = 0; cp <= MAX_UNICODE_CODEPOINT; cp++) {
const ch = String.fromCodePoint(cp);
if (isStart(ch)) {
idStartWriter.push(cp);
}
if (isContinue(ch)) {
idContinueWriter.push(cp);
}
}

console.log(`/**
* Generated by scripts/generate-unicode-id-parts.js on node ${
process.version
} with unicode ${process.versions.unicode}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this depends on a Node version, and might be flaky/inconsistent? I'm wondering if it's possible just do a predictable subset, and update it periodically (even if manually).

Also, where/when will we run this script? Is it a part of the build script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was run manually. Its unlikely this will change often, usually only if you choose to target a new version of NodeJS that itself includes a new version of the Unicode standard.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we can add it to the build script as one of the steps at some point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll copy a comment I made on the TypeScript implementation here: microsoft/TypeScript#32096 (comment)

The build script could be simplified and made to not rely on the current Node.js/V8 + Unicode version:

const ID_Start = require('unicode-12.1.0/Binary_Property/ID_Start/code-points.js');
const ID_Continue = require('unicode-12.1.0/Binary_Property/ID_Continue/code-points.js');

// ...then add the other needed characters, and write the two arrays to disk.

Links to how other tooling implements this:

* based on http://www.unicode.org/reports/tr31/ and https://tc39.es/ecma262/#sec-names-and-keywords
* U_ID_START corresponds to the ID_Start property, and U_ID_CONTINUE corresponds to ID_Continue property.
*/`);
console.log('U_ID_START ' + idStartWriter.toString());
console.log('U_ID_CONTINUE ' + idContinueWriter.toString());
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
* A regexp-tree plugin to translate `/(?<name>a)\k<name>/` to `/(a)\1/`.
*/
module.exports = {

// To track the names of the groups, and return them
// in the transform result state.
//
Expand Down Expand Up @@ -41,6 +40,7 @@ module.exports = {
this._groupNames[node.name] = node.number;

delete node.name;
delete node.nameRaw;
},

Backreference(path) {
Expand All @@ -52,5 +52,6 @@ module.exports = {

node.kind = 'number';
node.reference = node.number;
delete node.referenceRaw;
},
};
};
10 changes: 4 additions & 6 deletions src/generator/__tests__/generator-basic-test.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ function test(re) {
}

describe('generator-basic', () => {

it('simple char', () => {
test(/a/);
});
Expand Down Expand Up @@ -81,11 +80,11 @@ describe('generator-basic', () => {
});

it('named group', () => {
test('/(?<foo\\u003B\\u{003B}>bar)/');
test('/(?<foo\\u003B\\u{003B}>bar)/u');
});

it('empty named group', () => {
test('/(?<foo\\u003B\\u{003B}>)/');
test('/(?<foo\\u003B\\u{003B}>)/u');
});

it('empty non-capturing group', () => {
Expand All @@ -97,7 +96,7 @@ describe('generator-basic', () => {
});

it('named backreference', () => {
test('/(?<foo\\u003B\\u{003B}>)\\k<foo\\u003B\\u{003B}>/');
test('/(?<foo\\u003B\\u{003B}>)\\k<foo\\u003B\\u{003B}>/u');
});

it('basic-assertion', () => {
Expand Down Expand Up @@ -179,5 +178,4 @@ describe('generator-basic', () => {
test(/a{1,}?/);
test(/a{1,3}?/);
});

});
});
4 changes: 2 additions & 2 deletions src/generator/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ const generator = {
if (node.capturing) {
// A named group.
if (node.name) {
return `(?<${node.name}>${expression})`;
return `(?<${node.nameRaw || node.name}>${expression})`;
}

return `(${expression})`;
Expand All @@ -48,7 +48,7 @@ const generator = {
case 'number':
return `\\${node.reference}`;
case 'name':
return `\\k<${node.reference}>`;
return `\\k<${node.referenceRaw || node.reference}>`;
default:
throw new TypeError(`Unknown Backreference kind: ${node.kind}`);
}
Expand Down
Loading