Support supplementary (non-BMP) symbols in Unicode plugins #25

mathiasbynens · 2012-06-11T12:40:29Z

I’ve just finished work on a script that generates JavaScript-compatible regular expressions for Unicode categories including supplementary symbols: http://git.io/unicode

As explained on that page, the generated output is fully tested, too. (There’s a link to the test if you want to confirm this yourself.)

I’m afraid it won’t be possible to re-use the same “compressed” format you’re using now, given how the parts of the regular expressions that match surrogate pairs look. For example, here’s a small portion of the regex for the [Ll] category as per Unicode 6.1.0:

…\uD835[\uDC1A-\uDC33\uDC4E-\uDC54\uDC56-\uDC67\uDC82-\uDC9B\uDCB6-\uDCB9\uDCBB\uDCBD-\uDCC3\uDCC5-\uDCCF\uDCEA-\uDD03\uDD1E-\uDD37\uDD52-\uDD6B\uDD86-\uDD9F\uDDBA-\uDDD3\uDDEE-\uDE07\uDE22-\uDE3B\uDE56-\uDE6F\uDE8A-\uDEA5\uDEC2-\uDEDA\uDEDC-\uDEE1\uDEFC-\uDF14\uDF16-\uDF1B\uDF36-\uDF4E\uDF50-\uDF55\uDF70-\uDF88\uDF8A-\uDF8F\uDFAA-\uDFC2\uDFC4-\uDFC9\uDFCB]|\uD801[\uDC28-\uDC4F]…

If you want this in XRegExp (I know it’s been on the roadmap for a while), let me know which format you decide on; I’ll happily tweak my script and submit a pull request.

slevithan · 2012-06-09T22:07:04Z

@mathiasbynens , thanks for raising the issue and offering to help. I'd love it if astral/supplementary/non-BMP code point support could be added to XRegExp's \p{…} classes, especially if it didn't balloon the post-gzipping size of the library to something unmanageable. If astral support makes the Unicode addons too much larger or slower, I might not make the astral versions the default, but I'd still love to offer them as alternatives, as part of the XRegExp package.

I have no plans to work on astral support myself in the near term, but if it came via a pull request, that would be fantastic. However, there are some challenges that I think you might not yet be accounting for, which I describe below. (Since you mentioned the roadmap page, some time ago I removed the bullet about support for astral code points, due to these challenges and because of concerns about file size.)

I probably would prefer it if the Unicode data used by XRegExp continues to use some kind of compression, so long as that compression still offers a reasonably significant benefit after gzipping. It's OK if the decompression for any given Unicode property's data is a bit slow (on the order of double-digit milliseconds), so long as the decompression code is fairly lightweight and as long as Unicode Base is updated to cache decompressed ranges on first use, similar to how it already caches generated inverted ranges on first use. It's also OK to break backward compatibility for the XRegExp.addUnicodePackage function in Unicode Base, since that's undocumented (outside of the source code) and AFAIK only used by XRegExp's official Unicode addons.

Before getting to the challenges, I should say this: If it can be done fully and correctly, IMO this is certainly worth pursuing. Given JavaScript's poor support for Unicode, it is simply not yet a suitable language for applications that need robust Unicode functionality. You can use a normalization library, you can add ES6 code point shims, etc., but regular expressions remain a sore spot that is very difficult to work around. XRegExp with its Unicode addons gives JavaScript's Unicode support a huge leap forward, but XRegExp's current limitation of BMP-only support means that sometimes it's not enough.

The challenge is that there are a variety of different uses of \p{…} that must be accounted for in the code point ranges that are output by XRegExp's Unicode Base addon:

\p{L} becomes […range…]. XRegExp adds the surrounding brackets.
\p{^L} becomes [^…range…]. XRegExp adds the surrounding brackets.
[\p{L}] becomes […range…]. XRegExp does not add surrounding brackets.
[\p{^L}] becomes […generated inverted range…]. XRegExp does not add surrounding brackets. The range must be inverted, in order to play nice with other tokens inside the character class.

Those are the four basic usages, and all that XRegExp's current implementation needs to account for. But a more complex implementation that supports surrogate-pair-based ranges might also need special handling for things like these:

[^\p{L}] becomes [^…range…].
[^\p{^L}] becomes [^…generated inverted range…].
[\p{L}0-9] becomes […range…0-9].
[\p{^L}0-9] becomes […generated inverted range…0-9].
[^\p{L}0-9] becomes [^…range…0-9].
[^\p{^L}0-9] becomes [^…generated inverted range…0-9].
\p{L}+ becomes […range…]+.
[\p{L}]+ becomes […range…]+.
[\p{L}\p{M}] becomes […range…range…]
[\p{L}\p{^M}] becomes […range…generated inverted range…]

And so on. Note that \p{^…} is the same as uppercase \P{…}.

In other words, full regex syntax would still need to be supported (e.g., something like XRegExp('(?x) (?![c-f]) [\\p{^Ll}a-z] +') should still work as expected).

The code behind http://inimino.org/~inimino/blog/javascript_cset might be relevant.

As an aside, the supporting code needed to solve all of the above might also make it possible to implement 21-bit \u{10FFFF} code points for XRegExp (see https://gist.github.com/2630353) in a way that works intuitively in character classes and character class ranges. (You can make that gist work intuitively for \u{10FFFF} followed by a quantifier by simply wrapping the output in (?:…) if the match scope is not 'class'.)

So yeah, all of the above is why I haven't pursued fancy surrogate-pair-based astral code point support for \p{…} thus far. But IMO it would be an interesting, challenging, and worthwhile project.

mathiasbynens · 2012-06-10T05:44:28Z

Thanks for your detailed answer, Steven. Before I respond, let’s just say that your regex skills are far superior compared to mine — so it’s very likely that I’m oversimplifying things, or that there are edge cases that I’m not considering in my proposed solutions.

That said, some replies here:

\p{L} becomes […range…]. XRegExp adds the surrounding brackets.

I’m not sure how this would be feasible. Most of the regular expressions to match all symbols in a Unicode category already consist of multiple ranges (e.g. https://mathias.html5.org/data/unicode/6.1.0/Ll-regex.js). Would it be possible to simply not add the surrounding brackets ([]), and rather using something like (?:…)?

\p{^L} becomes [^…range…]. XRegExp adds the surrounding brackets.

It would be possible to write a script that generates separate regexes for negated category ranges, but can’t we use something like (?!…) instead?

[\p{L}0-9] becomes […range…0-9].

Couldn’t this be (…regexL…)|[0-9] instead? (Wrapped in (?:…), if needed.)

[\p{L}\p{M}] becomes […range…range…]

Could this be (regexL)|(regexM)? (Again, wrapped in (?:…) as needed.)

You get the idea for the other examples. It seems to me that most of the issues could be solved this way. What am I missing? :)

slevithan · 2012-06-10T06:51:38Z

\p{L} becomes […range…]. XRegExp adds the surrounding brackets.

I’m not sure how this would be feasible. Most of the regular expressions to match all symbols in a Unicode category already consist of multiple ranges (e.g. https://mathias.html5.org/data/unicode/6.1.0/Ll-regex.js). Would it be possible to simply not add the surrounding brackets ([]), and rather using something like (?:…).

Yes, that's fine. As you know, you cannot match surrogate-pair-based code points and ranges in a single character class. It will require pairs of character classes and alternation within a group, as you've already shown. I was merely showing what XRegExp currently outputs, and what must be precisely emulated by the new output.

\p{^L} becomes [^…range…]. XRegExp adds the surrounding brackets.

It would be possible to write a script that generates separate regexes for negated category ranges, but can’t we use something like (?!…) instead?

Yes, you can use negative lookahead (at least in a prototype version to prove that astral support works), but it would be less efficient. If you can script the inversion at runtime (like XRegExp already does for BMP code points) and cache results, you don't need to take a filesize hit or any significant performance hit. But you probably can't script the inversion if the data is provided as pre-generated regexes. You'd need some lightweight representation for the underlying Unicode data that is then used to generate both the inverted and noninverted emulated character classes.

[\p{L}0-9] becomes […range…0-9].

Couldn’t this be (…regexL…)|[0-9] instead?

It could be (?:(?:…regexL…)|[0-9]). That would ensure that the emulated character class can be quantified as a single unit, and that it does not interfere with the surrounding pattern in unexpected ways. The same applies for [\p{L}\p{M}]. It can be (?:(?:regexL)|(?:regexM)).

But there is a bigger issue, in this case, compared to \p{L} with no surrounding character class. XRegExp syntax token handlers operate only on the individual tokens they match, and don't know what comes before or after the match. When a token is matched, only a few basics are known, including what flags are active and whether the token was found in a character class or not. It's possible to peek forward by using a capturing group in lookahead as part of the regex that matches your syntax token, but you can't peek back at previous tokens since JavaScript doesn't support lookbehind. Of course, if you changed the implementation in xregexp.js, it may be possible to look back at previous tokens and even modify earlier parts of the pattern that have already been rewritten by other token handlers. I'm open to ideas about how to change the token matching and addon systems. But since you can't currently rewrite any part of the regex that you've already passed, you can't currently remove the opening square bracket of the containing character class, assuming your new syntax token matches just \p{…}.

One way to deal with this might be to add a new syntax token that matches entire character classes. You could then process the entire contents of the character class yourself, within your token handler. For this to work correctly (and not break other addons), you would need issue #18 to be implemented, so that other syntax token handlers can transform parts of the character class that they're interested in, after you're done with your transformation.

The quickest way to get a feel for the challenges you are up against is probably to just go ahead and try to implement support for 21-bit \p{…} that handles the four main cases (\p{L}, \p{^L}, [0\p{L}1], and [0\p{^L}1]), using XRegExp.addToken. You will quickly hit a wall, and realize that XRegExp's addon support would require major new features in order to support 21-bit \p{L} and \p{^L} that appear within character classes. Like I've said, though, I'm open to smart new features that would give XRegExp more robust addon support. (Speaking of addon-targeted features, this discussion might be relevant.)

Adding support for 21-bit \p{…} is easy if you only want to support it outside of character classes, so long as you already have the data to go with it (which you do). You could already do this with XRegExp v2.0.0, perhaps with BMP-only fallback within character classes.

slevithan · 2012-06-11T05:49:30Z

Actually, supporting 21-bit \p{…} within character classes is probably a bad idea to begin with, because JavaScript character classes (emulated or not) should not be matching more than one code unit, unless they are using ES6's code-point-based matching with the /u flag. Doing this with ES3/5 code-unit-based matching would have some subtle and weird effects. E.g., what should [\p{L}\uD835\uDC9E] match? A code unit, a code point, or sometimes one and sometimes the other? What about [\p{^L}]--code unit or code point? This would all be too screwy and could introduce subtle and latent bugs.

But there's a bright side: Because supporting 21-bit \p{…} within ES3/5 character classes is a bad idea, the implementation challenges I've been talking about are all moot. 21-bit \p{…} is easy if you limit it to only work outside of character classes. And if users want character-class-like functionality, they can just use something like (?:\p{L}|\p{M}|[0-9]).

What I'd recommend is to create a new addon for 21-bit \p{…}, \P{…}, and \p{^…}, and make it work only in the 'default' scope (i.e., outside of character classes). By limiting the scope like that, XRegExp's existing BMP-only Unicode addons will happily work alongside your new addon (so long as your addon is loaded last), and the BMP-only versions would pick up the slack within character classes. If XRegExp's existing Unicode addons are not loaded alongside your new addon, then use of \p{…} inside character classes will automatically throw a SyntaxError (which is good), because XRegExp makes unrecognized alphanumeric escapes an error.

The implementation could be something as simple as this:

(function (XRegExp) {
    "use strict";

    var unicode = {
        ...
        // Your 21-bit Unicode data
        ...
    };

    function slug(name) {
        return name.replace(/[- _]+/g, "").toLowerCase();
    }

    XRegExp.install("extensibility");

    XRegExp.addToken(
        /\\([pP]){(\^?)([^}]*)}/,
        function (match) {
            var item = slug(match[3]),
                codePoint = "[\\0-\\ud7ff\\udc00-\\uffff]|[\\ud800-\\udbff][\\udc00-\\udfff]|[\\ud800-\\udbff]";
            if (match[1] === "P" && match[2]) {
                throw new SyntaxError("invalid double negation \\P{^");
            }
            if (!unicode.hasOwnProperty(item)) {
                throw new SyntaxError("invalid or unknown Unicode property " + match[0]);
            }
            if (match[1] === "P" || match[2]) { // Negated
                // 21-bit Unicode properties should always match a code point, to avoid
                // any confusion about when they match a code unit vs code point
                return "(?:(?!" + unicode[item] + ")(?:" + codePoint + "))";
            }
            return "(?:" unicode[item] + ")";
        }
    );

}(XRegExp));

(All code untested. Passing in XRegExp like that follows the style of other addons. It just makes it a bit easier to update the script when XRegExp goes by a different name, and has minor minification benefits.)

If you want, you could explicitly specify the scope as 'default' (see XRegExp.addToken), but that's unnecessary. Because the above addon works only outside of character classes, it can't replace the existing BMP-only Unicode addons. But holy crap, I think this is pretty cool, and it should work nicely with your existing data. It would also play nice with all XRegExp syntax/flags and any user-created addons.

I encourage you to submit a pull request with something along these lines. The new addon should not go in the src folder, though, since those files get compiled into xregexp-all.js, and the new script should not. Instead, I'd recommend creating a new root folder for it, named something like misc or experiments (perhaps you have a better name).

Of course, you might have other ideas about how best to proceed...

mathiasbynens · 2012-06-11T08:33:41Z

I encourage you to submit a pull request with something along these lines. The new addon should not go in the src folder, though, since those files get compiled into xregexp-all.js, and the new script should not. Instead, I'd recommend creating a new root folder for it, named something like misc or experiments (perhaps you have a better name).

To be honest, I’d just put it in the src/addons folder. As long as we don’t modify concatenate-source-files.sh, the new file won’t be added to xregex-all.js anyway. Or do you think that would be confusing? (I personally think it would be more confusing to introduce yet another root folder.)

slevithan · 2012-06-11T08:59:06Z

Yes, I think it would be confusing. I already dislike having to say that xregexp-all.js bundles all addons except backcompat.js (although the fact that that one is in src rather than src/addons helps a bit).

But you're probably right that a new root folder would be even more confusing. Feel free to include it in src/addons/unicode, if you think that's the best fit.

mathiasbynens · 2012-06-11T12:45:39Z

There, I’ve turned this issue into a pull request. :)

With XRegExp and unicode-categories-all.js (please feel free to rename), you can do stuff like this:

XRegExp('^\\p{Ll}+$').test('\uD835\uDFCB'); // true
XRegExp('^\\P{Ll}+$').test('\uD835\uDFCB'); // false

I’ve also created a repository for the JavaScript-compatible Unicode data and the scripts that generate it: http://git.io/unicode

slevithan · 2012-06-11T13:13:15Z

It's...it's beautiful. :-D Thanks for doing this! Really great stuff.

I haven't looked over the code in depth yet, but I'll accept the pull request when I have an hour or so free to deal with this. Note that I'll probably rename the -all files to something like -allplanes or -21bit (other suggestions welcome), to avoid confusion related to xregexp-all.js.

Also, in order to avoid confusion and unintended edge case behavior, I think it might be better to go against my earlier suggestion and let the new addon give \p and \P within character classes its own descriptive error (so that they don't fall back to BMP-only matching when loaded together with unicode-categories.js). I can make this change after accepting, unless you're opposed.

XRegExp 2.1.0 (milestone issues) should be released this month, and I definitely plan for this to be in the new feature list. Make that "the best new feature".

mathiasbynens · 2012-06-11T13:22:02Z

Also, in order to avoid confusion and unintended edge case behavior, I think it might be better to go against my earlier suggestion and let the new addon give \p and \P within character classes its own descriptive error (so that they don't fall back to BMP-only matching when loaded together with unicode-categories.js). I can make this change after accepting, unless you're opposed.

+1 Sounds good. Please feel free to rename the files or change anything in the code — it’s your project! (Sorry, I hadn’t seen your edit until after I submitted the pull request.)

I was wondering about the -all part in the file name. At first I thought the existing BMP-only categories plugin could be renamed into unicode-categories-bmp.js, and the new one could be unicode-categories.js, but that might be confusing for users who update to the latest XRegExp + addons. I’ll leave it up to you to come up with something that makes sense and doesn’t break user expectations :p

walling · 2012-06-11T16:13:56Z

IMHO it should be included in xregexp-all.js, so I get all the functions of the library when installing the NPM. Not that I'm working with non-BMP code points right now, but you never know. :-)

mathiasbynens · 2012-06-11T16:50:22Z

@walling The problem is that unicode-categories.js and unicode-categories-all.js (the one with non-BMP code point support) aren’t 100% compatible (explained in the comments above). It would be possible to add it all into the same file, but we’d probably have to default to the BMP behavior (for backwards compatibility), and then allow users to opt-in to the non-BMP support (thereby opting out to the other features e.g. use of \p in character classes) with XRegExp.install('astral') or similar.

I wonder what @slevithan thinks about this.

walling · 2012-06-11T16:57:39Z

I see! Personally I wouldn't mind to invoke XRegExp.install('astral') or similar.

mathiasbynens · 2012-06-11T16:59:54Z

I just realized using install to opt-in/out would make it easy to add unit tests for this new behavior. You could simply call install() after all the tests that rely on the BMP-only behavior (with extra functionality), then test for the “new” behavior. That’s definitely a plus :)

Support astral symbols in new Unicode Categories Astral addon

slevithan · 2012-06-11T23:35:12Z

@walling The problem is that unicode-categories.js and unicode-categories-all.js (the one with non-BMP code point support) aren’t 100% compatible (explained in the comments above).

Exactly right. Astral support cannot be the default. Most people don't care about non-BMP code points anyway, and I can't take away support for use within character classes in the default handling. The astral version is also a bit less efficient. Of course, for the people that care about full 21-bit Unicode support, they tend to really care about it. So it's fantastic to have this new (optional) functionality.

It would be possible to add it all into the same file, but we’d probably have to default to the BMP behavior (for backwards compatibility), and then allow users to opt-in to the non-BMP support (thereby opting out to the other features e.g. use of \p in character classes) with XRegExp.install('astral') or similar.

I wonder what @slevithan thinks about this.

He thinks it's bloody brilliant, yo. ;-) I've added basic support for this in commit e0cf69d.

I just realized using install to opt-in/out would make it easy to add unit tests for this new behavior. You could simply call install() after all the tests that rely on the BMP-only behavior (with extra functionality), then test for the “new” behavior. That’s definitely a plus :)

Yup. I've added some basic tests in the same commit.

BTW, I've renamed the -all files as -astral. I realize that's confusing (since it covers all planes, not just the astral planes), but I'm using that for now to match the addon name XRegExp Unicode Categories Astral, which is what I'm calling it for now. But as soon as optional astral support is merged into unicode-categories.js (see below), the separate addon file can go away, and the -astral files in tools can be renamed using e.g. -all, -allplanes, or no suffix.

So, your awesome work and ideas have already made this good enough to be included in the v2.1.0 package (though not yet as part of xregexp-all.js, which is used by the npm package). But big improvements are possible....

In order to merge optional astral support into the XRegExp Unicode Categories addon (and thus xregexp-all.js), here's what I think is needed...

Edit: Details moved to the new issue #29.

I'd leave this issue open, but GitHub doesn't allow reopening merged pull requests.

Add Unicode category plugin with support for supplementary symbols

0ed1855

slevithan added a commit that referenced this pull request Jun 11, 2012

Merge pull request #25 from mathiasbynens/0x10FFFF

e63e404

Support astral symbols in new Unicode Categories Astral addon

slevithan merged commit e63e404 into slevithan:master Jun 11, 2012

This was referenced Jun 12, 2012

Add installable feature 'astral' #28

Closed

Add opt-in astral support to Unicode addons, without separate files #29

Closed

slevithan mentioned this pull request Nov 22, 2012

\p and \P and mixed astral/BMP within character classes #44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support supplementary (non-BMP) symbols in Unicode plugins #25

Support supplementary (non-BMP) symbols in Unicode plugins #25

mathiasbynens commented Jun 11, 2012

slevithan commented Jun 9, 2012

mathiasbynens commented Jun 10, 2012

slevithan commented Jun 10, 2012

slevithan commented Jun 11, 2012

mathiasbynens commented Jun 11, 2012

slevithan commented Jun 11, 2012

mathiasbynens commented Jun 11, 2012

slevithan commented Jun 11, 2012

mathiasbynens commented Jun 11, 2012

walling commented Jun 11, 2012

mathiasbynens commented Jun 11, 2012

walling commented Jun 11, 2012

mathiasbynens commented Jun 11, 2012

slevithan commented Jun 11, 2012

Support supplementary (non-BMP) symbols in Unicode plugins #25

Support supplementary (non-BMP) symbols in Unicode plugins #25

Conversation

mathiasbynens commented Jun 11, 2012

slevithan commented Jun 9, 2012

mathiasbynens commented Jun 10, 2012

slevithan commented Jun 10, 2012

slevithan commented Jun 11, 2012

mathiasbynens commented Jun 11, 2012

slevithan commented Jun 11, 2012

mathiasbynens commented Jun 11, 2012

slevithan commented Jun 11, 2012

mathiasbynens commented Jun 11, 2012

walling commented Jun 11, 2012

mathiasbynens commented Jun 11, 2012

walling commented Jun 11, 2012

mathiasbynens commented Jun 11, 2012

slevithan commented Jun 11, 2012