-
-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support supplementary (non-BMP) symbols in Unicode plugins #25
Conversation
@mathiasbynens , thanks for raising the issue and offering to help. I'd love it if astral/supplementary/non-BMP code point support could be added to XRegExp's I have no plans to work on astral support myself in the near term, but if it came via a pull request, that would be fantastic. However, there are some challenges that I think you might not yet be accounting for, which I describe below. (Since you mentioned the roadmap page, some time ago I removed the bullet about support for astral code points, due to these challenges and because of concerns about file size.) I probably would prefer it if the Unicode data used by XRegExp continues to use some kind of compression, so long as that compression still offers a reasonably significant benefit after gzipping. It's OK if the decompression for any given Unicode property's data is a bit slow (on the order of double-digit milliseconds), so long as the decompression code is fairly lightweight and as long as Unicode Base is updated to cache decompressed ranges on first use, similar to how it already caches generated inverted ranges on first use. It's also OK to break backward compatibility for the Before getting to the challenges, I should say this: If it can be done fully and correctly, IMO this is certainly worth pursuing. Given JavaScript's poor support for Unicode, it is simply not yet a suitable language for applications that need robust Unicode functionality. You can use a normalization library, you can add ES6 code point shims, etc., but regular expressions remain a sore spot that is very difficult to work around. XRegExp with its Unicode addons gives JavaScript's Unicode support a huge leap forward, but XRegExp's current limitation of BMP-only support means that sometimes it's not enough. The challenge is that there are a variety of different uses of
Those are the four basic usages, and all that XRegExp's current implementation needs to account for. But a more complex implementation that supports surrogate-pair-based ranges might also need special handling for things like these:
And so on. Note that In other words, full regex syntax would still need to be supported (e.g., something like The code behind http://inimino.org/~inimino/blog/javascript_cset might be relevant. As an aside, the supporting code needed to solve all of the above might also make it possible to implement 21-bit So yeah, all of the above is why I haven't pursued fancy surrogate-pair-based astral code point support for |
Thanks for your detailed answer, Steven. Before I respond, let’s just say that your regex skills are far superior compared to mine — so it’s very likely that I’m oversimplifying things, or that there are edge cases that I’m not considering in my proposed solutions. That said, some replies here:
I’m not sure how this would be feasible. Most of the regular expressions to match all symbols in a Unicode category already consist of multiple ranges (e.g. https://mathias.html5.org/data/unicode/6.1.0/Ll-regex.js). Would it be possible to simply not add the surrounding brackets (
It would be possible to write a script that generates separate regexes for negated category ranges, but can’t we use something like
Couldn’t this be
Could this be You get the idea for the other examples. It seems to me that most of the issues could be solved this way. What am I missing? :) |
Yes, that's fine. As you know, you cannot match surrogate-pair-based code points and ranges in a single character class. It will require pairs of character classes and alternation within a group, as you've already shown. I was merely showing what XRegExp currently outputs, and what must be precisely emulated by the new output.
Yes, you can use negative lookahead (at least in a prototype version to prove that astral support works), but it would be less efficient. If you can script the inversion at runtime (like XRegExp already does for BMP code points) and cache results, you don't need to take a filesize hit or any significant performance hit. But you probably can't script the inversion if the data is provided as pre-generated regexes. You'd need some lightweight representation for the underlying Unicode data that is then used to generate both the inverted and noninverted emulated character classes.
It could be But there is a bigger issue, in this case, compared to One way to deal with this might be to add a new syntax token that matches entire character classes. You could then process the entire contents of the character class yourself, within your token handler. For this to work correctly (and not break other addons), you would need issue #18 to be implemented, so that other syntax token handlers can transform parts of the character class that they're interested in, after you're done with your transformation. The quickest way to get a feel for the challenges you are up against is probably to just go ahead and try to implement support for 21-bit Adding support for 21-bit |
Actually, supporting 21-bit But there's a bright side: Because supporting 21-bit What I'd recommend is to create a new addon for 21-bit The implementation could be something as simple as this: (function (XRegExp) {
"use strict";
var unicode = {
...
// Your 21-bit Unicode data
...
};
function slug(name) {
return name.replace(/[- _]+/g, "").toLowerCase();
}
XRegExp.install("extensibility");
XRegExp.addToken(
/\\([pP]){(\^?)([^}]*)}/,
function (match) {
var item = slug(match[3]),
codePoint = "[\\0-\\ud7ff\\udc00-\\uffff]|[\\ud800-\\udbff][\\udc00-\\udfff]|[\\ud800-\\udbff]";
if (match[1] === "P" && match[2]) {
throw new SyntaxError("invalid double negation \\P{^");
}
if (!unicode.hasOwnProperty(item)) {
throw new SyntaxError("invalid or unknown Unicode property " + match[0]);
}
if (match[1] === "P" || match[2]) { // Negated
// 21-bit Unicode properties should always match a code point, to avoid
// any confusion about when they match a code unit vs code point
return "(?:(?!" + unicode[item] + ")(?:" + codePoint + "))";
}
return "(?:" unicode[item] + ")";
}
);
}(XRegExp)); (All code untested. Passing in If you want, you could explicitly specify the scope as I encourage you to submit a pull request with something along these lines. The new addon should not go in the Of course, you might have other ideas about how best to proceed... |
To be honest, I’d just put it in the |
Yes, I think it would be confusing. I already dislike having to say that But you're probably right that a new root folder would be even more confusing. Feel free to include it in |
There, I’ve turned this issue into a pull request. :) With XRegExp and XRegExp('^\\p{Ll}+$').test('\uD835\uDFCB'); // true
XRegExp('^\\P{Ll}+$').test('\uD835\uDFCB'); // false I’ve also created a repository for the JavaScript-compatible Unicode data and the scripts that generate it: http://git.io/unicode |
It's...it's beautiful. :-D Thanks for doing this! Really great stuff. I haven't looked over the code in depth yet, but I'll accept the pull request when I have an hour or so free to deal with this. Note that I'll probably rename the Also, in order to avoid confusion and unintended edge case behavior, I think it might be better to go against my earlier suggestion and let the new addon give XRegExp 2.1.0 (milestone issues) should be released this month, and I definitely plan for this to be in the new feature list. Make that "the best new feature". |
+1 Sounds good. Please feel free to rename the files or change anything in the code — it’s your project! (Sorry, I hadn’t seen your edit until after I submitted the pull request.) I was wondering about the |
IMHO it should be included in |
@walling The problem is that I wonder what @slevithan thinks about this. |
I see! Personally I wouldn't mind to invoke |
I just realized using |
Support astral symbols in new Unicode Categories Astral addon
Exactly right. Astral support cannot be the default. Most people don't care about non-BMP code points anyway, and I can't take away support for use within character classes in the default handling. The astral version is also a bit less efficient. Of course, for the people that care about full 21-bit Unicode support, they tend to really care about it. So it's fantastic to have this new (optional) functionality.
He thinks it's bloody brilliant, yo. ;-) I've added basic support for this in commit e0cf69d.
Yup. I've added some basic tests in the same commit. BTW, I've renamed the So, your awesome work and ideas have already made this good enough to be included in the v2.1.0 package (though not yet as part of In order to merge optional astral support into the XRegExp Unicode Categories addon (and thus Edit: Details moved to the new issue #29. I'd leave this issue open, but GitHub doesn't allow reopening merged pull requests. |
I’ve just finished work on a script that generates JavaScript-compatible regular expressions for Unicode categories including supplementary symbols: http://git.io/unicode
As explained on that page, the generated output is fully tested, too. (There’s a link to the test if you want to confirm this yourself.)
I’m afraid it won’t be possible to re-use the same “compressed” format you’re using now, given how the parts of the regular expressions that match surrogate pairs look. For example, here’s a small portion of the regex for the [Ll] category as per Unicode 6.1.0:
If you want this in XRegExp (I know it’s been on the roadmap for a while), let me know which format you decide on; I’ll happily tweak my script and submit a pull request.