Should we parse a codepoint at a time? #15

hildjj · 2021-04-15T10:18:08Z

hildjj
Apr 15, 2021
Maintainer

Right now, we parse one UCS-2 code unit at a time, which makes processing non-BMP text a challenge. To move to a full codepoint at a time, there would be several issues:

Either drop support for older environments, polyfill these, or rely on babel to fill these for the web only:
- String.prototype.codePointAt
- String.fromCodePoint
- RegExp u flag, or a different implementation for the places we use RegExp (looks like only for character classes to me)
The location information is in code units, not codepoints. I think this will more-or-less work if it stays in code units and all of the places we call charAt increment peg$currPos by 2 when a non-BMP codepoint is found. note, String.prototype.codePointAt 's parameter is in code units, not codepoints

Of those, the RegExp u flag is the only interesting one. While we could hand-roll support for [\u{0}-\u{10}], doing the same for [\p{Emoji_Presentation}] would be a pain to keep in sync with future versions of Unicode.

One approach might be to only turn on these features if the call to generate has a unicode: true property, and make it clear in the docs that this limits your browser compatibility.

StoneCypher · 2021-04-18T22:43:10Z

StoneCypher
Apr 18, 2021

There is already a working patch for this which follows the upcoming astral codepoint notation, which is probably the best thing to adopt

Using the regexp u flag is problematic, as it has different results in different engines. You'd see different things happen with the same code in chrome, node, and firefox. This isn't just about defects or support; they're also on different versions of the UCD.

One approach might be to only turn on these features if the call to generate has a unicode: true property

This is an option, but, I think that it's probably not a good choice, on the following grounds:

It means that things with that flag off behave differently in more modern than older browsers
It means that things with that flag on behave differently between more modern browsers
When the language moves on, a lingering default will have the parsers subtly behave in an out of date fashion in a modern environment, in a way that almost nobody actually tests

3 replies

hildjj Apr 18, 2021
Maintainer Author

Using the regexp u flag is problematic, as it has different results in different engines. You'd see different things happen with the same code in chrome, node, and firefox. This isn't just about defects or support; they're also on different versions of the UCD.

Do you mean that different browsers are based on different versions of Unicode, or something more subtle?

StoneCypher Apr 19, 2021

Both.

The Unicode Character Database evolves over time. Mostly it's new characters being added, but not entirely. Errors get fixed on occasion, mostly in the character properties.
1. Browsers generally have the UCD static-linked, so it'll be relative to when they were released
2. By example, we're currently on Unicode 13, which is 14 months old. If you attempt to use the Yazidi language in an older browser, it will fail.
3. Notable exception: Safari dynamically fetches UCD information from the operating system, which seems better to me
Some of these changes are relevant to what we do, particularly the whitespace and combining literal changes
What version of the UCD isn't the entire story
1. The browser has to actually use the UCD correctly. None of them do
2. Example: Edge didn't implement case folding until Edge 13. You will get different results in Edge 12.
3. Example:
What the u flag does isn't actually well defined
1. It varies by version. No, babel does not control this for you.
2. In 5.1, it will throw an exception https://262.ecma-international.org/5.1/#sec-15.10.4.1
3. In 2019, u1234 is required to be the unicode encoding of the codepoint.
  1. This is an impossible requirement. 2019 defines the character storage space to be 16 bits, and no Unicode encoding can store a codepoint in a single 16 bit space, due to thinks like surrogates and astrals.
4. In 2022, it's now a direct hex encoding and explicitly not a codepoint.
  1. This at least has an unambiguous parsing, but it can still express characters that cannot be stored, and behavior in that case is undefined.
5. Every browser is forced to make its own choices how to handle this. None of them agree, including the same browser across major versions.
Regexes are, in general, a surreal nightmare
1. Unicode is worse
2. Put them together and ohno.jpg
3. If you think the population of Earth uses enough Javascript to have searched through regex implementation bugs in browsers on the u flag?
  1. Nope.
An implementation of parsing unicode which depends on an external unicode implementation that is outside our control is a shouting match waiting to happen

StoneCypher Apr 19, 2021

The u flag in regexes is a trash fire in every contemporary browser

This crashes Edge up to 13:

var foo = "Change or cancel my flight booking";
var match = "a";
foo.replace(new RegExp(match + "(?!([^<]+)?>)", 'gu'), '<span class="text-highlight">${match}</span>');

That's safe from Edge 14, but, the results are different from 14 .. 46, then from 47 .. 71, then from 71 .. current (which I believe are syncs with chromium)

In 14..46 there's no case folding or lookahead/lookbehind; from 47 it gains case folding; from 71 it gains lookahead/lookbehind in u

I mean, imagine, you can take a working regex, add the unicode flag to it, and the match behavior changes?

Node's behavior changes significantly after 8.6.0 when they switch underlying regex providers

Then like

HTML Pattern doesn't have u enabled in regexes 5 years later in Mozilla https://bugzilla.mozilla.org/show_bug.cgi?id=1227906
- This causes hilariously obscure bugs

Some of these bugs go so far back that the bug system they were on is terminated

Regexp U support is flaky
- https://connect.microsoft.com/IE/feedback/details/1102227/regexp-u-flag-support-is-flaky
- archive.org doesn't have it
- If I remember correctly, this was a big list of things where sometimes it was right and sometimes it was wrong and the driving factor wasn't clear
[RegExp] case-insensitive matching misses characters chakra-core/ChakraCore#517 (comment)
- No idea what this was, but this is IE6 era, so ... jeez

other things exist

Then god forbid you want this to match the behavior of a different programming language somewhere else (not that you're gonna get that without either, but, still)

lol wut time

On top of that, y'know, the thing this issue is about sort of breaks the idea of using the browser to do this

If the idea is to parse unicode correctly, we can't use the thing that's getting it wrong to get it right, you know?

Javascript itself, before you even consider the browser implementations, already has a severely broken relationship with Unicode. Just swimming through that mess in the mythical compliant browser would be agony compared to doing it directly

When you combine that with that none of the browsers do it quite right, and that doing it through the browser would impose a maintenance burden on us to detect context and adapt accordingly?

I am pretty of the opinion that this is one of those cases where doing it the hard way is doing it the easy way

The hard way isn't actually that hard. Sebastien Beyou @Seb35 already did it and it's like 30-40 lines

Invoking something external and untrustworthy to avoid a PR that small doesn't make sense to me

hildjj · 2021-04-19T21:44:08Z

hildjj
Apr 19, 2021
Maintainer Author

Neither of the patches correctly handles a peggy grammar that looks like:

foo = [\u{10}-\u{20}]

but they both accept it. That rule generates a regexp. Now, we could argue that it should NOT generate a regexp, and instead it should generate a swtich statement or something, but that needs to be dealt with.

2 replies

hildjj Apr 19, 2021
Maintainer Author

for reference to the regex code generation I mentioned:

peggy/lib/compiler/passes/generate-bytecode.js

Line 572 in 434c4b9

const regexp = "/^["

Mingun Apr 20, 2021

Originally David developed class AST node as containing regexp only, but in my opinion we should change that. See pegjs/pegjs#459 (comment) for the related discussion.

reverofevil · 2021-08-04T04:56:58Z

reverofevil
Aug 4, 2021

I think this is a big can of worms, and shouldn't be opened.

There's not many users for improved Unicode support;
It will probably make performance worse for current users;
It will probably make more cognitive load on current users;

An alternative would be to ensure that someone who really needs it (to implement an emoji-aware markdown parser, for example) is able to do so by plugging in some JS-implemented function. The &{} syntax seems fine, but I'd also consider importing clauses from JS files. I think there was a discussion somewhere in issues of pegjs repo about it.

0 replies

hildjj · 2022-06-11T17:04:35Z

hildjj
Jun 11, 2022
Maintainer Author

See #290 for a prototype of some things we can add. I haven't looked at performance yet, but I don't expect this to be too bad.

0 replies

reverofevil · 2022-06-11T19:07:06Z

reverofevil
Jun 11, 2022

Now that I have several more PEG parser generators behind, I came up with another approach.

Parsers don't have to process only strings. There are at least

default JS string that has problems iterating over codepoints;
its Iterable, that actually iterates over codepoints;
typed arrays with much better performance, and ability to parse binary data (mysql would highly benefit parsing packets in a less ad-hoc way);
Buffer in Node.js that serves the same purpose;
string templates, where you might need to check that substitutions are done in correct places in string (think of safe substitutions to SQL queries);

In short, it might be possible to represent explicitly a set of operations on string-like type, and provide different implementations of it to codegen.

2 replies

hildjj Jun 11, 2022
Maintainer Author

So, a stream of tokens?

stefnotch Dec 19, 2023

@hildjj Ideally yes, though it frequently ends up being "stream of tokens, plus backtracking support or lookahead support".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we parse a codepoint at a time? #15

{{title}}

Replies: 5 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Should we parse a codepoint at a time? #15

hildjj Apr 15, 2021 Maintainer

Replies: 5 comments · 7 replies

StoneCypher Apr 18, 2021

hildjj Apr 18, 2021 Maintainer Author

StoneCypher Apr 19, 2021

StoneCypher Apr 19, 2021

other things exist

lol wut time

hildjj Apr 19, 2021 Maintainer Author

hildjj Apr 19, 2021 Maintainer Author

Mingun Apr 20, 2021

reverofevil Aug 4, 2021

hildjj Jun 11, 2022 Maintainer Author

reverofevil Jun 11, 2022

hildjj Jun 11, 2022 Maintainer Author

stefnotch Dec 19, 2023

hildjj
Apr 15, 2021
Maintainer

Replies: 5 comments 7 replies

StoneCypher
Apr 18, 2021

hildjj Apr 18, 2021
Maintainer Author

hildjj
Apr 19, 2021
Maintainer Author

hildjj Apr 19, 2021
Maintainer Author

reverofevil
Aug 4, 2021

hildjj
Jun 11, 2022
Maintainer Author

reverofevil
Jun 11, 2022

hildjj Jun 11, 2022
Maintainer Author