You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
UTF-8 is one of several sanctioned ways of encoding Unicode "code
points". But a code point is, at its heart, just a non-negative integer.
The mechanism of UTF-8 can't handle numbers 2**36 and higher. (And
Unicode and other standards artificially limit what numbers are
considered acceptable.)
Perl decided to create an extension to UTF-8 for representing higher
values, so it could be used for any 64-bit number.
We now have a DFA that translates UTF-8 for numbers less than 2**36.
For larger numbers, a different mechanism (the older one) is used.
The DFA uses table lookup. To get it to accept larger numbers, the
table would have to be widened from U8 to U16 (and the numbers in it
recalculated).
The table is about 180 bytes now. Widening it wouldn't consume that
many more bytes in the grand scheme of things, but I don't know of
anyone actually using these extremely large numbers, so I haven't felt
that it is worth it.
But every so often, I get curious about what it would take, so this
commit sketches that out, for possible future reference.
0 commit comments