Skip to content

Commit a67a651

Browse files
committed
perl.h: Add comments regarding UTF-8 conversion table
UTF-8 is one of several sanctioned ways of encoding Unicode "code points". But a code point is, at its heart, just a non-negative integer. The mechanism of UTF-8 can't handle numbers 2**36 and higher. (And Unicode and other standards artificially limit what numbers are considered acceptable.) Perl decided to create an extension to UTF-8 for representing higher values, so it could be used for any 64-bit number. We now have a DFA that translates UTF-8 for numbers less than 2**36. For larger numbers, a different mechanism (the older one) is used. The DFA uses table lookup. To get it to accept larger numbers, the table would have to be widened from U8 to U16 (and the numbers in it recalculated). The table is about 180 bytes now. Widening it wouldn't consume that many more bytes in the grand scheme of things, but I don't know of anyone actually using these extremely large numbers, so I haven't felt that it is worth it. But every so often, I get curious about what it would take, so this commit sketches that out, for possible future reference.
1 parent 3f9ec1d commit a67a651

File tree

1 file changed

+49
-10
lines changed

1 file changed

+49
-10
lines changed

perl.h

Lines changed: 49 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6559,14 +6559,6 @@ static U8 utf8d_C9[] = {
65596559
* arbitrary class number chosen to not conflict with the above
65606560
* classes, and to index into the remaining table
65616561
*
6562-
* It would make the code simpler if start byte FF could also be handled, but
6563-
* doing so would mean adding two more classes (one from splitting 80 from 81,
6564-
* and one for FF), and nodes for each of 6 new continuation bytes. The
6565-
* current table has 436 entries; the new one would require 140 more = 576 (2
6566-
* additional classes for each of the 10 existing nodes, and 20 for each of 6
6567-
* new nodes. The array would have to be made U16 instead of U8, not worth it
6568-
* for this rarely encountered case
6569-
*
65706562
* byte class
65716563
* 00-7F 0 Always legal, single byte sequence
65726564
* 80-81 7 Not legal immediately after start bytes E0 F0 F8 FC
@@ -6588,7 +6580,7 @@ static U8 utf8d_C9[] = {
65886580
* FD 6 Legal start byte for six byte sequences
65896581
* FE 17 Some sequences are overlong; others legal
65906582
* (is 1 on 32-bit machines, since it overflows)
6591-
* FF 1 Need to handle specially
6583+
* FF 1 Need to handle specially (explained below)
65926584
*/
65936585

65946586
EXTCONST U8 PL_extended_utf8_dfa_tab[] = {
@@ -6670,7 +6662,54 @@ EXTCONST U8 PL_extended_utf8_dfa_tab[] = {
66706662
/*N10*/ 1, 1, 1, 1, 1, 1, 1, 1,N5,N5,N5,N5,N5, 1, 1, 1, 1, 1,
66716663
};
66726664

6673-
/* And below is a version of the above table that accepts only strict UTF-8.
6665+
/* The first portion of the table is 256 bytes. To keep the table declarable
6666+
* as U8, 256 is added to the index when accessing this portion at runtime.
6667+
* That addition could be eliminated if we were willing to declare the table
6668+
* U16 and adjust the numbers accordingly.
6669+
*
6670+
* FF is handled specially because otherwise the table would need to contain
6671+
* elements that occupy more than 8 bits and so the table would have to be
6672+
* declared as U16, so not worth it for this rarely encountered case. If you
6673+
* are tempted anyway, here is a sketch of what the nodes would look like:
6674+
* N0 The initial state, and final accepting one.
6675+
* N1 Any one continuation byte (80-BF) left. This is transitioned to
6676+
* immediately when the start byte indicates a two-byte sequence
6677+
* N2 Any two continuation bytes left.
6678+
* N3 Any three continuation bytes left.
6679+
* N4 Any four continuation bytes left.
6680+
* N5 Any five continuation bytes left.
6681+
* N6 Any six continuation bytes left.
6682+
* N7 Any seven continuation bytes left.
6683+
* N8 Any eight continuation bytes left.
6684+
* N9 Any nine continuation bytes left.
6685+
* N10 Any ten continuation bytes left.
6686+
* N11 Start byte is E0. Continuation bytes 80-9F are illegal (overlong);
6687+
* the other continuations transition to N1
6688+
* N12 Start byte is F0. Continuation bytes 80-8F are illegal (overlong);
6689+
* the other continuations transition to N2
6690+
* N13 Start byte is F8. Continuation bytes 80-87 are illegal (overlong);
6691+
* the other continuations transition to N3
6692+
* N14 Start byte is FC. Continuation bytes 80-83 are illegal (overlong);
6693+
* the other continuations transition to N4
6694+
* N15 Start byte is FE. Continuation bytes 80-81 are illegal (overlong);
6695+
* N16 Start byte is FF. Continuation byte 80 transitions to N17;
6696+
* the other continuations are illegal (overflow)
6697+
* N17 sequence so far is FF 80; continuation byte 80 transitions to N18;
6698+
* 81-9F to N10; the other continuations are illegal (overflow)
6699+
* N18 sequence so far is FF 80 80; continuation byte 80 transitions to N19;
6700+
* the other continuations transition to N9
6701+
* N19 sequence so far is FF 80 80 80; continuation byte 80 transitions to
6702+
* N20; the other continuations transition to N8
6703+
* N20 sequence so far is FF 80 80 80 80; continuation byte 80 transitions to
6704+
* N21; the other continuations transition to N7
6705+
* N21 sequence so far is FF 80 80 80 80 80; continuation bytes 81-BF
6706+
* transition to N6; 80 is illegal (overlong)
6707+
*
6708+
* A new class, the 19th, would have to be created for FF. Then the nodes
6709+
* portion of the table would have 21 * 19 = 399 slots. The current table has
6710+
* 18 classes and 10 nodes = 180 slots for the nodes portion. */
6711+
6712+
/* Below is a version of the above table that accepts only strict UTF-8.
66746713
* Hence no surrogates nor non-characters, nor non-Unicode. Thus, if the input
66756714
* passes this dfa, it will be for a well-formed, non-problematic code point
66766715
* that can be returned immediately.

0 commit comments

Comments
 (0)