string_decoder: fix handling of malformed utf8 #7318

gagern · 2016-06-16T14:07:06Z

Checklist

make -j4 test (UNIX) or vcbuild test nosign (Windows) passes
a test and/or benchmark is included
the commit message follows commit guidelines

Affected core subsystem(s)

string_decoder

Description of change

There have been problems with utf8 decoding in cases where the input was invalid. Some cases would give different results depending on chunking, while others even led to exceptions. This commit simplifies the code in several ways, reducing the risk of breakage.

Most importantly, the text method is not only used for bulk conversion of a single chunk, but also for the conversion of the mini buffer lastChar while handling characters spanning chunks. That should make most of the problems due to incorrect handling of special cases disappear.

Secondly, that text method is now independent of encoding. The encoding-dependent method complete now has a single well-defined task: determine the buffer position up to which the input consists of complete characters. The actual conversion is handled in a central location, leading to a cleaner and leaner internal interface.

Thirdly, we no longer try to remember just how many bytes we'll need for the next complete character. We simply try to fill up the nextChar buffer and perform a conversion on that. This reduces the number of internal counter variables from two to one, namely partial which indicates the number of bytes currently stored in nextChar.
A possible drawback of this approach is that there is chance of degraded performance if input arrives one byte at a time and is from a script using long utf8 sequences. As a benefit, though, this allows us
to emit a U+FFFD replacement character sooner in cases where the first byte of an utf8 sequence is not followed by the expected number of continuation bytes.

Fixes: #7308

This is an alternative to #7310; merging this makes that obsolete.

There have been problems with utf8 decoding in cases where the input was invalid. Some cases would give different results depending on chunking, while others even led to exceptions. This commit simplifies the code in several ways, reducing the risk of breakage. Most importantly, the `text` method is not only used for bulk conversion of a single chunk, but also for the conversion of the mini buffer `lastChar` while handling characters spanning chunks. That should make most of the problems due to incorrect handling of special cases disappear. Secondly, that `text` method is now independent of encoding. The encoding-dependent method `complete` now has a single well-defined task: determine the buffer position up to which the input consists of complete characters. The actual conversion is handled in a central location, leading to a cleaner and leaner internal interface. Thirdly, we no longer try to remember just how many bytes we'll need for the next complete character. We simply try to fill up the `nextChar` buffer and perform a conversion on that. This reduces the number of internal counter variables from two to one, namely `partial` which indicates the number of bytes currently stored in `nextChar`. A possible drawback of this approach is that there is chance of degraded performance if input arrives one byte at a time and is from a script using long utf8 sequences. As a benefit, though, this allows us to emit a U+FFFD replacement character sooner in cases where the first byte of an utf8 sequence is not followed by the expected number of continuation bytes. Fixes: nodejs#7308

gagern · 2016-06-16T22:09:00Z

I have to concede that my changes appear to come at some performance cost, particularly for the base64 encodings. Comparing 588864d with its direct parent 1a1ff77 I see this benchmark comparison:

string_decoder/string-decoder-create.js encoding="ascii"    n="25000000": ..................... 588864d/node: 8513500   1a1ff77/node: 7738100 .... 10.02%
string_decoder/string-decoder-create.js encoding="utf8"     n="25000000": ..................... 588864d/node: 3286700   1a1ff77/node: 3590100 .... -8.45%
string_decoder/string-decoder-create.js encoding="utf-8"    n="25000000": ..................... 588864d/node: 3047000   1a1ff77/node: 3234800 .... -5.81%
string_decoder/string-decoder-create.js encoding="base64"   n="25000000": ..................... 588864d/node: 2107300   1a1ff77/node: 2335200 .... -9.76%
string_decoder/string-decoder-create.js encoding="ucs2"     n="25000000": ..................... 588864d/node: 2884700   1a1ff77/node: 3018600 .... -4.44%
string_decoder/string-decoder-create.js encoding="UTF-8"    n="25000000": ..................... 588864d/node: 2066800   1a1ff77/node: 2026100 ..... 2.01%
string_decoder/string-decoder-create.js encoding="AscII"    n="25000000": ..................... 588864d/node: 3038600   1a1ff77/node: 4011300 ... -24.25%
string_decoder/string-decoder-create.js encoding="UTF-16LE" n="25000000": ..................... 588864d/node: 1755500   1a1ff77/node: 1568800 .... 11.91%

string_decoder/string-decoder.js encoding="ascii"        inlen=  "32" chunk=  "16" n="2500000": 588864d/node: 1819400   1a1ff77/node: 2257700 ... -19.41%
string_decoder/string-decoder.js encoding="ascii"        inlen=  "32" chunk=  "64" n="2500000": 588864d/node: 4687000   1a1ff77/node: 4057000 .... 15.53%
string_decoder/string-decoder.js encoding="ascii"        inlen=  "32" chunk= "256" n="2500000": 588864d/node: 3747300   1a1ff77/node: 4322900 ... -13.31%
string_decoder/string-decoder.js encoding="ascii"        inlen=  "32" chunk="1024" n="2500000": 588864d/node: 4219900   1a1ff77/node: 3180500 .... 32.68%
string_decoder/string-decoder.js encoding="ascii"        inlen= "128" chunk=  "16" n="2500000": 588864d/node:  461860   1a1ff77/node:  487710 .... -5.30%
string_decoder/string-decoder.js encoding="ascii"        inlen= "128" chunk=  "64" n="2500000": 588864d/node: 1668000   1a1ff77/node: 1759200 .... -5.18%
string_decoder/string-decoder.js encoding="ascii"        inlen= "128" chunk= "256" n="2500000": 588864d/node: 3470300   1a1ff77/node: 3701400 .... -6.24%
string_decoder/string-decoder.js encoding="ascii"        inlen= "128" chunk="1024" n="2500000": 588864d/node: 3303200   1a1ff77/node: 3674700 ... -10.11%
string_decoder/string-decoder.js encoding="ascii"        inlen="1024" chunk=  "16" n="2500000": 588864d/node:   74869   1a1ff77/node:   59589 .... 25.64%
string_decoder/string-decoder.js encoding="ascii"        inlen="1024" chunk=  "64" n="2500000": 588864d/node:  235220   1a1ff77/node:  292320 ... -19.53%
string_decoder/string-decoder.js encoding="ascii"        inlen="1024" chunk= "256" n="2500000": 588864d/node:  820890   1a1ff77/node:  791060 ..... 3.77%
string_decoder/string-decoder.js encoding="ascii"        inlen="1024" chunk="1024" n="2500000": 588864d/node: 1974500   1a1ff77/node: 2066800 .... -4.47%
string_decoder/string-decoder.js encoding="ascii"        inlen="4096" chunk=  "16" n="2500000": 588864d/node:   18091   1a1ff77/node:   18726 .... -3.39%
string_decoder/string-decoder.js encoding="ascii"        inlen="4096" chunk=  "64" n="2500000": 588864d/node:   58023   1a1ff77/node:   57669 ..... 0.61%
string_decoder/string-decoder.js encoding="ascii"        inlen="4096" chunk= "256" n="2500000": 588864d/node:  226320   1a1ff77/node:  233570 .... -3.10%
string_decoder/string-decoder.js encoding="ascii"        inlen="4096" chunk="1024" n="2500000": 588864d/node:  496170   1a1ff77/node:  370910 .... 33.77%

string_decoder/string-decoder.js encoding="utf8"         inlen=  "32" chunk=  "16" n="2500000": 588864d/node: 1383600   1a1ff77/node: 1307200 ..... 5.85%
string_decoder/string-decoder.js encoding="utf8"         inlen=  "32" chunk=  "64" n="2500000": 588864d/node: 1612300   1a1ff77/node: 2009400 ... -19.76%
string_decoder/string-decoder.js encoding="utf8"         inlen=  "32" chunk= "256" n="2500000": 588864d/node: 2146500   1a1ff77/node: 2132100 ..... 0.68%
string_decoder/string-decoder.js encoding="utf8"         inlen=  "32" chunk="1024" n="2500000": 588864d/node: 2156900   1a1ff77/node: 2146100 ..... 0.50%
string_decoder/string-decoder.js encoding="utf8"         inlen= "128" chunk=  "16" n="2500000": 588864d/node:  282740   1a1ff77/node:  421970 ... -33.00%
string_decoder/string-decoder.js encoding="utf8"         inlen= "128" chunk=  "64" n="2500000": 588864d/node:  624260   1a1ff77/node:  812660 ... -23.18%
string_decoder/string-decoder.js encoding="utf8"         inlen= "128" chunk= "256" n="2500000": 588864d/node:  890720   1a1ff77/node:  927910 .... -4.01%
string_decoder/string-decoder.js encoding="utf8"         inlen= "128" chunk="1024" n="2500000": 588864d/node:  798830   1a1ff77/node: 1116100 ... -28.43%
string_decoder/string-decoder.js encoding="utf8"         inlen="1024" chunk=  "16" n="2500000": 588864d/node:   41316   1a1ff77/node:   43883 .... -5.85%
string_decoder/string-decoder.js encoding="utf8"         inlen="1024" chunk=  "64" n="2500000": 588864d/node:  104220   1a1ff77/node:  115900 ... -10.08%
string_decoder/string-decoder.js encoding="utf8"         inlen="1024" chunk= "256" n="2500000": 588864d/node:  121100   1a1ff77/node:  129600 .... -6.56%
string_decoder/string-decoder.js encoding="utf8"         inlen="1024" chunk="1024" n="2500000": 588864d/node:  130920   1a1ff77/node:  101800 .... 28.60%
string_decoder/string-decoder.js encoding="utf8"         inlen="4096" chunk=  "16" n="2500000": 588864d/node:    9906.4 1a1ff77/node:   10700 .... -7.42%
string_decoder/string-decoder.js encoding="utf8"         inlen="4096" chunk=  "64" n="2500000": 588864d/node:   24386   1a1ff77/node:   23261 ..... 4.84%
string_decoder/string-decoder.js encoding="utf8"         inlen="4096" chunk= "256" n="2500000": 588864d/node:   31953   1a1ff77/node:   32738 .... -2.40%
string_decoder/string-decoder.js encoding="utf8"         inlen="4096" chunk="1024" n="2500000": 588864d/node:   25762   1a1ff77/node:   32769 ... -21.38%

string_decoder/string-decoder.js encoding="base64-utf8"  inlen=  "32" chunk=  "16" n="2500000": 588864d/node:  236220   1a1ff77/node:  388520 ... -39.20%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen=  "32" chunk=  "64" n="2500000": 588864d/node:  718230   1a1ff77/node: 1320800 ... -45.62%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen=  "32" chunk= "256" n="2500000": 588864d/node:  715660   1a1ff77/node: 1671300 ... -57.18%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen=  "32" chunk="1024" n="2500000": 588864d/node:  728550   1a1ff77/node: 1307200 ... -44.27%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen= "128" chunk=  "16" n="2500000": 588864d/node:   67355   1a1ff77/node:  110650 ... -39.13%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen= "128" chunk=  "64" n="2500000": 588864d/node:  192100   1a1ff77/node:  406130 ... -52.70%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen= "128" chunk= "256" n="2500000": 588864d/node:  635020   1a1ff77/node:  991090 ... -35.93%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen= "128" chunk="1024" n="2500000": 588864d/node:  604890   1a1ff77/node: 1169200 ... -48.26%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="1024" chunk=  "16" n="2500000": 588864d/node:    8373.9 1a1ff77/node:   14872 ... -43.69%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="1024" chunk=  "64" n="2500000": 588864d/node:   28409   1a1ff77/node:   65263 ... -56.47%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="1024" chunk= "256" n="2500000": 588864d/node:  114250   1a1ff77/node:  133340 ... -14.32%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="1024" chunk="1024" n="2500000": 588864d/node:  215090   1a1ff77/node:  230280 .... -6.60%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="4096" chunk=  "16" n="2500000": 588864d/node:    2635.2 1a1ff77/node:    3725 ... -29.26%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="4096" chunk=  "64" n="2500000": 588864d/node:    9046.6 1a1ff77/node:   12260 ... -26.21%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="4096" chunk= "256" n="2500000": 588864d/node:   24803   1a1ff77/node:   35052 ... -29.24%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="4096" chunk="1024" n="2500000": 588864d/node:   50969   1a1ff77/node:   68712 ... -25.82%

string_decoder/string-decoder.js encoding="base64-ascii" inlen=  "32" chunk=  "16" n="2500000": 588864d/node:  286250   1a1ff77/node:  454890 ... -37.07%
string_decoder/string-decoder.js encoding="base64-ascii" inlen=  "32" chunk=  "64" n="2500000": 588864d/node:  742810   1a1ff77/node: 1451100 ... -48.81%
string_decoder/string-decoder.js encoding="base64-ascii" inlen=  "32" chunk= "256" n="2500000": 588864d/node:  748130   1a1ff77/node: 1311000 ... -42.94%
string_decoder/string-decoder.js encoding="base64-ascii" inlen=  "32" chunk="1024" n="2500000": 588864d/node:  986320   1a1ff77/node: 1376700 ... -28.35%
string_decoder/string-decoder.js encoding="base64-ascii" inlen= "128" chunk=  "16" n="2500000": 588864d/node:   78080   1a1ff77/node:  162880 ... -52.06%
string_decoder/string-decoder.js encoding="base64-ascii" inlen= "128" chunk=  "64" n="2500000": 588864d/node:  244550   1a1ff77/node:  455950 ... -46.36%
string_decoder/string-decoder.js encoding="base64-ascii" inlen= "128" chunk= "256" n="2500000": 588864d/node:  833560   1a1ff77/node: 1261000 ... -33.90%
string_decoder/string-decoder.js encoding="base64-ascii" inlen= "128" chunk="1024" n="2500000": 588864d/node:  662110   1a1ff77/node: 1075600 ... -38.44%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="1024" chunk=  "16" n="2500000": 588864d/node:   10065   1a1ff77/node:   16471 ... -38.90%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="1024" chunk=  "64" n="2500000": 588864d/node:   34553   1a1ff77/node:   59507 ... -41.93%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="1024" chunk= "256" n="2500000": 588864d/node:  126240   1a1ff77/node:  169490 ... -25.52%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="1024" chunk="1024" n="2500000": 588864d/node:  227460   1a1ff77/node:  262340 ... -13.30%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="4096" chunk=  "16" n="2500000": 588864d/node:    2522.3 1a1ff77/node:    4446.2 . -43.27%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="4096" chunk=  "64" n="2500000": 588864d/node:    9335.4 1a1ff77/node:   18852 ... -50.48%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="4096" chunk= "256" n="2500000": 588864d/node:   27761   1a1ff77/node:   44906 ... -38.18%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="4096" chunk="1024" n="2500000": 588864d/node:   61087   1a1ff77/node:   71490 ... -14.55%

string_decoder/string-decoder.js encoding="utf16le"      inlen=  "32" chunk=  "16" n="2500000": 588864d/node: 2034700   1a1ff77/node: 1962700 ..... 3.66%
string_decoder/string-decoder.js encoding="utf16le"      inlen=  "32" chunk=  "64" n="2500000": 588864d/node: 3102000   1a1ff77/node: 3954200 ... -21.55%
string_decoder/string-decoder.js encoding="utf16le"      inlen=  "32" chunk= "256" n="2500000": 588864d/node: 3990800   1a1ff77/node: 3504400 .... 13.88%
string_decoder/string-decoder.js encoding="utf16le"      inlen=  "32" chunk="1024" n="2500000": 588864d/node: 3144100   1a1ff77/node: 4183900 ... -24.85%
string_decoder/string-decoder.js encoding="utf16le"      inlen= "128" chunk=  "16" n="2500000": 588864d/node:  392810   1a1ff77/node:  496990 ... -20.96%
string_decoder/string-decoder.js encoding="utf16le"      inlen= "128" chunk=  "64" n="2500000": 588864d/node: 1709900   1a1ff77/node: 1692800 ..... 1.01%
string_decoder/string-decoder.js encoding="utf16le"      inlen= "128" chunk= "256" n="2500000": 588864d/node: 2617300   1a1ff77/node: 2468200 ..... 6.04%
string_decoder/string-decoder.js encoding="utf16le"      inlen= "128" chunk="1024" n="2500000": 588864d/node: 2183500   1a1ff77/node: 2823700 ... -22.67%
string_decoder/string-decoder.js encoding="utf16le"      inlen="1024" chunk=  "16" n="2500000": 588864d/node:   55287   1a1ff77/node:   63105 ... -12.39%
string_decoder/string-decoder.js encoding="utf16le"      inlen="1024" chunk=  "64" n="2500000": 588864d/node:  226390   1a1ff77/node:  177080 .... 27.85%
string_decoder/string-decoder.js encoding="utf16le"      inlen="1024" chunk= "256" n="2500000": 588864d/node:  458280   1a1ff77/node:  452330 ..... 1.32%
string_decoder/string-decoder.js encoding="utf16le"      inlen="1024" chunk="1024" n="2500000": 588864d/node:  692320   1a1ff77/node:  671270 ..... 3.14%
string_decoder/string-decoder.js encoding="utf16le"      inlen="4096" chunk=  "16" n="2500000": 588864d/node:   13839   1a1ff77/node:   15688 ... -11.79%
string_decoder/string-decoder.js encoding="utf16le"      inlen="4096" chunk=  "64" n="2500000": 588864d/node:   55990   1a1ff77/node:   51500 ..... 8.72%
string_decoder/string-decoder.js encoding="utf16le"      inlen="4096" chunk= "256" n="2500000": 588864d/node:  118300   1a1ff77/node:  113680 ..... 4.06%
string_decoder/string-decoder.js encoding="utf16le"      inlen="4096" chunk="1024" n="2500000": 588864d/node:  176230   1a1ff77/node:  166440 ..... 5.88%

I guess the degraded base64 performance might be due to extra method invocation from write to text to complete. I don't know how to avoid that without sacrificing the improved maintainability I was aiming for. I'm also not sure how Buffer.copy compares to copying these few bytes one at a time, whether changing that would make sense. I'm open to suggestions on how else performance might be improved.

mscdex · 2016-06-24T03:54:38Z

The original issue is fixed in 79ef3b6 while maintaining performance (and actually improving performance in split multi-byte character cases). I've also merged the tests you're provided here in 5e8cbd7.

nodejs-github-bot added the string_decoder Issues and PRs related to the string_decoder subsystem. label Jun 16, 2016

gagern mentioned this pull request Jun 16, 2016

string_decoder: fix bad utf8 character handling #7310

Merged

3 tasks

gagern force-pushed the string-decoder-simplify branch from dc94308 to d11e41c Compare June 16, 2016 14:16

gagern force-pushed the string-decoder-simplify branch from d11e41c to 588864d Compare June 16, 2016 14:23

mscdex closed this Jun 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

string_decoder: fix handling of malformed utf8 #7318

string_decoder: fix handling of malformed utf8 #7318

gagern commented Jun 16, 2016

gagern commented Jun 16, 2016

mscdex commented Jun 24, 2016

string_decoder: fix handling of malformed utf8 #7318

string_decoder: fix handling of malformed utf8 #7318

Conversation

gagern commented Jun 16, 2016

Checklist

Affected core subsystem(s)

Description of change

gagern commented Jun 16, 2016

mscdex commented Jun 24, 2016