Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string_decoder: fix handling of malformed utf8 #7318

Closed
wants to merge 1 commit into from

Conversation

gagern
Copy link
Contributor

@gagern gagern commented Jun 16, 2016

Checklist
  • make -j4 test (UNIX) or vcbuild test nosign (Windows) passes
  • a test and/or benchmark is included
  • the commit message follows commit guidelines
Affected core subsystem(s)
  • string_decoder
Description of change

There have been problems with utf8 decoding in cases where the input was invalid. Some cases would give different results depending on chunking, while others even led to exceptions. This commit simplifies the code in several ways, reducing the risk of breakage.

Most importantly, the text method is not only used for bulk conversion of a single chunk, but also for the conversion of the mini buffer lastChar while handling characters spanning chunks. That should make most of the problems due to incorrect handling of special cases disappear.

Secondly, that text method is now independent of encoding. The encoding-dependent method complete now has a single well-defined task: determine the buffer position up to which the input consists of complete characters. The actual conversion is handled in a central location, leading to a cleaner and leaner internal interface.

Thirdly, we no longer try to remember just how many bytes we'll need for the next complete character. We simply try to fill up the nextChar buffer and perform a conversion on that. This reduces the number of internal counter variables from two to one, namely partial which indicates the number of bytes currently stored in nextChar.
A possible drawback of this approach is that there is chance of degraded performance if input arrives one byte at a time and is from a script using long utf8 sequences. As a benefit, though, this allows us
to emit a U+FFFD replacement character sooner in cases where the first byte of an utf8 sequence is not followed by the expected number of continuation bytes.

Fixes: #7308

This is an alternative to #7310; merging this makes that obsolete.

@nodejs-github-bot nodejs-github-bot added the string_decoder Issues and PRs related to the string_decoder subsystem. label Jun 16, 2016
There have been problems with utf8 decoding in cases where the input
was invalid.  Some cases would give different results depending on
chunking, while others even led to exceptions.  This commit simplifies
the code in several ways, reducing the risk of breakage.

Most importantly, the `text` method is not only used for bulk
conversion of a single chunk, but also for the conversion of the mini
buffer `lastChar` while handling characters spanning chunks.  That
should make most of the problems due to incorrect handling of special
cases disappear.

Secondly, that `text` method is now independent of encoding.  The
encoding-dependent method `complete` now has a single well-defined
task: determine the buffer position up to which the input consists of
complete characters.  The actual conversion is handled in a central
location, leading to a cleaner and leaner internal interface.

Thirdly, we no longer try to remember just how many bytes we'll need
for the next complete character.  We simply try to fill up the
`nextChar` buffer and perform a conversion on that.  This reduces the
number of internal counter variables from two to one, namely `partial`
which indicates the number of bytes currently stored in `nextChar`.
A possible drawback of this approach is that there is chance of
degraded performance if input arrives one byte at a time and is from a
script using long utf8 sequences.  As a benefit, though, this allows us
to emit a U+FFFD replacement character sooner in cases where the first
byte of an utf8 sequence is not followed by the expected number of
continuation bytes.

Fixes: nodejs#7308
@gagern
Copy link
Contributor Author

gagern commented Jun 16, 2016

I have to concede that my changes appear to come at some performance cost, particularly for the base64 encodings. Comparing 588864d with its direct parent 1a1ff77 I see this benchmark comparison:

string_decoder/string-decoder-create.js encoding="ascii"    n="25000000": ..................... 588864d/node: 8513500   1a1ff77/node: 7738100 .... 10.02%
string_decoder/string-decoder-create.js encoding="utf8"     n="25000000": ..................... 588864d/node: 3286700   1a1ff77/node: 3590100 .... -8.45%
string_decoder/string-decoder-create.js encoding="utf-8"    n="25000000": ..................... 588864d/node: 3047000   1a1ff77/node: 3234800 .... -5.81%
string_decoder/string-decoder-create.js encoding="base64"   n="25000000": ..................... 588864d/node: 2107300   1a1ff77/node: 2335200 .... -9.76%
string_decoder/string-decoder-create.js encoding="ucs2"     n="25000000": ..................... 588864d/node: 2884700   1a1ff77/node: 3018600 .... -4.44%
string_decoder/string-decoder-create.js encoding="UTF-8"    n="25000000": ..................... 588864d/node: 2066800   1a1ff77/node: 2026100 ..... 2.01%
string_decoder/string-decoder-create.js encoding="AscII"    n="25000000": ..................... 588864d/node: 3038600   1a1ff77/node: 4011300 ... -24.25%
string_decoder/string-decoder-create.js encoding="UTF-16LE" n="25000000": ..................... 588864d/node: 1755500   1a1ff77/node: 1568800 .... 11.91%

string_decoder/string-decoder.js encoding="ascii"        inlen=  "32" chunk=  "16" n="2500000": 588864d/node: 1819400   1a1ff77/node: 2257700 ... -19.41%
string_decoder/string-decoder.js encoding="ascii"        inlen=  "32" chunk=  "64" n="2500000": 588864d/node: 4687000   1a1ff77/node: 4057000 .... 15.53%
string_decoder/string-decoder.js encoding="ascii"        inlen=  "32" chunk= "256" n="2500000": 588864d/node: 3747300   1a1ff77/node: 4322900 ... -13.31%
string_decoder/string-decoder.js encoding="ascii"        inlen=  "32" chunk="1024" n="2500000": 588864d/node: 4219900   1a1ff77/node: 3180500 .... 32.68%
string_decoder/string-decoder.js encoding="ascii"        inlen= "128" chunk=  "16" n="2500000": 588864d/node:  461860   1a1ff77/node:  487710 .... -5.30%
string_decoder/string-decoder.js encoding="ascii"        inlen= "128" chunk=  "64" n="2500000": 588864d/node: 1668000   1a1ff77/node: 1759200 .... -5.18%
string_decoder/string-decoder.js encoding="ascii"        inlen= "128" chunk= "256" n="2500000": 588864d/node: 3470300   1a1ff77/node: 3701400 .... -6.24%
string_decoder/string-decoder.js encoding="ascii"        inlen= "128" chunk="1024" n="2500000": 588864d/node: 3303200   1a1ff77/node: 3674700 ... -10.11%
string_decoder/string-decoder.js encoding="ascii"        inlen="1024" chunk=  "16" n="2500000": 588864d/node:   74869   1a1ff77/node:   59589 .... 25.64%
string_decoder/string-decoder.js encoding="ascii"        inlen="1024" chunk=  "64" n="2500000": 588864d/node:  235220   1a1ff77/node:  292320 ... -19.53%
string_decoder/string-decoder.js encoding="ascii"        inlen="1024" chunk= "256" n="2500000": 588864d/node:  820890   1a1ff77/node:  791060 ..... 3.77%
string_decoder/string-decoder.js encoding="ascii"        inlen="1024" chunk="1024" n="2500000": 588864d/node: 1974500   1a1ff77/node: 2066800 .... -4.47%
string_decoder/string-decoder.js encoding="ascii"        inlen="4096" chunk=  "16" n="2500000": 588864d/node:   18091   1a1ff77/node:   18726 .... -3.39%
string_decoder/string-decoder.js encoding="ascii"        inlen="4096" chunk=  "64" n="2500000": 588864d/node:   58023   1a1ff77/node:   57669 ..... 0.61%
string_decoder/string-decoder.js encoding="ascii"        inlen="4096" chunk= "256" n="2500000": 588864d/node:  226320   1a1ff77/node:  233570 .... -3.10%
string_decoder/string-decoder.js encoding="ascii"        inlen="4096" chunk="1024" n="2500000": 588864d/node:  496170   1a1ff77/node:  370910 .... 33.77%

string_decoder/string-decoder.js encoding="utf8"         inlen=  "32" chunk=  "16" n="2500000": 588864d/node: 1383600   1a1ff77/node: 1307200 ..... 5.85%
string_decoder/string-decoder.js encoding="utf8"         inlen=  "32" chunk=  "64" n="2500000": 588864d/node: 1612300   1a1ff77/node: 2009400 ... -19.76%
string_decoder/string-decoder.js encoding="utf8"         inlen=  "32" chunk= "256" n="2500000": 588864d/node: 2146500   1a1ff77/node: 2132100 ..... 0.68%
string_decoder/string-decoder.js encoding="utf8"         inlen=  "32" chunk="1024" n="2500000": 588864d/node: 2156900   1a1ff77/node: 2146100 ..... 0.50%
string_decoder/string-decoder.js encoding="utf8"         inlen= "128" chunk=  "16" n="2500000": 588864d/node:  282740   1a1ff77/node:  421970 ... -33.00%
string_decoder/string-decoder.js encoding="utf8"         inlen= "128" chunk=  "64" n="2500000": 588864d/node:  624260   1a1ff77/node:  812660 ... -23.18%
string_decoder/string-decoder.js encoding="utf8"         inlen= "128" chunk= "256" n="2500000": 588864d/node:  890720   1a1ff77/node:  927910 .... -4.01%
string_decoder/string-decoder.js encoding="utf8"         inlen= "128" chunk="1024" n="2500000": 588864d/node:  798830   1a1ff77/node: 1116100 ... -28.43%
string_decoder/string-decoder.js encoding="utf8"         inlen="1024" chunk=  "16" n="2500000": 588864d/node:   41316   1a1ff77/node:   43883 .... -5.85%
string_decoder/string-decoder.js encoding="utf8"         inlen="1024" chunk=  "64" n="2500000": 588864d/node:  104220   1a1ff77/node:  115900 ... -10.08%
string_decoder/string-decoder.js encoding="utf8"         inlen="1024" chunk= "256" n="2500000": 588864d/node:  121100   1a1ff77/node:  129600 .... -6.56%
string_decoder/string-decoder.js encoding="utf8"         inlen="1024" chunk="1024" n="2500000": 588864d/node:  130920   1a1ff77/node:  101800 .... 28.60%
string_decoder/string-decoder.js encoding="utf8"         inlen="4096" chunk=  "16" n="2500000": 588864d/node:    9906.4 1a1ff77/node:   10700 .... -7.42%
string_decoder/string-decoder.js encoding="utf8"         inlen="4096" chunk=  "64" n="2500000": 588864d/node:   24386   1a1ff77/node:   23261 ..... 4.84%
string_decoder/string-decoder.js encoding="utf8"         inlen="4096" chunk= "256" n="2500000": 588864d/node:   31953   1a1ff77/node:   32738 .... -2.40%
string_decoder/string-decoder.js encoding="utf8"         inlen="4096" chunk="1024" n="2500000": 588864d/node:   25762   1a1ff77/node:   32769 ... -21.38%

string_decoder/string-decoder.js encoding="base64-utf8"  inlen=  "32" chunk=  "16" n="2500000": 588864d/node:  236220   1a1ff77/node:  388520 ... -39.20%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen=  "32" chunk=  "64" n="2500000": 588864d/node:  718230   1a1ff77/node: 1320800 ... -45.62%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen=  "32" chunk= "256" n="2500000": 588864d/node:  715660   1a1ff77/node: 1671300 ... -57.18%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen=  "32" chunk="1024" n="2500000": 588864d/node:  728550   1a1ff77/node: 1307200 ... -44.27%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen= "128" chunk=  "16" n="2500000": 588864d/node:   67355   1a1ff77/node:  110650 ... -39.13%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen= "128" chunk=  "64" n="2500000": 588864d/node:  192100   1a1ff77/node:  406130 ... -52.70%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen= "128" chunk= "256" n="2500000": 588864d/node:  635020   1a1ff77/node:  991090 ... -35.93%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen= "128" chunk="1024" n="2500000": 588864d/node:  604890   1a1ff77/node: 1169200 ... -48.26%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="1024" chunk=  "16" n="2500000": 588864d/node:    8373.9 1a1ff77/node:   14872 ... -43.69%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="1024" chunk=  "64" n="2500000": 588864d/node:   28409   1a1ff77/node:   65263 ... -56.47%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="1024" chunk= "256" n="2500000": 588864d/node:  114250   1a1ff77/node:  133340 ... -14.32%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="1024" chunk="1024" n="2500000": 588864d/node:  215090   1a1ff77/node:  230280 .... -6.60%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="4096" chunk=  "16" n="2500000": 588864d/node:    2635.2 1a1ff77/node:    3725 ... -29.26%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="4096" chunk=  "64" n="2500000": 588864d/node:    9046.6 1a1ff77/node:   12260 ... -26.21%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="4096" chunk= "256" n="2500000": 588864d/node:   24803   1a1ff77/node:   35052 ... -29.24%
string_decoder/string-decoder.js encoding="base64-utf8"  inlen="4096" chunk="1024" n="2500000": 588864d/node:   50969   1a1ff77/node:   68712 ... -25.82%

string_decoder/string-decoder.js encoding="base64-ascii" inlen=  "32" chunk=  "16" n="2500000": 588864d/node:  286250   1a1ff77/node:  454890 ... -37.07%
string_decoder/string-decoder.js encoding="base64-ascii" inlen=  "32" chunk=  "64" n="2500000": 588864d/node:  742810   1a1ff77/node: 1451100 ... -48.81%
string_decoder/string-decoder.js encoding="base64-ascii" inlen=  "32" chunk= "256" n="2500000": 588864d/node:  748130   1a1ff77/node: 1311000 ... -42.94%
string_decoder/string-decoder.js encoding="base64-ascii" inlen=  "32" chunk="1024" n="2500000": 588864d/node:  986320   1a1ff77/node: 1376700 ... -28.35%
string_decoder/string-decoder.js encoding="base64-ascii" inlen= "128" chunk=  "16" n="2500000": 588864d/node:   78080   1a1ff77/node:  162880 ... -52.06%
string_decoder/string-decoder.js encoding="base64-ascii" inlen= "128" chunk=  "64" n="2500000": 588864d/node:  244550   1a1ff77/node:  455950 ... -46.36%
string_decoder/string-decoder.js encoding="base64-ascii" inlen= "128" chunk= "256" n="2500000": 588864d/node:  833560   1a1ff77/node: 1261000 ... -33.90%
string_decoder/string-decoder.js encoding="base64-ascii" inlen= "128" chunk="1024" n="2500000": 588864d/node:  662110   1a1ff77/node: 1075600 ... -38.44%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="1024" chunk=  "16" n="2500000": 588864d/node:   10065   1a1ff77/node:   16471 ... -38.90%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="1024" chunk=  "64" n="2500000": 588864d/node:   34553   1a1ff77/node:   59507 ... -41.93%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="1024" chunk= "256" n="2500000": 588864d/node:  126240   1a1ff77/node:  169490 ... -25.52%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="1024" chunk="1024" n="2500000": 588864d/node:  227460   1a1ff77/node:  262340 ... -13.30%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="4096" chunk=  "16" n="2500000": 588864d/node:    2522.3 1a1ff77/node:    4446.2 . -43.27%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="4096" chunk=  "64" n="2500000": 588864d/node:    9335.4 1a1ff77/node:   18852 ... -50.48%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="4096" chunk= "256" n="2500000": 588864d/node:   27761   1a1ff77/node:   44906 ... -38.18%
string_decoder/string-decoder.js encoding="base64-ascii" inlen="4096" chunk="1024" n="2500000": 588864d/node:   61087   1a1ff77/node:   71490 ... -14.55%

string_decoder/string-decoder.js encoding="utf16le"      inlen=  "32" chunk=  "16" n="2500000": 588864d/node: 2034700   1a1ff77/node: 1962700 ..... 3.66%
string_decoder/string-decoder.js encoding="utf16le"      inlen=  "32" chunk=  "64" n="2500000": 588864d/node: 3102000   1a1ff77/node: 3954200 ... -21.55%
string_decoder/string-decoder.js encoding="utf16le"      inlen=  "32" chunk= "256" n="2500000": 588864d/node: 3990800   1a1ff77/node: 3504400 .... 13.88%
string_decoder/string-decoder.js encoding="utf16le"      inlen=  "32" chunk="1024" n="2500000": 588864d/node: 3144100   1a1ff77/node: 4183900 ... -24.85%
string_decoder/string-decoder.js encoding="utf16le"      inlen= "128" chunk=  "16" n="2500000": 588864d/node:  392810   1a1ff77/node:  496990 ... -20.96%
string_decoder/string-decoder.js encoding="utf16le"      inlen= "128" chunk=  "64" n="2500000": 588864d/node: 1709900   1a1ff77/node: 1692800 ..... 1.01%
string_decoder/string-decoder.js encoding="utf16le"      inlen= "128" chunk= "256" n="2500000": 588864d/node: 2617300   1a1ff77/node: 2468200 ..... 6.04%
string_decoder/string-decoder.js encoding="utf16le"      inlen= "128" chunk="1024" n="2500000": 588864d/node: 2183500   1a1ff77/node: 2823700 ... -22.67%
string_decoder/string-decoder.js encoding="utf16le"      inlen="1024" chunk=  "16" n="2500000": 588864d/node:   55287   1a1ff77/node:   63105 ... -12.39%
string_decoder/string-decoder.js encoding="utf16le"      inlen="1024" chunk=  "64" n="2500000": 588864d/node:  226390   1a1ff77/node:  177080 .... 27.85%
string_decoder/string-decoder.js encoding="utf16le"      inlen="1024" chunk= "256" n="2500000": 588864d/node:  458280   1a1ff77/node:  452330 ..... 1.32%
string_decoder/string-decoder.js encoding="utf16le"      inlen="1024" chunk="1024" n="2500000": 588864d/node:  692320   1a1ff77/node:  671270 ..... 3.14%
string_decoder/string-decoder.js encoding="utf16le"      inlen="4096" chunk=  "16" n="2500000": 588864d/node:   13839   1a1ff77/node:   15688 ... -11.79%
string_decoder/string-decoder.js encoding="utf16le"      inlen="4096" chunk=  "64" n="2500000": 588864d/node:   55990   1a1ff77/node:   51500 ..... 8.72%
string_decoder/string-decoder.js encoding="utf16le"      inlen="4096" chunk= "256" n="2500000": 588864d/node:  118300   1a1ff77/node:  113680 ..... 4.06%
string_decoder/string-decoder.js encoding="utf16le"      inlen="4096" chunk="1024" n="2500000": 588864d/node:  176230   1a1ff77/node:  166440 ..... 5.88%

I guess the degraded base64 performance might be due to extra method invocation from write to text to complete. I don't know how to avoid that without sacrificing the improved maintainability I was aiming for. I'm also not sure how Buffer.copy compares to copying these few bytes one at a time, whether changing that would make sense. I'm open to suggestions on how else performance might be improved.

@mscdex
Copy link
Contributor

mscdex commented Jun 24, 2016

The original issue is fixed in 79ef3b6 while maintaining performance (and actually improving performance in split multi-byte character cases). I've also merged the tests you're provided here in 5e8cbd7.

@mscdex mscdex closed this Jun 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
string_decoder Issues and PRs related to the string_decoder subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

string_decoder: RangeError introduced by 6.2.1
3 participants