string_decoder: fix bad utf8 character handling #7310

mscdex · 2016-06-15T16:48:03Z

Checklist

make -j4 test (UNIX) or vcbuild test nosign (Windows) passes
a test and/or benchmark is included
the commit message follows commit guidelines

Affected core subsystem(s)

string_decoder

Description of change

This commit fixes an issue when extra utf8 continuation bytes appear at the end of a chunk of data, causing miscalculations to be made when checking how many bytes are needed to decode a complete
character.

Fixes: #7308

mscdex · 2016-06-15T16:51:43Z

/cc @gagern

cjihrig · 2016-06-15T16:51:54Z

lib/string_decoder.js

      self.lastNeed = nb + 1 - (buf.length - j);
+      if (self.lastNeed < 0)
+        self.lastNeed = nb = 0;


Could you make these two assignments on separate lines here and below.

I find this way describes the intention more succinctly.

This is against the code style though, not sure why we don't lint it.

@indutny I'm not aware of a fitting rule in ESLint. Filed eslint/eslint#6424 for a rule proposal.

cjihrig · 2016-06-15T16:52:11Z

LGTM pending CI and a style fix.

mscdex · 2016-06-15T17:33:04Z

CI: https://ci.nodejs.org/job/node-test-pull-request/3004/

Trott · 2016-06-15T17:52:56Z

test/parallel/test-string-decoder.js

@@ -55,9 +55,14 @@ assert.strictEqual(decoder.write(Buffer.from('\ufffd\ufffd\ufffd')),
 assert.strictEqual(decoder.end(), '');

 decoder = new StringDecoder('utf8');
-assert.strictEqual(decoder.write(Buffer.from('efbfbde2', 'hex')), '\ufffd');
+assert.strictEqual(decoder.write(Buffer.from('EFBFBDE2', 'hex')), '\ufffd');


Why this change? And if needed, maybe leave the old test too?

It's just making capitalization consistent with all of the other hex buffers.

I'd be inclined to leave it just to make sure that lower-case is tested. (I'm sure that under the hood, string_decoder delegates to buffer or whatever. So it's probably far-fetched that we'd be in a situation where lower-case was broken but upper-case still worked. But it doesn't cost anything to have that tiny bit of extra coverage and the change seems aesthetic only.)

While I have an opinion, I don't feel strongly enough about this to protest or anything. If you do feel strongly, feel free to leave it as you have it.

I don't understand. The hex string is being passed to the Buffer creation function (Buffer.from()), not to the StringDecoder functions. That function has plenty of lowercase and uppercase hex string tests in the appropriate test-buffer* test files.

@mscdex Argh! You're right, of course. Ignore me.

mscdex · 2016-06-15T18:09:27Z

CI is green. /cc @nodejs/collaborators

indutny · 2016-06-15T20:30:08Z

One comment, otherwise LGTM.

I thought of this while looking at nodejs#7310.

gagern · 2016-06-16T09:04:37Z

I did some more tests in gagern/node@f3d4f2a. Observations:

Input EC41 decodes incorrectly if decoded in two chunks. In that case, the ASCII byte gets turned into an U+FFFD as well. This is a regression compared to 6.2.0, too.
(At some point I had thought that EDA0B5EDB08D, which is CESU-8 for U+1D40D, were behaving incorrectly as well. But I guess I got this wrong, the result is six U+FFFD consistently.)

gagern · 2016-06-16T10:19:49Z

I tried some more systematic testing in gagern/node@2e4bff1: using the bytes 00 41 b8 cc e2 f0 f1 fb as a hopefully representative set of input bytes likely to cause different behavior, I tried all byte sequences of length up to four and checked whether chunking had any effect. For the following byte sequences it had:

(e2|f0|f1)(00|41)
ccccb8
f[01](b8|cc)(00|41)
f[01]ccb8
f[01]fb(00|41)
(cc|e2)e2b8b8
e2(b8|e2)ccb8
e2fbcc(01|b8)

I'm not sure how much of this can be attributed to the same bug. I'm also not sure how much of this is a regression compared to 6.2.0. If you can afford the time, it might make sense to include some variant of this systematic testing code. If not, then at least try out the indicated problematic cases, and perhaps add them to the suite to guard against regressions. Is there some special test suite where lengthy tests can be placed, to be run occasionally without slowing down the main test suite?

I'm also somewhat concerned about the use of Buffer.allocUnsafe in there. Is there some machinery to check whether some given code reads some buffer content without writing it first, so that it could be affected by the random data potentially included in such a buffer?

gagern · 2016-06-16T14:08:53Z

@mscdex would you care to have a look at #7318? The fact that even after your fix there are so many possible breakages here made me have a closer look at the code, and come up with an alternate solution. I'd value your opinion on that.

mscdex · 2016-06-18T21:47:05Z

I've now included fixes for the other test cases that @gagern had created, which I have now included.

CI: https://ci.nodejs.org/job/node-test-pull-request/3025/

CI is green except for an unrelated test failure on Windows.

jasnell · 2016-06-20T16:06:47Z

LGTM

mscdex · 2016-06-20T16:12:22Z

@cjihrig @indutny Does this still LGTY after the additional changes?

cjihrig · 2016-06-20T16:24:52Z

Yea, LGTM. CI seemed happy.

indutny · 2016-06-21T16:40:55Z

lib/string_decoder.js

  if (nb >= 0) {
    if (nb > 0)
-      self.lastNeed = nb + 1 - (buf.length - j);
+      self.lastNeed = nb - 2;


Couldn't it still go negative here?

No, because if execution has reached here, nb would have to be a 2, 3, or 4-byte character indicator. So self.lastNeed would range from 0 to 2.

Oh right, nb can't be 1.

mscdex · 2016-06-24T03:00:54Z

One last CI before landing, just in case: https://ci.nodejs.org/job/node-test-pull-request/3064/

EDIT: CI is green, but a few CI nodes are offline at the moment. Still enough coverage IMHO.

This commit fixes an issue when extra utf8 continuation bytes appear at the end of a chunk of data, causing miscalculations to be made when checking how many bytes are needed to decode a complete character. Fixes: nodejs#7308 PR-URL: nodejs#7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>

PR-URL: nodejs#7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>

This commit fixes an issue when extra utf8 continuation bytes appear at the end of a chunk of data, causing miscalculations to be made when checking how many bytes are needed to decode a complete character. Fixes: #7308 PR-URL: #7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>

PR-URL: #7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>

This commit fixes an issue when extra utf8 continuation bytes appear at the end of a chunk of data, causing miscalculations to be made when checking how many bytes are needed to decode a complete character. Fixes: #7308 PR-URL: #7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>

PR-URL: #7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>

MylesBorins · 2016-07-11T23:47:17Z

@mscdex lts?

mscdex · 2016-07-12T00:53:33Z

@thealphanerd This PR is only relevant if #6777 is also landed, however that PR is currently marked as dont-land-on-v4.x to give the rewrite some time in non-LTS. If others are comfortable with the original StringDecoder rewrite now landing on LTS, then this would need to land also.

MylesBorins · 2016-07-12T01:12:36Z

@mscdex I'm going to mark this as dont-land for now as well. Would you like us to keep an issue in @nodejs/lts to keep track of the commits neccessary to backport the string_decoder changes?

mscdex · 2016-07-12T02:13:09Z

@thealphanerd Sure.

gagern · 2016-07-14T12:02:20Z

Will the 6.3.1 release be derived from current master, and hence contain this fix here? Or is the fix scheduled for 6.4.0?

targos · 2016-07-14T12:10:10Z

@gagern 6.3.0 already has the fix.

gagern · 2016-07-14T12:37:42Z

Ah thanks, @targos. I apparently only searched the changelog for #7308 and the 5e8cbd7 in master, so I missed the 962ac37 for #7310 that is listed. Sorry.

mscdex added the string_decoder Issues and PRs related to the string_decoder subsystem. label Jun 15, 2016

mscdex force-pushed the string-decoder-fix-regression branch 3 times, most recently from 164ca60 to 541bcc9 Compare June 15, 2016 16:50

cjihrig reviewed Jun 15, 2016
View reviewed changes

Trott reviewed Jun 15, 2016
View reviewed changes

mscdex force-pushed the string-decoder-fix-regression branch from 541bcc9 to 52462c6 Compare June 15, 2016 21:54

gagern added a commit to gagern/node that referenced this pull request Jun 16, 2016

string_decoder: Some more tests for invalid input

f3d4f2a

I thought of this while looking at nodejs#7310.

gagern mentioned this pull request Jun 16, 2016

string_decoder: fix handling of malformed utf8 #7318

Closed

3 tasks

mscdex force-pushed the string-decoder-fix-regression branch from 52462c6 to 023bb56 Compare June 18, 2016 19:37

mscdex force-pushed the string-decoder-fix-regression branch from 023bb56 to 985bd75 Compare June 18, 2016 19:38

mscdex force-pushed the string-decoder-fix-regression branch from 985bd75 to bafb77f Compare June 18, 2016 21:50

indutny reviewed Jun 21, 2016
View reviewed changes

mscdex and others added 2 commits June 23, 2016 23:18

test: add more UTF-8 StringDecoder tests

5e8cbd7

PR-URL: nodejs#7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>

mscdex force-pushed the string-decoder-fix-regression branch from d3cea4b to 5e8cbd7 Compare June 24, 2016 03:50

mscdex merged commit 5e8cbd7 into nodejs:master Jun 24, 2016

mscdex deleted the string-decoder-fix-regression branch June 24, 2016 03:55

Fishrock123 pushed a commit that referenced this pull request Jun 27, 2016

test: add more UTF-8 StringDecoder tests

bf18e04

PR-URL: #7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>

Fishrock123 mentioned this pull request Jun 27, 2016

Propose v6.3.0 #7441

Merged

Fishrock123 pushed a commit that referenced this pull request Jul 5, 2016

test: add more UTF-8 StringDecoder tests

8f1733c

PR-URL: #7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>

Fishrock123 mentioned this pull request Jul 5, 2016

Propose v6.3.0 (v2) #7550

Merged

MylesBorins mentioned this pull request Jul 11, 2016

Audit commits not found on v4.x-staging #7661

Closed

MylesBorins added the lts-watch-v4.x label Jul 11, 2016

MylesBorins added dont-land-on-v4.x and removed lts-watch-v4.x labels Jul 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

string_decoder: fix bad utf8 character handling #7310

string_decoder: fix bad utf8 character handling #7310

mscdex commented Jun 15, 2016

mscdex commented Jun 15, 2016

cjihrig Jun 15, 2016

mscdex Jun 15, 2016

indutny Jun 15, 2016

silverwind Jun 15, 2016 •

edited

Loading

cjihrig commented Jun 15, 2016

mscdex commented Jun 15, 2016

Trott Jun 15, 2016

mscdex Jun 15, 2016

Trott Jun 15, 2016

mscdex Jun 15, 2016 •

edited

Loading

Trott Jun 15, 2016

mscdex commented Jun 15, 2016

indutny commented Jun 15, 2016

gagern commented Jun 16, 2016 •

edited

Loading

gagern commented Jun 16, 2016

gagern commented Jun 16, 2016

mscdex commented Jun 18, 2016 •

edited

Loading

jasnell commented Jun 20, 2016

mscdex commented Jun 20, 2016

cjihrig commented Jun 20, 2016

indutny Jun 21, 2016

mscdex Jun 21, 2016

indutny Jun 21, 2016

mscdex commented Jun 24, 2016 •

edited

Loading

MylesBorins commented Jul 11, 2016

mscdex commented Jul 12, 2016

MylesBorins commented Jul 12, 2016

mscdex commented Jul 12, 2016

gagern commented Jul 14, 2016

targos commented Jul 14, 2016

gagern commented Jul 14, 2016

string_decoder: fix bad utf8 character handling #7310

string_decoder: fix bad utf8 character handling #7310

Conversation

mscdex commented Jun 15, 2016

Checklist

Affected core subsystem(s)

Description of change

mscdex commented Jun 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

silverwind Jun 15, 2016 • edited Loading

Choose a reason for hiding this comment

cjihrig commented Jun 15, 2016

mscdex commented Jun 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mscdex Jun 15, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mscdex commented Jun 15, 2016

indutny commented Jun 15, 2016

gagern commented Jun 16, 2016 • edited Loading

gagern commented Jun 16, 2016

gagern commented Jun 16, 2016

mscdex commented Jun 18, 2016 • edited Loading

jasnell commented Jun 20, 2016

mscdex commented Jun 20, 2016

cjihrig commented Jun 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mscdex commented Jun 24, 2016 • edited Loading

MylesBorins commented Jul 11, 2016

mscdex commented Jul 12, 2016

MylesBorins commented Jul 12, 2016

mscdex commented Jul 12, 2016

gagern commented Jul 14, 2016

targos commented Jul 14, 2016

gagern commented Jul 14, 2016

silverwind Jun 15, 2016 •

edited

Loading

mscdex Jun 15, 2016 •

edited

Loading

gagern commented Jun 16, 2016 •

edited

Loading

mscdex commented Jun 18, 2016 •

edited

Loading

mscdex commented Jun 24, 2016 •

edited

Loading