-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
string_decoder: fix bad utf8 character handling #7310
Conversation
164ca60
to
541bcc9
Compare
/cc @gagern |
self.lastNeed = nb + 1 - (buf.length - j); | ||
if (self.lastNeed < 0) | ||
self.lastNeed = nb = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make these two assignments on separate lines here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this way describes the intention more succinctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is against the code style though, not sure why we don't lint it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@indutny I'm not aware of a fitting rule in ESLint. Filed eslint/eslint#6424 for a rule proposal.
LGTM pending CI and a style fix. |
@@ -55,9 +55,14 @@ assert.strictEqual(decoder.write(Buffer.from('\ufffd\ufffd\ufffd')), | |||
assert.strictEqual(decoder.end(), ''); | |||
|
|||
decoder = new StringDecoder('utf8'); | |||
assert.strictEqual(decoder.write(Buffer.from('efbfbde2', 'hex')), '\ufffd'); | |||
assert.strictEqual(decoder.write(Buffer.from('EFBFBDE2', 'hex')), '\ufffd'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? And if needed, maybe leave the old test too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just making capitalization consistent with all of the other hex buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be inclined to leave it just to make sure that lower-case is tested. (I'm sure that under the hood, string_decoder
delegates to buffer
or whatever. So it's probably far-fetched that we'd be in a situation where lower-case was broken but upper-case still worked. But it doesn't cost anything to have that tiny bit of extra coverage and the change seems aesthetic only.)
While I have an opinion, I don't feel strongly enough about this to protest or anything. If you do feel strongly, feel free to leave it as you have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. The hex string is being passed to the Buffer creation function (Buffer.from()
), not to the StringDecoder functions. That function has plenty of lowercase and uppercase hex string tests in the appropriate test-buffer*
test files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mscdex Argh! You're right, of course. Ignore me.
CI is green. /cc @nodejs/collaborators |
One comment, otherwise LGTM. |
541bcc9
to
52462c6
Compare
I thought of this while looking at nodejs#7310.
I did some more tests in gagern/node@f3d4f2a. Observations:
|
I tried some more systematic testing in gagern/node@2e4bff1: using the bytes
I'm not sure how much of this can be attributed to the same bug. I'm also not sure how much of this is a regression compared to 6.2.0. If you can afford the time, it might make sense to include some variant of this systematic testing code. If not, then at least try out the indicated problematic cases, and perhaps add them to the suite to guard against regressions. Is there some special test suite where lengthy tests can be placed, to be run occasionally without slowing down the main test suite? I'm also somewhat concerned about the use of |
52462c6
to
023bb56
Compare
023bb56
to
985bd75
Compare
I've now included fixes for the other test cases that @gagern had created, which I have now included. CI: https://ci.nodejs.org/job/node-test-pull-request/3025/ CI is green except for an unrelated test failure on Windows. |
985bd75
to
bafb77f
Compare
LGTM |
Yea, LGTM. CI seemed happy. |
if (nb >= 0) { | ||
if (nb > 0) | ||
self.lastNeed = nb + 1 - (buf.length - j); | ||
self.lastNeed = nb - 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't it still go negative here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because if execution has reached here, nb
would have to be a 2, 3, or 4-byte character indicator. So self.lastNeed
would range from 0 to 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, nb
can't be 1.
One last CI before landing, just in case: https://ci.nodejs.org/job/node-test-pull-request/3064/ EDIT: CI is green, but a few CI nodes are offline at the moment. Still enough coverage IMHO. |
This commit fixes an issue when extra utf8 continuation bytes appear at the end of a chunk of data, causing miscalculations to be made when checking how many bytes are needed to decode a complete character. Fixes: nodejs#7308 PR-URL: nodejs#7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>
PR-URL: nodejs#7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>
d3cea4b
to
5e8cbd7
Compare
This commit fixes an issue when extra utf8 continuation bytes appear at the end of a chunk of data, causing miscalculations to be made when checking how many bytes are needed to decode a complete character. Fixes: #7308 PR-URL: #7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>
PR-URL: #7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>
This commit fixes an issue when extra utf8 continuation bytes appear at the end of a chunk of data, causing miscalculations to be made when checking how many bytes are needed to decode a complete character. Fixes: #7308 PR-URL: #7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>
PR-URL: #7310 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>
@mscdex lts? |
@thealphanerd This PR is only relevant if #6777 is also landed, however that PR is currently marked as |
@mscdex I'm going to mark this as dont-land for now as well. Would you like us to keep an issue in @nodejs/lts to keep track of the commits neccessary to backport the string_decoder changes? |
@thealphanerd Sure. |
Will the 6.3.1 release be derived from current master, and hence contain this fix here? Or is the fix scheduled for 6.4.0? |
@gagern 6.3.0 already has the fix. |
Checklist
make -j4 test
(UNIX) orvcbuild test nosign
(Windows) passesAffected core subsystem(s)
Description of change
This commit fixes an issue when extra utf8 continuation bytes appear at the end of a chunk of data, causing miscalculations to be made when checking how many bytes are needed to decode a complete
character.
Fixes: #7308