-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
string_decoder: fix handling of malformed utf8 #7318
Conversation
dc94308
to
d11e41c
Compare
There have been problems with utf8 decoding in cases where the input was invalid. Some cases would give different results depending on chunking, while others even led to exceptions. This commit simplifies the code in several ways, reducing the risk of breakage. Most importantly, the `text` method is not only used for bulk conversion of a single chunk, but also for the conversion of the mini buffer `lastChar` while handling characters spanning chunks. That should make most of the problems due to incorrect handling of special cases disappear. Secondly, that `text` method is now independent of encoding. The encoding-dependent method `complete` now has a single well-defined task: determine the buffer position up to which the input consists of complete characters. The actual conversion is handled in a central location, leading to a cleaner and leaner internal interface. Thirdly, we no longer try to remember just how many bytes we'll need for the next complete character. We simply try to fill up the `nextChar` buffer and perform a conversion on that. This reduces the number of internal counter variables from two to one, namely `partial` which indicates the number of bytes currently stored in `nextChar`. A possible drawback of this approach is that there is chance of degraded performance if input arrives one byte at a time and is from a script using long utf8 sequences. As a benefit, though, this allows us to emit a U+FFFD replacement character sooner in cases where the first byte of an utf8 sequence is not followed by the expected number of continuation bytes. Fixes: nodejs#7308
d11e41c
to
588864d
Compare
I have to concede that my changes appear to come at some performance cost, particularly for the base64 encodings. Comparing 588864d with its direct parent 1a1ff77 I see this benchmark comparison:
I guess the degraded base64 performance might be due to extra method invocation from |
Checklist
make -j4 test
(UNIX) orvcbuild test nosign
(Windows) passesAffected core subsystem(s)
Description of change
There have been problems with utf8 decoding in cases where the input was invalid. Some cases would give different results depending on chunking, while others even led to exceptions. This commit simplifies the code in several ways, reducing the risk of breakage.
Most importantly, the
text
method is not only used for bulk conversion of a single chunk, but also for the conversion of the mini bufferlastChar
while handling characters spanning chunks. That should make most of the problems due to incorrect handling of special cases disappear.Secondly, that
text
method is now independent of encoding. The encoding-dependent methodcomplete
now has a single well-defined task: determine the buffer position up to which the input consists of complete characters. The actual conversion is handled in a central location, leading to a cleaner and leaner internal interface.Thirdly, we no longer try to remember just how many bytes we'll need for the next complete character. We simply try to fill up the
nextChar
buffer and perform a conversion on that. This reduces the number of internal counter variables from two to one, namelypartial
which indicates the number of bytes currently stored innextChar
.A possible drawback of this approach is that there is chance of degraded performance if input arrives one byte at a time and is from a script using long utf8 sequences. As a benefit, though, this allows us
to emit a U+FFFD replacement character sooner in cases where the first byte of an utf8 sequence is not followed by the expected number of continuation bytes.
Fixes: #7308
This is an alternative to #7310; merging this makes that obsolete.