child_process: measure buffer length in bytes #6764

Trott · 2016-05-14T22:32:23Z

Checklist

child_process test

tests and code linting passes
a test and/or benchmark is included
the commit message follows commit guidelines

Affected core subsystem(s)

child_process test

Description of change

This change fixes a known issue where maxBuffer limits by characters
rather than bytes.

Fixes: #1901

Probably need to benchmark it against the current implementation. It's entirely possible no one has done it this way because of performance?

Trott · 2016-05-14T23:30:32Z

I've added a benchmark to check performance and it shows that this is as performant as the current implementation. However, the benchmark could definitely use a review as it's very easy to do benchmarks wrong and I have no particular expertise. @nodejs/benchmarking

jasnell · 2016-05-15T05:01:50Z

@trevnorris @mscdex

mscdex · 2016-05-15T07:06:42Z

benchmark/child_process/child-process-exec-stdout.js

+  const dur = +conf.dur;
+  const len = +conf.len;
+
+  const msg = '"' + Array(len).join('.') + '"';


You can use '.'.repeat(len) for this kind of thing nowadays.

But also run msg.test(/./) to flatten the string. Creates more reliable benchmark results.

@mscdex @trevnorris So I should do this?

const msg = `"${'.'.repeat(len)}"`; msg.match(/./);

sure. that looks good. i'm not sure what the internal mechanics of repeat() do, but if it's similar to doing += then should take care of it.

OK, done! The benchmark shows frighteningly similar results for Node.js 6.2.0 vs. this PR.

6.2.0:

$ node benchmark/child_process/child-process-exec-stdout.js child_process/child-process-exec-stdout.js len=64 dur=5: 40873.45059 child_process/child-process-exec-stdout.js len=256 dur=5: 40853.93567 child_process/child-process-exec-stdout.js len=1024 dur=5: 40878.88575 child_process/child-process-exec-stdout.js len=4096 dur=5: 40851.36544 child_process/child-process-exec-stdout.js len=32768 dur=5: 40876.38241 $

This PR:

$ ./node benchmark/child_process/child-process-exec-stdout.js child_process/child-process-exec-stdout.js len=64 dur=5: 40873.08798 child_process/child-process-exec-stdout.js len=256 dur=5: 40860.52815 child_process/child-process-exec-stdout.js len=1024 dur=5: 40878.13205 child_process/child-process-exec-stdout.js len=4096 dur=5: 40869.97254 child_process/child-process-exec-stdout.js len=32768 dur=5: 40863.11006 $

mscdex · 2016-05-15T07:12:32Z

I did some benchmarking outside of node recently that was related to this and IIRC concatenating strings was faster (rather than creating the string at the end)? This would need some double checking though.

cjihrig · 2016-05-15T19:20:40Z

lib/child_process.js

@@ -277,7 +271,7 @@ exports.execFile = function(file /*, args, options, callback*/) {
        if (!encoding)
          _stderr.push(chunk);
        else
-          _stderr += chunk;
+          _stderr += chunk.toString(encoding);


Couldn't this yield incorrect results if a variable-width encoded character is split across two chunks?

Yes, this is true. That's why setEncoding() was being used initially. This will need to be solved differently I think.

I think the solution is to always store the data as Buffers and concat/toString them just before being emitted.

isn't that what StringDecoder is for? (didn't look at this in detail so may be missing something)

One alternative would be to keep setEncoding(), etc. and only change the length increment lines to something like:

stderrLen += (this.encoding ? Buffer.byteLength(chunk, this.encoding) : chunk.byteLength);

I haven't benchmarked the difference between that solution and only converting to string at the end though.

EDIT: or another take on this alternative solution would be to add another data handler first that depends on whether encoding is set that increments the length appropriately, if you want to avoid doing the encoding check on every data event (maybe the performance difference is negligble, I haven't tested):

var onStderrData; if (encoding) { child.stderr.setEncoding(encoding); onStderrData = function(chunk) { stderrLen += Buffer.byteLength(chunk, this.encoding); }; } else { onStderrData = function(chunk) { stderrLen += chunk.byteLength; }; } child.stderr.addListener('data', onStderrData); child.stderr.addListener('data', function(chunk) { if (stderrLen > options.maxBuffer) { // ... });

EDIT 2: Actually the above solution wouldn't work since the encoding could be changed at any time, so you would need to do what I originally suggested (checking this.encoding on each chunk).

@jasnell The problem this PR is trying to solve is that the units for maxBuffer are in bytes, but once you set up a string decoder, there is no way to know how many bytes went into making the string that gets emitted on data. That means you can no longer rely on chunk.length since you could have multi-byte characters.

@mscdex ... +1 gotcha.

On second thought, does it make sense to emit on byte count and not character count if encoding is set? Only because the string passed to 'data' can vary noticeably if reading in utf8 characters.

cjihrig · 2016-05-16T19:22:44Z

By the way, the discussion in #1902 is probably relevant. The PR changed over time and some of the commits seem to be lost, but check out the comments by @piscisaureus.

mscdex · 2016-05-16T19:31:40Z

@cjihrig I personally don't see a problem with going over by at most 3 bytes if the stream ends on a partial character, but you could always peek into the StringDecoder state to see how many bytes are buffered at the end of the stream to account for those bytes.

trevnorris · 2016-05-16T20:03:53Z

@mscdex While I'm not sure whether we can account for this, there are cases where the buffered character is larger. Here's an exaggerated case:

const s = 'ก็็็็็็็็็็็็็็็็็็็็';
console.log(s.length);  // 21
console.log(Buffer.from(s));  // 63

mscdex · 2016-05-16T20:17:25Z

@trevnorris and Buffer.byteLength(s, 'utf8') === 63, so I still don't see what the problem is with the presented alternative solutions (mine or yours).

If you pass say Buffer.from(s).slice(0, 3) to a string decoder, a single character will be returned. Calling Buffer.byteLength() on that character will return 3 bytes. Concatenating the output from the string deocder still results in the same s value.

trevnorris · 2016-05-16T22:16:05Z

@mscdex Sorry, didn't convey my point. It's that while the character code may be complete it's also possible that the rendered character is incomplete because of missing diacritical marks. But also that this isn't a case we're able to handle.

trevnorris · 2016-05-16T22:21:44Z

For example:

Buffer('Á');  // <Buffer c3 81>
Buffer('Á');  // <Buffer 41 cc 81>

Appear to be the same character, but the first is actually '\u00c1' and the other is 'A\u0301'. So the rendered output on the other end is incomplete.

mscdex · 2016-05-17T00:52:22Z

@trevnorris I still don't quite follow. Are you referring to a situation where Buffer.from([0x41,0xcc]) is passed to the string decoder but the needed 0x81 is never received from the actual stderr stream? If so, there is nothing that node or anyone else can ever do about that, but what matters is that we can count the bytes received. So in that particular case we'd see the 'A' emitted by the decoder which would count for one byte and at stderr end we can stderrLen += child.stderr._readableState.decoder.charReceived to account for the received 0xcc byte.

Now that I think about it, maybe yet another (better) solution is to just manually instantiate a string decoder and pass that the data chunks, that way we avoid having to peer into _readableState and the string decoder's state variables?

jasnell · 2016-05-17T00:56:54Z

maybe yet another (better) solution is to just manually instantiate
a string decoder and pass that the data chunks, that way we avoid
having to peer into _readableState and the string decoder's state variables?

That's where I was leading with #6764 (comment)

mscdex · 2016-05-17T01:10:32Z

@jasnell Ah ok. The only difference performance-wise with that last proposed solution is that it may be still be possible for streams to internally concatenate buffered Buffers (with Buffer.concat()) vs concatenating plain strings (when using .setEncoding()) before data is emitted? I'm not sure what that cost difference is.

jasnell · 2016-05-17T01:12:59Z

Yeah, not sure either. Would need to benchmark it to be certain
On May 16, 2016 6:10 PM, "Brian White" notifications@github.com wrote:

@jasnell https://github.com/jasnell Ah ok. The only difference
performance-wise with that last proposed solution is that it may be still
be possible for streams to internally concatenate buffered Buffers (with
Buffer.concat()) vs concatenating plain strings (when using .setEncoding())
before data is emitted? I'm not sure what that cost difference is.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6764 (comment)

trevnorris · 2016-05-17T19:11:18Z

@mscdex This isn't about the string decoder. It's about the (very unlikely) possibility of data coming in, then say be logged. Where it would basically be:

console.log('A');
console.log('\u0301');

The rendered output would be wrong, even though each character is technically correct. I'm not saying we should consider supporting this properly. Just wanted to make the point that there are cases where even having the full utf8 characters won't render properly.

jasnell · 2016-05-17T19:16:11Z

If someone was truly worried about that case they could buffer and wait for a new line to be received before calling console.log(). Not all that worried about that possibility

mscdex · 2016-05-17T20:04:14Z

@trevnorris That's a separate issue though and node can't do anything about that anyway. Being able to count the bytes is the issue here, but even that separate issue isn't an issue for the majority of people that use exec() with a callback since the entire (buffered) string would be available for each stream (so it's not possible for users to console.log() multiple times unless they explicitly slice the string or also add their own stream event handlers, but that is a user problem at those points).

trevnorris · 2016-05-17T20:38:58Z

@mscdex sure thing. there was another comment that prompted me to make mention of this, and clarify that node doesn't have any intention of handling it.

Trott · 2016-05-19T21:52:24Z

I've changed the implementation to only run toString() on the buffer in the exit handler. My interpretation of the discussion above is that this still doesn't account for some unlikely edge cases, but that we probably can't do much about those anyway. PTAL.

mscdex · 2016-05-19T23:08:09Z

@Trott ~~You'd need to do it in the close event since data can still be received after exit.~~ Nevermind, I see that exithandler is actually used for the close event.

Trott · 2016-05-24T06:33:50Z

CI seems to be experiencing less heartburn now than 2 days ago. Let's try again. CI: https://ci.nodejs.org/job/node-test-pull-request/2754/

trevnorris · 2016-05-24T11:18:37Z

lib/child_process.js


    child.stdout.addListener('data', function(chunk) {
-      stdoutLen += chunk.length;
+      // If `child.stdout.setEncoding('utf8')` happened in userland, convert


Do you mean any setEncoding() call, or specifically w/ 'utf8' passed?

I mean any setEncoding(). Will change the comment...

Trott · 2016-05-24T16:19:33Z

OK... CI looks good... https://ci.nodejs.org/job/node-test-pull-request/2754/

Benchmarks are alarmingly on par:

$ ./node-pr-6764 benchmark/child_process/child-process-exec-stdout.js 
child_process/child-process-exec-stdout.js len=64 dur=5: 40871.77387
child_process/child-process-exec-stdout.js len=256 dur=5: 40870.66593
child_process/child-process-exec-stdout.js len=1024 dur=5: 40905.69407
child_process/child-process-exec-stdout.js len=4096 dur=5: 40880.96963
child_process/child-process-exec-stdout.js len=32768 dur=5: 40856.23579
$ node-6.2.0 benchmark/child_process/child-process-exec-stdout.js 
child_process/child-process-exec-stdout.js len=64 dur=5: 40872.87520
child_process/child-process-exec-stdout.js len=256 dur=5: 40868.80411
child_process/child-process-exec-stdout.js len=1024 dur=5: 40878.62745
child_process/child-process-exec-stdout.js len=4096 dur=5: 40876.45268
child_process/child-process-exec-stdout.js len=32768 dur=5: 40866.85730
$

Do we feel good about this as a solution to the problem? If so, can I get an LGTM or two? If not, what are the deficiencies?

mscdex · 2016-05-24T16:40:41Z

CI is green, LGTM.

This change fixes a known issue where `maxBuffer` limits by characters rather than bytes. Benchmark added to confirm no performance regression occurs with this change. PR-URL: nodejs#6764 Fixes: nodejs#1901 Reviewed-By: Brian White <mscdex@mscdex.net>

Trott · 2016-05-25T17:59:19Z

Landed in c9a5990

This change fixes a known issue where `maxBuffer` limits by characters rather than bytes. Benchmark added to confirm no performance regression occurs with this change. PR-URL: nodejs#6764 Fixes: nodejs#1901 Reviewed-By: Brian White <mscdex@mscdex.net>

This change fixes a known issue where `maxBuffer` limits by characters rather than bytes. Benchmark added to confirm no performance regression occurs with this change. PR-URL: #6764 Fixes: #1901 Reviewed-By: Brian White <mscdex@mscdex.net>

This change fixes a known issue where `maxBuffer` limits by characters rather than bytes. Benchmark added to confirm no performance regression occurs with this change. This necessarily changes default behavior of `stdout` and `stderr` callbacks such that they receive buffers rather than strings. The alternative would be a performance hit on the defaults. Refs: nodejs#6764 Refs: nodejs#1901

MylesBorins · 2016-07-11T21:41:00Z

@Trott lts?

Trott · 2016-07-11T22:38:14Z

@thealphanerd If it lands cleanly, yes.

This change fixes a known issue where `maxBuffer` limits by characters rather than bytes. Benchmark added to confirm no performance regression occurs with this change. PR-URL: #6764 Fixes: #1901 Reviewed-By: Brian White <mscdex@mscdex.net>

Trott added child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests. known issue test labels May 14, 2016

Trott mentioned this pull request May 14, 2016

the maxBuffer isn't complete accurate in child_process #1901

Closed

Trott force-pushed the bytes branch from 8926544 to 69ca4af Compare May 14, 2016 23:27

Trott force-pushed the bytes branch 2 times, most recently from 5765246 to 6060a64 Compare May 14, 2016 23:32

mscdex reviewed May 15, 2016
View reviewed changes

cjihrig reviewed May 15, 2016
View reviewed changes

Trott force-pushed the bytes branch from 6060a64 to 9d3fc13 Compare May 19, 2016 21:20

trevnorris reviewed May 24, 2016
View reviewed changes

squash: fix comment

561c798

Trott closed this May 25, 2016

jasnell mentioned this pull request Jun 7, 2016

child_process: make the maxBuffer correct with unicode #1902

Closed

cjihrig mentioned this pull request Jun 20, 2016

Encoding of child_process.exec.stdout changed from String to Buffer in v6.2.1 #7342

Closed

MylesBorins mentioned this pull request Jul 11, 2016

Audit commits not found on v4.x-staging #7661

Closed

MylesBorins added the lts-watch-v4.x label Jul 11, 2016

MylesBorins added land-on-v4.x and removed lts-watch-v4.x labels Jul 11, 2016

Trott mentioned this pull request Jul 12, 2016

child_process: preserve argument type #7391

Closed

3 tasks

MylesBorins mentioned this pull request Jul 12, 2016

v4.5.0 proposal #7688

Merged

Trott deleted the bytes branch January 13, 2022 22:43

child_process: measure buffer length in bytes #6764

child_process: measure buffer length in bytes #6764

Conversation

Trott commented May 14, 2016

Checklist

Affected core subsystem(s)

Description of change

Trott commented May 14, 2016

jasnell commented May 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mscdex commented May 15, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mscdex May 16, 2016 • edited Loading

Choose a reason for hiding this comment

mscdex May 16, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjihrig commented May 16, 2016

mscdex commented May 16, 2016

trevnorris commented May 16, 2016

mscdex commented May 16, 2016 • edited Loading

trevnorris commented May 16, 2016 • edited Loading

trevnorris commented May 16, 2016

mscdex commented May 17, 2016 • edited Loading

jasnell commented May 17, 2016

mscdex commented May 17, 2016

jasnell commented May 17, 2016

trevnorris commented May 17, 2016

jasnell commented May 17, 2016

mscdex commented May 17, 2016 • edited Loading

trevnorris commented May 17, 2016

Trott commented May 19, 2016

mscdex commented May 19, 2016 • edited Loading

Trott commented May 24, 2016

Choose a reason for hiding this comment

Trott May 24, 2016 • edited Loading

Choose a reason for hiding this comment

Trott commented May 24, 2016

mscdex commented May 24, 2016

Trott commented May 25, 2016

MylesBorins commented Jul 11, 2016

Trott commented Jul 11, 2016

mscdex commented May 15, 2016 •

edited

Loading

mscdex May 16, 2016 •

edited

Loading

mscdex May 16, 2016 •

edited

Loading

mscdex commented May 16, 2016 •

edited

Loading

trevnorris commented May 16, 2016 •

edited

Loading

mscdex commented May 17, 2016 •

edited

Loading

mscdex commented May 17, 2016 •

edited

Loading

mscdex commented May 19, 2016 •

edited

Loading

Trott May 24, 2016 •

edited

Loading