Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow child_process.spawn stdout stream performance on macOS #3429

Closed
mhart opened this issue Oct 18, 2015 · 54 comments
Closed

Very slow child_process.spawn stdout stream performance on macOS #3429

mhart opened this issue Oct 18, 2015 · 54 comments
Labels
child_process Issues and PRs related to the child_process subsystem.

Comments

@mhart
Copy link
Contributor

mhart commented Oct 18, 2015

The child_process.spawn stdout stream seems to be taking ~4x longer than other "similar" methods of I/O streaming with similarly sized data streams.

Running on Node.js v4.2.1, Mac OS X 10.11, Macbook Air 1.7GHz

As a baseline, the following program, where file is a 472MB gzip file, and stdout is set to ignore takes 6.5 secs with the gzip process using ~100% CPU:

var dataLength = 0
var time = process.hrtime()
var gzip = spawn('gzip', ['-cd', file], {stdio: ['ignore', 'ignore', 'ignore']})

if (gzip.stdout != null) {
  gzip.stdout.on('data', function(data) {
    if (!dataLength) dataLength = data.length
  })
}
gzip.on('close', function(code) {
  var diff = process.hrtime(time)
  console.log('stream took %d seconds', diff[0] + diff[1] / 1e9)
  if (dataLength) console.log('buffer length was %d', dataLength)
})
$ node spawn-bench.js
stream took 6.497701762 seconds

If I set the stdout option from ignore to pipe, it suddenly takes 27 secs, with gzip only using ~35% CPU and node using ~75%:

var dataLength = 0
var time = process.hrtime()
var gzip = spawn('gzip', ['-cd', file], {stdio: ['ignore', 'pipe', 'ignore']})

if (gzip.stdout != null) {
  gzip.stdout.on('data', function(data) {
    if (!dataLength) dataLength = data.length
  })
}
gzip.on('close', function(code) {
  var diff = process.hrtime(time)
  console.log('stream took %d seconds', diff[0] + diff[1] / 1e9)
  if (dataLength) console.log('buffer length was %d', dataLength)
})
$ node spawn-bench.js
stream took 27.406851714 seconds
buffer length was 8192

Fine, so that 4x overhead could just be standard pipe/stream overhead, but if I pipe in from stdin, there's really not much overhead at all and it finishes in 7.2 secs, with gzip using ~95% CPU and node ~30%:

var dataLength = 0
var time = process.hrtime()

process.stdin.on('data', function(data) {
  if (!dataLength) dataLength = data.length
})
process.stdin.on('end', function() {
  var diff = process.hrtime(time)
  console.log('stream took %d seconds', diff[0] + diff[1] / 1e9)
  if (dataLength) console.log('buffer length was %d', dataLength)
})
gzip -cd file.gz | node spawn-bench.js
stream took 7.2209929479999995 seconds
buffer length was 65536

Similarly, if I process the entire gzip file in node using zlib, I get relatively ok performance too, taking 9.8 secs:

var dataLength = 0
var time = process.hrtime()
fs.createReadStream(file, {highWaterMark: 32 * 1024 * 1024})
  .pipe(zlib.createGunzip({chunkSize: 32 * 1024 * 1024}))
  .on('data', function(data) {
    if (!dataLength) dataLength = data.length
  })
  .on('end', function() {
    var diff = process.hrtime(time)
    console.log('stream took %d seconds', diff[0] + diff[1] / 1e9)
    if (dataLength) console.log('buffer length was %d', dataLength)
  })
node spawn-bench.js
stream took 9.836914587 seconds
buffer length was 33554432

So one thing I suspected was the stream buffer size – you can see when spawning it's only 8k, whereas it's 64k when piping from stdin, and I had better performance upping it even further (to 32MB) when reading from the file and gzipping in node.

However, I couldn't see any way to change this with spawn, so I can't really determine if this is the cause.

Is there something else I'm missing? A 4x slowdown seems to be far greater than expected.

@mhart
Copy link
Contributor Author

mhart commented Oct 18, 2015

I ran with --prof and analyzed... can't really tell what's going on here.

Is this mostly IPC overhead? _openat, _mach_port_extract_member, etc?

Statistical profiling result from isolate-0x101804600-v8.log, (147497 ticks, 602 unaccounted, 0 excluded).

 [Unknown]:
   ticks  total  nonlib   name
    602    0.4%

 [Shared libraries]:
   ticks  total  nonlib   name
    438    0.3%    0.0%  /usr/lib/system/libsystem_malloc.dylib
    373    0.3%    0.0%  /usr/lib/system/libsystem_pthread.dylib
      1    0.0%    0.0%  /usr/lib/system/libsystem_platform.dylib
      1    0.0%    0.0%  /usr/lib/system/libsystem_c.dylib

 [JavaScript]:
   ticks  total  nonlib   name
    569    0.4%    0.4%  LazyCompile: ~readableAddChunk _stream_readable.js:123:26
    487    0.3%    0.3%  LazyCompile: *emit events.js:117:44
    467    0.3%    0.3%  LazyCompile: *nextTick node.js:475:22
    464    0.3%    0.3%  Stub: LoadICTrampolineStub
    405    0.3%    0.3%  LazyCompile: *onread net.js:501:16
    322    0.2%    0.2%  LazyCompile: *_tickCallback node.js:333:27
...
 [C++]:
   ticks  total  nonlib   name
  56380   38.2%   38.4%  _openat
  44780   30.4%   30.5%  _mach_port_extract_member
  10645    7.2%    7.3%  ___rename_ext
   8456    5.7%    5.8%  _szone_free_definite_size
    907    0.6%    0.6%  __ZN2v88internal4Heap22RegisterNewArrayBufferEbPvm
...
 [GC]:
   ticks  total  nonlib   name
   6978    4.7%

 [Bottom up (heavy) profile]:
  Note: percentage shows a share of a particular caller in the total
  amount of its parent calls.
  Callers occupying less than 2.0% are not shown.

   ticks parent  name
  56380   38.2%  _openat
  44780   30.4%  _mach_port_extract_member
  10645    7.2%  ___rename_ext
   8456    5.7%  _szone_free_definite_size

 [Top down (heavy) profile]:
  Note: callees occupying less than 0.1% are not shown.

  inclusive      self           name
  ticks   total  ticks   total
  56380   38.2%  56380   38.2%  _openat
  44780   30.4%  44780   30.4%  _mach_port_extract_member
  10645    7.2%  10645    7.2%  ___rename_ext
   8455    5.7%   8455    5.7%  _szone_free_definite_size
   4699    3.2%    405    0.3%  LazyCompile: *onread net.js:501:16
   3879    2.6%    566    0.4%    LazyCompile: ~readableAddChunk _stream_readable.js:123:26
    812    0.6%     92    0.1%      LazyCompile: *maybeReadMore _stream_readable.js:422:23
    582    0.4%    405    0.3%        LazyCompile: *nextTick node.js:475:22
    515    0.3%    436    0.3%      LazyCompile: *emit events.js:117:44
    452    0.3%    452    0.3%      Stub: LoadICTrampolineStub
    186    0.1%    186    0.1%      Stub: InstanceofStub
    152    0.1%     99    0.1%      LazyCompile: *howMuchToRead _stream_readable.js:211:23
    332    0.2%     69    0.0%    LazyCompile: *readableAddChunk _stream_readable.js:123:26
...

@mscdex mscdex added the child_process Issues and PRs related to the child_process subsystem. label Oct 18, 2015
@bnoordhuis
Copy link
Member

FWIW, I can't reproduce on OS X 10.8. The fact that the process spends nearly 40% of its time in openat() is suspicious.

@mhart
Copy link
Contributor Author

mhart commented Oct 19, 2015

@bnoordhuis interesting – is there something I can do to narrow this down then?

Perhaps this is an OSX-only issue, even OSX-10.11

I'll test the same setup on a Linux box in a sec

@bnoordhuis
Copy link
Member

Maybe you can put a dtrace probe on openat() that prints the filename and the time spent inside the system call? As root:

$ dtrace -q \
-n 'syscall::openat_nocancel:entry /pid == $target/ { self->timestamp = timestamp; self->path = copyinstr(arg1); }' \
-n 'syscall::openat_nocancel:return /pid == $target/ { printf("%d %s\n", timestamp - self->timestamp, self->path) }' \
-c 'node test.js`

I'm just guessing at the name of the openat probe, 10.8 doesn't have that system call. dtrace -l | grep -w syscall prints the available probes.

@mhart
Copy link
Contributor Author

mhart commented Oct 19, 2015

Well, that appears to be the correct syscall afaict:

$ sudo dtrace -l | grep -w syscall | grep openat
 1076    syscall                                              openat entry
 1077    syscall                                              openat return
 1078    syscall                                     openat_nocancel entry
 1079    syscall                                     openat_nocancel return

But running the command you gave gives me no (extra) output. If I remove the -q I get:

$ sudo dtrace -n 'syscall::openat_nocancel:entry /pid == $target/ { self->timestamp = timestamp; self->path = copyinstr(arg1); }' -n 'syscall::openat_nocancel:return /pid == $target/ { printf("%d %s\n", timestamp - self->timestamp, self->path) }' -c 'node spawn-bench.js'

dtrace: description 'syscall::openat_nocancel:entry ' matched 1 probe
dtrace: description 'syscall::openat_nocancel:return ' matched 1 probe
stream took 25.471454835 seconds
buffer length was 8192
dtrace: pid 56635 has exited

(should I be looking somewhere else for the dtrace output?)

I was using https://github.com/sidorares/node-tick to process the results of --prof – running it using node-tick-processor --mac – just in case there's the possibility that it's resolving the wrong syscalls? There were no errors/warnings though.

@mhart
Copy link
Contributor Author

mhart commented Oct 19, 2015

Oh, and replacing openat_nocancel with openat in the dtrace command had no effect either.

@bnoordhuis
Copy link
Member

Can you use tools/v8-prof/tick-processor.js? The format of the log file is pretty closely tied to the version of V8. If the tick processor and the log file don't match up, you can get bogus results.

@sidorares Don't know if node-tick already does that but newer versions of V8 start the log with a v8-version marker that you can use to print a warning when the tick processor doesn't match. Feel free to steal the logic from here.

@mhart
Copy link
Contributor Author

mhart commented Oct 19, 2015

@bnoordhuis I kinda remember a similar discussion happening here 😄 sidorares/node-tick#10 (comment)

@mhart
Copy link
Contributor Author

mhart commented Oct 19, 2015

Same (similar) result with tools/v8-prof/tick-processor.js. I had deleted the old profile, but generating a new one is quite similar, largely made up by _mach_port_extract_member, _openat, ___rename_ext and _free:

$ node tools/v8-prof/tick-processor.js ../isolate-0x102800000-v8.log
Statistical profiling result from ../isolate-0x102800000-v8.log, (19119 ticks, 43 unaccounted, 0 excluded).

 [Shared libraries]:
   ticks  total  nonlib   name
     79    0.4%          /usr/lib/system/libsystem_malloc.dylib
     50    0.3%          /usr/lib/system/libsystem_pthread.dylib

 [JavaScript]:
   ticks  total  nonlib   name
     82    0.4%    0.4%  Stub: LoadICTrampolineStub
     66    0.3%    0.3%  LazyCompile: *nextTick node.js:475:22
     61    0.3%    0.3%  LazyCompile: *onread net.js:501:16
     55    0.3%    0.3%  LazyCompile: *emit events.js:117:44
     45    0.2%    0.2%  LazyCompile: *Readable.read _stream_readable.js:250:35
     44    0.2%    0.2%  LazyCompile: *_tickCallback node.js:333:27
     27    0.1%    0.1%  Stub: InstanceofStub
...

 [C++]:
   ticks  total  nonlib   name
   6310   33.0%   33.2%  _mach_port_extract_member
   5364   28.1%   28.2%  _openat
   2030   10.6%   10.7%  ___rename_ext
   1308    6.8%    6.9%  _free
    203    1.1%    1.1%  _malloc_zone_valloc
    187    1.0%    1.0%  _malloc_zone_malloc
    146    0.8%    0.8%  v8::internal::Heap::RegisterNewArrayBuffer(bool, void*, unsigned long)
     86    0.4%    0.5%  v8::Context::Enter()
     86    0.4%    0.5%  _uv__io_poll
...

 [Summary]:
   ticks  total  nonlib   name
    800    4.2%    4.2%  JavaScript
  18147   94.9%   95.6%  C++
   1109    5.8%    5.8%  GC
    129    0.7%          Shared libraries
     43    0.2%          Unaccounted

 [C++ entry points]:
   ticks    cpp   total   name
     27   25.0%    0.1%  v8::internal::Builtins::~Builtins()
     18   16.7%    0.1%  v8::internal::Runtime_CompileLazy(int, v8::internal::Object**, v8::internal::Isolate*)
     10    9.3%    0.1%  v8::internal::Runtime_StackGuard(int, v8::internal::Object**, v8::internal::Isolate*)
      9    8.3%    0.0%  v8::internal::Runtime_Interrupt(int, v8::internal::Object**, v8::internal::Isolate*)
      8    7.4%    0.0%  v8::internal::StoreIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
      8    7.4%    0.0%  v8::internal::LoadIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
      7    6.5%    0.0%  v8::internal::Runtime_CompileOptimized(int, v8::internal::Object**, v8::internal::Isolate*)
      4    3.7%    0.0%  v8::internal::KeyedStoreIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
      3    2.8%    0.0%  v8::Isolate::RunMicrotasks()
      2    1.9%    0.0%  v8::internal::ToBooleanIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
      2    1.9%    0.0%  v8::internal::KeyedLoadIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
...

 [Bottom up (heavy) profile]:
  Note: percentage shows a share of a particular caller in the total
  amount of its parent calls.
  Callers occupying less than 2.0% are not shown.

   ticks parent  name
   6310   33.0%  _mach_port_extract_member
   5364   28.1%  _openat
   2030   10.6%  ___rename_ext
   1308    6.8%  _free

@mhart
Copy link
Contributor Author

mhart commented Oct 19, 2015

@bnoordhuis is there a way you know of to increase the buffer size for spawn stdio streams/pipes?

@evanlucas
Copy link
Contributor

In an attempt to see all of the syscalls made here, I ran dtruss -c -f node <file> and it seems like when stdio is set to ['pipe', 'pipe', 'pipe'], madvise is called significantly more.

With pipe, it was called 24418 times, compared to ignore, 66 times.

@bnoordhuis
Copy link
Member

Working hypothesis: the openat and __rename_ext calls (maybe mach_port_extract_member too) are caused by V8 logging to disk.

@mhart Can you try the dtrace command again while passing --prof to node this time?

@mhart
Copy link
Contributor Author

mhart commented Oct 19, 2015

FWIW, I tested with 0.12.7 and 0.10.40 on the same machine – I do get slightly better results – still at least 2x slower than reading from stdin though. Would love to up the buffer sizes to see if that helps.

nave use 0.12.7 node spawn-bench.js
Already installed: 0.12.7
using 0.12.7
stream took 19.746245263 seconds
buffer length was 8192
$ nave use 0.10.40 node spawn-bench.js
Already installed: 0.10.40
using 0.10.40
stream took 15.013791288 seconds
buffer length was 8192

@bnoordhuis will try that dtrace command now

@mhart
Copy link
Contributor Author

mhart commented Oct 19, 2015

No difference in the dtrace output using --prof – FWIW there's no difference in execution time either – bizarrely it seems that passing --prof even reduces the time by a few seconds...!? Small sample size I know, but it seems pretty consistent.

$ node spawn-bench.js
stream took 26.928994887 seconds
buffer length was 8192
$ node --prof spawn-bench.js
stream took 22.790958391 seconds
buffer length was 8192
$ node spawn-bench.js
stream took 25.437598346 seconds
buffer length was 8192
$ node --prof spawn-bench.js
stream took 22.1429554 seconds
buffer length was 8192
$ node spawn-bench.js
stream took 25.366320914 seconds
buffer length was 8192
$ node --prof spawn-bench.js
stream took 22.147164474 seconds
buffer length was 8192

@mhart
Copy link
Contributor Author

mhart commented Oct 19, 2015

(ok, maybe forget the --prof-being-faster observation, it's a little inconsistent – in any case, it certainly doesn't appear to be contributing negatively to performance)

Interestingly on an EC2 Ubuntu Linux instance, using node+zlib with large buffers is actually markedly faster than the default gzip command line (which is kinda cool):

$ uname -a
Linux 3.19.0-30-generic #34-Ubuntu SMP Fri Oct 2 22:08:41 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

$ node --version
v4.2.1

# using zlib and 32MB buffers:
$ node spawn-bench.js
stream took 9.823702408 seconds
bytes read: 2674732313
buffer length: 33554432

$ time gzip -cd file.gz > /dev/null
real    0m14.706s
user    0m14.584s
sys 0m0.120s

And there appears to be no difference between spawn and stdin times – the spawn buffer is larger on Linux it seems, 64k:

# using spawn with `ignore`:
$ node spawn-bench.js
stream took 14.649728433 seconds

# using spawn with `pipe`:
$ node spawn-bench.js
stream took 15.488220565 seconds
bytes read: 2674732313
buffer length: 65536

# changing to read from stdin:
$ gzip -cd file.gz | node spawn-bench.js
stream took 15.568545432 seconds
bytes read: 2674732313
buffer length: 65536

(these results are consistent after I run them a few times)

So:

  • There's no noticeable performance difference between spawn and stdin on Linux, but there is on Mac OS X 10.11
  • There's a much smaller difference between ignore and pipe on Linux (~1 sec)
  • The spawn buffer size on Linux is 64k, on Mac OS X 10.11 it's 8k

Again, I'd really love to play around with the buffer sizes here to see if it's stream/pipe overhead that's causing the issue, so any pointers on how to set them for spawn would be great.

@mhart mhart changed the title Very slow child_process.spawn stdout stream performance Very slow child_process.spawn stdout stream performance on Mac OS X 10.11 Oct 19, 2015
@mhart
Copy link
Contributor Author

mhart commented Oct 19, 2015

So 10.11 may be partly to blame. I just tested on a machine w/ Mac OS X 10.10.5 – a 2GHz mid-2012 MB Air, and I saw more like a 2-3x slowdown instead of 4x.

spawn w/ ignore (~6.7 secs):

$ node spawn-bench.js 
stream took 6.734095193 seconds
$ node spawn-bench.js 
stream took 6.760609026 seconds
$ node spawn-bench.js 
stream took 6.768254683 seconds

spawn w/ pipe (~18-19 secs):

$ node spawn-bench.js 
stream took 18.098372379 seconds
bytes read: 2674732313
buffer length: 8192
$ node spawn-bench.js 
stream took 18.301364698 seconds
bytes read: 2674732313
buffer length: 8192
$ node spawn-bench.js 
stream took 19.535337659 seconds
bytes read: 2674732313
buffer length: 8192

piping to stdin (~7.7 secs):

$ gzip -cd file.gz | node spawn-bench.js 
stream took 7.753027612 seconds
bytes read: 2674732313
buffer length: 65536
$ gzip -cd file.gz | node spawn-bench.js 
stream took 7.710982234 seconds
bytes read: 2674732313
buffer length: 65536
$ gzip -cd file.gz | node spawn-bench.js 
stream took 7.70945793 seconds
bytes read: 2674732313
buffer length: 65536

And the profile for the spawn w/ pipe case seems largely taken up by ___bsdthread_ctl, ___thread_selfusage and _connectx:

$ node-tick-processor --mac isolate-0x101804c00-v8.log 
Statistical profiling result from isolate-0x101804c00-v8.log, (15488 ticks, 17 unaccounted, 0 excluded).

 [Unknown]:
   ticks  total  nonlib   name
     17    0.1%

 [Shared libraries]:
   ticks  total  nonlib   name
     47    0.3%    0.0%  /usr/lib/system/libsystem_pthread.dylib
     20    0.1%    0.0%  /usr/lib/system/libsystem_malloc.dylib
      1    0.0%    0.0%  /usr/lib/system/libsystem_platform.dylib

 [JavaScript]:
   ticks  total  nonlib   name
     55    0.4%    0.4%  LazyCompile: *emit events.js:117:44
     53    0.3%    0.3%  LazyCompile: *_tickCallback node.js:333:27
     49    0.3%    0.3%  Stub: LoadICTrampolineStub
     46    0.3%    0.3%  LazyCompile: *nextTick node.js:475:22
     42    0.3%    0.3%  LazyCompile: *Readable.read _stream_readable.js:250:35
     32    0.2%    0.2%  LazyCompile: *onread net.js:501:16
     24    0.2%    0.2%  Stub: InstanceofStub
     16    0.1%    0.1%  LazyCompile: ~readableAddChunk _stream_readable.js:123:26
     16    0.1%    0.1%  LazyCompile: *howMuchToRead _stream_readable.js:211:23
     15    0.1%    0.1%  Stub: GrowArrayElementsStub
     15    0.1%    0.1%  Stub: CallICStub(args(2), FUNCTION, 
     12    0.1%    0.1%  Stub: LoadICStub
     11    0.1%    0.1%  Stub: CallFunctionStub_Args3
     11    0.1%    0.1%  LazyCompile: *readableAddChunk _stream_readable.js:123:26
     11    0.1%    0.1%  Builtin: ArgumentsAdaptorTrampoline
     10    0.1%    0.1%  Stub: CompareICStub {1}
...
 [C++]:
   ticks  total  nonlib   name
   5694   36.8%   36.9%  ___bsdthread_ctl
   3420   22.1%   22.2%  ___thread_selfusage
   2711   17.5%   17.6%  _connectx
    741    4.8%    4.8%  _szone_free
    434    2.8%    2.8%  _szone_pressure_relief
    115    0.7%    0.7%  _szone_free_definite_size
     89    0.6%    0.6%  _szone_malloc_should_clear
     74    0.5%    0.5%  _mach_msg_destroy
     50    0.3%    0.3%  _uv__stream_io
     47    0.3%    0.3%  __ZN2v88internal4Heap22RegisterNewArrayBufferEbPvm
     32    0.2%    0.2%  _uv__io_poll
     31    0.2%    0.2%  __ZN2v88internal14LookupIterator14LookupInHolderEPNS0_3MapEPNS0_10JSReceiverE
     29    0.2%    0.2%  _vm_region_recurse_64
     27    0.2%    0.2%  __ZN2v88internal8JSObject12SetPrototypeENS0_6HandleIS1_EENS2_INS0_6ObjectEEEb
     27    0.2%    0.2%  __ZN2v87Context4ExitEv
     26    0.2%    0.2%  _tiny_malloc_from_free_list
     25    0.2%    0.2%  _platform_task_update_threads
     25    0.2%    0.2%  __simple_asl_msg_set
     24    0.2%    0.2%  __ZN2v88internal7Context14native_contextEv
     24    0.2%    0.2%  __ZN2v87Context5EnterEv
     23    0.1%    0.1%  __ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0_6ObjectEEEiPS5_
     21    0.1%    0.1%  __ZN4node9AsyncWrap12MakeCallbackEN2v85LocalINS1_8FunctionEEEiPNS2_INS1_5ValueEEE
     21    0.1%    0.1%  __ZN4node16ContextifyScript3NewERKN2v820FunctionCallbackInfoINS1_5ValueEEE
     21    0.1%    0.1%  __ZN2v88internal15TimerEventScopeINS0_17TimerEventExecuteEE13LogTimerEventENS0_6Logger8StartEndE
     20    0.1%    0.1%  __simple_asl_send
     20    0.1%    0.1%  __malloc_initialize
...
 [GC]:
   ticks  total  nonlib   name
   1097    7.1%

 [Bottom up (heavy) profile]:
  Note: percentage shows a share of a particular caller in the total
  amount of its parent calls.
  Callers occupying less than 2.0% are not shown.

   ticks parent  name
   5694   36.8%  ___bsdthread_ctl
   3420   22.1%  ___thread_selfusage
   2711   17.5%  _connectx
    741    4.8%  _szone_free
    434    2.8%  _szone_pressure_relief

 [Top down (heavy) profile]:
  Note: callees occupying less than 0.1% are not shown.

  inclusive      self           name
  ticks   total  ticks   total
   5694   36.8%   5694   36.8%  ___bsdthread_ctl
   3419   22.1%   3419   22.1%  ___thread_selfusage
   2711   17.5%   2711   17.5%  _connectx
    740    4.8%    740    4.8%  _szone_free
    482    3.1%     32    0.2%  LazyCompile: *onread net.js:501:16
    295    1.9%     14    0.1%    LazyCompile: ~readableAddChunk _stream_readable.js:123:26
     59    0.4%      5    0.0%      LazyCompile: *maybeReadMore _stream_readable.js:422:23
     40    0.3%     30    0.2%        LazyCompile: *nextTick node.js:475:22
     55    0.4%     32    0.2%      LazyCompile: *emit events.js:117:44
     20    0.1%      2    0.0%        LazyCompile: ~<anonymous> /Users/sofijastefanovic/spawn-bench.js:37:36
     30    0.2%     30    0.2%      Stub: LoadICTrampolineStub
     28    0.2%     20    0.1%      LazyCompile: *Readable.read _stream_readable.js:250:35
    135    0.9%     10    0.1%    LazyCompile: *readableAddChunk _stream_readable.js:123:26
     34    0.2%     19    0.1%      LazyCompile: *Readable.read _stream_readable.js:250:35
     24    0.2%     23    0.1%      LazyCompile: *emit events.js:117:44
     21    0.1%      5    0.0%      LazyCompile: ~emit events.js:117:44
     19    0.1%     12    0.1%      LazyCompile: *nextTick node.js:475:22
    434    2.8%    434    2.8%  _szone_pressure_relief
    113    0.7%    113    0.7%  _szone_free_definite_size
    107    0.7%     53    0.3%  LazyCompile: *_tickCallback node.js:333:27
     30    0.2%      7    0.0%    LazyCompile: doNTCallback2 node.js:436:27
     87    0.6%     87    0.6%  _szone_malloc_should_clear
     74    0.5%     74    0.5%  _mach_msg_destroy
     55    0.4%      0    0.0%  Function: ~<anonymous> node.js:10:10
     53    0.3%      0    0.0%    LazyCompile: ~startup node.js:13:19
     23    0.1%      0    0.0%      LazyCompile: ~Module.runMain module.js:465:26
     22    0.1%      0    0.0%        LazyCompile: Module._load module.js:271:24
     19    0.1%      0    0.0%          LazyCompile: ~Module.load module.js:346:33
     18    0.1%      0    0.0%            LazyCompile: ~Module._extensions..js module.js:440:37
     16    0.1%      0    0.0%              LazyCompile: ~Module._compile module.js:379:37
     50    0.3%     50    0.3%  _uv__stream_io
     47    0.3%     47    0.3%  __ZN2v88internal4Heap22RegisterNewArrayBufferEbPvm
     43    0.3%     43    0.3%  /usr/lib/system/libsystem_pthread.dylib
     32    0.2%     32    0.2%  _uv__io_poll
     31    0.2%     31    0.2%  __ZN2v88internal14LookupIterator14LookupInHolderEPNS0_3MapEPNS0_10JSReceiverE
     29    0.2%     29    0.2%  _vm_region_recurse_64
     27    0.2%     27    0.2%  __ZN2v88internal8JSObject12SetPrototypeENS0_6HandleIS1_EENS2_INS0_6ObjectEEEb
     27    0.2%     27    0.2%  __ZN2v87Context4ExitEv
...

@mhart
Copy link
Contributor Author

mhart commented Oct 19, 2015

Might have to scratch the 10.11 theory, seems to exist on 10.9 too. I spun up a macincloud instance running OS X 10.9.5 on a 2.5GHz i5.

The issues are even more pronounced on this setup: seeing a 5-7x slowdown compared w/ stdin here.

Here I generated the gzip file using cat /dev/random | head -c 2674732313 | gzip > file.gz – and the time to ungzip is quite quick:

Using spawn w/ ignore (~2 secs):

$ node spawn-bench.js 
stream took 2.036773606 seconds
$ node spawn-bench.js 
stream took 2.049106005 seconds
$ node spawn-bench.js 
stream took 2.00025435 seconds

Using stdin and streaming (~3.5 secs):

$ gzip -cd file.gz | node spawn-bench.js 
stream took 3.46003934 seconds
bytes read: 2674732313
buffer length: 65536
$ gzip -cd file.gz | node spawn-bench.js 
stream took 3.537773093 seconds
bytes read: 2674732313
buffer length: 65536
$ gzip -cd file.gz | node spawn-bench.js 
stream took 3.562273549 seconds
bytes read: 2674732313
buffer length: 65536

Using spawn w/ pipe (~19 secs):

$ node spawn-bench.js 
stream took 18.548235443 seconds
bytes read: 2674732313
buffer length: 8192
$ node spawn-bench.js 
stream took 20.300110516 seconds
bytes read: 2674732313
buffer length: 8192
$ node spawn-bench.js 
stream took 19.733407683 seconds
bytes read: 2674732313
buffer length: 8192

And here's the profile of spawn w/ pipe:

$ node-tick-processor --mac isolate-0x101804c00-v8.log 
Statistical profiling result from isolate-0x101804c00-v8.log, (15787 ticks, 29 unaccounted, 0 excluded).

 [Unknown]:
   ticks  total  nonlib   name
     29    0.2%

 [Shared libraries]:
   ticks  total  nonlib   name
     17    0.1%    0.0%  /usr/lib/system/libsystem_pthread.dylib
      1    0.0%    0.0%  /usr/lib/system/libsystem_c.dylib

 [JavaScript]:
   ticks  total  nonlib   name
     71    0.4%    0.5%  LazyCompile: *emit events.js:117:44
     67    0.4%    0.4%  LazyCompile: *_tickCallback node.js:333:27
     52    0.3%    0.3%  Stub: LoadICTrampolineStub
     52    0.3%    0.3%  LazyCompile: *Readable.read _stream_readable.js:250:35
     43    0.3%    0.3%  LazyCompile: *onread net.js:501:16
     43    0.3%    0.3%  LazyCompile: *nextTick node.js:475:22
     40    0.3%    0.3%  Stub: InstanceofStub
     28    0.2%    0.2%  Stub: JSEntryStub
     28    0.2%    0.2%  LazyCompile: ~readableAddChunk _stream_readable.js:123:26
     27    0.2%    0.2%  Stub: GrowArrayElementsStub
     26    0.2%    0.2%  LazyCompile: *isNaN native v8natives.js:29:21
...
 [C++]:
   ticks  total  nonlib   name
   6488   41.1%   41.1%  ___initgroups
   2701   17.1%   17.1%  _proc_listpidspath
   1992   12.6%   12.6%  _mprotect
    295    1.9%    1.9%  _purgeable_ptr_in_use_enumerator
    257    1.6%    1.6%  _purgeable_print
    252    1.6%    1.6%  _szone_error
    205    1.3%    1.3%  _szone_good_size
     91    0.6%    0.6%  __ZN2v88internal4Heap22RegisterNewArrayBufferEbPvm
...
 [GC]:
   ticks  total  nonlib   name
   1361    8.6%

 [Bottom up (heavy) profile]:
  Note: percentage shows a share of a particular caller in the total
  amount of its parent calls.
  Callers occupying less than 2.0% are not shown.

   ticks parent  name
   6488   41.1%  ___initgroups
   2701   17.1%  _proc_listpidspath
   1992   12.6%  _mprotect

 [Top down (heavy) profile]:
  Note: callees occupying less than 0.1% are not shown.

  inclusive      self           name
  ticks   total  ticks   total
   6487   41.1%   6487   41.1%  ___initgroups
   2701   17.1%   2701   17.1%  _proc_listpidspath
   1992   12.6%   1992   12.6%  _mprotect
    664    4.2%     43    0.3%  LazyCompile: *onread net.js:501:16
    407    2.6%     27    0.2%    LazyCompile: ~readableAddChunk _stream_readable.js:123:26
     80    0.5%     40    0.3%      LazyCompile: *emit events.js:117:44
     35    0.2%      6    0.0%        LazyCompile: ~<anonymous> /Users/user113524/spawn-bench.js:37:36
     69    0.4%      8    0.1%      LazyCompile: *maybeReadMore _stream_readable.js:422:23
     42    0.3%     22    0.1%        LazyCompile: *nextTick node.js:475:22
     20    0.1%     20    0.1%          Stub: GrowArrayElementsStub
     52    0.3%     28    0.2%      LazyCompile: *Readable.read _stream_readable.js:250:35
     22    0.1%      7    0.0%        LazyCompile: *howMuchToRead _stream_readable.js:211:23
     37    0.2%     37    0.2%      Stub: LoadICTrampolineStub
    180    1.1%     16    0.1%    LazyCompile: *readableAddChunk _stream_readable.js:123:26
     37    0.2%     17    0.1%      LazyCompile: *Readable.read _stream_readable.js:250:35
     36    0.2%     31    0.2%      LazyCompile: *emit events.js:117:44
     26    0.2%     17    0.1%      LazyCompile: *nextTick node.js:475:22
     20    0.1%      6    0.0%      LazyCompile: ~emit events.js:117:44
     19    0.1%     19    0.1%      Stub: InstanceofStub
    295    1.9%    295    1.9%  _purgeable_ptr_in_use_enumerator
    257    1.6%    257    1.6%  _purgeable_print
    252    1.6%    252    1.6%  _szone_error
    205    1.3%    205    1.3%  _szone_good_size
    158    1.0%     67    0.4%  LazyCompile: *_tickCallback node.js:333:27
     40    0.3%     12    0.1%    LazyCompile: doNTCallback2 node.js:436:27
     16    0.1%      4    0.0%      LazyCompile: ~maybeReadMore_ _stream_readable.js:429:24
     91    0.6%     91    0.6%  __ZN2v88internal4Heap22RegisterNewArrayBufferEbPvm
     88    0.6%     88    0.6%  _bitarray_zap_first_set
     87    0.6%     87    0.6%  _large_entry_free_no_lock
     83    0.5%     83    0.5%  _hex
     81    0.5%     81    0.5%  _szone_malloc_should_clear
...

Also seems that regressions from 0.10.40 to 0.12.7 to 4.2.1 are partly to blame. On this 10.9.5 setup I get the following for spawn w/ pipe:

  • 0.10.40: ~8.2 secs
  • 0.12.7: ~13.2 secs
  • 4.2.1: ~19 secs

So, it definitely exists in OS X 10.9, and at least half of the slowdown appears to be from regressions from 0.10 where it performs "only" 2-3x slower than an equivalent stdin pipe.

That said, that factor of 2-3x for 0.10.40 is still much greater than any overhead 4.2.1 causes on Linux, so something weird is happening in OS X land.

@mhart mhart changed the title Very slow child_process.spawn stdout stream performance on Mac OS X 10.11 Very slow child_process.spawn stdout stream performance on Mac OS X 10.9 – 10.11 Oct 19, 2015
@mhart
Copy link
Contributor Author

mhart commented Oct 20, 2015

So I removed gzip from the equation and ran some experiments with different node versions on my OS X 10.11 machine comparing spawn with stdin, streaming 1GB. Here are the results.

(higher values = worse performance = 😿)

chart

Read further below for how I generated this data, but here are some things to note:

  • Execution times for both spawn and stdin increased markedly from v0.11.3 to v0.11.4, and stdin performance hasn't really changed much since then (perhaps has increased slightly recently, but still not near v0.11.3 levels)
  • spawn had a dire increase in execution time from v2.5.0 to v3.0.0 and hasn't been near v0.x/v1.x/v2.x levels since.

To create this, I generated a file using head -c 1000000000 /dev/random > file.dat and then for each version of node (inc v0.8.28, then v0.10.40 and upwards) I ran the following programs five times each and averaged the results in the chart:

spawn.js:

var spawn = require('child_process').spawn

spawnPipe(function(err, time) {
  if (err) return console.error(err.stack || err)
  console.log([process.version, 'spawn', time].join(','))
})

function spawnPipe(cb) {
  var bufferLength = 0, bytesRead = 0
  var time = process.hrtime()
  var gzip = spawn('cat', ['file.dat'], {stdio: ['ignore', 'pipe', 'ignore']})

  if (gzip.stdout != null) {
    gzip.stdout.on('data', function(data) {
      if (!bufferLength) bufferLength = data.length
      bytesRead += data.length
    })
  }
  gzip.on('error', cb)
  gzip.on('close', function(code) {
    if (code !== 0) return cb(new Error('gzip exited with: ' + code))
    if (bytesRead != 1000000000) return cb(new Error('Incorrect bytes read: ' + bytesRead))
    var diff = process.hrtime(time)
    cb(null, diff[0] + diff[1] / 1e9)
  })
}

stdin.js (using cat file.dat | node stdin.js):

stdinPipe(function(err, time) {
  if (err) return console.error(err.stack || err)
  console.log([process.version, 'stdin', time].join(','))
})

function stdinPipe(cb) {
  var bufferLength = 0, bytesRead = 0
  var time = process.hrtime()

  process.stdin.resume()
  process.stdin.on('data', function(data) {
    if (!bufferLength) bufferLength = data.length
    bytesRead += data.length
  })
  process.stdin.on('error', cb)
  process.stdin.on('end', function() {
    if (bytesRead != 1000000000) return cb(new Error('Incorrect bytes read: ' + bytesRead))
    var diff = process.hrtime(time)
    cb(null, diff[0] + diff[1] / 1e9)
  })
}

I invoked these using the following bash script to generate the CSV data:

#!/bin/bash

VERSIONS="4.2.1 0.12.7 0.10.40 0.8.28 0.11.16 3.3.1 2.5.0 1.8.4 \
  0.11.0    0.11.1  0.11.2 \
  0.11.3    0.11.4  0.11.5  0.11.6  0.11.7  0.11.8  0.11.9  0.11.10 0.11.11 \
  0.11.12   0.11.13 0.11.14 0.11.15 0.12.0  0.12.1  0.12.2  0.12.3 \
  0.12.4    0.12.5  0.12.6  4.0.0   4.1.0   4.1.1   4.1.2   4.2.0 \
  1.0.0 1.0.1   1.0.2   1.0.3   1.0.4   1.1.0   1.2.0   1.3.0   1.4.1 \
  1.4.2 1.4.3   1.5.0   1.5.1   1.6.0   1.6.1   1.6.2   1.6.3   1.6.4 \
  1.7.1 1.8.1   1.8.2   1.8.3   2.0.0   2.0.1   2.0.2   2.1.0 \
  2.2.0 2.2.1   2.3.0   2.3.1   2.3.2   2.3.3   2.3.4   2.4.0 \
  3.0.0 3.1.0   3.2.0   3.3.0"

for v in $VERSIONS; do
  for i in 0 1 2 3 4 5; do
    nave use $v node spawn.js
    nave use $v sh -c 'cat file.dat | node stdin.js'
  done
done

This doesn't shed much light on why Mac OS X would be any different from Linux (or if something happened from OS X 10.8 to 10.9) – but it does highlight where some pretty serious regressions occurred for spawn on OS X itself.

@sidorares
Copy link

@bnoordhuis @mhart I added version marker support but haven't published yet - need to properly extract all historical tickprocessor.js versions and map which one to use ( and add heuristic to detect which to use if no marker )

@mhart
Copy link
Contributor Author

mhart commented Oct 20, 2015

Here's another version of the same data, showing processing speed this time (so higher is better).

That drop from v0.11.3 to v0.11.4 is just... ouch.

chart

(Updated: there's less variance if I use the max instead of the mean)

@mhart
Copy link
Contributor Author

mhart commented Oct 20, 2015

(as a hunch, any chance that massive drop in performance from 0.11.3 to 0.11.4 is due to the removal of the SlabAllocator in ec90e6e ?)

@bnoordhuis
Copy link
Member

It's possible. The OS X implementation of malloc is pretty naive.

@mhart
Copy link
Contributor Author

mhart commented Oct 20, 2015

A graph of Linux performance for comparison. Ran on an EC2 c3.large instance, generated file using fallocate -l 1000000000 file.dat

chart linux

No disparity really exists here (indeed spawn is usually faster than stdin), and the v0.11.3 to v0.11.4 leap doesn't exist. Rather precipitous drop entering v3.x though.

Still think the buffer size difference must be a factor on OS X here. Smaller buffer for spawn = more allocations = crappy performance.

@mhart
Copy link
Contributor Author

mhart commented Oct 21, 2015

Extracted the v0.11.3 to v0.11.4 regression into a separate issue

@mhart
Copy link
Contributor Author

mhart commented Oct 21, 2015

Extracted the regression from v3.0.0 onwards into a separate issue

@mhart
Copy link
Contributor Author

mhart commented Jul 8, 2017

@Trott I'm guessing it is too – instructions to repro are here: #3429 (comment)

@mhart
Copy link
Contributor Author

mhart commented Jul 8, 2017

Yep, still similar on latest macOS/Node.js – reading a spawn stream still much slower than reading a stdin stream:

image

@Trott
Copy link
Member

Trott commented Jul 8, 2017

reading a spawn stream still much slower than reading a stdin stream:

Guess we can't turn it around and say "Oh, it's a feature! stdin is much faster on OS X than spawn!" Just kidding. I think.

/cc @nodejs/platform-macos @nodejs/benchmarking

@mhart mhart changed the title Very slow child_process.spawn stdout stream performance on Mac OS X 10.9 – 10.11 Very slow child_process.spawn stdout stream performance on macOS Jul 8, 2017
@mhart
Copy link
Contributor Author

mhart commented Jul 8, 2017

So @Trott what do you think about my buffer size theory then?

one thing I suspected was the stream buffer size – you can see when spawning it's only 8k, whereas it's 64k when piping from stdin, and I had better performance upping it even further (to 32MB)

any pointers on how to set them for spawn would be great

@Trott
Copy link
Member

Trott commented Jul 8, 2017

/cc @cjihrig

@MylesBorins
Copy link
Contributor

@mhart was this affected by the TF+I landing?

@mhart
Copy link
Contributor Author

mhart commented Aug 16, 2017

@MylesBorins I mean... I doubt it. But you can always double check! #3429 (comment)

@mhart
Copy link
Contributor Author

mhart commented Aug 17, 2017

Nah, no difference from v8.1.3 – spawn is still only 60% of the performance of stdin:

chart 1

(again, on Linux there's no difference)

@mhart
Copy link
Contributor Author

mhart commented Aug 17, 2017

And just to confirm: manual gc'ing doesn't change the spawn/stdin gap, it just gets rid of the performance regression from v0.10.48:

chart 3

@gireeshpunathil
Copy link
Member

easily recreatable with this:

$ cat 3429.js

var spawn = require('child_process').spawn
var chunk = 'g'.repeat(65535)
var cp = spawn('echo', [chunk])
var count = 0
cp.stdout.on('data', (d) => {
  count++
})
cp.on('close', () => {
  console.log(`${require('os').platform()}: ${count}`)
})

Linux: 1
darwin: 8

The key difference here is that Linux reads in 64KB (or sometimes even 1MB) chunks in one shot - probably limited by the socket buffer limits only, while in mac it is always 8KB, so you have more round about time to complete the task.

To check if echo or cat is rate limiting this from within, eliminate them, bring in node as child too:

var spawn = require('child_process').spawn
var fs = require('fs')
var chunk = 'g'.repeat(65535)

if(process.argv[2] === 'child') {
  console.log(chunk)
}
else {
  var cp = spawn(process.execPath, [process.argv[1], 'child'])
  var count = 0
  cp.stdout.on('data', (d) => {
    count++
  })
  cp.on('close', () => {
    console.log(`${require('os').platform()}: ${count}`)
  })
}

But again, the same pattern is observed - the data is chunked at the descriptor level.

30358/0x938fb:  write(0x1, "gggggggggggggggg", 0xC000)           = 8192 0
30356/0x938e0:  kevent(0x3, 0x7FFF5FBF6C00, 0x0)                 = 1 0
30356/0x938e0:  read(0xD, "gggggggggggggggg", 0x10000)           = 8192 0

Comparing with the good case:

command | node where stdin is used as opposed to stdout, bash performs the piping which results in large chunks being flown:

30572/0x94ea0: read(0x0, "ggggggggggggggggg", 0x10000) = 65536 0

In the slow case, fd 1 is a socket.
In the good case, fd 0 is a FIFO (a named pipe).

This is same in MAC and Linux, but their underlying resource utilization characteristics seem to be different.

@gireeshpunathil
Copy link
Member

diff --git a/deps/uv/src/unix/process.c b/deps/uv/src/unix/process.c
index 3a3cfd6f09..93d83b48e1 100644
--- a/deps/uv/src/unix/process.c
+++ b/deps/uv/src/unix/process.c
@@ -200,7 +200,8 @@ static int uv__process_init_stdio(uv_stdio_container_t* container, int fds[2]) {
     if (container->data.stream->type != UV_NAMED_PIPE)
       return UV_EINVAL;
     else
-      return uv__make_socketpair(fds, 0);
+      return uv__make_pipe(fds, 0);

this patch makes the darwin case at par with linux. However, I did not run any other tests or haven't thought about any implications of doing so, just wanted to complete the investigation and ratify my above findings.

I added these values in /etc/sysctl.conf , but does not seem to help, probably the variables are different for UNIX domain sockets:

kern.ipc.maxsockbuf=8000000
net.inet.tcp.sendspace=1048576
net.inet.tcp.recvspace=2097152

So the bottom line is that socket buffers are relatively smaller and slower than the pipes in darwin, at least in their default configurations.

/cc @nodejs/child_process

@mhart
Copy link
Contributor Author

mhart commented Apr 17, 2018

@gireeshpunathil yeah I figured the buffer size had a large responsibility to play in the slow-down.

The only problem is I can't see how to adjust it for child_process.spawn like I can with other streaming methods.

@gireeshpunathil
Copy link
Member

let us wait to hear from the team. I will also try to understand the differences between FIFOs and sockets on their buffering aspects, and what OS tunables control them.

@santigimeno
Copy link
Member

santigimeno commented Apr 17, 2018

@gireeshpunathil I don't think you can pass ancillary data via pipes. Maybe that's the reason of using sockets instead of pipes.

@Trott
Copy link
Member

Trott commented Apr 17, 2018

@nodejs/libuv (for @gireeshpunathil's questions/observations above)

@bnoordhuis
Copy link
Member

I don't think you can pass ancillary data via pipes. Maybe that's the reason of using sockets instead of pipes.

That's right. Sending file descriptors only works over UNIX domain sockets.

I'm not aware of socket options that let you tweak the size of the send and receive buffers; I'm reasonably sure there aren't any.

@Trott
Copy link
Member

Trott commented Apr 17, 2018

I'm not aware of socket options that let you tweak the size of the send and receive buffers; I'm reasonably sure there aren't any.

@bnoordhuis Does that mean this should likely be closed with something along the lines of "Buffer size is necessarily OS dependent, so this is just the way it is."? Or am I reading too much into what you're saying?

@bnoordhuis
Copy link
Member

No, I think that sums it up. I'll close this out.

Libuv could tickle the kernel in ways that might eke out a few percent more throughput but not in any way that's documented or stable. The next Apple patch could undo that optimization again.

FWIW, Node.js on my 2012 MacBook Air can push almost 200M/sec to wc -c, which seems plenty fast to me:

let { WriteWrap } = process.binding('stream_wrap')
let { spawn } = require('child_process')
let proc = spawn('sh', ['-c', 'wc -c'], {stdio: ['ipc', 'inherit']})  // 'ipc' or 'pipe'
let data = Buffer.from('.'.repeat(1 << 20))

let i = 192, next = () => {
  if (--i < 0) return proc.channel ? proc.disconnect() : proc.stdin.end()
  let req = new WriteWrap()
  req.async = false
  req.handle = proc.channel || proc.stdin._handle
  req.oncomplete = next
  req.handle.writeBuffer(req, data)
}
next()
$ time ./out/Release/node 3429.js
 201326592

real    0m1.024s
user    0m0.964s
sys     0m0.315s

We should look into reducing that user CPU time but that isn't specific to MacOS.

@santigimeno
Copy link
Member

I'm not aware of socket options that let you tweak the size of the send and receive buffers; I'm reasonably sure there aren't any.

So SO_SNDBUF and SO_RCVBUF options wouldn't work for this, right? (Just asking, I haven't used them)

@bnoordhuis
Copy link
Member

No, with qualifications. It's not that you can't set them, but they don't have any real effect.

I initially figured it might help with blocking reads and writes (because UNIX domain sockets in XNU do have a concept of high and low water marks) but there's no measurable impact.

@gireeshpunathil
Copy link
Member

FWIW, I was able to apply setsockopt of 64KB on the socketpair successfully, and improve things:

with existing logic:

v9.11.1,spawn,3.435861797
v9.11.1,spawn,2.351160667
v9.11.1,spawn,2.348283289
v9.11.1,spawn,2.534070136
v9.11.1,spawn,2.424818775

with 64KB SNDBUF and RCVBUF on the socketpair:

v10.0.0-pre,spawn,0.590655946
v10.0.0-pre,spawn,0.568137925
v10.0.0-pre,spawn,0.563020876
v10.0.0-pre,spawn,0.564211097
v10.0.0-pre,spawn,0.594333673
29985/0x133545:  writev(0x1, 0x7FCA69C02430, 0x2)                = 65536 0
29983/0x13352d:  read(0xD, "ggggggggggggggggg", 0x10000)          = 8192 0
29983/0x13352d:  kevent(0x3, 0x7FFF5FBF6C00, 0x0)                = 1 0
29983/0x13352d:  read(0xD, "ggggggggggggggggg", 0x10000)          = 8192 0

reading is still in terms of 8KB only, but looks like the kernel buffer has more data so Node was able to read 8 times back-to-back before making the callback, so the performance is not lost.

In the absence of a clear documentation on OS tunables on the UNIX domain sockets, I agree this change is not reasonable.

@bnoordhuis
Copy link
Member

That's interesting. Does my benchmarklet from #3429 (comment) give different numbers on your system? I tried what you tried but it made no difference.

Diff so we're talking about the same thing:

diff --git a/deps/uv/src/unix/process.c b/deps/uv/src/unix/process.c
index 3a3cfd6f09..767a8cd940 100644
--- a/deps/uv/src/unix/process.c
+++ b/deps/uv/src/unix/process.c
@@ -144,6 +144,20 @@ skip:
     uv__nonblock(fds[1], 1);
   }
 
+  {
+    int fd, err, opt;
+
+    for (fd = 0; fd < 2; ++fd) {
+      opt = 65535;
+      if ((err = setsockopt(fds[fd], SOL_SOCKET, SO_SNDBUF, &opt, sizeof(opt))))
+        return err;
+
+      opt = 65535;
+      if ((err = setsockopt(fds[fd], SOL_SOCKET, SO_RCVBUF, &opt, sizeof(opt))))
+        return err;
+    }
+  }
+
   return 0;
 }
 

@gireeshpunathil
Copy link
Member

mine is this:

diff --git a/deps/uv/src/unix/process.c b/deps/uv/src/unix/process.c
index 3a3cfd6f09..724fc7b660 100644
--- a/deps/uv/src/unix/process.c
+++ b/deps/uv/src/unix/process.c
@@ -188,6 +188,8 @@ skip:
 static int uv__process_init_stdio(uv_stdio_container_t* container, int fds[2]) {
   int mask;
   int fd;
+  int ret;
+  int sz = 64 * 1024;
 
   mask = UV_IGNORE | UV_CREATE_PIPE | UV_INHERIT_FD | UV_INHERIT_STREAM;
 
@@ -199,8 +201,20 @@ static int uv__process_init_stdio(uv_stdio_container_t* container, int fds[2]) {
     assert(container->data.stream != NULL);
     if (container->data.stream->type != UV_NAMED_PIPE)
       return UV_EINVAL;
-    else
-      return uv__make_socketpair(fds, 0);
+    else {
+      ret = uv__make_socketpair(fds, 0);
+      if(0 == ret) {
+        ret = setsockopt(fds[0], SOL_SOCKET, SO_RCVBUF, &sz, 4);
+        if(0 == ret)
+          ret = setsockopt(fds[0], SOL_SOCKET, SO_SNDBUF, &sz, 4);
+          if(0 == ret)
+            ret = setsockopt(fds[1], SOL_SOCKET, SO_RCVBUF, &sz, 4);
+              if(0 == ret) {
+                ret = setsockopt(fds[1], SOL_SOCKET, SO_SNDBUF, &sz, 4);
+              }
+      }
+      return ret;
+    }

plus, I have this: really unsure whether it had any impact or not:
$ cat /etc/sysctl.conf

kern.ipc.maxsockbuf=8000000
net.inet.tcp.sendspace=1048576
net.inet.tcp.recvspace=2097152
net.inet.tcp.autorcvbufmax=33554432
net.inet.tcp.autosndbufmax=33554432
net.local.stream.recvspace=65536
net.local.stream.sendspace=65536
net.inet.tcp.slowlink_wsize=65536
net.inet.tcp.autosndbufinc=65536
net.inet.raw.maxdgram=65536
net.inet.raw.recvspace=65536
net.inet6.ip6.maxfrags=65536
net.stats.recvspace=65536

will try your benchmark now.

@gireeshpunathil
Copy link
Member

$ time ./out/Release/node 3429.js
201326592

real	0m0.753s
user	0m0.696s
sys	0m0.185s
$

pretty much the same, with or without the patch.

@bnoordhuis
Copy link
Member

Right, so at least that matches up. Tweaking net.local.stream.recvspace and net.local.stream.sendspace had no discernible impact for me either.

I think we've tried enough that we can safely say there's no easy fix. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
child_process Issues and PRs related to the child_process subsystem.
Projects
None yet
Development

No branches or pull requests

9 participants