-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow child_process.spawn stdout stream performance on macOS #3429
Comments
I ran with Is this mostly IPC overhead?
|
FWIW, I can't reproduce on OS X 10.8. The fact that the process spends nearly 40% of its time in openat() is suspicious. |
@bnoordhuis interesting – is there something I can do to narrow this down then? Perhaps this is an OSX-only issue, even OSX-10.11 I'll test the same setup on a Linux box in a sec |
Maybe you can put a dtrace probe on openat() that prints the filename and the time spent inside the system call? As root:
I'm just guessing at the name of the openat probe, 10.8 doesn't have that system call. |
Well, that appears to be the correct syscall afaict: $ sudo dtrace -l | grep -w syscall | grep openat
1076 syscall openat entry
1077 syscall openat return
1078 syscall openat_nocancel entry
1079 syscall openat_nocancel return But running the command you gave gives me no (extra) output. If I remove the $ sudo dtrace -n 'syscall::openat_nocancel:entry /pid == $target/ { self->timestamp = timestamp; self->path = copyinstr(arg1); }' -n 'syscall::openat_nocancel:return /pid == $target/ { printf("%d %s\n", timestamp - self->timestamp, self->path) }' -c 'node spawn-bench.js'
dtrace: description 'syscall::openat_nocancel:entry ' matched 1 probe
dtrace: description 'syscall::openat_nocancel:return ' matched 1 probe
stream took 25.471454835 seconds
buffer length was 8192
dtrace: pid 56635 has exited (should I be looking somewhere else for the dtrace output?) I was using https://github.com/sidorares/node-tick to process the results of |
Oh, and replacing |
Can you use @sidorares Don't know if node-tick already does that but newer versions of V8 start the log with a v8-version marker that you can use to print a warning when the tick processor doesn't match. Feel free to steal the logic from here. |
@bnoordhuis I kinda remember a similar discussion happening here 😄 sidorares/node-tick#10 (comment) |
Same (similar) result with $ node tools/v8-prof/tick-processor.js ../isolate-0x102800000-v8.log
Statistical profiling result from ../isolate-0x102800000-v8.log, (19119 ticks, 43 unaccounted, 0 excluded).
[Shared libraries]:
ticks total nonlib name
79 0.4% /usr/lib/system/libsystem_malloc.dylib
50 0.3% /usr/lib/system/libsystem_pthread.dylib
[JavaScript]:
ticks total nonlib name
82 0.4% 0.4% Stub: LoadICTrampolineStub
66 0.3% 0.3% LazyCompile: *nextTick node.js:475:22
61 0.3% 0.3% LazyCompile: *onread net.js:501:16
55 0.3% 0.3% LazyCompile: *emit events.js:117:44
45 0.2% 0.2% LazyCompile: *Readable.read _stream_readable.js:250:35
44 0.2% 0.2% LazyCompile: *_tickCallback node.js:333:27
27 0.1% 0.1% Stub: InstanceofStub
...
[C++]:
ticks total nonlib name
6310 33.0% 33.2% _mach_port_extract_member
5364 28.1% 28.2% _openat
2030 10.6% 10.7% ___rename_ext
1308 6.8% 6.9% _free
203 1.1% 1.1% _malloc_zone_valloc
187 1.0% 1.0% _malloc_zone_malloc
146 0.8% 0.8% v8::internal::Heap::RegisterNewArrayBuffer(bool, void*, unsigned long)
86 0.4% 0.5% v8::Context::Enter()
86 0.4% 0.5% _uv__io_poll
...
[Summary]:
ticks total nonlib name
800 4.2% 4.2% JavaScript
18147 94.9% 95.6% C++
1109 5.8% 5.8% GC
129 0.7% Shared libraries
43 0.2% Unaccounted
[C++ entry points]:
ticks cpp total name
27 25.0% 0.1% v8::internal::Builtins::~Builtins()
18 16.7% 0.1% v8::internal::Runtime_CompileLazy(int, v8::internal::Object**, v8::internal::Isolate*)
10 9.3% 0.1% v8::internal::Runtime_StackGuard(int, v8::internal::Object**, v8::internal::Isolate*)
9 8.3% 0.0% v8::internal::Runtime_Interrupt(int, v8::internal::Object**, v8::internal::Isolate*)
8 7.4% 0.0% v8::internal::StoreIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
8 7.4% 0.0% v8::internal::LoadIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
7 6.5% 0.0% v8::internal::Runtime_CompileOptimized(int, v8::internal::Object**, v8::internal::Isolate*)
4 3.7% 0.0% v8::internal::KeyedStoreIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
3 2.8% 0.0% v8::Isolate::RunMicrotasks()
2 1.9% 0.0% v8::internal::ToBooleanIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
2 1.9% 0.0% v8::internal::KeyedLoadIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
...
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
6310 33.0% _mach_port_extract_member
5364 28.1% _openat
2030 10.6% ___rename_ext
1308 6.8% _free |
@bnoordhuis is there a way you know of to increase the buffer size for spawn stdio streams/pipes? |
In an attempt to see all of the syscalls made here, I ran With pipe, it was called 24418 times, compared to ignore, 66 times. |
Working hypothesis: the openat and __rename_ext calls (maybe mach_port_extract_member too) are caused by V8 logging to disk. @mhart Can you try the dtrace command again while passing |
FWIW, I tested with 0.12.7 and 0.10.40 on the same machine – I do get slightly better results – still at least 2x slower than reading from stdin though. Would love to up the buffer sizes to see if that helps. nave use 0.12.7 node spawn-bench.js
Already installed: 0.12.7
using 0.12.7
stream took 19.746245263 seconds
buffer length was 8192 $ nave use 0.10.40 node spawn-bench.js
Already installed: 0.10.40
using 0.10.40
stream took 15.013791288 seconds
buffer length was 8192 @bnoordhuis will try that dtrace command now |
No difference in the dtrace output using $ node spawn-bench.js
stream took 26.928994887 seconds
buffer length was 8192
$ node --prof spawn-bench.js
stream took 22.790958391 seconds
buffer length was 8192
$ node spawn-bench.js
stream took 25.437598346 seconds
buffer length was 8192
$ node --prof spawn-bench.js
stream took 22.1429554 seconds
buffer length was 8192
$ node spawn-bench.js
stream took 25.366320914 seconds
buffer length was 8192
$ node --prof spawn-bench.js
stream took 22.147164474 seconds
buffer length was 8192 |
(ok, maybe forget the Interestingly on an EC2 Ubuntu Linux instance, using node+zlib with large buffers is actually markedly faster than the default gzip command line (which is kinda cool): $ uname -a
Linux 3.19.0-30-generic #34-Ubuntu SMP Fri Oct 2 22:08:41 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ node --version
v4.2.1
# using zlib and 32MB buffers:
$ node spawn-bench.js
stream took 9.823702408 seconds
bytes read: 2674732313
buffer length: 33554432
$ time gzip -cd file.gz > /dev/null
real 0m14.706s
user 0m14.584s
sys 0m0.120s And there appears to be no difference between spawn and stdin times – the spawn buffer is larger on Linux it seems, 64k: # using spawn with `ignore`:
$ node spawn-bench.js
stream took 14.649728433 seconds
# using spawn with `pipe`:
$ node spawn-bench.js
stream took 15.488220565 seconds
bytes read: 2674732313
buffer length: 65536
# changing to read from stdin:
$ gzip -cd file.gz | node spawn-bench.js
stream took 15.568545432 seconds
bytes read: 2674732313
buffer length: 65536 (these results are consistent after I run them a few times) So:
Again, I'd really love to play around with the buffer sizes here to see if it's stream/pipe overhead that's causing the issue, so any pointers on how to set them for spawn would be great. |
So 10.11 may be partly to blame. I just tested on a machine w/ Mac OS X 10.10.5 – a 2GHz mid-2012 MB Air, and I saw more like a 2-3x slowdown instead of 4x.
$ node spawn-bench.js
stream took 6.734095193 seconds
$ node spawn-bench.js
stream took 6.760609026 seconds
$ node spawn-bench.js
stream took 6.768254683 seconds
$ node spawn-bench.js
stream took 18.098372379 seconds
bytes read: 2674732313
buffer length: 8192
$ node spawn-bench.js
stream took 18.301364698 seconds
bytes read: 2674732313
buffer length: 8192
$ node spawn-bench.js
stream took 19.535337659 seconds
bytes read: 2674732313
buffer length: 8192 piping to stdin (~7.7 secs): $ gzip -cd file.gz | node spawn-bench.js
stream took 7.753027612 seconds
bytes read: 2674732313
buffer length: 65536
$ gzip -cd file.gz | node spawn-bench.js
stream took 7.710982234 seconds
bytes read: 2674732313
buffer length: 65536
$ gzip -cd file.gz | node spawn-bench.js
stream took 7.70945793 seconds
bytes read: 2674732313
buffer length: 65536 And the profile for the $ node-tick-processor --mac isolate-0x101804c00-v8.log
Statistical profiling result from isolate-0x101804c00-v8.log, (15488 ticks, 17 unaccounted, 0 excluded).
[Unknown]:
ticks total nonlib name
17 0.1%
[Shared libraries]:
ticks total nonlib name
47 0.3% 0.0% /usr/lib/system/libsystem_pthread.dylib
20 0.1% 0.0% /usr/lib/system/libsystem_malloc.dylib
1 0.0% 0.0% /usr/lib/system/libsystem_platform.dylib
[JavaScript]:
ticks total nonlib name
55 0.4% 0.4% LazyCompile: *emit events.js:117:44
53 0.3% 0.3% LazyCompile: *_tickCallback node.js:333:27
49 0.3% 0.3% Stub: LoadICTrampolineStub
46 0.3% 0.3% LazyCompile: *nextTick node.js:475:22
42 0.3% 0.3% LazyCompile: *Readable.read _stream_readable.js:250:35
32 0.2% 0.2% LazyCompile: *onread net.js:501:16
24 0.2% 0.2% Stub: InstanceofStub
16 0.1% 0.1% LazyCompile: ~readableAddChunk _stream_readable.js:123:26
16 0.1% 0.1% LazyCompile: *howMuchToRead _stream_readable.js:211:23
15 0.1% 0.1% Stub: GrowArrayElementsStub
15 0.1% 0.1% Stub: CallICStub(args(2), FUNCTION,
12 0.1% 0.1% Stub: LoadICStub
11 0.1% 0.1% Stub: CallFunctionStub_Args3
11 0.1% 0.1% LazyCompile: *readableAddChunk _stream_readable.js:123:26
11 0.1% 0.1% Builtin: ArgumentsAdaptorTrampoline
10 0.1% 0.1% Stub: CompareICStub {1}
...
[C++]:
ticks total nonlib name
5694 36.8% 36.9% ___bsdthread_ctl
3420 22.1% 22.2% ___thread_selfusage
2711 17.5% 17.6% _connectx
741 4.8% 4.8% _szone_free
434 2.8% 2.8% _szone_pressure_relief
115 0.7% 0.7% _szone_free_definite_size
89 0.6% 0.6% _szone_malloc_should_clear
74 0.5% 0.5% _mach_msg_destroy
50 0.3% 0.3% _uv__stream_io
47 0.3% 0.3% __ZN2v88internal4Heap22RegisterNewArrayBufferEbPvm
32 0.2% 0.2% _uv__io_poll
31 0.2% 0.2% __ZN2v88internal14LookupIterator14LookupInHolderEPNS0_3MapEPNS0_10JSReceiverE
29 0.2% 0.2% _vm_region_recurse_64
27 0.2% 0.2% __ZN2v88internal8JSObject12SetPrototypeENS0_6HandleIS1_EENS2_INS0_6ObjectEEEb
27 0.2% 0.2% __ZN2v87Context4ExitEv
26 0.2% 0.2% _tiny_malloc_from_free_list
25 0.2% 0.2% _platform_task_update_threads
25 0.2% 0.2% __simple_asl_msg_set
24 0.2% 0.2% __ZN2v88internal7Context14native_contextEv
24 0.2% 0.2% __ZN2v87Context5EnterEv
23 0.1% 0.1% __ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0_6ObjectEEEiPS5_
21 0.1% 0.1% __ZN4node9AsyncWrap12MakeCallbackEN2v85LocalINS1_8FunctionEEEiPNS2_INS1_5ValueEEE
21 0.1% 0.1% __ZN4node16ContextifyScript3NewERKN2v820FunctionCallbackInfoINS1_5ValueEEE
21 0.1% 0.1% __ZN2v88internal15TimerEventScopeINS0_17TimerEventExecuteEE13LogTimerEventENS0_6Logger8StartEndE
20 0.1% 0.1% __simple_asl_send
20 0.1% 0.1% __malloc_initialize
...
[GC]:
ticks total nonlib name
1097 7.1%
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
5694 36.8% ___bsdthread_ctl
3420 22.1% ___thread_selfusage
2711 17.5% _connectx
741 4.8% _szone_free
434 2.8% _szone_pressure_relief
[Top down (heavy) profile]:
Note: callees occupying less than 0.1% are not shown.
inclusive self name
ticks total ticks total
5694 36.8% 5694 36.8% ___bsdthread_ctl
3419 22.1% 3419 22.1% ___thread_selfusage
2711 17.5% 2711 17.5% _connectx
740 4.8% 740 4.8% _szone_free
482 3.1% 32 0.2% LazyCompile: *onread net.js:501:16
295 1.9% 14 0.1% LazyCompile: ~readableAddChunk _stream_readable.js:123:26
59 0.4% 5 0.0% LazyCompile: *maybeReadMore _stream_readable.js:422:23
40 0.3% 30 0.2% LazyCompile: *nextTick node.js:475:22
55 0.4% 32 0.2% LazyCompile: *emit events.js:117:44
20 0.1% 2 0.0% LazyCompile: ~<anonymous> /Users/sofijastefanovic/spawn-bench.js:37:36
30 0.2% 30 0.2% Stub: LoadICTrampolineStub
28 0.2% 20 0.1% LazyCompile: *Readable.read _stream_readable.js:250:35
135 0.9% 10 0.1% LazyCompile: *readableAddChunk _stream_readable.js:123:26
34 0.2% 19 0.1% LazyCompile: *Readable.read _stream_readable.js:250:35
24 0.2% 23 0.1% LazyCompile: *emit events.js:117:44
21 0.1% 5 0.0% LazyCompile: ~emit events.js:117:44
19 0.1% 12 0.1% LazyCompile: *nextTick node.js:475:22
434 2.8% 434 2.8% _szone_pressure_relief
113 0.7% 113 0.7% _szone_free_definite_size
107 0.7% 53 0.3% LazyCompile: *_tickCallback node.js:333:27
30 0.2% 7 0.0% LazyCompile: doNTCallback2 node.js:436:27
87 0.6% 87 0.6% _szone_malloc_should_clear
74 0.5% 74 0.5% _mach_msg_destroy
55 0.4% 0 0.0% Function: ~<anonymous> node.js:10:10
53 0.3% 0 0.0% LazyCompile: ~startup node.js:13:19
23 0.1% 0 0.0% LazyCompile: ~Module.runMain module.js:465:26
22 0.1% 0 0.0% LazyCompile: Module._load module.js:271:24
19 0.1% 0 0.0% LazyCompile: ~Module.load module.js:346:33
18 0.1% 0 0.0% LazyCompile: ~Module._extensions..js module.js:440:37
16 0.1% 0 0.0% LazyCompile: ~Module._compile module.js:379:37
50 0.3% 50 0.3% _uv__stream_io
47 0.3% 47 0.3% __ZN2v88internal4Heap22RegisterNewArrayBufferEbPvm
43 0.3% 43 0.3% /usr/lib/system/libsystem_pthread.dylib
32 0.2% 32 0.2% _uv__io_poll
31 0.2% 31 0.2% __ZN2v88internal14LookupIterator14LookupInHolderEPNS0_3MapEPNS0_10JSReceiverE
29 0.2% 29 0.2% _vm_region_recurse_64
27 0.2% 27 0.2% __ZN2v88internal8JSObject12SetPrototypeENS0_6HandleIS1_EENS2_INS0_6ObjectEEEb
27 0.2% 27 0.2% __ZN2v87Context4ExitEv
... |
Might have to scratch the 10.11 theory, seems to exist on 10.9 too. I spun up a macincloud instance running OS X 10.9.5 on a 2.5GHz i5. The issues are even more pronounced on this setup: seeing a 5-7x slowdown compared w/ stdin here. Here I generated the gzip file using Using $ node spawn-bench.js
stream took 2.036773606 seconds
$ node spawn-bench.js
stream took 2.049106005 seconds
$ node spawn-bench.js
stream took 2.00025435 seconds Using stdin and streaming (~3.5 secs): $ gzip -cd file.gz | node spawn-bench.js
stream took 3.46003934 seconds
bytes read: 2674732313
buffer length: 65536
$ gzip -cd file.gz | node spawn-bench.js
stream took 3.537773093 seconds
bytes read: 2674732313
buffer length: 65536
$ gzip -cd file.gz | node spawn-bench.js
stream took 3.562273549 seconds
bytes read: 2674732313
buffer length: 65536 Using $ node spawn-bench.js
stream took 18.548235443 seconds
bytes read: 2674732313
buffer length: 8192
$ node spawn-bench.js
stream took 20.300110516 seconds
bytes read: 2674732313
buffer length: 8192
$ node spawn-bench.js
stream took 19.733407683 seconds
bytes read: 2674732313
buffer length: 8192 And here's the profile of $ node-tick-processor --mac isolate-0x101804c00-v8.log
Statistical profiling result from isolate-0x101804c00-v8.log, (15787 ticks, 29 unaccounted, 0 excluded).
[Unknown]:
ticks total nonlib name
29 0.2%
[Shared libraries]:
ticks total nonlib name
17 0.1% 0.0% /usr/lib/system/libsystem_pthread.dylib
1 0.0% 0.0% /usr/lib/system/libsystem_c.dylib
[JavaScript]:
ticks total nonlib name
71 0.4% 0.5% LazyCompile: *emit events.js:117:44
67 0.4% 0.4% LazyCompile: *_tickCallback node.js:333:27
52 0.3% 0.3% Stub: LoadICTrampolineStub
52 0.3% 0.3% LazyCompile: *Readable.read _stream_readable.js:250:35
43 0.3% 0.3% LazyCompile: *onread net.js:501:16
43 0.3% 0.3% LazyCompile: *nextTick node.js:475:22
40 0.3% 0.3% Stub: InstanceofStub
28 0.2% 0.2% Stub: JSEntryStub
28 0.2% 0.2% LazyCompile: ~readableAddChunk _stream_readable.js:123:26
27 0.2% 0.2% Stub: GrowArrayElementsStub
26 0.2% 0.2% LazyCompile: *isNaN native v8natives.js:29:21
...
[C++]:
ticks total nonlib name
6488 41.1% 41.1% ___initgroups
2701 17.1% 17.1% _proc_listpidspath
1992 12.6% 12.6% _mprotect
295 1.9% 1.9% _purgeable_ptr_in_use_enumerator
257 1.6% 1.6% _purgeable_print
252 1.6% 1.6% _szone_error
205 1.3% 1.3% _szone_good_size
91 0.6% 0.6% __ZN2v88internal4Heap22RegisterNewArrayBufferEbPvm
...
[GC]:
ticks total nonlib name
1361 8.6%
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
6488 41.1% ___initgroups
2701 17.1% _proc_listpidspath
1992 12.6% _mprotect
[Top down (heavy) profile]:
Note: callees occupying less than 0.1% are not shown.
inclusive self name
ticks total ticks total
6487 41.1% 6487 41.1% ___initgroups
2701 17.1% 2701 17.1% _proc_listpidspath
1992 12.6% 1992 12.6% _mprotect
664 4.2% 43 0.3% LazyCompile: *onread net.js:501:16
407 2.6% 27 0.2% LazyCompile: ~readableAddChunk _stream_readable.js:123:26
80 0.5% 40 0.3% LazyCompile: *emit events.js:117:44
35 0.2% 6 0.0% LazyCompile: ~<anonymous> /Users/user113524/spawn-bench.js:37:36
69 0.4% 8 0.1% LazyCompile: *maybeReadMore _stream_readable.js:422:23
42 0.3% 22 0.1% LazyCompile: *nextTick node.js:475:22
20 0.1% 20 0.1% Stub: GrowArrayElementsStub
52 0.3% 28 0.2% LazyCompile: *Readable.read _stream_readable.js:250:35
22 0.1% 7 0.0% LazyCompile: *howMuchToRead _stream_readable.js:211:23
37 0.2% 37 0.2% Stub: LoadICTrampolineStub
180 1.1% 16 0.1% LazyCompile: *readableAddChunk _stream_readable.js:123:26
37 0.2% 17 0.1% LazyCompile: *Readable.read _stream_readable.js:250:35
36 0.2% 31 0.2% LazyCompile: *emit events.js:117:44
26 0.2% 17 0.1% LazyCompile: *nextTick node.js:475:22
20 0.1% 6 0.0% LazyCompile: ~emit events.js:117:44
19 0.1% 19 0.1% Stub: InstanceofStub
295 1.9% 295 1.9% _purgeable_ptr_in_use_enumerator
257 1.6% 257 1.6% _purgeable_print
252 1.6% 252 1.6% _szone_error
205 1.3% 205 1.3% _szone_good_size
158 1.0% 67 0.4% LazyCompile: *_tickCallback node.js:333:27
40 0.3% 12 0.1% LazyCompile: doNTCallback2 node.js:436:27
16 0.1% 4 0.0% LazyCompile: ~maybeReadMore_ _stream_readable.js:429:24
91 0.6% 91 0.6% __ZN2v88internal4Heap22RegisterNewArrayBufferEbPvm
88 0.6% 88 0.6% _bitarray_zap_first_set
87 0.6% 87 0.6% _large_entry_free_no_lock
83 0.5% 83 0.5% _hex
81 0.5% 81 0.5% _szone_malloc_should_clear
... Also seems that regressions from 0.10.40 to 0.12.7 to 4.2.1 are partly to blame. On this 10.9.5 setup I get the following for
So, it definitely exists in OS X 10.9, and at least half of the slowdown appears to be from regressions from 0.10 where it performs "only" 2-3x slower than an equivalent stdin pipe. That said, that factor of 2-3x for 0.10.40 is still much greater than any overhead 4.2.1 causes on Linux, so something weird is happening in OS X land. |
So I removed gzip from the equation and ran some experiments with different node versions on my OS X 10.11 machine comparing spawn with stdin, streaming 1GB. Here are the results. (higher values = worse performance = 😿) Read further below for how I generated this data, but here are some things to note:
To create this, I generated a file using spawn.js: var spawn = require('child_process').spawn
spawnPipe(function(err, time) {
if (err) return console.error(err.stack || err)
console.log([process.version, 'spawn', time].join(','))
})
function spawnPipe(cb) {
var bufferLength = 0, bytesRead = 0
var time = process.hrtime()
var gzip = spawn('cat', ['file.dat'], {stdio: ['ignore', 'pipe', 'ignore']})
if (gzip.stdout != null) {
gzip.stdout.on('data', function(data) {
if (!bufferLength) bufferLength = data.length
bytesRead += data.length
})
}
gzip.on('error', cb)
gzip.on('close', function(code) {
if (code !== 0) return cb(new Error('gzip exited with: ' + code))
if (bytesRead != 1000000000) return cb(new Error('Incorrect bytes read: ' + bytesRead))
var diff = process.hrtime(time)
cb(null, diff[0] + diff[1] / 1e9)
})
} stdin.js (using stdinPipe(function(err, time) {
if (err) return console.error(err.stack || err)
console.log([process.version, 'stdin', time].join(','))
})
function stdinPipe(cb) {
var bufferLength = 0, bytesRead = 0
var time = process.hrtime()
process.stdin.resume()
process.stdin.on('data', function(data) {
if (!bufferLength) bufferLength = data.length
bytesRead += data.length
})
process.stdin.on('error', cb)
process.stdin.on('end', function() {
if (bytesRead != 1000000000) return cb(new Error('Incorrect bytes read: ' + bytesRead))
var diff = process.hrtime(time)
cb(null, diff[0] + diff[1] / 1e9)
})
} I invoked these using the following bash script to generate the CSV data: #!/bin/bash
VERSIONS="4.2.1 0.12.7 0.10.40 0.8.28 0.11.16 3.3.1 2.5.0 1.8.4 \
0.11.0 0.11.1 0.11.2 \
0.11.3 0.11.4 0.11.5 0.11.6 0.11.7 0.11.8 0.11.9 0.11.10 0.11.11 \
0.11.12 0.11.13 0.11.14 0.11.15 0.12.0 0.12.1 0.12.2 0.12.3 \
0.12.4 0.12.5 0.12.6 4.0.0 4.1.0 4.1.1 4.1.2 4.2.0 \
1.0.0 1.0.1 1.0.2 1.0.3 1.0.4 1.1.0 1.2.0 1.3.0 1.4.1 \
1.4.2 1.4.3 1.5.0 1.5.1 1.6.0 1.6.1 1.6.2 1.6.3 1.6.4 \
1.7.1 1.8.1 1.8.2 1.8.3 2.0.0 2.0.1 2.0.2 2.1.0 \
2.2.0 2.2.1 2.3.0 2.3.1 2.3.2 2.3.3 2.3.4 2.4.0 \
3.0.0 3.1.0 3.2.0 3.3.0"
for v in $VERSIONS; do
for i in 0 1 2 3 4 5; do
nave use $v node spawn.js
nave use $v sh -c 'cat file.dat | node stdin.js'
done
done This doesn't shed much light on why Mac OS X would be any different from Linux (or if something happened from OS X 10.8 to 10.9) – but it does highlight where some pretty serious regressions occurred for spawn on OS X itself. |
@bnoordhuis @mhart I added version marker support but haven't published yet - need to properly extract all historical tickprocessor.js versions and map which one to use ( and add heuristic to detect which to use if no marker ) |
(as a hunch, any chance that massive drop in performance from 0.11.3 to 0.11.4 is due to the removal of the SlabAllocator in ec90e6e ?) |
It's possible. The OS X implementation of malloc is pretty naive. |
A graph of Linux performance for comparison. Ran on an EC2 c3.large instance, generated file using No disparity really exists here (indeed spawn is usually faster than stdin), and the v0.11.3 to v0.11.4 leap doesn't exist. Rather precipitous drop entering v3.x though. Still think the buffer size difference must be a factor on OS X here. Smaller buffer for spawn = more allocations = crappy performance. |
Extracted the v0.11.3 to v0.11.4 regression into a separate issue |
Extracted the regression from v3.0.0 onwards into a separate issue |
@Trott I'm guessing it is too – instructions to repro are here: #3429 (comment) |
Guess we can't turn it around and say "Oh, it's a feature! stdin is much faster on OS X than spawn!" Just kidding. I think. /cc @nodejs/platform-macos @nodejs/benchmarking |
So @Trott what do you think about my buffer size theory then?
|
/cc @cjihrig |
@mhart was this affected by the TF+I landing? |
@MylesBorins I mean... I doubt it. But you can always double check! #3429 (comment) |
easily recreatable with this: $ cat 3429.js var spawn = require('child_process').spawn
var chunk = 'g'.repeat(65535)
var cp = spawn('echo', [chunk])
var count = 0
cp.stdout.on('data', (d) => {
count++
})
cp.on('close', () => {
console.log(`${require('os').platform()}: ${count}`)
}) Linux: 1 The key difference here is that Linux reads in 64KB (or sometimes even 1MB) chunks in one shot - probably limited by the socket buffer limits only, while in mac it is always 8KB, so you have more round about time to complete the task. To check if var spawn = require('child_process').spawn
var fs = require('fs')
var chunk = 'g'.repeat(65535)
if(process.argv[2] === 'child') {
console.log(chunk)
}
else {
var cp = spawn(process.execPath, [process.argv[1], 'child'])
var count = 0
cp.stdout.on('data', (d) => {
count++
})
cp.on('close', () => {
console.log(`${require('os').platform()}: ${count}`)
})
} But again, the same pattern is observed - the data is chunked at the descriptor level.
Comparing with the good case:
In the slow case, fd 1 is a socket. This is same in MAC and Linux, but their underlying resource utilization characteristics seem to be different. |
diff --git a/deps/uv/src/unix/process.c b/deps/uv/src/unix/process.c
index 3a3cfd6f09..93d83b48e1 100644
--- a/deps/uv/src/unix/process.c
+++ b/deps/uv/src/unix/process.c
@@ -200,7 +200,8 @@ static int uv__process_init_stdio(uv_stdio_container_t* container, int fds[2]) {
if (container->data.stream->type != UV_NAMED_PIPE)
return UV_EINVAL;
else
- return uv__make_socketpair(fds, 0);
+ return uv__make_pipe(fds, 0); this patch makes the darwin case at par with linux. However, I did not run any other tests or haven't thought about any implications of doing so, just wanted to complete the investigation and ratify my above findings. I added these values in /etc/sysctl.conf , but does not seem to help, probably the variables are different for UNIX domain sockets:
So the bottom line is that socket buffers are relatively smaller and slower than the pipes in darwin, at least in their default configurations. /cc @nodejs/child_process |
@gireeshpunathil yeah I figured the buffer size had a large responsibility to play in the slow-down. The only problem is I can't see how to adjust it for |
let us wait to hear from the team. I will also try to understand the differences between FIFOs and sockets on their buffering aspects, and what OS tunables control them. |
@gireeshpunathil I don't think you can pass ancillary data via pipes. Maybe that's the reason of using sockets instead of pipes. |
@nodejs/libuv (for @gireeshpunathil's questions/observations above) |
That's right. Sending file descriptors only works over UNIX domain sockets. I'm not aware of socket options that let you tweak the size of the send and receive buffers; I'm reasonably sure there aren't any. |
@bnoordhuis Does that mean this should likely be closed with something along the lines of "Buffer size is necessarily OS dependent, so this is just the way it is."? Or am I reading too much into what you're saying? |
No, I think that sums it up. I'll close this out. Libuv could tickle the kernel in ways that might eke out a few percent more throughput but not in any way that's documented or stable. The next Apple patch could undo that optimization again. FWIW, Node.js on my 2012 MacBook Air can push almost 200M/sec to let { WriteWrap } = process.binding('stream_wrap')
let { spawn } = require('child_process')
let proc = spawn('sh', ['-c', 'wc -c'], {stdio: ['ipc', 'inherit']}) // 'ipc' or 'pipe'
let data = Buffer.from('.'.repeat(1 << 20))
let i = 192, next = () => {
if (--i < 0) return proc.channel ? proc.disconnect() : proc.stdin.end()
let req = new WriteWrap()
req.async = false
req.handle = proc.channel || proc.stdin._handle
req.oncomplete = next
req.handle.writeBuffer(req, data)
}
next()
We should look into reducing that user CPU time but that isn't specific to MacOS. |
So |
No, with qualifications. It's not that you can't set them, but they don't have any real effect. I initially figured it might help with blocking reads and writes (because UNIX domain sockets in XNU do have a concept of high and low water marks) but there's no measurable impact. |
FWIW, I was able to apply with existing logic:
with 64KB SNDBUF and RCVBUF on the socketpair:
reading is still in terms of 8KB only, but looks like the kernel buffer has more data so Node was able to read 8 times back-to-back before making the callback, so the performance is not lost. In the absence of a clear documentation on OS tunables on the UNIX domain sockets, I agree this change is not reasonable. |
That's interesting. Does my benchmarklet from #3429 (comment) give different numbers on your system? I tried what you tried but it made no difference. Diff so we're talking about the same thing: diff --git a/deps/uv/src/unix/process.c b/deps/uv/src/unix/process.c
index 3a3cfd6f09..767a8cd940 100644
--- a/deps/uv/src/unix/process.c
+++ b/deps/uv/src/unix/process.c
@@ -144,6 +144,20 @@ skip:
uv__nonblock(fds[1], 1);
}
+ {
+ int fd, err, opt;
+
+ for (fd = 0; fd < 2; ++fd) {
+ opt = 65535;
+ if ((err = setsockopt(fds[fd], SOL_SOCKET, SO_SNDBUF, &opt, sizeof(opt))))
+ return err;
+
+ opt = 65535;
+ if ((err = setsockopt(fds[fd], SOL_SOCKET, SO_RCVBUF, &opt, sizeof(opt))))
+ return err;
+ }
+ }
+
return 0;
}
|
mine is this: diff --git a/deps/uv/src/unix/process.c b/deps/uv/src/unix/process.c
index 3a3cfd6f09..724fc7b660 100644
--- a/deps/uv/src/unix/process.c
+++ b/deps/uv/src/unix/process.c
@@ -188,6 +188,8 @@ skip:
static int uv__process_init_stdio(uv_stdio_container_t* container, int fds[2]) {
int mask;
int fd;
+ int ret;
+ int sz = 64 * 1024;
mask = UV_IGNORE | UV_CREATE_PIPE | UV_INHERIT_FD | UV_INHERIT_STREAM;
@@ -199,8 +201,20 @@ static int uv__process_init_stdio(uv_stdio_container_t* container, int fds[2]) {
assert(container->data.stream != NULL);
if (container->data.stream->type != UV_NAMED_PIPE)
return UV_EINVAL;
- else
- return uv__make_socketpair(fds, 0);
+ else {
+ ret = uv__make_socketpair(fds, 0);
+ if(0 == ret) {
+ ret = setsockopt(fds[0], SOL_SOCKET, SO_RCVBUF, &sz, 4);
+ if(0 == ret)
+ ret = setsockopt(fds[0], SOL_SOCKET, SO_SNDBUF, &sz, 4);
+ if(0 == ret)
+ ret = setsockopt(fds[1], SOL_SOCKET, SO_RCVBUF, &sz, 4);
+ if(0 == ret) {
+ ret = setsockopt(fds[1], SOL_SOCKET, SO_SNDBUF, &sz, 4);
+ }
+ }
+ return ret;
+ }
plus, I have this: really unsure whether it had any impact or not:
will try your benchmark now. |
$ time ./out/Release/node 3429.js
pretty much the same, with or without the patch. |
Right, so at least that matches up. Tweaking I think we've tried enough that we can safely say there's no easy fix. :-) |
The
child_process.spawn
stdout stream seems to be taking ~4x longer than other "similar" methods of I/O streaming with similarly sized data streams.Running on Node.js v4.2.1, Mac OS X 10.11, Macbook Air 1.7GHz
As a baseline, the following program, where
file
is a 472MB gzip file, andstdout
is set toignore
takes 6.5 secs with the gzip process using ~100% CPU:If I set the
stdout
option fromignore
topipe
, it suddenly takes 27 secs, with gzip only using ~35% CPU and node using ~75%:Fine, so that 4x overhead could just be standard pipe/stream overhead, but if I pipe in from stdin, there's really not much overhead at all and it finishes in 7.2 secs, with gzip using ~95% CPU and node ~30%:
Similarly, if I process the entire gzip file in node using zlib, I get relatively ok performance too, taking 9.8 secs:
So one thing I suspected was the stream buffer size – you can see when spawning it's only 8k, whereas it's 64k when piping from stdin, and I had better performance upping it even further (to 32MB) when reading from the file and gzipping in node.
However, I couldn't see any way to change this with spawn, so I can't really determine if this is the cause.
Is there something else I'm missing? A 4x slowdown seems to be far greater than expected.
The text was updated successfully, but these errors were encountered: