-
Notifications
You must be signed in to change notification settings - Fork 11
Documenting most recent bottlenecks detected #127
Comments
Also should mention that the above is a braindump because I'm stepping away from this work after pushing on #126. I'll add that I did get as far as confirming that -- if the JS conversion bottleneck is removed by trying the My comparison is not quite fair since I only implemented the object passing and the actual access to the object properties is not yet implemented. So all the unit tests are broken. However the Here is the unfinished branch that starts passing the C++ features without copying into JS land: https://github.com/mapbox/carmen-cache/compare/pass-by-object-rather-than-json. The next step here, if someone where to take this one, would be adding methods to that |
Update since some time has passed: I think this still applies: that the overhead of passing objects between JS and C++ is non-trivial. |
Agreed. There are probably some intermediate improvements that can be made (for example: not allocating new strings for the keys of all the JS objects we return -- they all have an "id" key so we should just make one "id" string rather than one per object returned), but ultimately I think the solution will be to move the API boundary so we have to pass less stuff back and forth. |
@apendleton provided a
cache_bench
benchmark which exercisescxxcache.RocksDBCache
andcxxcache.coalesce
.Using latest master at fec8605 (using node v8.11.3) here are the things I notice when running the benchmark locally on my OS X machine:
cxxcache.coalesce
usingqueue(1)
and therefore is not able to utilize the threadpool muchuv_work_done
uv_work_done
is garbage collection of v8 objects (initiated by node::NodePlatform::FlushForegroundTasksInteral())uv_work_done
is due to the task of returning data from the threadpool to the main event loop, which is done in thecarmen::coalesceAfter
function.carmen::coalesceAfter
looks to be bottlenecked onv8::Object::Set
which is called insidecarmen::contextToArray
andcarmen::coverToObject
to convert the carmen data into JS objects.worker
threads in the C++ threadpool (wherecarmen::coalesceMulti
does its critical work) appear to be not very busy, which is surprising. Ideally the main event loop is mostly idle (the profiler mostly sees the CPU executing inuv__io_poll
which signifies waiting for work) and the threads are nearly 100% busy. But in this case the opposite is true: the main event loop is busier than the threads.Why are the threads so idle?
One hypothesis is that the threads are idle due to the
queue(1);
being used in the bench.js script causing not much work to be sent to the threads meaning that each time work is sent it gets sent to a different thread and the work is fanned out and returns quickly. This appears to not be the case. If I run the benchmark with a threadpool size of the 1 the profiling output (and the script timing) is similar:So, rather it seems like the problem is that the main event loop is doing too much work. Normally you want the hard work to be done in the threads to avoid blocking the main loop, but it appears that is not happening here (on balance). The fact that this balance is so off in the benchmark is concerning because in real-world usage a lot of other functions are going to need to execute on the main event loop, so carmen-cache taking up so much time is of great concern.
Profiling in Activity monitor shows the significant cost of the JS object conversion with
carmen::coalesceAfter
:If I change the
queue(1);
toqueue();
in the benchmark (to allow more async usage of the threadpool), this seems to further confirm that we've got a bad bottleneck on the JS conversion in the main event loop. RunningUV_THREADPOOL_SIZE=16 node cache_bench/bench.js
finishes faster (in 14s) but at the cost of nearly entirely jamming up the main event loop with JS object conversion, with > 93% incarmen::coalesceAfter
. In all my profiling of async C++ code I've never seen this much of the main event loop taken up by object conversion from C++ to JS (even in a micro benchmark like this):At the time this profile was taken the threads were completely idle without any work to do, likely because the main event loop could not dispatch it, given it was blocked on object conversion. This is bad.
How likely is this to be impacting production?
I don't know. Surely not as severely as this micro benchmark, but profiling in production would show the level of severity of this bottleneck, so I think that is worth doing.
Potential solutions
A solution to this problem would involve:
As far as 🍒 - we currently limit to 40 features to reduce this cost. Could we possibly limit further? To 20 features? Looks like the 40 cutoff was added in #65 perhaps?
As far as 🌸 two ideas come to mind. Instead of converting C++ objects to JS JSON-like objects (which is clearly crazy expensive) we could either:
context.getRelev()
instead ofcontext.relev
in JS-landThe text was updated successfully, but these errors were encountered: