Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process.nextTick & node::MakeCallback are very bloated #10101

Closed
ghost opened this issue Dec 3, 2016 · 34 comments
Closed

process.nextTick & node::MakeCallback are very bloated #10101

ghost opened this issue Dec 3, 2016 · 34 comments
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. process Issues and PRs related to the process subsystem.

Comments

@ghost
Copy link

ghost commented Dec 3, 2016

I'm doing some optimizations to an addon and I have found the node::MakeCallback and the accompanying process.nextTick to be a major bottleneck for calling into JS-land. I have looked through the code all the way down to the process._nextTickCallback, or whatever it was called, and it is very obvious the whole idea of "register this function for next tick" is very inefficiently implemented in Node.js.

I have found that if I call from C++ into JS using the pure V8 function call I get almost 2x the throughput, while this brings the problem of not correctly firing the enqueued functions to process.nextTick. So I'm doing a workaround where I only call MakeCallback on the last function call for that iteration (or at least that is somewhat close to what I'm trying to do) and this properly fires the enqueued functions.

It would be nice to have a better control of process.nextTick and when to fire it from native land since the current code is like some kind of "solve-everything-with-dynamic-checks-for-every-possible-scenario" kind of solution. The best idea I have come up with is to MakeCallback into a noop fuction just to get the process.nextTick functions to trigger when I'm done with my iteration.

If addons would get to include the env.h they could use TickInfo to check if there are any enqueued functions and then call env->get_tick_callback()->Call(yada, yada... to take care of that thing more efficiently, or you could just add some new functions to node.h so that addons can check the current number of functions to call, and a function to actually do it.

Or, you could just remove the entire thing and only call the enqueued functions from uv_check_t so that you only do it once per iteration and not once per every single function call.

@Fishrock123
Copy link
Contributor

cc @trevnorris

@Fishrock123 Fishrock123 added process Issues and PRs related to the process subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. labels Dec 3, 2016
@trevnorris
Copy link
Contributor

Would you mind sharing node version, platform and an example implementation?

@ghost
Copy link
Author

ghost commented Dec 6, 2016

This is Node.js 5 (which I have found is faster than 7 and 6) on Linux. I don't have any minimal test right now but you should be able to just write a test that calls into JavaScript and returns many times and compare it with the native V8 function calls.

@sam-github
Copy link
Contributor

I only call MakeCallback on the last function call for that iteration

What is an iteration? MakeCallback() should only be used when you want to run user js code, and when you do, you need to clear the ticks after - its defined behaviour, not doing it breaks the API expectations. Efficiently doing the wrong thing isn't helpful, perhaps you can provide example code or a more concrete example of when you think you can skip the extra work in MakeCallback()? And why in those examples you can't do simply what you suggest, use ->Call() or MakeCallbac() as appropriate?

@ghost
Copy link
Author

ghost commented Dec 7, 2016

The problem is that MakeCallback is slow, very slow. While V8 alone is much faster. So I cannot use MakeCallback.

One iteration is one epoll_wait return and handling of its ready fds. By avoiding MakeCallback on each call into JS I get a lot better performance, so the conclusion is that Node.js adds tons of overhead to V8 in this case which needs to be improved to allow better performance of addons and Node.js itself.

@ghost
Copy link
Author

ghost commented Dec 7, 2016

The current behavior is not even as described in the docs: https://nodejs.org/api/process.html#process_process_nexttick_callback_args

The behavior I implement is more inline with the docs.

@sam-github
Copy link
Contributor

@alexhultman whether the docs for next tick are sufficiently detailed is not relevant.

One iteration is one epoll_wait return and handling of its ready fds.

That is not my understanding. An iteration is actually the processing of a single struct epoll_event (and yes, this is not documented, the challenge is describing it without refering to epoll, which isn't used on non-linux). If it wasn't like that, there would be observable-in-js differences in when next tick callbacks are called depending on the size of the epoll_wait output array, and the timing of the events.

Intended behaviour is

  • event 1
  • next ticks
  • event 2
  • next ticks

Whether epoll_wait returned event1 and event2 in a single call, or two consecutive calls. Your suggestion would result in the next ticks all being bunched at the end, after the two events were both handled.

Or so is my understanding, maybe @bnoordhuis or @saghul will contradict me.

@sam-github
Copy link
Contributor

You may find the conversation in nodejs/nan#284 interesting.

@bnoordhuis
Copy link
Member

process.nextTick() callbacks bear almost no relation to libuv I/O. They fire whenever node makes a call into the VM but that doesn't need to be due to I/O. Example: timers.

@alexhultman Try batching your calls. That's a good idea whether you use node::MakeCallback() or v8::Function::Call() - the latter isn't exactly cheap either, compared to a regular function call.

remove the entire thing and only call the enqueued functions from uv_check_t so that you only do it once per iteration and not once per every single function call

Too risky. I expect the change in timing would cause considerable ecosystem fallout.

@ghost
Copy link
Author

ghost commented Dec 8, 2016

I get the difference and I have read the linked issue (and many others). I know that whenever there is a call into JS the tick queue is emptied, this I have seen with testing. But the thing is, this is slow, and there is nothing in the documentation explicitly stating what counts as an iteration. If user code heavily depends on exact order of process.nextTick then that code is wrong, you shouldn't make such absurd assumptions about order and code like that kind of goes against the whole "async" programming idea where order is irrelevant.

I have thought about batching up the calls but that would disrupt my zero-copy data flow (and I dont like having different behavior in the addon, compared to the core library). But it is not just about me and my use case - the idea is to speed up all calls in to JS land by relieving the rules of process.nextTick. I know Node.js itself is sprinkled with tons of MakeCallback and those could be replaced with pure V8 calls and significantly speed up lots of things. There shouldn't be such a major tax on every call in to JS, especially when V8 itself adds even more tax.

Also, I already shipped this change and users haven't noticed anything. Not that I have billions of users but last time there was something wrong with process.nextTick I got people reporting that issue (it was way back in April when I still didn't use MakeCallback at all, only pure V8, which ofc fucks up the process.nextTick completely).

@bnoordhuis
Copy link
Member

I know Node.js itself is sprinkled with tons of MakeCallback and those could be replaced with pure V8 calls and significantly speed up lots of things.

That's not possible because of domains (i.e., the functionality from the domain module.) The right domain needs to have been entered when calling into JS land.

Little used feature of questionable practicality, you say? I don't disagree but that doesn't mean it can change.

@ghost
Copy link
Author

ghost commented Dec 8, 2016

I haven't looked into domains so I cannot make any kind of enlightened statement regarding it. However my view is still that this whole process.nextTick seems very inefficient and bloated for what it is trying to achieve. The way you splice the process.nextTick vector from the beginning for every single function is something I have never seen anywhere else (just to name one thing).

Node.js is of course built with some performance limiting decisions (JavaScript/V8) that are also part of why it is popular. But when Node.js adds even more overhead to something already very performance limited [V8], that seems like something worth taking a look at. If one decides to have V8 as the center of the product you need to take extra care to really get 100% of the performance and not 40%.

I mean, when I do things I often write a prototype first to see what kind of performance I can expect, then I add abstractions to that code and track my performance so that I do not deviate way off my expectations. With Node.js I feel there is really no benchmarkings done for things like these, I mean the difference between using V8 pure and using Node.js's addon features are Massive.

I actually recently (maybe one or two months ago) also found that V8's ArrayBuffer was WAY faster than Node.js's Buffer type. It's kind of the same thing going on and on again: V8 is a lot slower than native, which is acceptable given that it is a scripting VM, but then Node.js (in c++) adds tons of overhead ontop of that (MakeCallback, Buffer, NaN).

Point is, I think Node.js needs to take more caution when wrapping V8 because it seems you don't perform any benchmarks of those features you expose to addons.

Swapping from Buffer to ArrayBuffer gave me a 100% throughput gain, swapping from MakeCallback to V8 Call gave me a 24% throughput gain. These numbers are very high for something that are made to achieve the exact same thing, or at least very similar things.

@bnoordhuis
Copy link
Member

It all makes perfect sense when you know the history. Node started in bazaar fashion: much enthusiasm, many contributors, little cohesion. Then node got big and it became increasingly difficult to revisit previous mistakes. That was already becoming more and more true five years ago, let alone now.

I'm well aware that it's not an epitome of efficiency in places, probably more so than anyone else. At the end of the day though, most users don't really care about performance, even if they say they do; what they care about is that their stuff keeps working. Node is at a point in its life where backwards compatibility is more important than performance at all cost.

I'm fairly confident I could write a node knockoff that's at least 5x faster but it would be different enough that no one is going to bother because their code wouldn't work out of the box.

@ghost
Copy link
Author

ghost commented Dec 8, 2016

At the end of the day though, most users don't really care about performance, even if they say they do; what they care about is that their stuff keeps working.

100% truth right there. "Scalability" among Nodesters™ seems to be more about the ability to put in more hardware rather than a measurement of software efficiency.

I am still a bit unhappy with MakeCallback since that is the only thing I cannot get away from.

@ghost
Copy link
Author

ghost commented Dec 9, 2016

Do you think MakeCallback can be improved when Domains gets fully removed? Then you should at least be able to replace everything with a simple if (hasPendingProcessNextTickFunction) {} else {pure V8 call}?

@sam-github
Copy link
Contributor

and async hooks and the v8 microtask queue

@ghost
Copy link
Author

ghost commented Dec 9, 2016

So why aren't you nuking the current process.nextTick mountain and just using the microtask features:

void v8::Isolate::EnqueueMicrotask(Handle< Function > microtask)
void v8::Isolate::RunMicrotasks()

It seems V8 can even automatically run microtasks when you return from all Calls. I bet you can improve performance majorly by ditching all that JS code doing all those splices and whatnot for just using the V8 native counterparts.

From looking at the V8 docs regarding this it sounds like you can still get away with only one single Call per event, and then just run the tasks at uv_check_t. To me it just looks extremely inefficient as it is right now.

@trevnorris
Copy link
Contributor

@alexhultman before you continue going on about how bloated things are, let's take a concrete example. quick benchmark script:

C++ code for benchmark
#include <v8.h>
#include <uv.h>
#include <node.h>

namespace bmod {

using namespace v8;

uv_idle_t my_idle;
size_t iter_;
Persistent<Context> context_p;
Persistent<Function> fn_p;
Persistent<Object> obj_p;

template <class TypeName>
v8::Local<TypeName> StrongPersistentToLocal(
    const v8::Persistent<TypeName>& persistent) {
  return *reinterpret_cast<v8::Local<TypeName>*>(
      const_cast<v8::Persistent<TypeName>*>(&persistent));
}


void MakeCallback(const FunctionCallbackInfo<Value>& args) {
  Isolate* isolate = args.GetIsolate();
  Local<Object> ctx = args[0].As<Object>();
  Local<Function> fn = args[1].As<Function>();
  const size_t iter = args[2]->IntegerValue();

  for (size_t i = 0; i < iter; i++) {
    HandleScope scope(isolate);
    node::MakeCallback(isolate, ctx, fn, 0, nullptr);
  }
}

void FnCall(const FunctionCallbackInfo<Value>& args) {
  Isolate* isolate = args.GetIsolate();
  Local<Context> context = isolate->GetCurrentContext();
  Local<Object> ctx = args[0].As<Object>();
  Local<Function> fn = args[1].As<Function>();
  const size_t iter = args[2]->IntegerValue();

  for (size_t i = 0; i < iter; i++) {
    HandleScope scope(isolate);
    fn->Call(context, ctx, 0, nullptr).ToLocalChecked();
  }
}

void run_makecallback(uv_idle_t* handle) {
  uv_idle_stop(handle);

  Isolate* isolate = static_cast<Isolate*>(handle->data);

  HandleScope scope(isolate);
  Local<Context> context = StrongPersistentToLocal(context_p);
  Context::Scope context_scope(context);
  Local<Object> ctx = StrongPersistentToLocal(obj_p);
  Local<Function> fn = StrongPersistentToLocal(fn_p);

  for (size_t i = 0; i < iter_; i++) {
    HandleScope scope(isolate);
    node::MakeCallback(isolate, ctx, fn, 0, nullptr);
    //fn->Call(context, ctx, 0, nullptr).ToLocalChecked();
  }

  uv_close(reinterpret_cast<uv_handle_t*>(handle), nullptr);
}

void FromIdle(const FunctionCallbackInfo<Value>& args) {
  Isolate* isolate = args.GetIsolate();
  context_p.Reset(isolate, isolate->GetCurrentContext());
  obj_p.Reset(isolate, args[0].As<Object>());
  fn_p.Reset(isolate, args[1].As<Function>());
  iter_ = args[2]->IntegerValue();

  my_idle.data = isolate;
  uv_idle_start(&my_idle, run_makecallback);
}

void Init(Local<Object> exports) {
  NODE_SET_METHOD(exports, "makeCallback", MakeCallback);
  NODE_SET_METHOD(exports, "fnCall", FnCall);
  NODE_SET_METHOD(exports, "fromIdle", FromIdle);

  uv_idle_init(uv_default_loop(), &my_idle);
}

}  // namespace bmod
NODE_MODULE(addon, bmod::Init)

and the JS file to run it:

JS file for benchmark
'use strict';

// uncomment this to see difference w/ domains
//require('domain');
const addon = require('./build/Release/addon');
const print = process._rawDebug;
const ITER = 1e6;
var cntr = 0;
var t = process.hrtime();

function noop() {
  if (++cntr < ITER) return;
  t = process.hrtime(t);
  print(((t[0] * 1e9 + t[1]) / ITER).toFixed(1) + ' ns/op');
}

// This probably more appropriately measures what you are observing.
addon.fromIdle({}, noop, ITER);
return;

setImmediate(() => {
  var t = process.hrtime();

  //addon.makeCallback({}, noop, ITER);
  //addon.fnCall({}, noop, ITER);

  t = process.hrtime(t);
  print(((t[0] * 1e9 + t[1]) / ITER).toFixed(1) + ' ns/op');
});

Results:

fnCall():        72 ns/op
makeCallback(): 196 ns/op
fromIdle():     330 ns/op

makeCallback() w/ domain: 270 ns/op
fromIdle() w/ domain:     410 ns/op

First notice that we have already done quite a bit to improve performance in regards to domains. Second, the time difference between makeCallback() and fromIdle() is partially from an optimization where node::MakeCallback() automatically detects if async calls are recursive, and if so it returns early instead of attempting to process the MicrotaskQueue and nextTickQueue.

Now, in regards to the difference between fromIdle() and makeCalback(), I found an issue while investigating this. The fix is:

diff --git a/src/node.cc b/src/node.cc
index ce39cb4..24ea3bb 100644
--- a/src/node.cc
+++ b/src/node.cc
@@ -1316,6 +1316,7 @@ Local<Value> MakeCallback(Environment* env,
 
   if (tick_info->length() == 0) {
     tick_info->set_index(0);
+    return ret;
   }
 
   if (env->tick_callback_function()->Call(process, 0, nullptr).IsEmpty()) {

Short of it is, if there is nothing in the nextTickQueue then node::MakeCallback() can return early. This should have been happening all along. With this change the execution time for fromIdle() then drops to ~190 ns/op. Same as makeCallback().

More of that time can be removed from the following change (which will be coming a future PR):

diff --git a/src/node.cc b/src/node.cc
index ce39cb4..e4a987e 100644
--- a/src/node.cc
+++ b/src/node.cc
@@ -1227,13 +1227,8 @@ Local<Value> MakeCallback(Environment* env,
 
   Environment::AsyncCallbackScope callback_scope(env);
 
-  // TODO(trevnorris): Adding "_asyncQueue" to the "this" in the init callback
-  // is a horrible way to detect usage. Rethink how detection should happen.
   if (recv->IsObject()) {
     object = recv.As<Object>();
-    Local<Value> async_queue_v = object->Get(env->async_queue_string());
-    if (async_queue_v->IsObject())
-      ran_init_callback = true;
   }
 
   if (env->using_domains()) {

Which brings execution time down to ~140 ns/op.

As for the remaining ~70 ns/op difference, I'll investigate that more after the other two listed issues are fixed. But more to the point, a lot has gone into making node::MakeCallback() fast. Simply spouting that it's "bloated" without anything technical to back up your claim only makes it tedious for those of us who are actually investigating.

So why aren't you nuking the current process.nextTick mountain and just using the microtask features:

Because it doesn't support wrapping calls in try/finally so we can make sure to correctly execute 'uncaughtException'. Otherwise we'd have to wrap every call that's passed into the MicrotaskQueue. Also because there's no JS API for it, so every call would have to transverse down to C++ first. Which would be slower. Also because it doesn't support the current API of passing arguments to the call, requiring we either .bind() or wrap the function.

From looking at the V8 docs regarding this it sounds like you can still get away with only one single Call per event, and then just run the tasks at uv_check_t.

If that were the case then we'd ditch nextTick() for setImmediate() entirely. The reason for nextTick()'s existence is to run callbacks immediately after the current stack has drained; without waiting for the event loop to continue.

@trevnorris
Copy link
Contributor

I'm going to leave this open until at least the first issue described above is addressed. The later, about removing the _asyncQueue check, will take a little longer and is part of another set of changes.

@ghost
Copy link
Author

ghost commented Dec 12, 2016

I don't get your point, didn't your benchmark just validate my report? That is, MakeCallback is extremely slow compared to pure V8 function calls. Even with those future fixes you mentioned, you are still looking at a 2x overhead per function call even though the process.nextTick queue is empty (which is the case most of the time). I already bypass most of Node.js with my addon but I'm still reliant on MakeCallback which, like I reported, plays a majorly negative role on my overall performance.

Also I wouldn't agree that calling into C++ is slow, that is done for most things in Node.js already. Buffer.toString() is a C++ call and nobody is reporting issues about that being slow. Getting a small overhead when calling process.nextTick is far better than getting a constant overhead for every event trigger. Something optimized would be maybe 10% slower than V8 - not 100%.

This recursive check is an example of how it is bloated - why would I need a check like that if I already know that my call is not going to be recursive? Those are the things I mean when I say it is bloated and those are the things that add up to the 2x overhead.

@trevnorris
Copy link
Contributor

@alexhultman

MakeCallback is extremely slow compared to pure V8 function calls. Even with those future fixes you mentioned, you are still looking at a 2x overhead per function call even though the process.nextTick queue is empty [...]

Let's clarify "extremely slow". In the land of JS 70 ns is not "extremely" slow. Let's be reasonable about the use of our adjectives here.

Also I wouldn't agree that calling into C++ is slow, that is done for most things in Node.js already. Buffer.toString() is a C++ call and nobody is reporting issues about that being slow.

You're conflating calling into C++ and from C++. The performance overhead is not the same. Calling into C++ is 10-15 ns. Calling out of C++ is ~70 ns.

This recursive check is an example of how it is bloated [...] and those are the things that add up to the 2x overhead.

Please, I beg of you to stop arguing things you haven't tested or validated. It's honestly a waste of developer's time. But to appease your insistence on something you're so sure of, let's simplify this.

Diff to completely trim node::MakeCallback()
diff --git a/src/node.cc b/src/node.cc
index ce39cb4..c55db1e 100644
--- a/src/node.cc
+++ b/src/node.cc
@@ -1219,72 +1219,10 @@ Local<Value> MakeCallback(Environment* env,
   // If you hit this assertion, you forgot to enter the v8::Context first.
   CHECK_EQ(env->context(), env->isolate()->GetCurrentContext());
 
-  Local<Function> pre_fn = env->async_hooks_pre_function();
-  Local<Function> post_fn = env->async_hooks_post_function();
-  Local<Object> object, domain;
-  bool ran_init_callback = false;
-  bool has_domain = false;
-
   Environment::AsyncCallbackScope callback_scope(env);
 
-  // TODO(trevnorris): Adding "_asyncQueue" to the "this" in the init callback
-  // is a horrible way to detect usage. Rethink how detection should happen.
-  if (recv->IsObject()) {
-    object = recv.As<Object>();
-    Local<Value> async_queue_v = object->Get(env->async_queue_string());
-    if (async_queue_v->IsObject())
-      ran_init_callback = true;
-  }
-
-  if (env->using_domains()) {
-    CHECK(recv->IsObject());
-    Local<Value> domain_v = object->Get(env->domain_string());
-    has_domain = domain_v->IsObject();
-    if (has_domain) {
-      domain = domain_v.As<Object>();
-      if (domain->Get(env->disposed_string())->IsTrue())
-        return Undefined(env->isolate());
-    }
-  }
-
-  if (has_domain) {
-    Local<Value> enter_v = domain->Get(env->enter_string());
-    if (enter_v->IsFunction()) {
-      if (enter_v.As<Function>()->Call(domain, 0, nullptr).IsEmpty()) {
-        FatalError("node::MakeCallback",
-                   "domain enter callback threw, please report this");
-      }
-    }
-  }
-
-  if (ran_init_callback && !pre_fn.IsEmpty()) {
-    TryCatch try_catch(env->isolate());
-    MaybeLocal<Value> ar = pre_fn->Call(env->context(), object, 0, nullptr);
-    if (ar.IsEmpty()) {
-      ClearFatalExceptionHandlers(env);
-      FatalException(env->isolate(), try_catch);
-      return Local<Value>();
-    }
-  }
-
   Local<Value> ret = callback->Call(recv, argc, argv);
 
-  if (ran_init_callback && !post_fn.IsEmpty()) {
-    Local<Value> did_throw = Boolean::New(env->isolate(), ret.IsEmpty());
-    // Currently there's no way to retrieve an uid from node::MakeCallback().
-    // This needs to be fixed.
-    Local<Value> vals[] =
-        { Undefined(env->isolate()).As<Value>(), did_throw };
-    TryCatch try_catch(env->isolate());
-    MaybeLocal<Value> ar =
-        post_fn->Call(env->context(), object, arraysize(vals), vals);
-    if (ar.IsEmpty()) {
-      ClearFatalExceptionHandlers(env);
-      FatalException(env->isolate(), try_catch);
-      return Local<Value>();
-    }
-  }
-
   if (ret.IsEmpty()) {
     // NOTE: For backwards compatibility with public API we return Undefined()
     // if the top level call threw.
@@ -1292,36 +1230,6 @@ Local<Value> MakeCallback(Environment* env,
         ret : Undefined(env->isolate()).As<Value>();
   }
 
-  if (has_domain) {
-    Local<Value> exit_v = domain->Get(env->exit_string());
-    if (exit_v->IsFunction()) {
-      if (exit_v.As<Function>()->Call(domain, 0, nullptr).IsEmpty()) {
-        FatalError("node::MakeCallback",
-                   "domain exit callback threw, please report this");
-      }
-    }
-  }
-
-  if (callback_scope.in_makecallback()) {
-    return ret;
-  }
-
-  Environment::TickInfo* tick_info = env->tick_info();
-
-  if (tick_info->length() == 0) {
-    env->isolate()->RunMicrotasks();
-  }
-
-  Local<Object> process = env->process_object();
-
-  if (tick_info->length() == 0) {
-    tick_info->set_index(0);
-  }
-
-  if (env->tick_callback_function()->Call(process, 0, nullptr).IsEmpty()) {
-    return Undefined(env->isolate());
-  }
-
   return ret;
 }
 
Native add-on benchmark script for diff
'use strict';
const addon = require('./build/Release/addon');
const print = process._rawDebug;
const ITER = 1e6;
var cntr = 0;
var t = process.hrtime();

function noop() {
  if (++cntr < ITER) return;
  t = process.hrtime(t);
  print(((t[0] * 1e9 + t[1]) / ITER).toFixed(1) + ' ns/op');
}

addon.fromIdle({}, noop, ITER);

Here I have eliminated absolutely everything you are complaining about and the execution speed is still ~140 ns/op. Please do some investigation for yourself and figure out where the overhead is actually coming from before complaining about core being "bloated" and "extremely slow" again.

@trevnorris
Copy link
Contributor

To completely put this to rest, Here is the fully trimmed node::MakeCallback():

Local<Value> MakeCallback(Environment* env,
                          Local<Value> recv,
                          const Local<Function> callback,
                          int argc,
                          Local<Value> argv[]) {
  return callback->Call(recv, argc, argv);
}

And here's the test script:

'use strict';
const addon = require('./build/Release/addon');
const print = process._rawDebug;
const ITER = 1e6;
var cntr = 0;
var t = process.hrtime();

addon.makeCallback({}, () => {}, ITER);

t = process.hrtime(t);
print(((t[0] * 1e9 + t[1]) / ITER).toFixed(1) + ' ns/op');

Run time is ~115 ns/op. We can drop this to ~105ns/op if we change the native add-on code to:

void MakeCallback(const FunctionCallbackInfo<Value>& args) {
  Isolate* isolate = args.GetIsolate();
  Local<Object> ctx = args[0].As<Object>();
  Local<Function> fn = args[1].As<Function>();
  const size_t iter = args[2]->IntegerValue();

  for (size_t i = 0; i < iter; i++) {
    node::MakeCallback(isolate, ctx, fn, 0, nullptr);
  }
}

but you can't do this directly when returning from an async call or else the SealHandleScope will cause the following failure:

FATAL ERROR: v8::HandleScope::CreateHandle() Cannot create a handle without a HandleScope
 1: node::Abort() [/var/projects/node_v6/node]
 2: 0xf3d143 [/var/projects/node_v6/node]
 3: v8::Utils::ReportApiFailure(char const*, char const*) [/var/projects/node_v6/node]
 4: v8::internal::HandleScope::Extend(v8::internal::Isolate*) [/var/projects/node_v6/node]
 5: v8::EscapableHandleScope::EscapableHandleScope(v8::Isolate*) [/var/projects/node_v6/node]
 6: node::MakeCallback(v8::Isolate*, v8::Local<v8::Object>, v8::Local<v8::Function>, int, v8::Local<v8::Value>*) [/var/projects/node_v6/node]
 7: bmod::run_makecallback(uv_idle_s*) [/var/projects/tmp/node-basic-module/build/Release/addon.node]

So even if you're only calling Function::Call() you still require a v8::HandleScope.

@MylesBorins
Copy link
Contributor

@alexhultman I have modified your above comment to remove a single word which was less than ideal to use. As someone who has just poked their head in the thread I urge you to remember we abide by a code of conduct in this repo.

@benjamingr
Copy link
Member

@alexhultman so far @trevnorris has responded with a lot of technical detail on why MakeCallback takes 140ns and shows how to trip the current number to get to 140ns, he has been very politce and to-the-point about everything and a thank-you would have been nice for that.

As for the async queue I suspect using a cirtular buffer structure with zero (well, O(1) amortized) allocations could improve the speed of draining - this is what bluebird does. Currently using V8's microtask queue is very problematic for the reason Trevvor mentioned.

You want to reduce MakeCallback further? It would be really useful if you provided the following.

  • What you're doing and why that extra 70ns is prohibitive for you.
  • A demonstration of how avoiding MakeCallback and using the v8 call directly solve that.

That would go a huge way, and please - enter this discussion in an honest attempt to have a dialogue with the people trying to collaborate with you. No one is shoving you into boxes or not taking your feedback seriously.

@trevnorris
Copy link
Contributor

@alexhultman First, it'd be a lot easier to communicate w/ you, and to care about what you're saying, if the entire issue were just posed as "I'm seeing a X ns/op discrepancy between using node::MakeCallback() and using v8::Function::Call() directly. What can we do to improve that?" The excessive use of inflammatory adjectives stating how awful of a job we've done really does nothing to help us want to help you.

Second, you haven't put any effort into actually diagnosing the problem you're reporting. Instead of using demeaning remarks it would be helpful to provide something like provide a perf trace, patches to core you've tried to speed things up, or even a test case. Right now the attitude here only exhibits the desire to release frustration; not to improve code.

Third, yes I found one optimization for one use-case of node::MakeCallback(), but you haven't made any comment on the fact that after reducing all of node::MakeCallback() to return fn->Call(...) it still has 30-40ns overhead. I'd enjoy discussing where this additional overhead comes from, but despite being "pretty much the only Node.js developer who actually tries to improve the overall project" you haven't shown any inclination to make an effort at solving the problem.

Fourth, you're ignoring the fact that the shortcuts you're taking makes your code not "node". Meaning, it doesn't obey the domain module, check for proper v8::Context for appropriate vm support, etc. Instead you're basically using node as a build system for your own V8 embedded code. Will it be slower? Yes, of course. If you'd like to have an argument about what operations we should/shouldn't be doing that's fine. But every operation comes with CPU time, and as a project there are general guards we need to have in place.

I've attempted to be helpful. I have three posts with ample amount of code changes and benchmarks; with the hope they would spur productive discussion as to what else we can do. Below is the disassembled function calls that I've walked through and highlighted what additional steps are taken to execute the function in a minimal case:

Disassembly of user facing `node::MakeCallback()`
->  0xf3ba00 <+0>:   pushq  %rbp
    0xf3ba01 <+1>:   movq   %rsp, %rbp
    0xf3ba04 <+4>:   pushq  %r15
    0xf3ba06 <+6>:   pushq  %r14
    0xf3ba08 <+8>:   pushq  %r13
    0xf3ba0a <+10>:  pushq  %r12
    0xf3ba0c <+12>:  pushq  %rbx
    0xf3ba0d <+13>:  subq   $0x38, %rsp
    0xf3ba11 <+17>:  movq   %r8, -0x50(%rbp)
    0xf3ba15 <+21>:  movl   %ecx, -0x54(%rbp)
    0xf3ba18 <+24>:  movq   %rdx, -0x60(%rbp)
    0xf3ba1c <+28>:  movq   %rsi, %rbx
    0xf3ba1f <+31>:  movq   %rdi, %r14
    0xf3ba22 <+34>:  leaq   -0x48(%rbp), %rdi
    0xf3ba26 <+38>:  movq   %r14, %rsi
    0xf3ba29 <+41>:  callq  0x8ce770                  ; v8::EscapableHandleScope::EscapableHandleScope(v8::Isolate*)
    0xf3ba2e <+46>:  movq   %rbx, %rdi
    0xf3ba31 <+49>:  callq  0x8e2ba0                  ; v8::Object::CreationContext()
+   0xf3ba36 <+54>:  movq   %rax, %r13
+   0xf3ba39 <+57>:  movq   (%r13), %rax
+   0xf3ba3d <+61>:  movq   0x37(%rax), %rax
+   0xf3ba41 <+65>:  movq   0x10f(%rax), %r12
+   0xf3ba48 <+72>:  movq   %r13, %rdi
    0xf3ba4b <+75>:  callq  0x8ce8d0                  ; v8::Context::Enter()
    0xf3ba50 <+80>:  movq   0x878(%r12), %r15
    0xf3ba58 <+88>:  movq   (%r12), %rdi
+   0xf3ba5c <+92>:  callq  0x8d47f0                  ; v8::Isolate::GetCurrentContext()
+   0xf3ba61 <+97>:  testq  %r15, %r15
+   0xf3ba64 <+100>: je     0xf3ba78                  ; <+120>
+   0xf3ba66 <+102>: testq  %rax, %rax
+   0xf3ba69 <+105>: je     0xf3ba78                  ; <+120>
+   0xf3ba6b <+107>: movq   (%r15), %rcx
+   0xf3ba6e <+110>: cmpq   (%rax), %rcx
+   0xf3ba71 <+113>: je     0xf3ba8a                  ; <+138>
+   0xf3ba73 <+115>: jmp    0xf3bb19                  ; <+281>
+   0xf3ba78 <+120>: testq  %r15, %r15
+   0xf3ba7b <+123>: jne    0xf3bb19                  ; <+281>
+   0xf3ba81 <+129>: testq  %rax, %rax
+   0xf3ba84 <+132>: jne    0xf3bb19                  ; <+281>
+   0xf3ba8a <+138>: incq   0x338(%r12)
+   0xf3ba92 <+146>: movq   -0x60(%rbp), %rdi
+   0xf3ba96 <+150>: movq   %rbx, %rsi
+   0xf3ba99 <+153>: movl   -0x54(%rbp), %edx
+   0xf3ba9c <+156>: movq   -0x50(%rbp), %rcx
    0xf3baa0 <+160>: callq  0x8e43a0                  ; v8::Function::Call(v8::Local<v8::Value>, int, v8::Local<v8::Value>*)
+   0xf3baa5 <+165>: movq   0x338(%r12), %rcx
+   0xf3baad <+173>: testq  %rax, %rax
+   0xf3bab0 <+176>: jne    0xf3bacf                  ; <+207>
+   0xf3bab2 <+178>: cmpq   $0x2, %rcx
+   0xf3bab6 <+182>: jb     0xf3bac7                  ; <+199>
+   0xf3bab8 <+184>: decq   %rcx
+   0xf3babb <+187>: movq   %rcx, 0x338(%r12)
+   0xf3bac3 <+195>: xorl   %esi, %esi
+   0xf3bac5 <+197>: jmp    0xf3bae8                  ; <+232>
+   0xf3bac7 <+199>: movq   (%r12), %rax
+   0xf3bacb <+203>: addq   $0x60, %rax
+   0xf3bacf <+207>: decq   %rcx
+   0xf3bad2 <+210>: movq   %rcx, 0x338(%r12)
+   0xf3bada <+218>: movq   (%rax), %rsi
+   0xf3badd <+221>: movq   %r14, %rdi
    0xf3bae0 <+224>: callq  0x8ce6e0                  ; v8::HandleScope::CreateHandle(v8::internal::Isolate*, v8::internal::Object*)
    0xf3bae5 <+229>: movq   %rax, %rsi
    0xf3bae8 <+232>: leaq   -0x48(%rbp), %r14
    0xf3baec <+236>: movq   %r14, %rdi
    0xf3baef <+239>: callq  0x8ce7d0                  ; v8::EscapableHandleScope::Escape(v8::internal::Object**)
    0xf3baf4 <+244>: movq   %rax, %rbx
    0xf3baf7 <+247>: movq   %r13, %rdi
    0xf3bafa <+250>: callq  0x8cea20                  ; v8::Context::Exit()
    0xf3baff <+255>: movq   %r14, %rdi
    0xf3bb02 <+258>: callq  0x8d4060                  ; v8::HandleScope::~HandleScope()

Started to do this in the hope we could find some improvements, but at this point I'm just going to step back. When you've come back with some actual effort to find how things could improve I'll engage in further discussion with you.

my prototype does 260k

I'd love to see that. Have a link?

Did you know Node.js does 30k requests/second while NGINX does 170k?

Ah, mine only does 120k/sec. Probably because we have a different number of threads resolving requests. Unlike node and all, that only uses one thread. :P

@trevnorris
Copy link
Contributor

@benjamingr

As for the async queue I suspect using a cirtular buffer structure with zero (well, O(1) amortized) allocations could improve the speed of draining

The nextTickQueue rarely goes over 5 items. My benchmarks have shown that the added complexity of a circular buffer adds more than V8's ability to handle such small arrays. Though I did that back in v0.12 days, so perhaps it's worth revisiting.

But overall the overhead would still be negligible. We have extra overhead of needing to check for things like if more microtasks have been queued during the execution of a nextTick(), wrapping everything in try/finally to properly propagate in the case of an unhandledException, etc. Though even with that execution time sits around 50 ns/op. Which is faster that calling v8::Function::Call().

@benjamingr
Copy link
Member

The nextTickQueue rarely goes over 5 items. My benchmarks have shown that the added complexity of a circular buffer adds more than V8's ability to handle such small arrays. Though I did that back in v0.12 days, so perhaps it's worth revisiting.

Yes, that makes sense. Promises can push a thousand things to the queue and it needs to be 0 allocations so it's almost as fast as synchronous calls and the performance loss is minimal - so a fast promise implementation has to use a better data structure. nextTick deferred things are not nearly as common.

@ghost
Copy link
Author

ghost commented Dec 12, 2016

Anyways, you now have a report which you have validated yourself so you know where you can improve. If you can get those 330ns -> 190ns -> 140ns fixes committed then that would definitely be a nice initial step. Clearly this is piranhas water I don't want to stay in for more than necessary. Very toxic environment indeed.

@trevnorris
Copy link
Contributor

trevnorris commented Dec 13, 2016

@alexhultman

I report MakeCallback is a lot slower than V8 function call

We need numbers! Tangible, actionable pieces of information that we can do something about. The entirety of your contribution to this issue could be paraphrased as "MakeCallback extremely slow and bloated. Fix it!" I provided patches, benchmarks and numbers hoping you'd jump in at some point and do the same.

which was shown by my benchmark

For the love, where? You state a mythical benchmark exists but haven't posted a single link or line of code. I have exhaustively written benchmarks for 5 variations of a scenario that you say exists, but refuse to directly demonstrate.

including me "not having anything to back up my claims"

I'm only bringing this up b/c I'm afraid that someone in the future will stumble on this thread and think I actually said that. Please, unless you explicitly state you're paraphrasing, only quote things that have actually been said.

Right, because I was completely unaware that Node.js was single threaded

Sorry. That was just a friendly poke at the phrasing. Not a jab at your understanding of how NGINX works. Thought that'd be conveyed by how blatant the comment was and by the :P at the end, but text does suck for expressing humor. Or, I suck at using text to communicate humor.

@benjamingr

so it's almost as fast as synchronous calls and the performance loss is minimal

Queue'ing may be freaky fast, but don't forget about execution setup. Each microtask has ~60ns overhead as well.

@Trott
Copy link
Member

Trott commented Dec 13, 2016

Hey, everyone!

Just a reminder that we all want the same thing: For Node.js to be as awesome as it can be.

In the interest of working together to achieve that common goal, perhaps we can try to avoid loaded words/phrases like bloated or spouting. That's not to say you're incorrect to feel the frustration you feel that makes you want to reach for those types of words. Just, maybe, resist the temptation to reach for them anyway.

On the upside, it looks like there's agreement on at least some of the technical path forward. Maybe we can focus on that and try to put the rancor in the rear-view mirror.

And if I'm coming across a bit too preachy and just making things worse, please ignore me and I'll probably go away.

@ghost
Copy link
Author

ghost commented Dec 13, 2016

I get a major boost in performance with the added "return":

Echo performance: 548.076 echoes/ms
Echo performance: 564.108 echoes/ms
Echo performance: 569.993 echoes/ms
Echo performance: 572.249 echoes/ms
Echo performance: 573.693 echoes/ms
Echo performance: 574.935 echoes/ms
Echo performance: 575.624 echoes/ms
Echo performance: 573.393 echoes/ms
Echo performance: 572.917 echoes/ms
Echo performance: 573.756 echoes/ms
Echo performance: 574.125 echoes/ms
Echo performance: 574.401 echoes/ms
Echo performance: 575.03 echoes/ms
Echo performance: 575.586 echoes/ms
Echo performance: 576.696 echoes/ms
Echo performance: 577.558 echoes/ms
Echo performance: 578.03 echoes/ms
Echo performance: 578.846 echoes/ms
Echo performance: 579.383 echoes/ms

is now:

Echo performance: 642.918 echoes/ms
Echo performance: 643.367 echoes/ms
Echo performance: 643.61 echoes/ms
Echo performance: 643.515 echoes/ms
Echo performance: 643.977 echoes/ms
Echo performance: 643.986 echoes/ms
Echo performance: 644.031 echoes/ms
Echo performance: 643.999 echoes/ms
Echo performance: 644.04 echoes/ms
Echo performance: 644.087 echoes/ms
Echo performance: 644.422 echoes/ms
Echo performance: 644.429 echoes/ms
Echo performance: 644.489 echoes/ms

with the second "fix" of removing those three lines below the TODO comment I get this:

Echo performance: 700.181 echoes/ms
Echo performance: 699.595 echoes/ms
Echo performance: 699.881 echoes/ms
Echo performance: 699.922 echoes/ms
Echo performance: 700.458 echoes/ms
Echo performance: 701.075 echoes/ms
Echo performance: 701.179 echoes/ms
Echo performance: 701.031 echoes/ms
Echo performance: 701.268 echoes/ms
Echo performance: 701.311 echoes/ms
Echo performance: 701.659 echoes/ms
Echo performance: 701.408 echoes/ms
Echo performance: 701.219 echoes/ms
Echo performance: 701.103 echoes/ms
Echo performance: 701.369 echoes/ms
Echo performance: 700.882 echoes/ms
Echo performance: 700.567 echoes/ms
Echo performance: 700.622 echoes/ms
Echo performance: 700.644 echoes/ms

I think the above benchmark results shows a very clear picture of MakeCallback having some serious performance problems, just like my very first post said. Despite having been called out multiple times for not having anything technical to back it up and nice posts like this one:

Please, I beg of you to stop arguing things you haven't tested or validated. It's honestly a waste of developer's time. But to appease your insistence on something you're so sure of, let's simplify this.

Simply spouting that it's "bloated" without anything technical to back up your claim only makes it tedious for those of us who are actually investigating.

It's not like I have done this for 8+ months now...

@trevnorris
Copy link
Contributor

@alexhultman

Despite having been called out multiple times for not having anything technical to back it up

You have yet to post any code, stack trace, benchmark graph, etc. to technically backup your claim that it was slow. Were there improvements? Yes, but they were only found after I had written a number of benchmarks and test patches to find what could be improved.

we are already far past the point of no return socially here, our relation is already runied for the future

This is unfortunate. If I retained animosity for every developer I've had a strong disagreement with I'd be one very lonely developer.

@ghost
Copy link
Author

ghost commented Dec 13, 2016

I don't have anything to back up my claims. Everything I do is based on feelings and imagination. Those numbers I posted, those are all made up. I have no idea what I'm doing. I just woke up and thought it would be fun to imagine that MakeCallback was slow.

What a coincidence! Who would have guessed there actually was performance problems with MakeCallback. That's some nice chances right there, I should probably go buy a lottery ticket straight away. Since I clearly have no idea what I'm doing it must have all been thanks to your infinite wisdom of technicality and benchmarking grandness. Thanks o great one, for we shall no longer fumble in the dark.

Clearly nothing of what I have to say means anything since I obviously only imagine numbers and guess what to do next. Maybe this whole thing was a cover-up by the government? Maybe this all was just a big hallucination, oh wow who knows?

Thank for your infinite imperfection and your guidance in this hard hard time. I have so much to learn from thou grand grand one.

Btw, everything I do is open source and available on my github, including the benchmarks.

@trevnorris
Copy link
Contributor

@alexhultman I'm sorry, but after ample attempts to be civil your attitude has constantly been hostile towards healthy development. I'll open another issue with the technical points discussed here for further action.

@nodejs nodejs locked and limited conversation to collaborators Dec 13, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. process Issues and PRs related to the process subsystem.
Projects
None yet
Development

No branches or pull requests

7 participants