-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: seeking feedback on async_hooks performance in production #14794
Comments
@nodejs/diagnostics |
I'm guessing the instructions for a non-synthetic benchmark would be:
If so, I might get some data in the coming weeks (wanted to give 8.3 a spin anyhow in one of our services). |
@jkrems You actually need to add empty hooks functions, otherwise nothing changes. |
There is the example trace snippet from nodejs.org/api/async_hooks.html (slightly simplified): let indent = '';
const traceFile = 1; // for stdout
async_hooks.createHook({
init(asyncId, type, triggerAsyncId) {
const eid = async_hooks.executionAsyncId();
fs.writeSync(
traceFile,
`${indent}${type}(${asyncId}): trigger: ${triggerAsyncId} execution: ${eid}\n`);
},
before(asyncId) {
fs.writeSync(traceFile , `${indent}before: ${asyncId}\n`);
indent += ' ';
},
after(asyncId) {
indent = indent.slice(2);
fs.writeSync(traceFile , `${indent}after: ${asyncId}\n`);
},
destroy(asyncId) { fs.writeSync(traceFile , `${indent}destroy: ${asyncId}\n`); },
}).enable(); |
Been running it in production on several high traffic components since 4.2.1 and have had no major memory leaks. No performance degradation compared to the old async-listeners implementation. One thing we did notice was the asynchronous hooks generated by promises didn't get "destroy"ed until the promise got garbage collection. This caused a little bit of memory run away under very high traffic (synthetic traffic) but haven't noticed any problems with real world traffic. |
Hello, I believe https://github.com/angular/zone.js/issues/889 might interest you! |
I'm under the impression Google started using is in production in their APM. googleapis/cloud-trace-nodejs@0e15b6c |
We are currently experimenting with async hooks but it is still behind a flag while we evaluate it. We found it to be slightly slower than the |
awesome ! Thanks @matthewloring . When running with the CLS, is the monkeypatching on all async methods still applied ? Might have a performance boost there. |
The monkeypatching done by our agent is the same under both approaches (since other modules have not yet started using the async hooks embedder api). Async hooks tracks more lifecycle events per asynchronous task than CLS does, especially for promises, many of which are redundant in the case of the context tracking done by our agent. I suspect this to be part of the slow down but we need to do more performance analysis to verify. |
Is it even possible to implement a robust cls/zone (or a lower-level abstraction that would allow zone/cls to be built upon) as a native Node module ? I'm under the impression that some features are very hard / impossible to implement in a robust way using only JS (event using monkey-patching), most related to hacks in third-party libraries that also attemps to do low-level async stuff, and implement some sort of userland scheduling, eg. bluebird. What are your use cases for this family of features? Example[1] : function* counterServer({ counter }) {
yield effect("saveCounter", counter);
try {
const message = yield receive(10000); // pause until a message is received or a timeout is elapsed
if(message.type === "add") {
// "loop" by recursively yielding itself (or another generator function)
return yield* counterServer({ counter: counter + 1 });
}
if(message.type === "query") {
// send a message back to the sender
yield send(message.source, { type: counter, counter });
return yield* counterServer({ counter });
}
}
catch(error) {
if(error instanceof ReceiveTimeout) {
// its okay to timeout, we can just keep receiving
return yield* counterServer({ counter });
}
// other errors crash the process so that its monitor can restart it in a safe state
console.error("Error thrown in counterServer");
throw error;
}
}
const loop = async (initialCounter) => {
let savedCounter = initialCounter;
try {
await spawn(counterServer, { counter }, {
// algebraic effect handler : produce a "global" side effect (bubbled and resumable)
saveCounter: (currentCounter) => savedCounter = currentCounter,
});
}
catch(error) {
console.error(error);
// restart in safe state
return await loop(savedCounter);
}
}
loop(0); For this feature to be exploitable, I need to monitor asynchronous side-effects, so that calculations can't escape their context (unless they really really want to, in which case the invariants would not be supported). If zone or cls were reliable at scale, I could build this upon them. For this I think I only need to ensure that I can keep a readable context, and that I can run a function in an extended context. Example[1] (could be expressed in terms of zones/cls) : Context.current.fork((context) => {
// extend context passed by parent
return { ... context, foo: "bar" };
}), (context) => {
// do stuff in new "hardly" escapable context (not strictly impossible but hard to do by mistake)
...
}).catch((error) => ...).then((result) => ... ); I think allowing "just" that would enable many useful paradigms, even with a more restricted feature set than Zones/cls. Do you think this can be implemented using async_hooks only ? I have a prototype but I'm sure you guys spent a lot of time thinking about this and I'd be glad to read your thoughts. [1]: Syntax for illustrative purpose only, not really important at this stage |
It seems that NR is having a beta for a new version of their agent based on async_hooks. See: https://discuss.newrelic.com/t/feature-idea-native-async-await-support/49141/7 @martinkuba can you share some results regarding the usage of the API here? ref: #14717 |
@elierotenberg there is a plan to re-implement domains over async_hooks, you might want to keep an eye on this. |
I will. Thanks. |
@vdeturckheim I work at NR (Martin actually just moved teams). We've had mostly positive results thus far. No real-world issues have come up and it seems the performance is generally better than our existing monkey-patch based instrumentation. Our customers who are testing it have responded positively without any complaints of performance degradation after turning it on. In our worst-case scenario benchmark (a 300-link no-op promise chain), our async-hook instrumentation is about 5x faster than our older monkey-patch based instrumentation. A no-op async hook was about 1.5x faster than our async-hook instrumentation. No instrumentation at all was about 8x faster than the async-hook instrumentation. This test is not reflective of real-world performance, but does show it is better than what we had and there is room for improvement. For memory-leak situations, we've run some servers under extremely heavy load (~6 million promises per minute, 6000 requests per minute) for several days and noticed no leaking issues. We have, however, encountered a strange behavior around the 6-hour mark where GC scavenge events suddenly jump and CPU jumps with it. This is a one-time situation and the memory usage doesn't change after that. We're currently working on narrowing down the cause of that. |
Glad to learn that NR is working on a visulization/debug tool based on async_hooks. Async work monitoring/debugging seems very painful and hard to me, and I'm convinced we need better tools with the proper abstractions. As a side note, I've continued my little research and am now working on implementing a (minimal) actor system for easier work distribution with a single Node process or spanning multiple Node process (it should even be able to support having actors actually run in the browser and communication with the cluster using http/websockets). Think Erlang or Akka (but with much less features initially of course). |
@NatalieWolfe thanks a lot! looks pretty promising! Out of curiosity, what applications are you testing the agent with? Did you build an internal app or do you rely on open source projects? |
Realized my comment might not be germane for this thread, so I moved it here: #14717 (comment) |
This commit fixes a bug where a request can end without finishing its timings. See: MiniProfiler#4 Bug cause: The structure that is used to register Miniprofiler timing providers, like (Postgres, HTTP, Redis), because it overrides the original method (globally) and uses the `request` object to access the Miniprofiler extensions to build the timings, and this doesn't work in a scenario with simultaneous requests using an async provider. Here is an example using [`pg`](https://github.com/goenning/miniprofiler-pg/blob/master/index.js) to try illustrating the failing scenario (check out the `tests/concurrent-async-test.js` test to see it running). request A start: * `pg.Client.prototype.query` holds a `req` object of requestA. * It calls `.query` in a pg instance * A miniprofiler timing starts with the call to `req.miniprofiler.timeQuery(...)` * The original method is called (async). request B start: * `pg.Client.prototype.query` holds a `req` object of request B. * It calls `.query` in a pg instance * Start timing with `req.miniprofiler.timeQuery(...)` * The original method is called (async). request A resume: * The result of `.query` is returned * A new call to `.query` is made * This time the `req` points to request B, this means that `req.miniprofiler.timeQuery(...)` will start a timing on request B. * The original method is called (async) request B resume: * The result of `.query` is returned. * All data was fetched, the request is ready to finish, so internally Miniprofile calls [`stopProfilling`](https://github.com/MiniProfiler/node/blob/1a98e40698b1126ac8de728a33406656361f8870/lib/miniprofiler.js#L80). * This fails because there is a timing started (by request A) but not finished, so calculating the [diffs](https://github.com/MiniProfiler/node/blob/1a98e40698b1126ac8de728a33406656361f8870/lib/miniprofiler.js#L409) will fails because `stop` is undefined. Solution -------- Using NodeJS "async_hooks" we can track the reference to the correct extension for each request, so calls to `req.miniprofiler.timeQuery()` will point to the correct miniprofiler extension. To check some performance analisys see: nodejs/node#14794 (comment) The goal of the current commit isn't introduce breaking changes, so the miniprofiler reference is injected into the request using JS getters. Another solution is changing the API for providers, where instead of receiving a `req` reference, they can receive a function that gets the reference to the correct miniprofiler instance. But this will break API with all existing providers. References ---------- - https://medium.com/the-node-js-collection/async-hooks-in-node-js-illustrated-b7ce1344111f - https://medium.com/@guysegev/async-hooks-a-whole-new-world-of-opportunities-a1a6daf1990a - nodejs/node#14794 (comment)
Closing as discussion seems to have run its course. Feel free to re-open if this is still something that needs more discussion. |
This commit fixes a bug where a request can end without finishing its timings. See: MiniProfiler#4 Bug cause: The structure that is used to register Miniprofiler timing providers, like (Postgres, HTTP, Redis), because it overrides the original method (globally) and uses the `request` object to access the Miniprofiler extensions to build the timings, and this doesn't work in a scenario with simultaneous requests using an async provider. Here is an example using [`pg`](https://github.com/goenning/miniprofiler-pg/blob/master/index.js) to try illustrating the failing scenario (check out the `tests/concurrent-async-test.js` test to see it running). request A start: * `pg.Client.prototype.query` holds a `req` object of requestA. * It calls `.query` in a pg instance * A miniprofiler timing starts with the call to `req.miniprofiler.timeQuery(...)` * The original method is called (async). request B start: * `pg.Client.prototype.query` holds a `req` object of request B. * It calls `.query` in a pg instance * Start timing with `req.miniprofiler.timeQuery(...)` * The original method is called (async). request A resume: * The result of `.query` is returned * A new call to `.query` is made * This time the `req` points to request B, this means that `req.miniprofiler.timeQuery(...)` will start a timing on request B. * The original method is called (async) request B resume: * The result of `.query` is returned. * All data was fetched, the request is ready to finish, so internally Miniprofile calls [`stopProfilling`](https://github.com/MiniProfiler/node/blob/1a98e40698b1126ac8de728a33406656361f8870/lib/miniprofiler.js#L80). * This fails because there is a timing started (by request A) but not finished, so calculating the [diffs](https://github.com/MiniProfiler/node/blob/1a98e40698b1126ac8de728a33406656361f8870/lib/miniprofiler.js#L409) will fails because `stop` is undefined. Solution -------- Using NodeJS "async_hooks" we can track the reference to the correct extension for each request, so calls to `req.miniprofiler.timeQuery()` will point to the correct miniprofiler extension. To check some performance analisys see: nodejs/node#14794 (comment) The goal of the current commit isn't introduce breaking changes, so the miniprofiler reference is injected into the request using JS getters. Another solution is changing the API for providers, where instead of receiving a `req` reference, they can receive a function that gets the reference to the correct miniprofiler instance. But this will break API with all existing providers. References ---------- - https://medium.com/the-node-js-collection/async-hooks-in-node-js-illustrated-b7ce1344111f - https://medium.com/@guysegev/async-hooks-a-whole-new-world-of-opportunities-a1a6daf1990a - nodejs/node#14794 (comment)
v8.x
-master
node@8.0.0
saw the introduction ofasync_hooks
the latest and greatest incarnation of core's tracing API (#11883 - based on https://github.com/nodejs/node-eps/blob/master/006-asynchooks-api.md)The @nodejs/async_hooks teams would love to hear feedback, especially regarding performance in production apps (CPU/memory usage), but also synthetics benchmarking results.
Any other feedback will also be welcome, as well as issues and PRs.
Thanks.
The text was updated successfully, but these errors were encountered: