Skip to content
This repository has been archived by the owner on Jul 31, 2018. It is now read-only.

proposal: JS tracing APIs #48

Closed
wants to merge 2 commits into from
Closed

Conversation

jasongin
Copy link
Member

@nodejs/diagnostics and anyone else interested, please review this proposal draft.

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Jan 17, 2017

Is this a follow up on "trace events" as introduced in nodejs/node#9304?


I'm generally against using the term tracing formally, it can mean so many things. For example we already have:

  • LTTng (Linux), SystemTap (Linux), DTrace (OSX), ETW (Windows)
  • async_hooks
  • trace_events

all of which does some kind of tracing. If this API integrates with trace_events I think we should just call the module trace_events.

@jasongin
Copy link
Member Author

Yes, the Background section of the proposal doc links to that "trace events" PR.

As also noted in the doc, the module name "tracing" was already reserved for this purpose long ago. I don't feel strongly about it personally, certainly we can go with whatever name is agreed by consensus.

const scopeId = tracing.emit(category, { eventType: 'begin', name: 'myTask' });
tracing.emit(category, { eventType: 'end', name: 'myTask', id: scopeId });

tracing.emit(category, { eventType: 'count', name: 'myCounter', value: 3 });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This emit function is different enough from EventEmitter.emit that I'd prefer it was named differently so as to not confuse new users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should support only the alternate API? All the conditional validation needed for this unified function is a bit worrying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this should not be called emit at all. Maybe record?

00X-tracing.md Outdated
emit(
category: string | string[],
name: string,
args?: { [name: string]: any } | () => { [name: string]: any }): boolean;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should call this instant rather than emit. Multi-signature methods risk unforeseen problems, as the buffer issue has proven.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, the polymorphism of multi-signature methods has implications for optimization, which raises performance concerns. Historically, this has led to problems for APM and other diagnostic solutions that are intended to be embedded in production applications. In general, the simpler / closer to monomorphic the APIs can be made, the better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this method would be better named instant. (Or perhaps timeStamp?)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for instant because it matches the terms used in the trace format descriptions. It makes it easier to find it elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to instant in a new commit.


To support listeners, the `Tracing` class (from the previous section) extends a new
`CategoryEventEmitter` base class. That class is similar to `EventEmitter`, but does not extend
`EventEmitter` because it has different behavior related to the multi-category support.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, similar yet different enough from a typical event emitter that I worry it'd be confusing to new users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered trying to make it more different from EventEmitter, but it felt silly to use different names for operations that are really the same as the normal EventEmitter methods.

I think the differences probably should not really be a problem. You can in fact use a CategoryEventEmitter (which the Tracing module extends) just like a normal EventEmitter if you don't care about the multi-category support.

Would it help to think of CategoryEventEmitter as a subclass of EventEmitter? (I considered literally extending the class, but it didn't provide any benefit since all the methods needed to be overridden to use different data structures from the base class.)

00X-tracing.md Outdated
tracing.emit(['example2', 'example3'], 'info', 'Hello tracing!');
```

### 3.2 Listern API specification
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listen*

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a new commit.


The existing Node.js implementation of those Console tracing methods simply logs tracing events to
the console. This proposal extends those methods with a new optional category parameter to enable
emitting categorized tracing events in a way that is backward-compatible with existing console APIs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea of deviating from the behaviour of the console API spec. If they decide to add their own extra argument in the future, suddenly we have problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood; I haven't entirely convinced myself that it is a good idea either. That's why I made it an optional part of the proposal. It might be nice for some scenarios, so I'd like to at least get some feedback about the idea of integrating with the console tracing APIs. 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👎 in adding any complexity to console.

@othiym23
Copy link

While I agree that treating those three APIs as separate concerns that can be implemented separately, I think it's helpful to treat emitting and listening of trace data as duals for a single abstraction.

As someone who, like @Qard, has worked on APM products in the past, the challenge for me was that there have been a variety of strategies for emitting trace events, but no reliable way to observe that data across platforms and deployment environments (in particular, PaaS and containerization). In practical terms, it feels like it's a lot easier to hit performance and overhead targets for the emitting rather than the listening side, and coming up with an efficient in-process interface for capturing that data would be the biggest win for a whole class of use cases.

@trevnorris
Copy link
Contributor

I'd like to see documentation about functionality this introduces that would be impossible to implement on top of async hooks (the async hooks EP is up to date).

@jasongin
Copy link
Member Author

I will update the doc to explain how this proposal relates to async hooks. I expect tracing instrumentation of node core can leverage async hooks for tracing of async operations, so in that regard the features are complementary.

Async hooks provide a way to hook into the beginning and end of async operations built on libuv primitives. Of course those async operations are a key subset of performance-related events.

This proposal is about a way to capture and/or listen to all kinds of performance-related events, including non-async begin/end, point-in-time, and counters. Additionally and most importantly it provides a very high-performance way to write the events out to a trace log file.

@targos
Copy link
Member

targos commented Jan 27, 2017

Is this somehow related to the browser's User Timing API? See https://developer.mozilla.org/en-US/docs/Web/API/User_Timing_API

@jasongin
Copy link
Member Author

@targos It is not directly related, but I'm aware of it. I think it would not be a good fit to implement the User Timing API directly in node, because it only has a subset of the functionality (does not support counters, categories, or any mechanism for turning off the tracing) and it does not have the right event-listener model.

If desired, it should be straightforward to create a polyfill for the User Timing API that would use this proposed Node.js tracing API underneath.

@trevnorris
Copy link
Contributor

@jasongin I should have clarified. I specifically meant in regards to the JS API. I'm fine placing the trace event hooks in native source. I already have for some testing and they work great. My concern is the JS API. Even if the call does nothing but return immediately there's still enough overhead to be noticeable. Especially when the function call is cross module (V8 has a harder time optimizing those for some reason).

For async hooks to incur the least amount of overhead possible in similar situations I had to do some things I'm rather not proud of (see https://git.io/vDJq9 as an example). So I'd like to avoid placing anything directly in JS. I'd like to see if the JS trace events could be optional async hooks that are enabled when those specific trace events are requested.

@jasongin
Copy link
Member Author

jasongin commented Jan 27, 2017

The main purpose of a JS API for tracing is to enable people to instrument their JavaScript code. While instrumenting internal async handles and other internal constructs may also provide useful data, that is only a piece of the performance picture and is not the focus of this proposal.

Even if the call does nothing but return immediately there's still enough overhead to be noticeable.

Only if your definition of "noticeable" is extremely generous. When tracing is completely off, any of the tracing event-emitter functions will literally be a no-op function, the equivalent of:

tracing.emit = function () {}

Even if the JIT cannot completely optimize-out the function call, it is still going to be extremely low-cost, I think small enough to not be noticeable in any reasonable sense.

Then of course when tracing gets enabled, either by turning on recording or adding a listener for any category, the emit functions will get replaced with functions that need to do a little bit of work, but are still fast enough not to have a significant impact on most applications.

@trevnorris
Copy link
Contributor

I spent around 1.5 week figuring out the best way to optimize for calls to timers and nextTick in my async hooks PR. The reason it was so important to improve is because node doesn't code for the average performance case. We code it for the fast case. Which means measuring against a high volume scenario.

Take a simple http server and add hooks when resources are created and also add timing around each resource's callback. In a high volume scenario, let's say a proxy (and these numbers are small compared to what F5 pushes through node), and do 45k req/sec. Just tracking the mentioned locations you're going to add ~1.5 million calls/sec. That's ~1.5 million noops in JS that have now been added. Which aren't actually noops because V8 can't completely optimize them out. At that volume I'm not being "extremely generous". I'm being practical.

Looking at the code from the link I posted you'd see that all the emitBefore()/emitAfter() calls are conditionally called. Because simply calling the noop function cost too much when I was benchmarking. Short of this is, I didn't spend a month tweaking every corner of the async hooks API to make sure it had no noticeable affect on core's performance just so we could throw in another similar API that would.

@jasongin
Copy link
Member Author

I appreciate the passion for performance. I am working on some benchmarks involving tracing in realistic scenarios. I will follow up when I have data ready to share.

const category = 'example';
tracing.enableRecording(category);

tracing.emit(category, { eventType: 'instant', name: 'myEvent' });
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could tracing expose constants to pass in as eventType instead of expecting particular strings?

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm waiting for this, but I think there are still some nits to be fixed in this EP.

const scopeId = tracing.emit(category, { eventType: 'begin', name: 'myTask' });
tracing.emit(category, { eventType: 'end', name: 'myTask', id: scopeId });

tracing.emit(category, { eventType: 'count', name: 'myCounter', value: 3 });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this should not be called emit at all. Maybe record?

* @returns Nonzero event id (either provided or generated) if the event was emitted, or null if
* the event was not emitted because tracing was not enabled for any of the event categories.
*/
emit(category: string | string[], e: TracingEvent): number | null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is redundant, maybe we can try to avoid overloading as much as possible as it is harder to optimize, and this API should try to be as fast as possible.

* This interface has methods that are similar to EventEmitter, with an important difference:
* events may be emitted under multiple categories simultaneously instead of a single event name,
* and listeners may listen to multiple categories while still only being called once per event.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is similar but different, it should not be called EventEmitter, but something else.


The existing Node.js implementation of those Console tracing methods simply logs tracing events to
the console. This proposal extends those methods with a new optional category parameter to enable
emitting categorized tracing events in a way that is backward-compatible with existing console APIs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👎 in adding any complexity to console.

which way may want to adapt to Node.js eventually. And there has been discussion of a more compact
binary format. This is an area that needs some further development, however for performance reasons
the tracing output probably would not be exposed to JavaScript anyway, beyond the listener APIs
that are proposed here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some time should be taken to design a processing pipeline for these. I think some tools and documentation to explain how to process this data format. Without interactive and easy-to-understand visualisations, tracing data is extremely hard to understand.

I should be straightforward to plot this data into a flamegraph for example.

@rvagg
Copy link
Member

rvagg commented Jun 21, 2017

https://github.com/brycebaril/transaction-tracer is a proposal that @brycebaril put forward a few years ago coming out the Tracing Summit held a few years ago. Might be good to connect to some of that work. @othiym23 and @trevnorris you were involved in that too weren't you? Was there any other documentation or notes that came out of that which might be useful?

@othiym23
Copy link

othiym23 commented Jun 21, 2017

I'm pretty sure @brycebaril's notes were the most complete; @groundwater acted as our facilitator, and may have recorded some of the outcomes, but I don't know how extensive his notes were, given that he was facilitating. @tjfontaine and @misterdjules were our hosts at the Joyent Vancouver offices and had extensive contributions to make around our efforts to come up with something that merged both APM and dynamic tracing along the lines of DTrace and ETW. I believe @Qard was also at that meeting as well, representing his APM-vendor employers of the time (sorry, Stephen, I'm blanking on what company that was temporarily).

I don't think @trevnorris made it to that particular tracing meeting, although he was certainly integral to the discussions that @groundwater, @tjfontaine, and I were having earlier that year. Trevor was at the error-handling "summit" convened by @dshaw earlier that year, at which @davepacheco was also present. As Trevor's involvement in those discussions eventually produced async_hooks and AsyncWrap, those APIs are far more concrete than any meeting notes are likely to be. ;)

I'll double-check my notebooks from the time, but my recollection is that I didn't keep detailed notes, in part because the discussion was largely focused on establishing the problem space more than solutions, in part because I'd moved on from APM at the time and was more just trying to act as a kind of expert witness.

@misterdjules
Copy link

FWIW, my notes from the meeting were:

General observations

  • People don't want to have to monkey patch node to implement tracing.

  • People seem to be more keen to keep the tracing facilities in node itself
    rather than use platform tracing facilities like dtrace/ETW/{System,K}tap.

What some APM providers do/need

New Relic:

  • Problems they try to solve:

    • help developers make their applications faster.
    • Identify individual transactions that are slow.
    • Don't generate very broad info (no big call graph), but more targetted
      info.
    • As lightweight as possible. Needs to not have performance impact on the
      running app.
  • Solution they use:

    • Uses AsyncListener polyfills or core implementation + continuous local
      storage to tie things together.

Strongloop:

  • Performance tracing: find performance problems and/or bugs (exceptions,
    leaks of descriptors, etc.).
    • Needs to know start and stop of async events.
  • Solution they use: monkey patch everything.
  • Other problems they try to solve: heap/cpu profiling.
  • Use or would like to use zones.
  • Mentioned being interested in other discrete events like GC or Node about to
    exit/not being able to exit, etc.

Appneta:

  • Goal: time profiling of the whole async flow.
  • Mentioned needs to trace async calls in native modules.

Concurix:

  • Also instrument synchronous code.
  • Light monkey patching.
  • Needs to not have performance impact on the running app.

General comments:

  • Tools to allow native modules to emit tracing/performance events when they
    perform async operations using plain libuv.

Conclusion of the "problem discussion"

  • Priorities:
    1. Unavoidable requirements:
    • Production ready.
    • Do not monkey patch node.
    • Do not alter behavior.
    1. Async tracing.
    2. Track async lifecycle.
    3. native module API.

Implementation

Concerns/Questions:

  • Keep a global state of async trees/transactions but how do we know when we
    can cleanup some of these long lasting state structures?

Observations:

  • Have a simpler solution that just bubbles up simple information and not
    complete async trees. AL is too high level.

Proposed API by TJ/Jacob (for a simple solution):

  • Terminology:
    • Namespace, provider, proble. Example:
      node:http:{client-request,client-response,etc.}
  • Advantages: similar to already defined probes.
  • API: global someProbe = createProbe('probeName', description of info that is
    going to be passed). The description of data is needed to support platforms
    providers like dtrace.
    someProbe.enable()
    someProbe.fire(function () { return [ data that is passed to the callback of the consumer]})
    .list(): returns all the probes/namespaces/etc defined in the system.
    .on(namespace, provider, probe, callback) with support for wildcards, etc.
  • Use ring buffers and other similar techniques on the consumer side to avoid
    performance hit.
  • Implementation of user side in https://github.com/joyent/node-tracing .
  • Namespace is useful to make sure that there is no conflict between probes,
    and can also allow some optimizations like filtering out what the consumer
    doesn't care about.

Amendments by Jacob:

API: do not need to create the probe explicitely. This is done by the API
the first time a probe is emitted.
The concern with that (from TJ) is that the consumer of these events often
needs to know in advance about the probes that can be fired. Not knowing about
existing probes in advance can make using dtrace and other platform providers
more difficult (but also other consuming platforms), and as Forest pointed out
it's a valuable asset for node.

Concerns

  • This solution is not generic enough to provide a solution for the number one
    priority: Async tracing of any function/callback. Answer to this concern: this
    API can be used as a tool to provide information about more generic stuff like
    handles creations, etc.

  • How to trace what happens in third party modules? Adding static probes to
    third party modules doesn't seem to be realistic for some people (e.g
    concurix).

  • How to attach the metadata that consumers are interested in? How to make
    sure that consumers have enough information? Monkey patching is currently the
    easiest way to do that.

  • Brute force solution for the problem of getting the metadata: emit all the
    data of the context at the firing site with the event. Slightly better
    proposition from Jacob: emit the associated data only if requested by the
    consumer.

  • Idealistic solution: being able to inspect any context within v8 when the
    consumer handles a probe.

@Qard
Copy link
Member

Qard commented Jun 21, 2017

The challenge that APM vendors faced with coming up with a generic API was that we all had slightly different needs for data collection. That's a big part of why I've been pursuing #56 as a way to apply instrumentation more abstractly.

@othiym23
Copy link

Thanks for the detailed notes, @misterdjules, and (belated) apologies for tagging you into the discussion!

@Trott
Copy link
Member

Trott commented Jun 12, 2018

Closing, but by all means re-open or comment if that isn't right.

@Trott Trott closed this Jun 12, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.