-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting multiple trace providers: Trace Events, DTrace, Lttng, ETW #163
Comments
Some feedback received so far:
(will update this list as we go) |
One key question that needs to be answered: does a single Node.js binary need to support multiple trace mechanisms? For instance, current builds on OSX support both DTrace and v8 Trace Events independently. If we include a single instrumentation mechanism, does that need to support multiple trace backends at runtime, or is it ok to pick one backend at compile time? |
As long as DTrace probes are included in the OS X and SmartOS (and FreeBSD?) binaries provided by the project, I would definitely be OK with leaving out the v8 Trace Events probes. |
I have a distinct feeling that current users of the v8 trace event probes will say the exact same thing. I think we need to be resigned to the fact that the default node.js binary is simply going to be required to support multiple trace targets for the forseeable future. The trick will be finding a way of efficiently instrumenting once for those multiple targets. |
Trace events are a lot more flexible, maintainable and consumable than static probes so if we had to pick one or the other, it's trace events all the way. I don't think it has to be either/or, though. That said, the static probes do suffer from technical debt; see e.g. nodejs/node#18074. Get involved if you want to ensure their bright future. |
I also thought that we agreed to think about deprecating DTrace in favor of trace events. If there is no feature gap to be expected, maybe we can initiate that process and think of EOLing DTrace with ~ v12? |
Given feedback I've received I don't believe we're going to be able to deprecate any time soon |
I'm curious, feedback from whom? I can't imagine many people caring about our static probes; it might be a literal fingers-on-one-hand deal. Data point: I quietly removed them from libuv a few years back and no one complained. |
Instead of supporting LTTng, could we look into using the Linux native trace events: https://lwn.net/Articles/379903/ If I remember correctly, they are also consumed by LTTng, so hopefully it will make little difference for the LTTng users. (edit: I remember correctly.) |
new LTTng PR: nodejs/node#18945 - it suggests there is some use. |
Probably very little though. The build was broken for close to two years without anyone noticing. |
I opened the PR mentioned above, and was asked to provide some feedback here. My problem is that we have a node process with an apparent memory leak (we let it run, the memory usage goes up endlessly). Inspecting using the devtools didn't show anything obvious, there wasn't a big difference before/after in the number or total size of allocated objects. But there was a difference of tens of gigabytes of mapped virtual memory (original issue: eclipse-theia/theia#1284). I happen to know LTTng and saw that node had some LTTng instrumentation, so I wanted to try it, that's when I hit the issues addressed by my PR. I don't know if it will actually help, but I thought I'd give it a shot. I didn't look yet, but I assume there are some tracepoints in the memory allocation/deallocation functions. If there isn't, I thought I could add some. I am not familiar with the node code, so I'm in exploration mode :) I think that properly instrumenting the binary with tracepoints placed at strategic places is useful. You won't use it often, but the day you need it you're happy it's there. In the end I don't really mind if it's LTTng or another tracer. But one thing to keep in mind is that some tricky performance problems or rare crashes can only be reproduced in production, so it's sometimes necessary to enable tracing in production. LTTng is very efficient and designed for such use cases. Tracing needs to have the smallest possible overhead. If it's possible to support multiple trace backends dynamically without any performance cost, then why not. But otherwise, if having to choose at compile time results in a more efficient solution, I would prefer that. |
@simark lttng based tracing would probably not help find the memory leak if it happens to be in the JS heap. V8 JIT generates allocation code inline at each allocation site – something that would be hard to intercept using a tracing mechanism. You might want to check out the experimental sampling heap profiler in V8 that is designed for in-production memory leak debugging. |
And that JITed code doesn't call a common allocation function?
Thanks, I'll give it a try. |
Sorry, I had both issues opened, and posted in the wrong tab. You can delete all history that this ever happened :) |
I feel it would be helpful (at least for me) if we had a matrix kinda' like the following, so we could look at different trace libraries we're discussing, and understand differences in capabilities. The rows aren't complete yet, so don't get too hung up on that. Do others find this useful? If so, should we move into top of issue & iterate on filling out in more detail? Or move into a markdown doc somewhere so ppl can PR it?
edit: filled out by @AndreasMadsen |
I would add another line that assess how easy is to use each tool. I would say that using ETW is extremely hard (and requires a lot of knowledge of Windows internals). dtrace needs Solaris (but maybe not in the future). LTTNG is being removed. |
@mcollina - feel free to edit table above to add rows you think are useful. "Easy to use" risks being a subjective thing, but it's valid, and would be great to see if there's consensus around different tools being "easy"/"hard". The other angle here (not captured in table above), is what are the expectations for a runtime on the target OSs? E.g., for windows, expectation is probably ETW. Not saying that's a strong argument to support one library over another, but it warrants consideration IMO. |
@mike-kaufman I've added the row |
Please remove the "easiness of usage" row. It's completely subjective, where at least the other rows seem like they could be definitively measured one way or another. Given that you've put "hard" against everything that isn't Chromium Trace Events, it feels fairly politically motivated as well. You should also add FreeBSD to the list of operating systems providing DTrace support. |
@jclulow - I agree the "easy"/"hard" classification subjective. However, usability is a key criteria. I'd like to get to a place where we have more detailed specifics about what's contributing to the easy/hard perception. Matteo's classification at least starts that conversation. If you have specifics or details that you think contribute to usability of any these libraries, it would be really helpful if you could chime in with them. |
I can't stress enough how frustrating it is to have an entirely fictional criteria added, in which the features that I use every day are immediately on the back foot with no explanation. If you want to add some concrete criteria in the future, that's fine -- we can talk about that. This is not that, though, and it needs to go immediately. Alternatively, you can change the entry for DTrace from "hard" to "easy", at which point obviously I agree! (See the problem?) |
@jclulow I used both dtrace and perf in the past. Setting up and using those tools is not something a typical Node.js user is able to do. I think developer experience is a key criteria for these tools, as most of those options are either a) poorly documented, b) not documented at all, c) not working at all (lttng). dtrace, lttng, Linux TRACE_EVENT are completely not documented within our organization. We document Chromium trace events: https://nodejs.org/api/tracing.html. Assuming documentation as one of the key metrics for usability, all of them minus Chromium trace events are hard to use. @jclulow I kindly ask to use more calm language. No one is offending you here. |
I am completely calm. Like I said, if you want to talk about something more concrete where we can measure alternatives by some objective criteria, that's fine. But, again, that's not what's up there right now! By way of example, if you wanted to put a row for "described in the Node documentation", that would be fine -- it's an objectively measurable fact. It also has the benefit of being something we could take action on to fix, if needed. The criterion "easiness of usage" is, by contrast, simply playing favourites with no justification. It should be taken down, first, then (if it makes sense) replaced with objective criteria afterwards. As an aside, I would note that you can hardly tell somebody else whether they are offended or not. |
@jclulow - the point here (at least from my pov) is to have a conversation about the different solutions & try to come to some objective, shared understanding. The easy/hard thing is simply a starting point for that conversation, which I think we're having. I don't foresee that column lasting too long. |
RE developer experience, there are two different facets here:
Complexity with the consumption can generally be addressed with additional tools if that is a concern. |
I've put the "easiness of usage" as TBD, and added a documentation line. I strongly argue that we need a measure for this, not with the tool themselves but with their usage with Node.js. perf as a tool can be simple or hard to use, but are we making things harder to use Node.js with perf? how can we improve the developer experience on the various platform? |
One thing that I'd like to clarify... the "hard" / "easy" bit is not a criterion for removing or favoring one mechanism over another. We will be continuing to support Dtrace/eBPF/ETW style tracing along with v8 style tracing. Each has it's own complexities, limitations, strengths, and relative difficulty of use. For instance, v8 trace events are theoretically "easier" but suffer from a lack of comprehensive tooling. |
@jasnell please expand with specifics on your claim
Is this really true, are tracing providers getting in our way? What problems aren't we able to solve because of our various trace providers? Are developers getting stuck or slowed in some way because of our trace providers? |
@jasnell The premise of this discussion seems to be:
Can we be a bit more specific (maybe quantify somehow?) the "maintenance burden" of specific tracing frameworks that are currently implemented in node core? Otherwise I'm not sure why we need to have this discussion. I think I'm familiar with what is and has been needed to maintain the DTrace support in node core, but I may have missed some things:
Is there anything else? If so, it doesn't seem to be such a challenge to me, but again, I may be missing things. @bnoordhuis mentioned nodejs/node#18074 as an example of maintenance burden for static probes in general (not just for DTrace), but unless I'm missing something those probes are not supported by DTrace, so I don't believe this example applies to it (not saying Ben was suggesting that). Now, I can also provide my feedback as a user of the DTrace support in node: in general, I haven't had the need to have static probes in libuv, which is maybe a sign of why removing them a while ago hasn't generated any significant pushback. On the question of:
I would say that I use the existing DTrace static probes very often to understand the behavior of node HTTP clients and servers (latency, distribution of requests methods/paths/etc., etc.) while:
@davepacheco mentioned specific use cases in a previous Twitter thread at https://twitter.com/dapsays/status/963921954309730304 that are similar to mine. I can definitely help anyone understand those use cases better. If people are interested, just let me know. However, just to make sure this was not lost in the length of this comment: I would really like us to be a bit more precise about what the "maintenance" burden is. If there's anyway users of the DTrace support can help with that, let's talk about that too. |
@geek ... look at the lltng probes, for instance... apparently using For the dtrace probes, I'm evaluating now the best way to instrument the new http2 code paths and I'm trying to avoid having to instrument for dtrace and v8 trace events separately. I also want to instrument more for the performance API. Part of this entire discussion is to see if it's possible to develop a single instrumentation approach that allows us to support multiple providers and ensures that we have appropriate test coverage across those. |
To put folks at ease a bit: no one is considering removing the DTrace support :-) ... I floated the idea as a hypothetical test balloon to solicit feedback. As soon as I did, enough folks came back with panicked screams that it the very notion could be quickly discarded (as I suspected it would). The key point is: v8 trace events are not an alternative to dtrace probes and it would be fantastic to have an approach that supports both ... preferably an approach that makes things more useful for everyone. |
FWIW, @davepacheco had submitted a PR to document tracing at https://github.com/nodejs/node-v0.x-archive/pull/9206/files. That never landed, but after taking another brief look now, it looks that it would require at most minor changes to port that to the current documentation. |
@misterdjules |
@mcollina I'll see if I can move that forward. |
Trace events integrate seamlessly with DevTools. That's the default debugging experience these days so I think it's a fair comparison, nothing political about it. |
I think at the summit we had similar disagreements about "easy/hard" and it was brought up that "complex/simple to use" might be more to the point (how many moving pieces are there to setting it up and evaluating the data). Worth noting that this is always a reflection of the current state. A tool that requires a lot of setup today might become one-click tomorrow. But with that restriction it seems relatively objective (and useful). |
RE motivation here, in addition to reducing maintenance, we want a consistent set of trace data across supported OS platforms. Bonus points if this can be done in a way that is idiomatic with the tools/practices of each platform. |
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made. |
Keeping the stale label for the same reason as #10 (this likely falls under the new tracing strategic initiative) |
Node.js core supports multiple trace providers... we currently have to instrument for each individually, which increases maintenance burden. Before we consider making any changes here, however, we need to understand the critical use cases and requirements for the existing probes. Let's use this thread to discuss.
The text was updated successfully, but these errors were encountered: