lib: add span API to diagnostics_channel #35534

Qard · 2020-10-07T03:36:20Z

I've split the span api out of #34895 as the exact design seems to still be a bit controversial, so I'd like to iterate on it here to allow the core of diagnostics_channel to land separately.

Depends on #34895

Checklist

make -j4 test (UNIX), or vcbuild test (Windows) passes
tests and/or benchmarks are included
documentation is changed or added
commit message follows commit guidelines

nodejs-github-bot · 2020-10-07T03:36:24Z

Review requested:

@nodejs/http
@nodejs/net

Qard · 2020-10-31T22:49:09Z

cc @nodejs/diagnostics Now that diagnostics_channel has landed, I'd like to get a fresh look at this to figure out how we can evolve the spans concept to make it less controversial. The general intent is to somehow provide a channel-unique correlation identifier to indicate that a set of events are related, along with some method of marking the start and end of a sequence of correlated events. I'm not tied to the current API design, it is just one possibility.

nodejs-github-bot · 2020-11-09T19:47:46Z

CI: https://ci.nodejs.org/job/node-test-pull-request/34255/

Flarna · 2020-11-12T11:39:33Z

In general I agree that something to correlate spans/transactions is helpful for diagnostics data.
But I'm not sure if we should push this into core as publishers can do something similar already now by adding some id into the data (which needs to be documented anyway by the publishers).

There are some drawbacks in optinion:

It's an optional API so the decission if something is published as span is still in the responsibility of the publisher and needs doc there
It disallows the publisher to select/reuse an alreay existing correlation ID
publisher can't access the correlation ID generated by core so they can't use it at other places to e.g. link spans
Core has to wrap span messages which is an unneeded overhead
A consumer using Span.aggregate gets the data twice (once aggregated and once directly via the channel callback) therefore filtering duplicates is needed in each consumre which is unneeded overhead
Span.aggregate leaks if publisher never ends a span. Clearly this is a publisher fault but the memory is held by core. Publishers don't see this leak until a consumer uses Span.aggregate

Qard · 2020-11-12T19:58:35Z

There are some drawbacks in optinion:

* It's an optional API so the decission if something is published as span is still in the responsibility of the publisher and needs doc there

The included docs already suggest when a span would make sense. Do you think there's something more needed than what I've already included?

* It disallows the publisher to select/reuse an alreay existing correlation ID

* publisher can't access the correlation ID generated by core so they can't use it at other places to e.g. link spans

Reuse of a correlation id is achieved by reusing the span object. If you want to use the id for something like linking one span to another, you can just get the id from span.id and do whatever you want with that data. Also, I don't consider span linking the responsibility of this API. It's up to the user to keep track of the "current" span through AsyncLocalStorage and apply linking themselves if they so desire. This API is meant to impose the least opinions possible on how the data is structured, providing little more than a channel-unique id only intended to be used to match events together. It's the responsibility of the user to create their own more meaningful ids for their own system, to build any span linkage, and to pull whatever data they need out of the span messages.

* Core has to wrap span messages which is an unneeded overhead

The overhead is extremely minimal. All it does it create an object with a numeric id, a name to identify the event type, and the message data itself. Do you have any suggestions for a simpler way to achieve the id and sequencing needs here?

* A consumer using `Span.aggregate` gets the data twice (once aggregated and once directly via the channel callback) therefore filtering duplicates is needed in each consumre which is unneeded overhead

That's assuming a channel would be used for multiple types of message data, which is not intended. Each channel should have a single purpose. I would strongly discourage sending spans on the same channel as non-span events. If it's something that doesn't fit a span.annotate(...) it should be on a different channel.

* `Span.aggregate` leaks if publisher never ends a span. Clearly this is a publisher fault but the memory is held by core. Publishers don't see this leak until a consumer uses `Span.aggregate`

Yep, I need to figure out a better way to handle that. That specific API is intended more just as a helper and I'm fine omitting that until we can figure out a good solution.

Flarna · 2020-11-13T06:58:27Z

Regarding docs: I meant publishers have to document what data they publish. So they could also include some correlationId in their data without dedicated support in core.
Once a publisher specifies that something is a span it can specify where/how the the messages can be correlated in a channel specific way, not via the core generated numeric id.

I have overlooked span.id. I agree that this can be used by publishers. I agree that context passing should not be part of this API. But some publishers may have usecases to "group"/link spans. It's not the task of core to do this and AsyncLocalStore may be used by consumers to correlate but I can imagine that there are "loose coupled" spans where publisher gives hints via related span-ids.

If you recommend to not mix spans and single messages on a channel I would recommend that API should reflect this. Maybe we should add a SpanChannel class which enforces spans to avoid the double publishing and checks like if (message instanceof SpanMessage || message.type === 'span') in onMessage instead we could check once in aggregate that the passed channel is a SpanChannel.

Qard · 2020-11-13T18:51:03Z

I'm not really sure there's a good way to add a new channel type parallel to the existing one though, due to how the publisher and subscriber sides basically form a race to construct the named channel instance. The subscriber-side API would all need to change to make sure it's constructing the appropriate type when not already found, which would probably also require duplicating most if not all of the other module internals...

Flarna · 2020-11-13T21:04:42Z

Another limitation of aggregate is that timings are lost for consumers.

Qard · 2020-11-14T06:43:22Z

The timings are what the map function was intended for, though I'm not convinced that's a great solution. 😕

Flarna · 2020-11-14T13:09:00Z

It's not just storing the timing. consider e.g. OpenTelemetry as consumer - they will call tracer.startSpan() and maybe propagte the new span via ALS. Getting only complete spans - without correlation between spans - will be not that useful for such usecases.

Qard · 2020-11-14T18:23:51Z

You can do the correlation in the map function too. Also, Span.aggregate is intended only as a helper to simplify the process of reducing a sequence of events to a single object. If the user prefers, they can just do that themselves and augment it with whichever extra behaviour they feel they need. 🤷

mmarchini · 2020-11-25T19:23:43Z

Haven't read the code yet, only the docs and discussion on the PR. So far I agree with @Flarna though, this can easily be achieved with the current API today by producers. Furthermore, general consensus in the diagnostics meeting today was that we need to start seeing adoption (especially by producers) of the existing API before adding new things (unless adding those new things is essential for adoption). @Qard do you know any frameworks using the current API, or with plans to use it in the near future?

Qard · 2020-11-25T20:19:32Z

I'm working on adding it to express, and I believe @RafaelGSS was working on making a fastify pugin. I would say that the spans API is somewhat essential to adoption though. There's definitely some value to what's already there, but there are certain common patterns such as expressing a span which would greatly benefit from a unified/standard way to express it, otherwise the ecosystem will gradually get fragmented. It would not be ideal if an APM wanted to instrument ten different database libraries and needed ten different span representation systems to handle all of them. 😬

RafaelGSS · 2020-11-25T21:32:33Z

I believe @RafaelGSS was working on making a fastify pugin

Yes, here fastify/fastify#2697 is the tracking issue. I haven't read the code though but seems reasonable to have a unique way to deal with Spans.

Qard · 2020-12-15T23:21:42Z

cc @nodejs/diagnostics I'm still seeking further feedback on this. I want to figure out some way to move this forward.

Flarna · 2020-12-18T15:48:13Z

Sorry, this got out of my focus...

What about publishing a ConsumerSpan once channel.span() is called (or on first annotated). The consumer registers a onMessage handler on this span and gets the annotate and end data synchronous.

user can decide to remove it's reference to this span to avoid memory leaks if publisher "forgets" about end
events are published only once even if the same channel has spans and normal messages
user can decide if data is aggregated before export, directly forwarded, after adding timestamps,...
publish timing/async context is available to consumer
no map lookup/instanceof checks needed

We could add a basic implementation of the span onMessage function which does aggregation like the current proposed static aggregate method.

Qard · 2020-12-18T18:35:12Z

Seems like all your arguments are against the Span.aggregate function specifically, and nothing really against any of the other parts of the API, so not sure why we would need an entirely different API. As already explained, the Span.aggregate largely exists as an example of how correlated events could be aggregated, but a consumer could just as easily do their own aggregation. I'm happy to just remove the Span.aggregate function rather than trying to solve the possible map leak, if that satisfies you.

Flarna · 2020-12-18T21:11:05Z

Yes, it's the consumer side I don't like that much. But I have to admit that I haven't tried yet to use your proposal in a bigger sample.

By proposing a Span object on consumer side I have a symmetry with publisher in mind. A published span arrives as span at consumer side, piece by piece.

I don't propose a completly different API. The publisher side is most likely uneffected. The consumer gets a SpanMessage with aggregation functionality.

Most likely it would be best that I just write down in code what I mean. Will try but I can't promise a date.

Qard · 2020-12-18T22:25:34Z

I'd be happy to get on a call sometime too, if you think that would help. We're getting into the holidays now though, so I don't expect to see a lot of movement on this until after the holidays are over. 😅

simon-id · 2021-07-19T11:09:49Z

To me, it looks like Span.aggregate is just an example of what could be achieved with this Span API. Not sure it belongs in the API directly though, it could just be a written example in the documentation.

Otherwise I think this Span API is a good idea, to at least try to give a general direction for the de facto (non-)standardization process that will happen in diagnostics_channel adoption. DC will still be usable without it, but at least, the use-cases that share this common "span" behavior will have a paved road to work with (and that also helps with adoption as it's easier to implement for producers AND consumers in the ecosystem).

Qard requested a review from a team as a code owner October 7, 2020 03:36

Qard removed the request for review from a team October 7, 2020 03:36

Qard force-pushed the diagnostics-channel-span-api branch from 254bf26 to 07305ce Compare October 7, 2020 03:41

Qard changed the title ~~Diagnostics channel span api~~ lib: add span API to diagnostics_channel Oct 7, 2020

Qard mentioned this pull request Oct 7, 2020

lib: create diagnostics_channel module #34895

Closed

4 tasks

mhdawson mentioned this pull request Oct 12, 2020

Node.js Diagnostics WorkGroup Meeting 2020-10-14 nodejs/diagnostics#435

Closed

mhdawson mentioned this pull request Oct 26, 2020

Node.js Diagnostics WorkGroup Meeting 2020-10-28 nodejs/diagnostics#438

Closed

Qard force-pushed the diagnostics-channel-span-api branch from 07305ce to d3bea0b Compare October 31, 2020 22:43

Qard force-pushed the diagnostics-channel-span-api branch from d3bea0b to de91871 Compare November 1, 2020 04:45

Qard added the request-ci Add this label to start a Jenkins CI on a PR. label Nov 1, 2020

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Nov 1, 2020

This comment has been minimized.

Sign in to view

Qard added the request-ci Add this label to start a Jenkins CI on a PR. label Nov 1, 2020

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Nov 1, 2020

This comment has been minimized.

Sign in to view

Qard added the request-ci Add this label to start a Jenkins CI on a PR. label Nov 9, 2020

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Nov 9, 2020

This comment has been minimized.

Sign in to view

lib: add span API

f2d16bb

Qard force-pushed the diagnostics-channel-span-api branch from de91871 to f2d16bb Compare November 9, 2020 19:41

Qard added the request-ci Add this label to start a Jenkins CI on a PR. label Nov 9, 2020

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Nov 9, 2020

Qard mentioned this pull request Nov 11, 2020

Node.js Diagnostics WorkGroup Meeting 2020-11-11 nodejs/diagnostics#441

Closed

Trott requested a review from bengl November 13, 2020 00:17

mhdawson mentioned this pull request Nov 23, 2020

Node.js Diagnostics WorkGroup Meeting 2020-11-25 nodejs/diagnostics#445

Closed

mmarchini removed the diag-agenda Issues and PRs to discuss during the meetings of the diagnostics working group. label Nov 25, 2020

guybedford force-pushed the master branch from dc5a5da to 8e46568 Compare March 29, 2021 21:33

Qard closed this Aug 11, 2021

Mesteery added the diagnostics_channel Issues and PRs related to diagnostics channel label Oct 13, 2021

Qard deleted the diagnostics-channel-span-api branch June 18, 2022 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib: add span API to diagnostics_channel #35534

lib: add span API to diagnostics_channel #35534

Qard commented Oct 7, 2020

nodejs-github-bot commented Oct 7, 2020

Qard commented Oct 31, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

nodejs-github-bot commented Nov 9, 2020

Flarna commented Nov 12, 2020

Qard commented Nov 12, 2020

Flarna commented Nov 13, 2020

Qard commented Nov 13, 2020

Flarna commented Nov 13, 2020

Qard commented Nov 14, 2020

Flarna commented Nov 14, 2020

Qard commented Nov 14, 2020

mmarchini commented Nov 25, 2020

Qard commented Nov 25, 2020

RafaelGSS commented Nov 25, 2020

Qard commented Dec 15, 2020

Flarna commented Dec 18, 2020

Qard commented Dec 18, 2020

Flarna commented Dec 18, 2020

Qard commented Dec 18, 2020

simon-id commented Jul 19, 2021

lib: add span API to diagnostics_channel #35534

lib: add span API to diagnostics_channel #35534

Conversation

Qard commented Oct 7, 2020

Checklist

nodejs-github-bot commented Oct 7, 2020

Qard commented Oct 31, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

nodejs-github-bot commented Nov 9, 2020

Flarna commented Nov 12, 2020

Qard commented Nov 12, 2020

Flarna commented Nov 13, 2020

Qard commented Nov 13, 2020

Flarna commented Nov 13, 2020

Qard commented Nov 14, 2020

Flarna commented Nov 14, 2020

Qard commented Nov 14, 2020

mmarchini commented Nov 25, 2020

Qard commented Nov 25, 2020

RafaelGSS commented Nov 25, 2020

Qard commented Dec 15, 2020

Flarna commented Dec 18, 2020

Qard commented Dec 18, 2020

Flarna commented Dec 18, 2020

Qard commented Dec 18, 2020

simon-id commented Jul 19, 2021