Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib: add span API to diagnostics_channel #35534

Closed
wants to merge 1 commit into from

Conversation

Qard
Copy link
Member

@Qard Qard commented Oct 7, 2020

I've split the span api out of #34895 as the exact design seems to still be a bit controversial, so I'd like to iterate on it here to allow the core of diagnostics_channel to land separately.

Depends on #34895

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines

@Qard Qard added lib / src Issues and PRs related to general changes in the lib or src directory. blocked PRs that are blocked by other issues or PRs. diag-agenda Issues and PRs to discuss during the meetings of the diagnostics working group. experimental Issues and PRs related to experimental features. labels Oct 7, 2020
@Qard Qard requested a review from a team as a code owner October 7, 2020 03:36
@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/http
  • @nodejs/net

@Qard Qard removed the request for review from a team October 7, 2020 03:36
@Qard Qard force-pushed the diagnostics-channel-span-api branch from 254bf26 to 07305ce Compare October 7, 2020 03:41
@Qard Qard changed the title Diagnostics channel span api lib: add span API to diagnostics_channel Oct 7, 2020
@Qard Qard mentioned this pull request Oct 7, 2020
4 tasks
@Qard Qard force-pushed the diagnostics-channel-span-api branch from 07305ce to d3bea0b Compare October 31, 2020 22:43
@Qard
Copy link
Member Author

Qard commented Oct 31, 2020

cc @nodejs/diagnostics Now that diagnostics_channel has landed, I'd like to get a fresh look at this to figure out how we can evolve the spans concept to make it less controversial. The general intent is to somehow provide a channel-unique correlation identifier to indicate that a set of events are related, along with some method of marking the start and end of a sequence of correlated events. I'm not tied to the current API design, it is just one possibility.

@Qard Qard force-pushed the diagnostics-channel-span-api branch from d3bea0b to de91871 Compare November 1, 2020 04:45
@Qard Qard added the request-ci Add this label to start a Jenkins CI on a PR. label Nov 1, 2020
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Nov 1, 2020
@nodejs-github-bot

This comment has been minimized.

@Qard Qard added the request-ci Add this label to start a Jenkins CI on a PR. label Nov 1, 2020
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Nov 1, 2020
@nodejs-github-bot

This comment has been minimized.

@Qard Qard added the request-ci Add this label to start a Jenkins CI on a PR. label Nov 9, 2020
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Nov 9, 2020
@nodejs-github-bot

This comment has been minimized.

@Qard Qard force-pushed the diagnostics-channel-span-api branch from de91871 to f2d16bb Compare November 9, 2020 19:41
@Qard Qard added the request-ci Add this label to start a Jenkins CI on a PR. label Nov 9, 2020
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Nov 9, 2020
@nodejs-github-bot
Copy link
Collaborator

@Flarna
Copy link
Member

Flarna commented Nov 12, 2020

In general I agree that something to correlate spans/transactions is helpful for diagnostics data.
But I'm not sure if we should push this into core as publishers can do something similar already now by adding some id into the data (which needs to be documented anyway by the publishers).

There are some drawbacks in optinion:

  • It's an optional API so the decission if something is published as span is still in the responsibility of the publisher and needs doc there
  • It disallows the publisher to select/reuse an alreay existing correlation ID
  • publisher can't access the correlation ID generated by core so they can't use it at other places to e.g. link spans
  • Core has to wrap span messages which is an unneeded overhead
  • A consumer using Span.aggregate gets the data twice (once aggregated and once directly via the channel callback) therefore filtering duplicates is needed in each consumre which is unneeded overhead
  • Span.aggregate leaks if publisher never ends a span. Clearly this is a publisher fault but the memory is held by core. Publishers don't see this leak until a consumer uses Span.aggregate

@Qard
Copy link
Member Author

Qard commented Nov 12, 2020

There are some drawbacks in optinion:

* It's an optional API so the decission if something is published as span is still in the responsibility of the publisher and needs doc there

The included docs already suggest when a span would make sense. Do you think there's something more needed than what I've already included?

* It disallows the publisher to select/reuse an alreay existing correlation ID

* publisher can't access the correlation ID generated by core so they can't use it at other places to e.g. link spans

Reuse of a correlation id is achieved by reusing the span object. If you want to use the id for something like linking one span to another, you can just get the id from span.id and do whatever you want with that data. Also, I don't consider span linking the responsibility of this API. It's up to the user to keep track of the "current" span through AsyncLocalStorage and apply linking themselves if they so desire. This API is meant to impose the least opinions possible on how the data is structured, providing little more than a channel-unique id only intended to be used to match events together. It's the responsibility of the user to create their own more meaningful ids for their own system, to build any span linkage, and to pull whatever data they need out of the span messages.

* Core has to wrap span messages which is an unneeded overhead

The overhead is extremely minimal. All it does it create an object with a numeric id, a name to identify the event type, and the message data itself. Do you have any suggestions for a simpler way to achieve the id and sequencing needs here?

* A consumer using `Span.aggregate` gets the data twice (once aggregated and once directly via the channel callback) therefore filtering duplicates is needed in each consumre which is unneeded overhead

That's assuming a channel would be used for multiple types of message data, which is not intended. Each channel should have a single purpose. I would strongly discourage sending spans on the same channel as non-span events. If it's something that doesn't fit a span.annotate(...) it should be on a different channel.

* `Span.aggregate` leaks if publisher never ends a span. Clearly this is a publisher fault but the memory is held by core. Publishers don't see this leak until a consumer uses `Span.aggregate`

Yep, I need to figure out a better way to handle that. That specific API is intended more just as a helper and I'm fine omitting that until we can figure out a good solution.

@Trott Trott requested a review from bengl November 13, 2020 00:17
@Flarna
Copy link
Member

Flarna commented Nov 13, 2020

Regarding docs: I meant publishers have to document what data they publish. So they could also include some correlationId in their data without dedicated support in core.
Once a publisher specifies that something is a span it can specify where/how the the messages can be correlated in a channel specific way, not via the core generated numeric id.

I have overlooked span.id. I agree that this can be used by publishers. I agree that context passing should not be part of this API. But some publishers may have usecases to "group"/link spans. It's not the task of core to do this and AsyncLocalStore may be used by consumers to correlate but I can imagine that there are "loose coupled" spans where publisher gives hints via related span-ids.

If you recommend to not mix spans and single messages on a channel I would recommend that API should reflect this. Maybe we should add a SpanChannel class which enforces spans to avoid the double publishing and checks like if (message instanceof SpanMessage || message.type === 'span') in onMessage instead we could check once in aggregate that the passed channel is a SpanChannel.

@Qard
Copy link
Member Author

Qard commented Nov 13, 2020

I'm not really sure there's a good way to add a new channel type parallel to the existing one though, due to how the publisher and subscriber sides basically form a race to construct the named channel instance. The subscriber-side API would all need to change to make sure it's constructing the appropriate type when not already found, which would probably also require duplicating most if not all of the other module internals...

@Flarna
Copy link
Member

Flarna commented Nov 13, 2020

Another limitation of aggregate is that timings are lost for consumers.

@Qard
Copy link
Member Author

Qard commented Nov 14, 2020

The timings are what the map function was intended for, though I'm not convinced that's a great solution. 😕

@Flarna
Copy link
Member

Flarna commented Nov 14, 2020

It's not just storing the timing. consider e.g. OpenTelemetry as consumer - they will call tracer.startSpan() and maybe propagte the new span via ALS. Getting only complete spans - without correlation between spans - will be not that useful for such usecases.

@Qard
Copy link
Member Author

Qard commented Nov 14, 2020

You can do the correlation in the map function too. Also, Span.aggregate is intended only as a helper to simplify the process of reducing a sequence of events to a single object. If the user prefers, they can just do that themselves and augment it with whichever extra behaviour they feel they need. 🤷

@mmarchini
Copy link
Contributor

Haven't read the code yet, only the docs and discussion on the PR. So far I agree with @Flarna though, this can easily be achieved with the current API today by producers. Furthermore, general consensus in the diagnostics meeting today was that we need to start seeing adoption (especially by producers) of the existing API before adding new things (unless adding those new things is essential for adoption). @Qard do you know any frameworks using the current API, or with plans to use it in the near future?

@mmarchini mmarchini removed the diag-agenda Issues and PRs to discuss during the meetings of the diagnostics working group. label Nov 25, 2020
@Qard
Copy link
Member Author

Qard commented Nov 25, 2020

I'm working on adding it to express, and I believe @RafaelGSS was working on making a fastify pugin. I would say that the spans API is somewhat essential to adoption though. There's definitely some value to what's already there, but there are certain common patterns such as expressing a span which would greatly benefit from a unified/standard way to express it, otherwise the ecosystem will gradually get fragmented. It would not be ideal if an APM wanted to instrument ten different database libraries and needed ten different span representation systems to handle all of them. 😬

@RafaelGSS
Copy link
Member

I believe @RafaelGSS was working on making a fastify pugin

Yes, here fastify/fastify#2697 is the tracking issue. I haven't read the code though but seems reasonable to have a unique way to deal with Spans.

@Qard
Copy link
Member Author

Qard commented Dec 15, 2020

cc @nodejs/diagnostics I'm still seeking further feedback on this. I want to figure out some way to move this forward.

@Flarna
Copy link
Member

Flarna commented Dec 18, 2020

Sorry, this got out of my focus...

What about publishing a ConsumerSpan once channel.span() is called (or on first annotated). The consumer registers a onMessage handler on this span and gets the annotate and end data synchronous.

  • user can decide to remove it's reference to this span to avoid memory leaks if publisher "forgets" about end
  • events are published only once even if the same channel has spans and normal messages
  • user can decide if data is aggregated before export, directly forwarded, after adding timestamps,...
  • publish timing/async context is available to consumer
  • no map lookup/instanceof checks needed

We could add a basic implementation of the span onMessage function which does aggregation like the current proposed static aggregate method.

@Qard
Copy link
Member Author

Qard commented Dec 18, 2020

Seems like all your arguments are against the Span.aggregate function specifically, and nothing really against any of the other parts of the API, so not sure why we would need an entirely different API. As already explained, the Span.aggregate largely exists as an example of how correlated events could be aggregated, but a consumer could just as easily do their own aggregation. I'm happy to just remove the Span.aggregate function rather than trying to solve the possible map leak, if that satisfies you.

@Flarna
Copy link
Member

Flarna commented Dec 18, 2020

Yes, it's the consumer side I don't like that much. But I have to admit that I haven't tried yet to use your proposal in a bigger sample.

By proposing a Span object on consumer side I have a symmetry with publisher in mind. A published span arrives as span at consumer side, piece by piece.

I don't propose a completly different API. The publisher side is most likely uneffected. The consumer gets a SpanMessage with aggregation functionality.

Most likely it would be best that I just write down in code what I mean. Will try but I can't promise a date.

@Qard
Copy link
Member Author

Qard commented Dec 18, 2020

I'd be happy to get on a call sometime too, if you think that would help. We're getting into the holidays now though, so I don't expect to see a lot of movement on this until after the holidays are over. 😅

@simon-id
Copy link
Contributor

To me, it looks like Span.aggregate is just an example of what could be achieved with this Span API. Not sure it belongs in the API directly though, it could just be a written example in the documentation.

Otherwise I think this Span API is a good idea, to at least try to give a general direction for the de facto (non-)standardization process that will happen in diagnostics_channel adoption. DC will still be usable without it, but at least, the use-cases that share this common "span" behavior will have a paved road to work with (and that also helps with adoption as it's easier to implement for producers AND consumers in the ecosystem).

@Qard Qard closed this Aug 11, 2021
@Mesteery Mesteery added the diagnostics_channel Issues and PRs related to diagnostics channel label Oct 13, 2021
@Qard Qard deleted the diagnostics-channel-span-api branch June 18, 2022 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked PRs that are blocked by other issues or PRs. diagnostics_channel Issues and PRs related to diagnostics channel experimental Issues and PRs related to experimental features. lib / src Issues and PRs related to general changes in the lib or src directory.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants