Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thousand Island instrumentation #136

Closed
bryannaegele opened this issue Nov 25, 2022 · 17 comments · Fixed by #357
Closed

Thousand Island instrumentation #136

bryannaegele opened this issue Nov 25, 2022 · 17 comments · Fixed by #357
Labels
enhancement New feature or request

Comments

@bryannaegele
Copy link
Collaborator

With Bandit now having full support for being a drop-in cowboy replacement for Phoenix, we'll need an instrumentation library for Thousand Island from its telemetry.

@bryannaegele bryannaegele added the enhancement New feature or request label Nov 25, 2022
@mtrudel
Copy link

mtrudel commented Dec 6, 2022

Note that thousand island's telemetry model is being overhauled to better accommodate exactly this scenario. See provisional work at mtrudel/thousand_island#27 (which itself is largely blocked by knowing what the golden path is for exactly this integration).

@tsloughter
Copy link
Member

I'm not sure how instrumentation of the socket level can actually be correlated with the application level.

Any span created in ThousandIsland will be ignored when the HTTP request's propagated context is extracted.

We'll want to create metrics from these events, of course, but I don't think they can be used for spans.

@mtrudel
Copy link

mtrudel commented Dec 6, 2022

My really early take on this at the HTTP layer is up at https://github.com/mtrudel/bandit/tree/telemetry; generally I was planning to

  1. Have Bandit reach into the underlying ThousandIsland.Socket to get the connection span to use as a parent:
  2. Have Bandit create spans / events as appropriate for each HTTP request & the various phases of a request (header & body reading, response writing, etc).
  3. Put the relevant span into the Plug.Conn struct for a request (likely as a private) and somehow get phoenix to nest its spans within this if present.

Truthfully, I'm not at all confident I'm not overbuilding this. The time lookups alone are pretty expensive to be doing so many of them. Happy to take guidance / advice from folks more versed in telemetry than I.

@tsloughter
Copy link
Member

Have Bandit reach into the underlying ThousandIsland.Socket to get the connection span to use as a parent:

Yea, that is an issue since Bandit, or Plug/Phoenix, needs to create a span from a parent that is received in request headers. It'll have its own trace id, so can't have a parent/child relationship with anything from ThousandIsland.

The only option I know of for such a relationship would be that Bandit could add the ThousandIsland span as a linked span if it is what starts the span from the propagated headers.

@mtrudel
Copy link

mtrudel commented Dec 8, 2022

Interesting! Alright, so if I understand correctly:

  • Since all the trade id header work is in open telemetry's Phoenix support, there's nothing for Bandit / Thousand Island to do in order for phoenix's telemetry support to 'just work', including support for passed in parent traces

  • Thousand Island / Bandit's telemetry / span hierarchy is entirely distinct from that within Phoenix, except for the possibility of linking them in more of an informational manner than a hierarchical manner, so whatever decisions are made here are largely separate from anything currently in opentelemetry's Phoenix support

If that's the case, then I think the only open questions that remain for me are those at mtrudel/thousand_island#27 (comment). Pending clarity on them, I should be able to land sensible telemetry support in Bandit / TI in short order, which would unblock this. In light of that, do you have any advice for those questions, specifically how I can make your life here as easy as possible?

@bryannaegele
Copy link
Collaborator Author

bryannaegele commented Dec 8, 2022

Hey, Mat. If you have a look at how cowboy/phoenix telemetry/otel integration works, we need something that kicks off and initiates at the socket level and then end at that same level so we get accurate total times. For cowboy + phoenix (unreleased) we're just continuing to add info the a span started in otel cowboy. The same would be true with bandit/thousand island.

I only briefly looked at what was available telemetry wise several weeks ago and only saw thousand island events which is why I started this issue. If it belongs in bandit, totally cool. The key thing we'd all want here is to start and stop a span as closely to or in the socket request handler as possible. For cowboy, there's a connection handler pool where the span gets started and then we're reaching into that process to get pull that trace context into the phoenix process and just extend it.

Basically need the equivalent of otel cowboy for thousand island/bandit

@tsloughter
Copy link
Member

Important to note it is started somewhere that has the headers, https://github.com/open-telemetry/opentelemetry-erlang-contrib/blob/main/instrumentation/opentelemetry_cowboy/src/opentelemetry_cowboy.erl#L32 which is why i assume it needs to be in Bandit.

@SteffenDE
Copy link

I‘m currently in the process of moving from cowboy to bandit for my phoenix setup. Is there anything that’s blocking this?
https://github.com/mtrudel/bandit/blob/main/lib/bandit/telemetry.ex seems like it should include the necessary events.

@mtrudel
Copy link

mtrudel commented Jan 29, 2024

Not on my side. Bandit's telemetry payloads have been solid since 1.0. There are gonna be a few minor changes to metadata soon but nothing substantial

@bryannaegele
Copy link
Collaborator Author

There are some developments here but I have not been able to get a response. I'd like to get things moved forward and build on the work they did since they've used the namespace now rather than having to start from scratch to have a more complete lib.

@bryannaegele
Copy link
Collaborator Author

bryannaegele commented Mar 13, 2024

@mtrudel when looking through the bandit code and the instrumenter here that was written, there's a critical change that we need to get into bandit. Currently, the start event on a request happens prior to the request being read. The full request is not available until the end which makes it too late to extract trace propagation headers and make sampling decisions which must happen at the start of the trace, leaving incomplete context for the rest of the request.

There are a couple of things that we need to change to get parity with the cowboy instrumentation.

  1. The :start telemetry event must happen after the request is read and must provide the entire request as metadata.
  2. You can assign the start time of the request to a variable where the start event currently is and add that to the metadata.
  3. If you want to keep an event firing there, I'd suggest adding an event with :init instead of start.

@mtrudel
Copy link

mtrudel commented Mar 13, 2024

That should be doable! Any other metadata you want in the start event?

@bryannaegele
Copy link
Collaborator Author

I think that should be all that's needed. I feel like that object looked pretty thorough when I perused it.

As a reference point, cowboy shows the bare minimum we'd like to see. https://github.com/open-telemetry/opentelemetry-erlang-contrib/blob/main/instrumentation/opentelemetry_cowboy/src/opentelemetry_cowboy.erl#L31-L50

@mtrudel
Copy link

mtrudel commented Mar 14, 2024

This is actually proving to be a bit more of a plumbing change than I was expecting. I'm planning on landing the refactors in mtrudel/bandit#297 et al during my company hack week later on this month; I'll make sure these changes get landed as part of that work

@bryannaegele
Copy link
Collaborator Author

Thanks for the update. I had seen you started that PR and the refactor looked like it was going to make it more difficult than just sliding down to here

@mtrudel
Copy link

mtrudel commented Mar 14, 2024

The main problem is that the conn struct is what I want to expose; I quite specifically don't want to expose any of Bandit's internal req structures in the metadata. But... the conn only gets built inside the Pipeline module so I can't easily get to it until / unless I tear apart Pipeline, which is part of what I'll be doing in 297 et al.

By the end of that work there's going to be a single consolidated pipeline for HTTP request/responses, and HTTP1/2 are going to be thin behaviour implementations that only know about the lowest levels of transport. All the shared HTTP semantics (including telemetry!) will be shared between them, so there will only be one place to change this sort of thing.

Expect an update early in the week of Mar 25

@mtrudel
Copy link

mtrudel commented Mar 27, 2024

Bandit 1.4.1 just went out the door with support for conns in the metadata of HTTP start events. I think that's all you need from me to get this rolling?

Let me know if anything else is needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants