-
-
Notifications
You must be signed in to change notification settings - Fork 224
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
c67b805
to
faa85a8
Compare
b442499
to
b44fa21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @AbhiPrasad!
I left a few minor details
|
||
## Approach | ||
|
||
TODO: Talk about the approach we are using, based on Matt's hackweek project - https://github.com/getsentry/sentry-ruby/pull/1876 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good if @sl0thentr0py and @mjq-sentry would chime in a few words about the approach @mjq-sentry took. Can you guys do this?
|
||
## Transaction Protocol | ||
|
||
There is no concept of a transaction within OpenTelemetry, so we rely on promoting spans to become transactions. The span `description` becomes the transaction `name`, and the span `op` becomes the transaction `op`. Therefore, OpenTelemetry spans must be mapped to Sentry spans before they can be promoted to become a transaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When talking about this, I would make it clearer on what spans are Otel spans and what spans are Sentry spans.
I can kinda figure it out, but it feels like I am guessing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - maybe an illustration. I can help if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll work on clarifying the wording, let me see what I can do about the illustration.
|
||
Aside from information from Spans and Transactions, OpenTelemetry has meta-level information about the SDK, resource, and service that generated spans. To track this information, we generate a new OpenTelemetry Event Context. | ||
|
||
The existence of this context on an event (transaction or error) is how the Sentry backend will know that the incoming event is an OpenTelemetry event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can link to event payload here and move the Otel context to event payload page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually going to leave this in here so we don't confuse others reading the develop docs (new hires, other teams, etc.). once we are comfortable with this new otel context schema, let's move this outside into the main event payload page.
} | ||
``` | ||
|
||
The reason sdk and service are split are so they can be indexed as top level fields in the future for easier usage within Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to capture how are we planning to fill in the service fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done using by grabbing them off the OpenTelemetry resource, but I can do this!
Can you outline a scenario that would cause two ongoing traces within one execution?
Such operations often cause memory leaks, don't they?
Yes to avoid redundancies by centralizing on Relay if needed. Generally, the copy is excellent but sometimes reads a bit like a dev spec to me. Are the docs really the right place for all of it? |
|
||
When Sentry performance monitoring was initially introduced, OpenTelemetry was in early stages. This lead to us adopt a slightly different model from OpenTelemetry, notably we have this concept of transactions that OpenTelemetry does not have. We've described this, and some more historical background, in our <Link to="/sdk/research/performance/">performance monitoring research document</Link>. | ||
|
||
TODO: Add history about OpenTelemetry Sentry Exporter: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/sentryexporter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really needed in the docs? I would leave that to a README.
|
||
## Background | ||
|
||
When Sentry performance monitoring was initially introduced, OpenTelemetry was in early stages. This lead to us adopt a slightly different model from OpenTelemetry, notably we have this concept of transactions that OpenTelemetry does not have. We've described this, and some more historical background, in our <Link to="/sdk/research/performance/">performance monitoring research document</Link>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just mention that Sentry and OpenTelemetry have conceptual differences. This is normal and true for each vendor with an SDK that predates OTel. It doesn't matter why it is like that today. It's more important to a reader what the differences are.
|
||
## Approach | ||
|
||
TODO: Talk about the approach we are using, based on Matt's hackweek project - https://github.com/getsentry/sentry-ruby/pull/1876 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really need to go into the docs?
Should this document our way to a solution or the solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should document the solution, I'll work with the team to get this in there.
}; | ||
|
||
// https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/semantic_conventions/README.md#telemetry-sdk | ||
sdk?: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this "otel_sdk" (or something similar) so we do not confuse it with sentry sdk versions, making just everything just more explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this spec has some potentially big implications for storage so I would like somebody from S&S to review this as well.
|
||
## Transaction Protocol | ||
|
||
There is no concept of a transaction within OpenTelemetry, so we rely on promoting spans to become transactions. The span `description` becomes the transaction `name`, and the span `op` becomes the transaction `op`. Therefore, OpenTelemetry spans must be mapped to Sentry spans before they can be promoted to become a transaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The span
description
becomes the transactionname
the span description is in some cases extremely high cardinality. Are we going to introduce more transaction name sources to mitigate this?
|
||
In Sentry, we have two options for how to treat span events. First, we can add them as breadcrumbs to the transaction the span belongs to. Second, we can create an artificial "point-in-time" span (a span with 0 duration), and add it to the span tree. TODO on what approach we take here. | ||
|
||
In the special case that the span event is an exception span, [where the `name` of the span event is `exception`](https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/exceptions/), we also have the possibility of generating a Sentry error from an exception. In this case, we can create this [exception based on the attributes of an event](https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/exceptions/#attributes), which include the error message and stacktrace. This exception can also inherit all other attributes of the span event + span as tags on the event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which and how many attributes do we expect there? if there are a ton of default attributes attached to an exception span, can we define which of them actually end up in event tags?
there is a storage concern about adding too many default tags to the event payload, and a product one as well considering user-defined tags and tags inherited from otel count against the same size limit
If I get this correctly, in this system, a sequence of spans at the root level of a trace would create one new transaction per span. |
Since this could have very large cross system implications have you considered discussing proposing an rfc ? |
@fpacifici @untitaker Thanks for taking a look! I think we need to do some more testing before we can better understand storage and ingestion ramifications. For example, theoretically this change should send transactions at the same rate fir both an OpenTelemetry and Sentry SDK (otel sdk behaves === sentry sdk), but until we test various scenarios we cannot confirm this. The primary purpose of this document was to outline a high level approach so we could start experimenting with the SDKs. Once we get to a comfortable spot and have an SDK prototype, we will open a more formal RFC so we can discuss infrastructure ramifications. There we can also bring up changes we'll need to make to Relay/ingestion to support this. |
What if SDK changes are needed? Are those then still possible? If the answer is no, this is IMO not a sustainable approach to prototyping. For example I see a 50:50 chance that we cannot send and store this data as tags because there are too many. Would your proposal be to later detect those payloads in Relay and selectively move tags elsewhere? I'm sure that's convenient for SDK authors but painful to maintain for ingest. |
Yet, if we need to do a translation of values and tags, it would be redundant to do that in the individual SDKs. Instead, I would rather - if possible - vow for a dedicated mapping component that does this. This component could maybe also be maintained by the SDK team. |
I don't have an issue with putting more things in Relay. I mainly take issue with the idea that questions concerning infrastructure and software architecture are entirely left unanswered until we have shipped a prototype potentially already used by customers. I don't have a clear suggestion for what we should do (because we don't have an answer to those questions), and I suspect the concern about tags is not the biggest concern with this proposal anyway. If our answer is that SDK changes are still possible after prototyping (incl breakign changes) that's sufficient to me. Regarding tags specifically So I think this is going too much into detail, but some concerns with tags off the top of my head:
then there are combinations of those problems. for example, let's say the UI is built to interpret you can fix all of those problems by making relay convert tags into some other structure in the payload (which is something we currently don't do). There are two downsides to that:
one other option would be to dump otel attributes into a new custom context, and then let relay figure out which of those attributes can be indexed and which limits to apply. More time would be needed (and an answer to the above questions) to determine whether that's a good idea. Or, again, you're saying upfront it's fine to change the schema in breaking ways post-prototyping. |
I think what you're suggesting here is that we should maintain two separate schemas: ingestion schema maintained by SDKs, and storage schema maintained by infra teams. I think that idea has potential but it's not where we're at right now, and it would break some assumptions both customers (and tooling built by customers) and our internals have. One needs to be a superset of the other, I don't think it's feasible to have them be maintained by two separate teams |
We won't release a prototype to customers, @untitaker.
That's fine for me. It was just an idea but if we aren't there yet, it's just that. |
So for now, we are going to go with this approach - and then we can evaluate what to do as we see data coming in. So nothing stored in tags for now for transactions. I'm going to revamps this based on some convos we had, and merge this in as a starting point. The next step we need to take here is to document the SDK API so that we have a baseline to work off of. Once we're comfortable with that, we can start RFCs / wider convos re: logic in Relay and product implications. |
Sounds good, thanks @AbhiPrasad! |
https://develop-git-abhi-otel.sentry.dev/sdk/performance/opentelemetry/
This PR outlines an initial spec for OpenTelemetry usage in Sentry. It focuses on outlining the Span protocol (how to map from otel span -> sentry span), as well as mapping out some high level sections for us to work on here.
Open Questions
There are still some open questions we need to answer together at this stage.
If there are two traces ongoing, do we create two transactions? If we do, which one becomes the active transaction? Do we store state about all ongoing transactions?
My opinion: We track all transactions in a set, and remove them when we get finished. So instead of assigning to the active transaction on the scope, we assign to transactions the span processor is aware about. Pseudo-algo:
OpenTelemetry has a list of semantic conventions that identify the type of span based on span attributes, but this is annoying to keep duplicating in every single SDK. The easiest solution here is just to push all of that work onto Relay.
So,
With dynamic sampling this is also not a problem, since we can just use the OpenTelemetry Span name as the transaction name, so everything should be consistent.
We did some of this mapping work in the Sentry Exporter. For the purposes of the MVP in Ruby/Python, we can prob do this mapping in the SDK though.
Final Thoughts
My goal here is to get this merged in after the protocol gets approved. We can then split up the work around each of the different sections and iterate on this together.