-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed tracing support #5078
base: main
Are you sure you want to change the base?
Conversation
src/Microsoft.Diagnostics.Monitoring.EventPipe/Configuration/ActivitySourceConfiguration.cs
Outdated
Show resolved
Hide resolved
_ActivitySourceNames = activitySourceNames?.ToArray() ?? Array.Empty<string>(); | ||
|
||
if (_SamplingRatio < 1D) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be doing something this heavyweight and async in the ctor. If we want to validate this only works in version 9, we can fail the pipeline later or have the validation done upfront in a helper async method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to a spot that felt more natural. Check it out, LMK.
src/Microsoft.Diagnostics.Monitoring.EventPipe/Configuration/ActivitySourceConfiguration.cs
Show resolved
Hide resolved
|
||
namespace Microsoft.Diagnostics.Monitoring.EventPipe | ||
{ | ||
internal readonly struct ActivityData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very large struct object. Perhaps a class or record is more suitable? I have not looked at usage yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a super big struct! But that doesn't concern me. It is handled correctly, passed by reference (out
or in
in our cases). What this code is trying to do is not create unneeded GC pressure in the monitoring app. It doesn't have to do that, but that was my thinking 😄
internal static partial class TraceEventExtensions | ||
{ | ||
[ThreadStatic] | ||
private static KeyValuePair<string, object?>[]? s_TagStorage; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be string/string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be. Today everything pumped through DiagnosticSourceEventSource
is in fact a string. But I went this direction because that is sort of an issue. When it comes to OTel distributed tracing, there are semantic conventions. Something like http.response.status_code
should be an int to be compliant with the spec. I went with object
so it could be fixed in the future if we add some typing info to the events or perhaps some conversion in the pipeline 🤷
status, | ||
statusDescription); | ||
|
||
payload.Tags = new(s_TagStorage, 0, tagCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am understanding correctly, the tags are cumulative? So payload n contains the tags from 0 to n? Perhaps it would be easier to just associate the tags with the event and then aggregate them as needed. The other consideration here is multiple sessions. If you start reading distributed tracing, fail and then start again the tag list will never reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no aggregation of tags. Each span/Activity will just have whatever tags were added to it. What is going on here is I have some [ThreadStatic]
storage for those tags. They are read off the EventSource and added to that storage. Then we hand out a ReadOnlySpan
to that storage. So the consumer can get them, but can't move it off the stack. Same kind of idea as the big ActivityData
struct. The goal here is to avoid having to allocate a list or array to store the tags for each span/Activity we receive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable. I guess my only concern would be that this is unbounded but in practice I can't imagine that there are so many different tags in metrics that this would begin to be an issue.
|
||
namespace Microsoft.Diagnostics.Monitoring.EventPipe | ||
{ | ||
internal class TracesPipeline : EventSourcePipeline<TracesPipelineSettings> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think DistributedTracePipeline would work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it "DistributedTracesPipeline". Plural "Traces" because it felt better and we had plural in "Logs" pipeline. But happy to drop the "s" if you feel strongly about it.
src/Microsoft.Diagnostics.Monitoring.EventPipe/Traces/TraceEventExtensions.cs
Outdated
Show resolved
Hide resolved
|
||
if (!s_Sources.TryGetValue(sourceName, out source)) | ||
{ | ||
source = new(sourceName, sourceVersion); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what happens if there are two events that have the same source name but different versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll just report the version of the first one we saw. It is kind of an edge case thing. I doubt multiple sources with different versions really happens in practice (it could) but even if it did I don't think there would be big impact. We could make the key a tuple if you are concerned. Or we could do it later if anyone raises an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can great a bug/issue to track for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestions/comments inline but for the most part if the dotnet-monitor crew is happy then I'm happy.
eventSource.Dynamic.All += traceEvent => { | ||
try | ||
{ | ||
if ("Version".Equals(traceEvent.EventName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to check the provider name as well so you don't get confused if more than one provider has a Version event.
src/Microsoft.Diagnostics.Monitoring.EventPipe/DistributedTraces/DistributedTracesPipeline.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Diagnostics.Monitoring.EventPipe/EventSourcePipeline.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Diagnostics.Monitoring.EventPipe/DistributedTraces/TraceEventExtensions.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Diagnostics.Monitoring.EventPipe/DistributedTraces/TraceEventExtensions.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Diagnostics.Monitoring.EventPipe/EventSourcePipeline.cs
Outdated
Show resolved
Hide resolved
@@ -86,7 +88,29 @@ public async Task<Task> StartAsync(CancellationToken token) | |||
// started task. Logically, the run task will not successfully complete before the session | |||
// started task. Thus, the combined task completes either when the session started task is | |||
// completed OR the run task has cancelled/failed. | |||
await Task.WhenAny(_processor.Value.SessionStarted, runTask).Unwrap().ConfigureAwait(false); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to incorporate the feedback here and the result was the code got more complicated 😄
The issue here is runTask
if it fails will cancel the SessionStarted
task which will trigger the WhenAny
to complete. There is a race then to figure out if runTask.IsFaulted
and get the exception.
I made it work by awaiting without the automatic throw but that is non-trivial on older targets.
Check it out, LMK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started investigating the code around this and mostly it just raised more questions. Partly I wasn't sure what motivated your initial change to this part of the code and without that I'm not sure whether I should be recommending you leave it alone, bandaid it, change higher level code to avoid calling it, or try fixing the root causes.
Looking around I saw a couple suspicious things:
- The comment at https://github.com/dotnet/diagnostics/blob/main/src/Microsoft.Diagnostics.Monitoring.EventPipe/DiagnosticsEventPipeProcessor.cs#L26 suggests that the SessionStarted Task is purely for testing but it looks like dotnet-monitor calls StartAsync() in situations that weren't obviously test related. If the functionality was just for testing there are probably other strategies that wouldn't require this StartAsync() functionality to exist.
- The comment text describes the WhenAny as a workaround for RunTask failing without signaling the session completed task. That underlying issue seems like it could be addressed directly such that any failure in RunTask also sets SessionStarted to a faulted state.
- Its not clear to me why StartAsync() needs to distinguish between a cancelled/faulted state so maybe all this code to prioritize one over the other is unneeded? Typically if you have code that calls StartXYZ() you also have code later that does StopXYZZ() or EndXYZ() and it is the end code which is expected to report failures. If the failure occurs early that would usually mean that the call to StopXYZ() still occurs, it returns immediately with an Exception and at that point the caller can do some unified error handling for whole operation rather than separate error handling for start and stop.
filterAndPayloadSpecs.AppendLine($"[AS]{activitySource}/Stop{sampler}:-TraceId;SpanId;ParentSpanId;ActivityTraceFlags;TraceStateString;Kind;DisplayName;StartTimeTicks=StartTimeUtc.Ticks;DurationTicks=Duration.Ticks;Status;StatusDescription;Tags=TagObjects.*Enumerate;ActivitySourceVersion=Source.Version"); | ||
} | ||
|
||
// Note: Microsoft-Diagnostics-DiagnosticSource only supports a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We solve this problem btw with a shared string across our configurations. Since we are not planning on operating triggers and OTLP at the same time right now, it's not a concern but if we want to do all of these, we'll have to aggregate the otlp configuration with the other ones. Let's save that for the future.
eventSource.Dynamic.All += traceEvent => { | ||
try | ||
{ | ||
if (traceEvent.TryGetActivityPayload(out ActivityPayload activity)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we move the validation code here and get away with just one session? We are not validating prior to running anyway.
|
||
private static void AddToArrayGrowingAsNeeded<T>(ref T[] destination, T item, ref int index) | ||
{ | ||
if (index >= destination.Length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making a separate helper method or extension method for growing the ref array. You have similar logic in the LogRecord PR.
This PR adds a new pipeline which will listen to DiagnosticSourceEventSource in order to receive
Activity
instances (aka spans) from a target process.The goal is to use this in dotnet-monitor to export distributed traces (along with logs & metrics) using OpenTelemetry Protocol (OTLP).
/cc @noahfalk @samsp-msft @rajkumar-rangaraj @wiktork