Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed tracing and Orleans #4992

Closed
NickDarvey opened this issue Sep 17, 2018 · 25 comments · Fixed by #7647
Closed

Distributed tracing and Orleans #4992

NickDarvey opened this issue Sep 17, 2018 · 25 comments · Fixed by #7647

Comments

@NickDarvey
Copy link

NickDarvey commented Sep 17, 2018

Goal

Distributed tracing that works out-of-the-box with other Microsoft libraries like System.Net (HttpClient), EntityFramework, Event Hubs SDK and Service Bus SDK.

Value

I want to instrument my Orleans application in a way I can correlate events so I can trace the effect of any one request across an entire system, across service boundaries. As an example, by adding a trace identifier to a HTTP request, I want to see that it reaches my gateway (in ASP.NET Core), each grain it hits (in Orleans), the messages it creates (on an Event Hub) and the database transaction it executes (via Entity Framework).

Background

I understand Orleans provides an ActivityId however it doesn't cover distributed tracing out-of-the-box (i.e. flowing that ActivityId to other systems).

I looked at how other Microsoft libraries did this (e.g. System.Net, Event Hubs SDK) and they seem to be converging on System.Diagnostics.Activity which is available as an AsyncLocal container (Activity.Current) for carrying diagnostic context. (More information about Activity is in the user guide.)
Unfortunately, it's not something that is used with Microsoft.Extension.Logging but rather introduces a (yet another) new diagnostic API: System.Diagnostics.DiagnosticSource. DiagnosticSource is distinct from other logging systems in that it's to be consumed in-process and sources normally provide rich payloads.
(More information about DiagnosticSource is in the user guide.)

Libraries using Activity which make or receive external calls often implement some variation of this correlation protocol. This means that the Activity is emitted on outbound requests and resumed on inbound requests.
It looks like there is work underway to standardize (the header names) based on the W3C spec for distributed tracing. However, Orleans should be able to be instrumented now without being affected by these changes. (Client-to-Orleans calls are via an Orleans-provided library so I don't think a change in header names would be a breaking change.)
(There (hopefully?) won't be any change to Activity itself, depending on this discussion.)
Edit: See @lmolkova's comment for details on the impact of W3C distributed tracing.

Implementation

There is official guidance for Activity and DiagnosticSource but newer documentation is work-in-progress and available here.

Based on the experience of people who have consumed DiagnosticSources before, it looks like a implementation similar to Entity Framework, where the events are typed, would be preferable.

My insight only goes as deep as poking around over the last week. I've come across people who seem very involved in the .NET and distributed tracing space who might be able to chime in and enlighten us about the right way of approaching this, such as @SergeyKanzhelev, @lmolkova and @cwe1ss.

Minimal

A minimal implementation would allow the Activity to flow across service boundaries with a DiagnosticSource for grain requests. Log providers (ILoggerProvider) could optionally listen to this DiagnosticSource (or all DiagnosticSources) and enrich the logs with the Activity context.

  • Add an interceptor to start an Activity and add it to the RequestContext.
    @rikbosch has a gist with an implementation of this part here.
  • Add a client-side interceptor (if that's a thing) which starts an Activity and adds it to the RequestContext.
  • Add documentation for log providers wanting to enrich logs with the Activity context and/or document which providers work like this out of the box.
    @SergeyKanzhelev does App Insight's (ASP.NET Core?) ILoggerProvider implementation already make use of the Activity.Current for correlation? I saw this, but does that mean it'll work for any DiagnosticSource?

    Application Insights SDK starting version 2.4.0-beta1 uses DiagnosticSource and Activity to collect telemetry and associate it with the current activity.

Extended

An extended implementation might go so far as to replace ILogger use in Orleans with DiagnosticSources and provide adapters back to ILogger (and to ITelemetryProvider for performance counters). I think the value of going this far would be to offer richer diagnostics to people who want it (e.g. someone who wants to build a profiler) and a single diagnostics API for Orleans. (The minimal implementation would mean Orleans would offer both ILogger and DiagnosticSource which is ugh.)

@sergeybykov sergeybykov added this to the Triage milestone Sep 18, 2018
@lmolkova
Copy link

Great write-up!

few comments:

  • there will be changes in Activity to support W3C distributed tracing, however, they will be backward-compatible and does not block implementation.

  • Using W3C headers right away does not make sense as they require very specific format and Activity does not generate this format yet.

  • AppInsights leverages Activity and DiagnosticSource for correlation and telemetry collection.

  • DiagnosticSource provides two different levels of details and instrumentation should enable them:

    1. Start/Stop events and shared context in the Activity, suitable for tracing tools
    2. payload parsing with reflection: suitable for profilers -provides detailed info, but requires build-time reference on the Orleans (other instrumented lib).

I can help with any details and provide examples, just let me know.

@sergeybykov
Copy link
Contributor

sergeybykov commented Sep 18, 2018

We are definitely interested in providing rich diagnostics to Orleans users, especially when they cross the boundaries to other systems. I think we should tentatively consider this for Orleans 3.0.0, and better understand the developer experience and performance impact of this.

@ReubenBond
Copy link
Member

From reading various threads/docs on DiagnosticSource & EventSource, it seems that we should be using both within Orleans:

  • Support flowing DiagnosticSource along a request
  • Add a small number of DiagnosticSource events (RequestStart, etc) for telemetry and to ensure a good distributed tracing story
  • Add a larger number of EventSource events for perf investigations & diagnostic tracing
  • Use EventCounter to replace our current counters

Does that sound right to you, @NickDarvey & @lmolkova?

This is something which we wont be able to address in the immediate future, but as Sergey said above, we should consider this for 3.0.

@NickDarvey
Copy link
Author

I'd defer to @lmolkova, but that sounds sensible to me.

You might have already looked in to this but I saw that there is a bridge from DiagnosticSource to EventSource which might mean you can do all the instrumentation with DiagnosticSource.

@lmolkova
Copy link

@ReubenBond
I agree with your plan. DiagnosticSource as an instrument for distributed tracing is not the best choice for verbose internal diagnostics.

You may still use DiagnosticSource for verbose perf and diagnostics logging by using two different DiagnosticSources for distributed tracing and internal debugging, but EventSoiurce is more optimal for the latter. It also provides out-of-process monitoring with ETW and LTTng.

DiagnosticSource to EventSource bridge may be hard to use as it requires some special configuration and even defines a language to serialize DS payloads in EventSource events.

@skyflyer
Copy link

I'm still trying to wrap my head around DiagnosticSource and DiagnosticListener, but my thoughts were along a very simple practical implementation:

Web API project --> Orleans Silo (grain) --> external service via http

I noticed that by default, ILogger contains a plethora of request-related information, including ConnectionId, RequestId, etc. I want to flow this information through the system towards the http client. I hope I haven't got it all backwards (?), but this is how I implemented it until better guidance is available. I understand you have this planned for later Orleans release, but I need something I can use now.

  1. Since HttpContext.TraceIdentifier is available in the Web API pipeline, I created a simple middleware to set the RequestContext with specific key to that value:
            app.Use(async (ctx, task) => {
                RequestContext.Set(RequestContextKey.RequestId, ctx.TraceIdentifier);
                await task.Invoke();
            });
  2. Since I'm using Serilog as a logging provider with an Application Insights sink (but otherwise use MEL ILogger<T> abstractions throughout the code), I thought about supplying this information in as general way as possible. I got inspiration from AspNetCore, where they use ILogger.BeginScope to add information to a logging scope. Since I'm already using an incoming grain call filter to catch and log exceptions, I added it there:
                var requestId = RequestContext.Get(RequestContextKey.RequestId) as string;
                using (var scope = _logger.BeginScope(new Dictionary<string, object> { { "RequestId", requestId}})) {
                    await context.Invoke();
                }
  3. The "httpclient" part is still missing - I'd like to pass this RequestId so that it gets forwarded to other endpoints automatically.

This approach gets me the RequestId from the web api, to the Orleans silo in "as ambient" way as I could find, without passing the RequestId all over the place.

Perhaps I got it backwards? Thoughts?

@suraciii
Copy link
Contributor

suraciii commented Jan 30, 2019

@skyflyer

  1. The "httpclient" part

There're 2 ways for HttpClient header injecting I know:
a) The DiagnosticsHandler in HttpClient produces several diagnostic events, so you can fetch the HttpRequestMessage when you subscribe related events, such as "System.Net.Http.HttpRequestOut.Start", and inject RequestId to message headers.

b) HttpClient provides a "middleware-like" way to configure Http request-response handling (much like Orleans' OutgoingGrainCallFilter) - you can add a custom handler to the client, and change request's header in that handler.
like this:

public class TracingHandler: DelegatingHandler
{
    public override Task<HttpResponseMessage> SendAsync(HttpRequestMessage message, CancellationToken token)
    {
        // change headers
        base.SendAsync(message, token);
    }
}

a is the more universal option - it uses the .net core's diagnostic pattern, but it's sort of complex, you need to start a task/thread to subscribe and handle events, and it's not much configurable.
I prefer and current using option b, with the help of the library HttpClientFactory, it's easy to configure different tracing policy for different HttpClients

@skyflyer
Copy link

skyflyer commented Feb 1, 2019

@csyszf, thanks for the input. I haven't work with HttpClientFactory (yet), looks interesting.

With my previous post, I was trying to access the same "trace identifier" that would be used "across different microservices". Given that ctx.TraceIdentifier was the only such parameter available to me at the time, I figured that would be it, but I later noticed that Application Insights does not use this value, but rather computes its own Operation Id in the form |61e9be9932679f4d8c3f775164d629d1.5173a375_ where TraceIdentifier value is in the form 0HLK6O4A3D8CC:00000001.

What is (or should be) the correct way to "correlate" all "operations" under the same (parent) id? The (simple) goal is to have one (parent) operation id available across different services -- and since we're logging everything into Application Insights (integration with ASP.NET Core + Serilog AI sink for other logging requirements), that would be really nice.

@suraciii
Copy link
Contributor

@csyszf, thanks for the input. I haven't work with HttpClientFactory (yet), looks interesting.

With my previous post, I was trying to access the same "trace identifier" that would be used "across different microservices". Given that ctx.TraceIdentifier was the only such parameter available to me at the time, I figured that would be it, but I later noticed that Application Insights does not use this value, but rather computes its own Operation Id in the form |61e9be9932679f4d8c3f775164d629d1.5173a375_ where TraceIdentifier value is in the form 0HLK6O4A3D8CC:00000001.

What is (or should be) the correct way to "correlate" all "operations" under the same (parent) id? The (simple) goal is to have one (parent) operation id available across different services -- and since we're logging everything into Application Insights (integration with ASP.NET Core + Serilog AI sink for other logging requirements), that would be really nice.

I'm not familiar with Application Insights, but |61e9be9932679f4d8c3f775164d629d1.5173a375_ looks like an Activity Id, an it should be avaliable in most areas which adapt System.Diagnostics.
AspNetCore start several activities like RequestStart, so you can get this activity id if you subscribe it - and it will auto set the new started activity's parent to the current activity(with stored by AsyncLocal)

@RainingNight
Copy link

Any news?

@ReubenBond
Copy link
Member

Yes, we are actively working on this

@veikkoeeva
Copy link
Contributor

veikkoeeva commented Jun 22, 2019

I'll cross-reference also dotnet/aspnetcore#5914 as those following this might be interested to see where the core framework is going and have potentially similar scenarios to the branched issues.

<Edit: Links of interest by following from ASP.NET Core to https://w3c.github.io/trace-context/ and from there to https://github.com/open-telemetry/opentelemetry-dotnet/ (and dotnet/aspnetcore#8924).

@jfritchman
Copy link

Is this issue going to make the 3.0.0 release?

@zh6335901
Copy link

zh6335901 commented Nov 5, 2019

What's the progress of this issue?

@RehanSaeed
Copy link

The ASP.NET Core team posted a very enlightening blog post showing how the new Activity API works:

https://devblogs.microsoft.com/aspnet/improvements-in-net-core-3-0-for-troubleshooting-and-monitoring-distributed-apps

@veikkoeeva
Copy link
Contributor

Also potentially interesting example from Firefox DevTools: Server-Timing HTTP header (also https://twitter.com/FirefoxDevTools/status/1201914691863244800).

@veikkoeeva
Copy link
Contributor

Maybe interesting resource: https://opentelemetry.io/. Also cross-referencing to #6178 that asks guidance for production runs.

@ElderJames
Copy link
Contributor

ElderJames commented Jan 7, 2020

I am working for the .NET agent of Apache Skywalking APM and want to support orleans. Like #6213 said.
Can I get any help?

@ReubenBond
Copy link
Member

@ElderJames I responded on that thread

@ReubenBond
Copy link
Member

I have a branch where I was testing integration with Jaeger via OpenTracing and DiagnosticSource. Longer term, OpenTelemetry is probably the way to go but I'm not sure if they had any nuget packages at the time. I plan to complete the integration to have a better end-to-end, pay-for-what-you-use solution here.

@NickDarvey
Copy link
Author

FYI, saw some related work in dotnet/runtime#31373

The proposal here is mainly targetting closing the gap between the .NET Libraries and OpenTelemetry (OT) tracing APIs (mainly the OT Span class).

and its linked issue dotnet/runtime#31372

This issue is tracking the work need to be done for supporting the distributed context scenarios for ASP.NET and OpenTelemetry.

@oising
Copy link
Contributor

oising commented Sep 6, 2020

@ReubenBond Given this didn't make 3.0 -- is this on track for 4.0 ?

@veikkoeeva
Copy link
Contributor

I link https://petabridge.com/blog/state-opentelemetry-dotnet/ and also dotnet/runtime#31862 (in spirit of @NickDarvey) to provide context.

@OracPrime
Copy link
Contributor

+1 for interest in seeing how opentelemetry integration pans out.

@Ilchert
Copy link
Contributor

Ilchert commented Dec 10, 2020

@ReubenBond, is it still actual to implement this feature?
I added a pull request with Activity and Diagnostic listener support and it does not depend on any internal things, so everyone can use the idea.
Thanks.

@ghost ghost added the Status: Fixed label Apr 13, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.