Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding FunctionActivitySource to Worker; refactoring AppInsights support #1307

Merged
merged 6 commits into from
Mar 15, 2023

Conversation

brettsam
Copy link
Member

@brettsam brettsam commented Feb 1, 2023

Fixes open-telemetry/opentelemetry-specification#1359
Fixes open-telemetry/opentelemetry-specification#1358
Fixes open-telemetry/opentelemetry-specification#1264

This change:

  • moves the FunctionActivitySource into the core worker project and provides OTel versioning capabillities for the future.
  • updates the handling of this Activity in the App Insights package
  • makes the App Insights handling richer (now sends Exceptions, sets Dependency status, for example)
  • removes a bunch of other custom App Insights code we had to simplify some of our behavior
  • sets "WorkerApplicationInsightsLoggingEnabled" capability for when functions host can handle this (soon)

Note that after this change you can start wiring up OpenTelemetry exporters in the worker as well and bypass our ApplicationInsights package completely. For example:

funcBuilder.Services.AddOpenTelemetry()
    .WithTracing(traceBuilder =>
    {
        traceBuilder.AddSource("Microsoft.Azure.Functions.Worker")
            .AddHttpClientInstrumentation()
            .AddAzureMonitorTraceExporter(o => o.ConnectionString = aiConnStr)
            .SetSampler<AlwaysOnSampler>();
    })
    .StartWithHost();

funcBuilder.Services.AddLogging(loggingBuilder =>
{
    loggingBuilder.AddOpenTelemetry(options =>
    {
        options.AddAzureMonitorLogExporter(o => o.ConnectionString = aiConnStr);
    });
});

@brettsam brettsam requested review from jviau, liliankasem and fabiocav and removed request for liliankasem February 1, 2023 22:51
Copy link
Member

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor(-ish) comments, looks great overall. Super-excited to see it coming! 🚀

Copy link
Contributor

@jviau jviau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great to see!

A question that I have with this and with DurableTask tracing: I would like to ship the activity source as "preview" to start. This is so we can iterate on schema. But how? How do we ship the activity source within this already GA package? Do we need to split this out into its own package Microsoft.Azure.Functions.Worker.Instrumentation and have some internal hook point for that separate package to start/stop its activity at the right point?


public Activity? StartInvoke(FunctionContext context)
{
var activity = _activitySource.StartActivity(TraceConstants.FunctionsInvokeActivityName, ActivityKind.Internal, context.TraceContext.TraceParent,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be ActivityKind.Server as we are ultimately serving a function invocation request?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment here -- #1307 (comment). It seems that this is currently not really defined by the otel spec, but that App Insights and others treat this as an internal call with the host log being the "Server" piece. If they were both "Server", rendering of the Span would not be correct.

I think this is something we can handle with the OTel schema versioning we're doing -- if things change in the future, we can use that to change the Kind as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have a discussion about the spans we will all have. I don't think we should be trying to hide the gRPC hop from Host to Worker. Our Host/Worker design is a distributed system and it should be represented as such in spans. Customers should see the following at minimum:

Incoming request (Host Server Span) --> Host Calling Worker (Host Client Span) --> Worker handling invocation (Worker Server Span)

And then we also have to see how gRPC spans factor into all of this. Given that N gRPC spans may relate to a single invocation, I think they should be links on the Functions spans and not parent/child

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So -- the spec does say it should be Server: https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/faas/#incoming-invocations. But I also see really no notion of a "host". It seems to only care about the instance handling the invocation itself.

I think in most other clouds, the "host" -- the thing polling/sending you events -- is part of infrastructure and not even present in the distributed trace? anyone have any idea?

@lmolkova -- if we do indeed make this a Server (which is what it seems like it should be) -- that would turn this span into a Request in App Insights, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Server becomes request in application insights. This is also beneficial because app insights has built in aggregation for request telemetry - failure rate, performance, etc. I see this as a great benefit for customers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're saying that Host => Worker is an Activity, that can be listened to by users (and therefore output to their backend observability platform), then we add more spans in the Worker under the same source (like a Http instrumentation span) then listening to the ActivitySource will emit to spans, one of which is arguably not that useful to them? and if they don't control the performance of it, I see little reason they should see it, and pay for it (in storage/backend costs).

Having a separate ActivitySource with .Host on the end might be a better way to look at it, so it become an opt-in for Host metrics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. Any other ActivitySource will be differentiated. That's why this one is named with ".Worker" -- it's only meant to emit Activities directly related to the Worker.

Any Activity related to the host will be appropriately named and, won't exist in the worker process -- those will be emitted via the Host process.

Does that address the concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think option 1 is the most "truthful" and flexible option. As @brettsam said, the Host spans will be identifiable as a different source. In the App Insights scenario, we should also ensure the emitted telemetry items are appropriately distinguishable from equivalent spans from the Worker. So that in the performance or failure view pages the Host and Worker spans are aggregated separately (I think this means ensuring they are different app names).

My justification is I believe that we should give Server spans for the Worker (and specifically not Internal). This is for similar reasons as to what @brettsam mentioned: we want to keep it flexible so that if Host and Worker separate further in the future, this telemetry remains accurate without any changes.

Copy link
Member

@lmolkova lmolkova Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My justification is I believe that we should give Server spans for the Worker (and specifically not Internal)

Having two nested server spans will break the application map in application insights and could be confusing to other backends. It also goes against otel spec that says server spans should have client parents - this creates expectations for backends that would be broken. So if the decision is to use server span, it'd be best to add client span that tracks call to worker right away

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, option 1 does include the client span from Host -> Worker.

@brettsam
Copy link
Member Author

brettsam commented Feb 3, 2023

@jviau re: versioning

I was planning to get this in as GA inside the Worker assembly without any preview. (The ActivitySource; not the App Insights stuff). I was intentionally keeping this initial Activity small -- with really only a single tag added plus the schema. I figured the only real changes I see coming are:

  • add new tags to make things richer -- which I wouldn't see as "breaking"
  • some tag needs to change due to the otel schema changing. This is factored in (i.e. "faas.execution" -> "faas.invocation" mapping would already happen and customer would choose the schema they want to run with in WorkerOptions).
  • we'd need to change the Activity name or kind? We can hide this behind the schema selection as well.

We're getting lots of requests for this to be "official" -- so I didn't want to sit in Preview for a long time. I know changes will come which is why I chatted with @lmolkova about how everyone else is handling schema changes.

We can certainly move to an external diagnostics assembly -- it'd be more flexible. But I'm sure even after we GA that we'd get more changes coming in the future. Maybe it'd be easier to bump that one's major we have to do a major overhaul?

Curious how others feel -- @fabiocav / @lmolkova / @RohitRanjanMS

@brettsam brettsam merged commit b963a28 into main Mar 15, 2023
@brettsam brettsam deleted the brettsam/activity branch March 15, 2023 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants