Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing is now very broken in Azure Functions #2733

Closed
martinjt opened this issue Sep 24, 2024 · 10 comments · Fixed by #2765
Closed

Tracing is now very broken in Azure Functions #2733

martinjt opened this issue Sep 24, 2024 · 10 comments · Fixed by #2765
Assignees
Labels
potential-bug Items opened using the bug report template, not yet triaged and confirmed as a bug

Comments

@martinjt
Copy link

Description

I'm not entirely sure what's changed, but I suspect this may have been a host change, but I'll list out what I'm saying based on multiple customer issues we've had.

There is now an empty Activity present in the execution path. This means that whenever a user adds tracing, this new Activity is the parent, however, it's not something that you can listen to since it does not have a source name.

This leads to the new ActivitySource named Microsoft.Azure.Functions.Worker, this does take over the Blank Activity when you listen to the source, however, it uses the TraceContext from the FunctionContext as it's parent, which doesn't use the inbound traceparent from the HttpRequest headers as it's parent. This means that your span has a parent which you're unable to access.

image
That is what a trace looks like right now with the OOTB setup.

        services.AddOpenTelemetry()
            .ConfigureResource(resource => resource.AddService("practical-otel-azure-functions"))
            .UseFunctionsWorkerDefaults()
            .WithTracing(tracingBuilder =>
            {
                tracingBuilder
                    .SetSampler<AlwaysOnSampler>()
                    .AddSource("Microsoft.Azure.Functions.Worker")
                    .AddHttpClientInstrumentation()
                    .AddAspNetCoreInstrumentation()
                    .AddOtlpExporter();
            });

Aspire is rendering it in a semi-ok way, that's compensating for the fact that the data itself on those spans isn't right (missing parents, and missing root spans).

I think the big change is that, previously, there was not TraceContext on the function context, and therefore there was no active Activity. Since the upgrade, this has changed and now, there is an Activity which has a remote parent that there is no way to get access to (there's supposed to be a way to access it in Flex Consumption).

To be clear, this is a breaking change. Adding an Activity that is not trackable, and overriding the inbound traceparent causes functions to be untraceable due to the context propagation issues.

The AspNetCore span completely baffles me, since it's got a different parent to the FunctionsWorker span, and also somehow is outside of the hierarchy.

I'd love to know what caused this, and whether there is a plan to fix functions tracing for users who aren't upgrading to Flex Consumption.

Steps to reproduce

Create a new HTTP Trigger function
Add OpenTelemetry
Send the data to somewhere you can see the raw data.
Run it locally
Hit the endpoint

Look at the telemetry and you'll see that it doesn't follow a linear hierarchy, and there is no span that doesn't have a parent (aka a root span)

@martinjt martinjt added the potential-bug Items opened using the bug report template, not yet triaged and confirmed as a bug label Sep 24, 2024
@RohitRanjanMS RohitRanjanMS self-assigned this Sep 24, 2024
@RohitRanjanMS
Copy link
Member

Thanks @martinjt for reaching out, I will provide an update by tomorrow.

@jkessenich
Copy link

jkessenich commented Sep 26, 2024

Apart from this, when you have a HTTP triggered function and send a request to it, a new unparented span (a.k.a. new trace) is also being emitted from the AspNetCore instrumentation, so you have this blank Invoke span and this new trace. Starting a new active span in does not make this the parent.

ScopeSpans #1
ScopeSpans SchemaURL:
InstrumentationScope Microsoft.Azure.Functions.Worker 1.0.0.0
Span #0
Trace ID : 63b5dd200c01e184afdf5bafe61c90eb
Parent ID : 055abe33f3e3a70e
ID : ff98f9123b333119
Name : Invoke
Kind : Server
Start time : 2024-09-26 16:56:46.7240646 +0000 UTC
End time : 2024-09-26 16:56:46.9083439 +0000 UTC
Status code : Unset
Status message :

ScopeSpans #2
ScopeSpans SchemaURL:
InstrumentationScope Microsoft.AspNetCore
Span #0
Trace ID : 1e5d60beeafd93c433bb594460c600ff
Parent ID :
ID : 58a419ff1ceb4d7a
Name : POST api/MyApi
Kind : Server
Start time : 2024-09-26 16:56:46.7395207 +0000 UTC
End time : 2024-09-26 16:56:46.9209312 +0000 UTC
Status code : Unset
Status message :

@RohitRanjanMS
Copy link
Member

@martinjt , you are right. We created this internal activity to provide context since the current activity was null. We intentionally did not use an ActivitySource so it cannot be tracked. I'll look into it and update this thread tomorrow.

image

@martinjt
Copy link
Author

This has caused a breaking change @RohitRanjanMS, I can understand following rhe pattern than ASP.NET Core uses here (the legacy activity without source), however, there is no check for an existing activity here, which if there is, it makes no sense to create a new one.

Further, the host now creates an activity (even in none flex consumption), which means we have no ability to know if this was an entrypoint, and therefore can't override to make a root span.

I get that Aspire isn't flagging that there are missing spans in the trace, that's a separate issue since it should make developers aware there is an issue in their tracing.

The reason this is important is that the lineage of spans is important for searching and also display. Where you're at right now with it, there is no way to render a trace view with the right lineage where there is a function app involved.

In my opinion, there needs to be 2 high priority fixes.

  1. Check for an existing activity and don't create the invisible one if that's the case.
  2. Change the host such that it doesn't create spans by default.

Those 2 will fix all the issues this change has caused.

@RohitRanjanMS
Copy link
Member

Hi @martinjt , there are two parallel pipelines: ASP.NET Core and Functions. While the ASP.NET Core pipeline is awaiting, the Functions pipeline runs the worker code. The correct approach is to integrate both pipelines, allowing the worker code to run within the ASP.NET Core pipeline and ensuring that telemetry is assigned the correct parent. This issue is also impacted for the same reason. Although I don’t have a specific date for the fix, it’s something we’ve been discussing recently.

I plan to add a null check for Activity.Current and, if it's not null, I won’t create the internal activity. This will be useful once the ASP.NET Core pipeline is fixed, allowing consumers to utilize the server span emitted by AspNetCoreInstrumentation.

However, the Functions runtime/host process will emit a server/consumer span, as it has the necessary context, especially in batch execution scenarios with messaging. The internal activity represents the remote activity on the host process, ensuring that all logs and spans are correctly parented by this remote activity.

This runtime version is available in the Flex consumption plan and will also be part of the Aspire release. We're actively working to improve OpenTelemetry support, with more features rolling out over the next few months.

Additionally, I don't recommend listening to Microsoft.Azure.Functions.Worker source, as it was designed to support the classic AppInsights SDK, which doesn’t add value in an OpenTelemetry context.

@martinjt
Copy link
Author

martinjt commented Oct 2, 2024

The null check, or correct parentage on the null activity will solve the issue with ASP.NET Core instrumentation appearing out of band (I tried this).

The issue right now is that even if you're not asking the runtime/host to use OpenTelemetry (and therefore generate all the data from the host), the host adds it's own span and therefore breaks the parentage with the calling service. This is most present when there is no traceparent coming from the calling http service, since this generates a root span in the host process that you don't get access to.

There is the further issue that you can't filter the Activity's generated by the host process, so it's incredibly noisy, with little signal data, but that's separate and related only to flex consumption (It's the reason that the collector approach in lambda is done that way).

I can understand that the host runtime has important context, and that context might only live there, but right now, it's generating spans and breaking the trace hierarchy even when you're not requesting that it does so. There are a lot of people who've required tracing to be present before this, and have used the inbound propagation from the traceparent in the headers, but now, that's broken due to the intermediary process generating a span that they can't see, and therefore can't do propagation properly.

This was a breaking change in the host runtime, that has broken existing functionality. One of them can be worked around, the other is in the host and therefore cannot.

@martinjt
Copy link
Author

martinjt commented Nov 13, 2024

I've created a minimal reproduction of the issue now.

martinjt/root-span-issue-functions

Deploy that to a basic azure function, then add an environment variable for OTEL_EXPORTER_OTLP_ENDPOINT then point that at anything which doesn't process the telemetry to normalise it (I used an ngrok endpoint forwarding to Aspire locally).

When you hit the endpoint for the function, you'll generate a trace with 1 span, and that span will have a parent. That shouldn't happen, as now we have a missing root span. Since this is not in flex consumption, I have no ability to get access to that span.

This is how broken it is, since this used to work.

Image

@martinjt
Copy link
Author

@RohitRanjanMS can we get this reopened to fix the remaining issue?

@jviau
Copy link
Contributor

jviau commented Nov 14, 2024

@martinjt we are tracking the 'fake' activity design here: #2810

@martinjt
Copy link
Author

martinjt commented Nov 14, 2024

@jviau This is slightly different.

The point (relating it to that other issue) is that context.TraceContext is set at all when I haven't passed one in the traceparent in my http call.

Ultimately, in my example scenario above, the .SetId() would be setting null anyway, which should be fine and work as expected.

That's what has broke all our customers, since now there is a trace parent set that they can't get access to.

stevejgordon added a commit to elastic/apm-agent-dotnet that referenced this issue Nov 29, 2024
This PR fixes several things in Azure Functions (isolated worker) which
occur when using the latest templates for Azure Functions.

- Newer versions of the Functions library prefer the gRPC-based
implementation, which throws when `Url` is accessed. Instead, we read
these from other data on the `FunctionContext`.
- Address Functions library changes that break distributed tracing. We
now parse the original headers from the `BindingContext` instead of the
request, which may contain a `traceparent` with the sampling flag set to
false when the user request does not include a specific `traceparent`.
- We explicitly don't record activities from the Azure functions library
as these are pretty broken
(Azure/azure-functions-dotnet-worker#2733) and
are redundant when using our middleware.
- Downgrade several packages as the newer ones are now deprecated
(thanks for the confusion, Microsoft!).
- Update some outdated compiler pre-processor directives.
- The final few commits focus on CI integration test hangs on Linux. We
don't have a perfect solution for those, but after reviewing the hang
dumps, I've avoided some of the potential causes of the hangs. We'll
monitor subsequent PRs, and if they remain stable, we will readdress the
original causes.

A follow-up PR will update our documentation.

Closes #2407 
Closes #2311
Closes #2218
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
potential-bug Items opened using the bug report template, not yet triaged and confirmed as a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants