-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracing is now very broken in Azure Functions #2733
Comments
Thanks @martinjt for reaching out, I will provide an update by tomorrow. |
Apart from this, when you have a HTTP triggered function and send a request to it, a new unparented span (a.k.a. new trace) is also being emitted from the AspNetCore instrumentation, so you have this blank Invoke span and this new trace. Starting a new active span in does not make this the parent. ScopeSpans #1 ScopeSpans #2 |
@martinjt , you are right. We created this internal activity to provide context since the current activity was null. We intentionally did not use an ActivitySource so it cannot be tracked. I'll look into it and update this thread tomorrow. |
This has caused a breaking change @RohitRanjanMS, I can understand following rhe pattern than ASP.NET Core uses here (the legacy activity without source), however, there is no check for an existing activity here, which if there is, it makes no sense to create a new one. Further, the host now creates an activity (even in none flex consumption), which means we have no ability to know if this was an entrypoint, and therefore can't override to make a root span. I get that Aspire isn't flagging that there are missing spans in the trace, that's a separate issue since it should make developers aware there is an issue in their tracing. The reason this is important is that the lineage of spans is important for searching and also display. Where you're at right now with it, there is no way to render a trace view with the right lineage where there is a function app involved. In my opinion, there needs to be 2 high priority fixes.
Those 2 will fix all the issues this change has caused. |
Hi @martinjt , there are two parallel pipelines: ASP.NET Core and Functions. While the ASP.NET Core pipeline is awaiting, the Functions pipeline runs the worker code. The correct approach is to integrate both pipelines, allowing the worker code to run within the ASP.NET Core pipeline and ensuring that telemetry is assigned the correct parent. This issue is also impacted for the same reason. Although I don’t have a specific date for the fix, it’s something we’ve been discussing recently. I plan to add a null check for Activity.Current and, if it's not null, I won’t create the internal activity. This will be useful once the ASP.NET Core pipeline is fixed, allowing consumers to utilize the server span emitted by AspNetCoreInstrumentation. However, the Functions runtime/host process will emit a server/consumer span, as it has the necessary context, especially in batch execution scenarios with messaging. The internal activity represents the remote activity on the host process, ensuring that all logs and spans are correctly parented by this remote activity. This runtime version is available in the Flex consumption plan and will also be part of the Aspire release. We're actively working to improve OpenTelemetry support, with more features rolling out over the next few months. Additionally, I don't recommend listening to Microsoft.Azure.Functions.Worker source, as it was designed to support the classic AppInsights SDK, which doesn’t add value in an OpenTelemetry context. |
The null check, or correct parentage on the null activity will solve the issue with ASP.NET Core instrumentation appearing out of band (I tried this). The issue right now is that even if you're not asking the runtime/host to use OpenTelemetry (and therefore generate all the data from the host), the host adds it's own span and therefore breaks the parentage with the calling service. This is most present when there is no There is the further issue that you can't filter the Activity's generated by the host process, so it's incredibly noisy, with little signal data, but that's separate and related only to flex consumption (It's the reason that the collector approach in lambda is done that way). I can understand that the host runtime has important context, and that context might only live there, but right now, it's generating spans and breaking the trace hierarchy even when you're not requesting that it does so. There are a lot of people who've required tracing to be present before this, and have used the inbound propagation from the This was a breaking change in the host runtime, that has broken existing functionality. One of them can be worked around, the other is in the host and therefore cannot. |
I've created a minimal reproduction of the issue now. martinjt/root-span-issue-functions Deploy that to a basic azure function, then add an environment variable for When you hit the endpoint for the function, you'll generate a trace with 1 span, and that span will have a parent. That shouldn't happen, as now we have a missing root span. Since this is not in flex consumption, I have no ability to get access to that span. This is how broken it is, since this used to work. |
@RohitRanjanMS can we get this reopened to fix the remaining issue? |
@jviau This is slightly different. The point (relating it to that other issue) is that Ultimately, in my example scenario above, the That's what has broke all our customers, since now there is a trace parent set that they can't get access to. |
This PR fixes several things in Azure Functions (isolated worker) which occur when using the latest templates for Azure Functions. - Newer versions of the Functions library prefer the gRPC-based implementation, which throws when `Url` is accessed. Instead, we read these from other data on the `FunctionContext`. - Address Functions library changes that break distributed tracing. We now parse the original headers from the `BindingContext` instead of the request, which may contain a `traceparent` with the sampling flag set to false when the user request does not include a specific `traceparent`. - We explicitly don't record activities from the Azure functions library as these are pretty broken (Azure/azure-functions-dotnet-worker#2733) and are redundant when using our middleware. - Downgrade several packages as the newer ones are now deprecated (thanks for the confusion, Microsoft!). - Update some outdated compiler pre-processor directives. - The final few commits focus on CI integration test hangs on Linux. We don't have a perfect solution for those, but after reviewing the hang dumps, I've avoided some of the potential causes of the hangs. We'll monitor subsequent PRs, and if they remain stable, we will readdress the original causes. A follow-up PR will update our documentation. Closes #2407 Closes #2311 Closes #2218
Description
I'm not entirely sure what's changed, but I suspect this may have been a host change, but I'll list out what I'm saying based on multiple customer issues we've had.
There is now an empty Activity present in the execution path. This means that whenever a user adds tracing, this new Activity is the parent, however, it's not something that you can listen to since it does not have a source name.
This leads to the new ActivitySource named
Microsoft.Azure.Functions.Worker
, this does take over the Blank Activity when you listen to the source, however, it uses the TraceContext from the FunctionContext as it's parent, which doesn't use the inboundtraceparent
from the HttpRequest headers as it's parent. This means that your span has a parent which you're unable to access.That is what a trace looks like right now with the OOTB setup.
Aspire is rendering it in a semi-ok way, that's compensating for the fact that the data itself on those spans isn't right (missing parents, and missing root spans).
I think the big change is that, previously, there was not TraceContext on the function context, and therefore there was no active Activity. Since the upgrade, this has changed and now, there is an Activity which has a remote parent that there is no way to get access to (there's supposed to be a way to access it in Flex Consumption).
To be clear, this is a breaking change. Adding an Activity that is not trackable, and overriding the inbound
traceparent
causes functions to be untraceable due to the context propagation issues.The AspNetCore span completely baffles me, since it's got a different parent to the FunctionsWorker span, and also somehow is outside of the hierarchy.
I'd love to know what caused this, and whether there is a plan to fix functions tracing for users who aren't upgrading to Flex Consumption.
Steps to reproduce
Create a new HTTP Trigger function
Add OpenTelemetry
Send the data to somewhere you can see the raw data.
Run it locally
Hit the endpoint
Look at the telemetry and you'll see that it doesn't follow a linear hierarchy, and there is no span that doesn't have a parent (aka a root span)
The text was updated successfully, but these errors were encountered: