-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenTelemetry .NET Tracing AspNetCore instrumentation is inefficient #1745
Comments
To add:
|
The benchmarks above show a totally different story. The proper way to test overall web app throughput reduction is likely by using bombardier similar tools which simulates real scenarios. Maybe add new perf tests using new approach first, and keep benchmark.net for micro optimizations/measurements only. |
@dpk83 there's a couple of point here that I think are interesting, I've done very specific analysis on this along side the benchmarks in the repo. Your analysis of the middleware you added seems to suggest that it's a like-for-like replacement, which it is not. There is much more done to create a full activity, including things like trace propagation and also filling out additional properties too. That's not to say that it isn't valid that there should be a smaller instrumentation for those that want it, just that it's not a like-for-like comparison. Your middleware approach does nothing with the Activity if I'm looking at it correctly. This means that, ultimately, there is no "work" being done, and therefore all you're testing is that you can add a tag and that it doesn't affect the load. I would highly recommend showing this as a full implementation. Specifically, use Baggage will cause some additional load as it has to be copied and put onto the outbound requests. This is more to do with Propagation than it is to do with AspNetCore instrumentation. Exporters, as long as they're using Batch, shouldn't affect inline performance, but will increase memory usage. I think "extremely" in the title here is incendiary and not required. This is quite an aggressive post, I'm hoping that it's coming from a place of frustration and not a criticism. If there are some optimizations you can see to the way it works, then I'd recommend working with the team to get those optimizations made for the benefit of everyone. Telemetry is always additional processing, that's something that everyone needs to understand, and therefore will always cause some degree of degradation in throughput/latency and/or increase in memory. There's a trade off here, do you fly blind without tracing so you can have a smaller amount of servers running your application? or take the hit of adding an additional server and/or scaling up so you can have the telemetry. |
In the current benchmarking none of the additional functionality beyond what's already provided by AspNetCore is used, so it is actually a like-for-like comparison for the required functionality and I don't expect OpenTelemetry to have additional overhead on top of it. I intentionally kept is simple because that's going to be the case for majority of the cases where baggage isn't modified or the tracing isn't going through multiple different vendor implementations. Also, leave the middleware aside, just comparing the throughput for binary scenarios i.e. tracing disabled vs tracing enabled itself is sufficient to the message I am trying to put forth.
Yeah, I intentionally left out baggage in the benchmarking because I wanted to make a point here for the very basic functionality where I don't expect any additional overhead on top of AspNetCore. Regarding baggage it will cause further reduction in throughput. The way currently baggage part is designed in the library can definitely be improved further. For our cases >95% of the cases will not modify the baggage as the request is passing through the system, so transforming the baggage header into the object and then retransforming it back to it's way out is clearly an unnecessary overhead. Such processing should be moved to more on-demand basis (i.e. on the first such action where it's required). I also believe there are additional opportunities to extract more performance out of it even for cases where baggage needs to be transformed and used.
That's fair, I will remove the
Yes it's additional processing but that additional processing should be minimal so my services can utilize the system's resources efficiently for the actual business logic. Why would one use OpenTelemetry .NET library if they are able to achieve the same result with way lower processing and cost implications by writing their own solution or some other solution? For smaller services or adhoc usage it's fine but for large complex services which are processing millions and billions of requests a day telemetry costs isn't insignificant and matters a lot. |
Adding this issue to the Instrumentation-1.0.0 milestone - we cannot release 1.0 until the perf is acceptable (and prove that with actual numbers). Few things:
@dpk83 would you have time to send a PR to this repo with perf test instructions using bombardier? Its okay to be manual at first, we can see how to automate it to orchestrate and run and show results, just like benchmark tool. |
@dpk83 There is a lot of work that has to be done in the AspNetCoreInstrumention to support OTel baggage because .NET doesn't do it.
If you aren't using baggage, try doing this... // When the default propagator is just w3c we respect the work AspNetCore does instead of calling the SDK logic.
// See: https://github.com/open-telemetry/opentelemetry-dotnet/blob/e9226cfadf3eb4b58fb8c087c61291a3569ccea3/src/OpenTelemetry.Instrumentation.AspNetCore/Implementation/HttpInListener.cs#L133
Sdk.SetDefaultTextMapPropagator(new TraceContextPropagator()); That should make the OTel perf more in line with your middleware. It is still doing more work, so it will be more of a hit, but it should help a lot. If you do want to use OTel baggage in your services, fix your middleware because it makes your tests invalid 😉 |
This is good to know but I think there are opportunities to improve the baggage handling as well.
|
Great ideas to improve baggage! Mind spinning up an issue to track that? For the case here where baggage isn't being used (probably very common case) what might be more immediately beneficial is for OTel .NET to use the .NET propagation API (somehow) to avoid the duplicate extraction/injection and activity creation. IIRC that is one thing @vishweshbankwar was looking into. |
@CodeBlanch I tried this and it's actually causing a different activity everytime. All the exported chained activities have different traceId. |
Please open a new issue with this one, so we dont add noise to this particular issue to track performance improvements. |
This is an important issue so checking for the status here. It is on track for getting fixed in the upcoming stable OpenTelemetry release, right? |
If you are referring to the 1.5 Stable Release (that is the next stable), then no. This issue is not planned for that milestone. This is something to be addressed definitely for the Stable release milestone. No firm ETA on that yet. |
@reyang This is an important ask and I was expecting this to be addressed in 1.5 stable release. We have been pushing services across our org to adopt distributed tracing and the perf penalty here is unacceptable. We definitely want to avoid creating our own solution here because OpenTelemetry didn't provide the performant solution. |
Sorry I'm confused, I think we all agree that this is part of the stable release. OpenTelemetry .NET 1.5 milestone is not planning to release a stable version of |
Oh so you are saying this will only be available when AspNetCore instrumentation becomes stable. What are the timelines for that? Based on the timelines we will have to evaluate whether we can wait or we need to switch gears |
Where is this coming from? I didn't say that. |
Okay looks like there is some problem on my side and I am not able to understand the statements properly. Let’s simplify things: |
@cijothomas already replied #1745 that we don't have the ETA yet. |
@dpk83 : Sharing an update here based on my investigation Regarding the issue about custom Middleware being more efficient than instrumentation library, I think this is not accurate. The sample shared is not 1:1 comparison with the instrumentation library. I ran some tests and as per my investigation, custom Middleware has slightly slower throughput compared to instrumentation library. Please see: open-telemetry/opentelemetry-dotnet#5163 That being said there are few areas of improvements we can further consider, some of them shared in this thread above as well
We plan to continue the discussions/efforts on improving perf and also work with .NET team for bringing support for trace instrumentation natively in |
This is an ongoing task to optimize and therefore this isn't a tracking issue. If you have specific areas and suggestions to improve them, I'd suggest they're split out into separate issues for you to work. |
dotnet/runtime#107333 Should fix the contention from runtime itself. |
Closing as dotnet/runtime#107333 is merged, and will be part of .NET 10. |
Feature Request
It's more of an issue than a feature request. We care a lot about performance for our services and in our local benchmarking we found that AspNetCoreInstrumentation for tracing is extremely inefficient. In a simple AspNetCore application with an extremely simple rest endpoint enabling tracing reduces the service throughput by around 15% which is huge for serving telemetry. (Drops from 10K RPS to 8.6K RPS)
Is your feature request related to a problem?
Yes, telemetry collection should be high performance and shouldn't impact service throughput.
Describe the solution you'd like:
Improve the performance of OTel instrumentation so it's not impacting service throughput.
Describe alternatives you've considered.
A custom middleware which is only using resources for processing what's essential on top of what AspNetCore out of box tracing instrumentation already provides.
Additional Context
The code for application used to measure the throughput and the results can be found here https://github.com/dpk83/otelperfimpactanalysis/tree/main/TelemetryPerfImpactAnalysis
The text was updated successfully, but these errors were encountered: