-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add metrics and tracing framework in Opensearch #1061
Comments
Sounds a bit overengineered to me. I would implement OpenTelemetry which can handle the metrics, but also help with tracing (logging is coming soon). The problem with "metrics" in OpenSearch today is that they are exposed as JMX, which is highly inefficient versus modern ways of exposing data as an endpoint (in the case of Prometheus) and allowing metrics to be scraped and stored easily. Similarly I am sure there are good reasons in OpenSearch when you might want to do some tracing too. |
+1 on OpenTelemetry and tracing! Opensearch does not provide hooks to collect detailed request level metrics/tracing easily. Once we have those hooks, we can integrate OpenTelemetry for metric collection, tracing etc. I don't think it is possible to add fine grained metrics in existing code with JMX. I will check it out though. |
Breaking down the problem
|
If you have ever tried to instrument ElasticSearch you will learn this is a really bad idea, especially per request. It might be useful for debugging, but generally, the data will make no sense. I have done this in the past. |
@jkowall Why do think request based latencies are bad? Is it bad in general or just for Opensearch or just for cases where throughput is really high? The way we collect stats in Opensearch gives us averages, which ends up hiding lot of issues. Request based latencies helps us track outliers easily. Aggregating metrics early on leaves me with very little information to troubleshoot the issues after they have already occurred. The kind of clusters I deal with on a daily basis, this information is really important and often I end up enabling trace logging after the fact and then wait for the issue to re-occur, which sometimes never re-occurs or trace logs don't have enough information. Reading the trace logs is pretty tedious at this point and doing it across requests is a nightmare. |
Heya @itiyamas What are you thinking next steps will be for this? |
My challenge to your suggestion is that tracing on ElasticSearch is very challenging. When you install autoinstrumentation and collect traces they will make no sense at all. I have done this with several tools and the data was useless. Additionally the overhead of instrumentation was performance impacting. If you want to collect metrics or collect response data that would be more reasonable. We actually already have something similar that @AmiStrn worked on around the profiler that exists already. |
@jkowall @itiyamas @Bukhtawar trying to summarize the discussion points and possible future developments on the subject of metrics / tracing / stats:
Does it make sense to create RFCs for Metrics / Tracing / JFRs and at least run some experiments to understand a) how useful that would be? b) how difficult that would be? Thoughts? |
@reta OpenTracing is deprecated, it should use OpenTelemetry if anything, but yes, I agree that autoinstrumentation is not a good idea and manually adding the code could add overhead depending on where you instrument in the code. I agree that focusing on Metrics and Stats are a better approach. @AmiStrn was working on this earlier in the project, but we switched to other work when we realized that the governance for OpenSearch was not going to include other companies outside of AWS. When this changes then we might contribute to core features to make the project better in general. |
We have been looking into instrumenting the Opensearch code recently. Even though stats provides a good mechanism, it loses a lot of details like percentiles, which makes it really harder to debug issues in production. Wouldn’t it be great to add a metrics framework in Opensearch that allows a developer to add metrics easily at any part of the code without having to know the exact stats object where the metric belongs?
The framework can, for example, be integrated with RestActions and emit timing and error metrics per operation by default. Similarly, we could pass around this metrics object via ThreadContext down the executor chain and correlate timing metrics together in a single block per request. The metrics can have different logging levels allowing us to skip or add metric calculation on the fly- similar to what we have in logger.
Imagine that you added a new action within the bulk API chain and suddenly a few more requests start taking more time. One of the ways of achieving this is by adding a stat for the new operation within bulk stats. But because stats are always averaged or use precomputed percentiles - it is really tricky to confirm whether the new operation is the culprit. If there is a single metrics block that allows us correlating these metrics- it would be really simple to determine causation.
Now that I have talked about metric generation framework, the publishing can be implemented in a pluggable fashion to different sinks. We can provide a default implementation for the metric log file format, which can be plugged in via different metrics plugins.
The text was updated successfully, but these errors were encountered: