-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Metrics Framework #10141
Comments
@reta @Bukhtawar @shwetathareja @backslasht @suranjay @rishabh6788 @nknize @dblock Please provide your inputs. |
Does this give us the opportunity to collect a bunch of counters related to a single end-to-end execution of some "operation"? As a start, will we be able to get an implementation that logs a bunch of counters for every HTTP request, so we'll be able to do out-of-process aggregation / computation of percentiles? The current stats-based approach that uses global (or per-node, per-index, per-shard) counters obscures a lot of useful information. For example, maybe I see stats showing that we're spending a lot of time doing thing A and a lot of time doing thing B. Are we spending a lot of time doing A and B in the same requests? Are some requests spending time on thing A, while other requests are spending time on thing B? Was there only once request that spent a really long time doing thing B? Request-level collection of counters would be extremely useful. |
@Gaganjuneja thanks for putting this one up
I believe we should rename it to |
Yes, we can generate the Request level counters and connect the traces with metrics to see both aggregated and request level view. |
@msfroh could you give an example of what kind of counters for every HTTP request you are looking for?
I believe with traces you should naturally get that, since A and B should be denoted by span or span hierarchy.
@Gaganjuneja To be fair, I really doubt that we could (and even should) do that, however we could add events + attributes to the spans and than query / aggregate over them as need. |
Yes, we can do that only challenge comes here is with sampling. |
I totally agree with you @reta here. We can start small and keep on adding the instruments based on the requirements. Like for now we can just start with counter and keep in facade simple as shown in the interface. I understand the above interface is motivated from the Otel schema but we should be able to implement it for most of the metering solutions. |
You bet! If I could have my every per-request metrics record wish granted, it would include things like (in priority order):
And of course, I would love to be able to correlate all of this back to the original query (not necessarily from a time series system, since I would expect that to store aggregate histograms of counters, but I should be able to fetch the raw logs that were aggregated for ingestion into the time series data store). |
This is the goal of tracing - you get out of the box
Those could be attach to each span at the relevant subflow
This could extremely difficult and expensive to track but we do try to track the CPU cost of search task per shard, that could be attached to task spans as well. |
@reta -- so all my wishes can potentially be granted? Can I have a pony too? 😁 |
haha @msfroh one more
Trace ids could be (and should be) in logs so it should be possible to correlate all logs for specific trace and request |
This information is already available as part of TaskResourceTracking framework which can be reused. Not sure if this should be attached as an event or attribute to the span. Only use case where we need the metric and trace integration is - Let's say we have a one metric graph for HttpStatusCode and it shows peek for statusCode=500 now we may want to see all the requests failed during this peek Now,
So, I agree we should be able to handle the storage side instead of integrating in the server because the sampling issue still persists there as well. |
Hi @reta
a. Increasing Counters There's also a variation involving asynchronous counters that require a callback and operate at a specific cadence, which we can incorporate later based on our use cases.
Given the clarity around counters, I propose that we commence by implementing support for counters first and then gradually expand the framework. I would appreciate your thoughts on this. |
Thanks @Gaganjuneja
Correct for search tasks, which are probably the ones which deserve such tracking
Adding |
Totally agreed. |
Sure, it make sense, thanks @Gaganjuneja, counters are the most basic metric out of all, would be great to unify API to support labels / tags / ... and other meta concepts than are applicable to all metrics. |
@reta
This is a generic api where the histograms buckets can be automatically provided by the implementation. Most likely we can go for exponential histogram gram createHistogram(String name, String description, String unit, List buckets);) gram createHistogram(String name, String description, String unit, List buckets);) in case of OpenTelementry.
Here, users can provide their own list of explicit buckets. This gives more control to the user in case they want to track the specific boundary ranges. |
Thanks @Gaganjuneja , I am unsure this is sufficient to understand what you are suggesting, what is |
Understanding the distribution of response times or OpenSearch operation is crucial for assessing performance. Unlike average or median values, percentiles shed light on the tail end of the latency distribution, offering insights into user experience during extreme scenarios. For example, the 95th percentile represents the response time below which 95% of requests fall, providing a more realistic reflection of the typical user experience than average latency alone. This information is invaluable for system administrators and developers in identifying and addressing performance bottlenecks. Various methods exist for calculating percentiles, with one straightforward approach involving the computation of several percentiles based on all data points. However, this method can be resource-intensive, requiring access to all data points. Alternatively, the use of a Constructing a histogram typically involves the following steps:
Histograms are particularly useful for visualizing the distribution of continuous data, providing insights into central tendency, spread, skewness, and potential outliers. Patterns such as normal distribution, skewed distribution, or bimodal distribution can be easily identified, aiding in data interpretation. In OpenTelemetry, histograms are a metric type that allows observation of the statistical distribution of a set of values over time, offering insights into spread, central tendency, and data shape. Other telemetry tools like Dropwizard and Micrometer also support histograms. The major difference lies in how they define buckets:
To unify Histogram APIs across implementations, the following APIs are proposed: Automatic Buckets approach - A generic API where histogram buckets are automatically provided by the implementation, likely following the exponential histogram option from Otel.
Explicit bucket approach - Users can provide their list of explicit buckets, offering more control in tracking specific boundary ranges.
|
Thanks for details, but I still don't see the API, what is |
As mentioned above
public interface Histogram {
/**
* record the value.
* @param value value to be recorded.
*/
void record(double value);
/**
* record value along with the tags.
*
* @param value value to be recorded.
* @param tags attributes/dimensions of the metric.
*/
void record(double value, Tags tags);
} |
@reta, your thoughts here? |
I am trying to figure out 2 things here:
|
Yes, most of the tools provide the dynamic bucketing or percentile calculations. We can also for now keep the API simple and provide the metric provider level config. We can live with one single API for now.
Yes, more generalised in Otel but we need this feature to create the dimensions based on data like index name or shardId etc. There are ways to achieve this in other tools by overriding the metricsRegistry and storage where we can explicitly keep the metrics at dimension level. Otel provides this out of the box. |
The purpose of tags is clear but this is usually done on per meter level (like histogram, counter, ..). I am trying to understand how does it work on value level - would it be represented by multiple meters at the end? (we would need to add the javadocs to explain the behaviour anyway) |
Sure I will add the java docs. It works like. It creates a Map<Meter, Map<UniqueTagsList, value>>. There will be distinct value per meter and unique tags combination so Yes we can say that it's represented by multiple meters at the end. This is the sample exported output of otel metrics.
|
Got it, thank you, I think we are good to go with Histograms :) |
Extending it further to add support for Gauge which is a current value at the time it is read. Gauges can be synchronous as well as asynchronous. Synchronous Gauge - Synchronous Gauge is normally used when the measurements are exposed via a subscription to change events. I think, Asynchronous Gauge make more sense in our use cases to record the resource utilisation in periodic manner. It can also be used to capture queue size at any point in time etc. Anyways synchronous Gauge is experimental in open telemetry (refer otel discussion - open-telemetry/opentelemetry-java#6272) Proposed API -
ObservableInstrument
ObservableGauge
@reta, your thoughts here? |
@Gaganjuneja hm ... the gauge part is looking like an option, but I believe we should not be driven by OTEL APIs but keep out APIs consistent instead:
The The other question I have is that the the it is inconsistent with other meters: those accept tags per each value, I think we should stick to this for gauges as well. |
Thanks @reta for reviewing.
I think we can go ahead with this for now.
|
Thanks @Gaganjuneja
Using |
Greetings @Gaganjuneja Kindly is there a way currently to export metrics/spans to signoz? I mean the configuration currently in the docs states that grpc only export to localhost, is there a way to configure opensearch to export metrics/spans to specific endpoint i.e. configurable URL for a remote collector. ? Thank you. |
external signoz we can do through local signoz/otel collector for now. We can externalise this property as a config. |
@Gaganjuneja we followed the documentation step by step and configured everything related. any suggestions? opensearch is deployed with helm chart, we tested it on version 2.12 and 2.13 Thank you. |
@zalseryani strange. Could you please share the exact steps you followed and the value for the following settings
|
Hi, I am trying to understand integration with Prometheus. What is the plan to enable integration with Prometheus? What is the role of the Collector box in the schema above? Is that any "user managed" component? Is that the component that would be scraped by Prometheus or the one that would eventually push metrics to Prometheus? What is the function of Periodic exporter? Does it periodically push in memory cache of collected metrics to Collector? When metric store/sink requires specific format of the data who and when is responsible for the conversion? |
Thank you for your time, we were missing the second config you shared with us
Appreciating your time and efforts :) |
@zalseryani, Thanks for your note. We can do it in the upcoming release. Would you mind creating an issue for this? |
Is your feature request related to a problem? Please describe.
Feature Request #1061
Feature Request #6533
PR - #8816
Describe the solution you'd like
Problem Statement
The current OpenSearch stats APIs offer valuable insights into the inner workings of each node and the cluster as a whole. However, they lack certain details such as percentiles and do not provide the semantics of richer metric types like histograms. Consequently, identifying outliers becomes challenging. OpenSearch needs comprehensive metric support to effectively monitor the cluster. Recent issues and RFCs have attempted to address this in a piecemeal fashion, and we are currently engaged in a significant effort to instrument OpenSearch code paths. This presents an opportune moment to introduce comprehensive metrics support.
Tenets
Metrics Framework
It is widely recognized that observability components like tracing and metrics introduce overhead. Therefore, designing the Metrics Framework for OpenSearch requires careful consideration. This framework will provide abstractions, governance, and utilities to enable developers and users to easily utilize it for emitting metrics. Let's delve deeper into these aspects: –
HLD
Implementation
OpenTelemetry offers decent support for metrics, and we can leverage the existing telemetry-otel plugin to provide the implementation for metrics as well.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: