-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of Error Metric for Rest Actions #4401
Comments
This is an interesting proposal. We have seen in the past in managed service when customer's cluster experiencing high error rate. They want to know which APIs are impacted and having this exposed as a stat would help in plotting this over time (by fetching periodically using the stats APIs like other stats) and correlating with other metrics like incoming request traffic/ latency/ change in no. of nodes etc. Few things you can further think about like:
|
This looks like a "performance counter" proposal, and I like the spirit of where this is coming from. As proposed I would likely not use this to monitor my system. The main pieces missing in this proposal are time, and the ability to correlate errors to volume of activity. In active systems we typically emit an event every time there's a success or an error, then slice/dice/aggregate those errors in a variety of ways in external systems. So this proposal should be some aggregation subset of that. For example, if we're talking REST actions I prefer something to the alternative proposed, but that also counts successes:
The api would take |
Thanks for the proposal @vzez . This should prove to be helpful in being able to monitor stats around node performance for rest actions. @dblock I like the idea of counting successes as well, and treating this as an overall performance counter as suggested. It will provide a more holistic view of how the system is actually behaving.
|
@dblock @mgodwan @vzez it sounds like we should look at larger picture that was discussed here [1]. I am largely agree with @dblock here, the idea of performance counters over time could (and probably should) be exposed as a metric by APIs slice at least. What is an audience for these metrics? Certanly not operations - because it is not possible to extract sufficient context what is failing and for how long. What these counters tell me is that nodes are (or were?) at least handling some requests, and are (were?) available. Shoud it be part of extended Health checks [2] may be? |
@dblock @mgodwan @shwetathareja I think we can give both options; stats for each API and stats for logical set of APIs. This help reduce the response size as well. User can pass a query param if stats are logically grouped or not. (?rest_actions_grouped=true) GET /_nodes/stats/rest_actions
GET /_nodes/stats/rest_actions??rest_actions_grouped=true
@dblock Providing a duration option wouldn't be feasible as we can have only limited number of duration option and that too will put unnessesory load on memory. Though users can achieve durational stats by themselves as per their requirements. |
So what is the duration of |
@dblock I have seen users of Opensearch use these numbers by periodically calling the stats API at a fixed interval and then do a diff across successive intervals. The diff is then plotted in a dashboard over time. This helps us compute averages and get a sense of how the system is performing e.g. indexingRate is calculated by using indexed doc count stat difference over time. I can see value in doing this given polling is the only way Opensearch users get stats about their cluster today. However, in the long term, I would think more about automating the above step for our users or take it one step further by pushing stat events to a pluggable sink. This thread was discussed earlier as well. More details here. |
Thanks @itiyama for clarifying. @dblock : yes for long term, we should build the metrics framework which publishes event with time to pluggable sink and users can write their consumer plugins and build dashboards on top of it. To answer @reta question also
Today OpenSearch publishes lot of cumulative stats around indexing and search, API usage etc. Like @itiyama mentioned plotting the diff over time is useful to understand how system is being used and performing over time.
+1 on the suggestion to provide both success and failure metrics. @vzez : You should also evaluate the |
What is the benefit you see in keeping the whole history at node level? Keeping absolute total count could be an issue because it should always be the same or increasing. In cases when some node goes down. Let's say at a certain point of time there were 100K failures and after that one node goes down then it will be 90k failures which can't be corrected. So reading the graph would always require a context or may be analysis at node level which you may not require always. Can we keep some sort of sliding window maybe just for 5/10 minutes and the user should be able to query and get the delta for just the last X minutes/seconds. I still feel that PA + RCA can do a better job here for dashboard building otherwise it looks like a heavy lifting at client end. |
@Gaganjuneja @shwetathareja |
@vzez Thanks for the clarification. We can build it incrementally. It is not coming out well how these PA limitations are going to impact, like if we talk about storage requirements then it just keeps the 5 sec aggregated samples over 7 minutes and how much would be needed etc. Also, the strengths of PA. These tools are very powerful in aggregations, windowing, etc. and may help customers build the analytics on top of it. We can also think how your proposed/preferred design choice can be integrated with PA may be in the next phase to ease out the consumption of this data at customer end. |
@CaptainDredge explore _nodes/usage API to expose these API level success and failure count as usage API already provides api count. It would make sense to keep all these counts at one place. Instead of changing existing rest_actions you can think about introducing new map with all counts and then deprecate existing rest_actions in next major version. |
Is your feature request related to a problem? Please describe.
Whenever 4xx or 5xx errors occur for any API call, we don’t have way to know what number of errors caused by which API.
This feature is to provide customer a sort of metrics for API-wise 4xx and 5xx errors. This helps to monitor the system behaviour for different APIs as well.
Describe the solution you'd like
Whenever an API throws 4xx or 5xx, we’ll keep 4xx and 5xx error count in-memory for that API. We are going to use an existing API (/_nodes/stats) to return that count metric. This API is already being used to retrieve the statistics. We are going to add a new matrix in the response named “rest_actions” which will contain number of times 4xx and 5xx error occurred for an API. As of now we’ll capture the error count for indexing, search API and other APIs error count is combined.
New metric to be introduced : “rest_actions”
For API call : GET /_nodes/stats/rest_actions
Response will look like as follows...
This approach is considered is good option for below reasons...
Easy and quick to implement
Can be extended further if more info required easily
Independent to node type
Describe alternatives you've considered
Another approach was to use PA plugin. We keep the error count same way as above approach but PA plugin would collect those matrix instead of exposing via API. PA collects the error count every 5 sec and store them in matricDB files.
This approach is discarded for following reasons...
PA is a separate component which runs on every node, so extra work to be done in order to collect matrix every 5 sec
It store the matrix in storage hence additional storage is required
Have some limitation as it does’t run on T2 nodes.
One more alternate approach was to use (/_nodes/usage) to return the response but this API is specific to get usage for ex, it returns number of time a rest action has been called.
Therefore, the error count doesn’t fall under this category. And chose (stats) API.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: