[PROPOSAL] Performance counters for rest actions #5623

CaptainDredge · 2022-12-23T09:04:30Z

What are you proposing?

This feature is to provide customer success and failure counters API-wise. The feature extends _nodes/stats API to expose response code counters for all the APIs that have been called at-least once on the cluster. Response code stats in _nodes/stats response looks like this

"rest_actions": {
                "all": {
                    "<response code 1>": <count>,
                    "<response code 2": <count>
                },
                "<rest action name>": {
                    "<response code 1>": <count>,
                    "<response code 2": <count>
                }
            }

Whenever 4xx or 5xx errors occur for any API call, we don’t have way to know what number of errors caused by which API. This helps to monitor the system behaviour for different APIs

What users have asked for this feature?

#4401

What is the developer experience going to be?

Opensearch _nodes/stats API will have an additional rest_actions section for each node in the json response

{
    "_nodes": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "cluster_name": "runTask",
    "nodes": {
        "0HKtbbpBQWKHBeKEygDuFQ": {
            "timestamp": 1665351289994,
            "name": "runTask-0",
            "transport_address": "127.0.0.1:9300",
            "host": "127.0.0.1",
            "ip": "127.0.0.1:9300",
            "roles": [
                "cluster_manager",
                "data",
                "ingest",
                "remote_cluster_client"
            ],
            "attributes": {
                "testattr": "test",
                "shard_indexing_pressure_enabled": "true"
            },
            "rest_actions": {
                "all": {
                    "200": 6,
                    "400": 3
                },
                "main_action": {
                    "200": 2
                },
                "nodes_usage_action": {
                    "200": 1
                },
                "get_indices_action": {
                    "200": 2
                },
                "create_index_action": {
                    "200": 1,
                    "400": 3
                }
            }
        }
    }
}

Are there any security considerations?

No

Are there any breaking changes to the API

No, there are no breaking changes. All changes will be backward compatible

What is the user experience going to be?

Today OpenSearch publishes lot of cumulative stats around indexing and search, API usage etc. Users hit the stats API periodically and plot the difference of these stats over time to understand how system is being used and performing over time.
Lets look at two user stories to understand the usefulness of feature

A user sees failures and wants to understand which all APIs are impacted, is it only search or only indexing or all APIs are failing and can also configure alarms if needed.
A user is upgrading their cluster to a different version (using rolling restart) and started seeing errors but only few requests are failing and not all. User sends requests through the load balancer to distribute the traffic across all nodes. Now these API level stats will provide node metrics and gives an idea which nodes are failing the requests and debug faster as opposed to analysing the logs which usually takes a lot of time and effort.
An operator sees API failures but want to know which APIs are frequently failing to involve subject matter experts of specific APIs like _search and _bulk for further debugging. A more advanced scenario is a system monitoring rest action stats to automatically cut a ticket to teams responsible for operating specific parts like indexing in a cluster.

Are there breaking changes to the User Experience?

No

Why should it be built? Any reason not to?

It should be built to allow users have a better view of the failures across APIs in opensearch which can give a good direction for further debugging making the overall process faster. It also allow users to build better monitoring solution using rest action stats.

What will it take to execute?

It will involve making changes to code related to stats api

Any remaining open questions?

NA

The text was updated successfully, but these errors were encountered:

anasalkouz · 2022-12-27T21:24:16Z

Hi @CaptainDredge, Thanks for spending the time to put this proposal, Is there a reason why we add this as separate issue? I would prefer to keep our discussion into the original issue #4401 for better tracking.

CaptainDredge · 2022-12-28T05:16:59Z

@anasalkouz I'm open to keeping the discussion on the original issue but I assumed the process for contribution is to have separate issues, one of feature ask which explains the problem and other for the proposal which explains the solution. What do you suggest, should we close this one and add all the details in original issue description?

anasalkouz · 2023-01-13T18:21:54Z

@CaptainDredge
Sorry for the late response, seems the origin issue also has some proposal. I would move this to the origin issue to get more attractions from the community.

CaptainDredge added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 23, 2022

tlfeng added the distributed framework label Dec 27, 2022

CaptainDredge closed this as completed Jan 4, 2023

anasalkouz removed the untriaged label Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL] Performance counters for rest actions #5623

[PROPOSAL] Performance counters for rest actions #5623

CaptainDredge commented Dec 23, 2022

anasalkouz commented Dec 27, 2022

CaptainDredge commented Dec 28, 2022

anasalkouz commented Jan 13, 2023

[PROPOSAL] Performance counters for rest actions #5623

[PROPOSAL] Performance counters for rest actions #5623

Comments

CaptainDredge commented Dec 23, 2022

What are you proposing?

What users have asked for this feature?

What is the developer experience going to be?

Are there any security considerations?

Are there any breaking changes to the API

What is the user experience going to be?

Are there breaking changes to the User Experience?

Why should it be built? Any reason not to?

What will it take to execute?

Any remaining open questions?

anasalkouz commented Dec 27, 2022

CaptainDredge commented Dec 28, 2022

anasalkouz commented Jan 13, 2023