Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Performance counters for rest actions #5623

Closed
CaptainDredge opened this issue Dec 23, 2022 · 3 comments
Closed

[PROPOSAL] Performance counters for rest actions #5623

CaptainDredge opened this issue Dec 23, 2022 · 3 comments
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@CaptainDredge
Copy link
Contributor

What are you proposing?

This feature is to provide customer success and failure counters API-wise. The feature extends _nodes/stats API to expose response code counters for all the APIs that have been called at-least once on the cluster. Response code stats in _nodes/stats response looks like this

"rest_actions": {
                "all": {
                    "<response code 1>": <count>,
                    "<response code 2": <count>
                },
                "<rest action name>": {
                    "<response code 1>": <count>,
                    "<response code 2": <count>
                }
            }

Whenever 4xx or 5xx errors occur for any API call, we don’t have way to know what number of errors caused by which API. This helps to monitor the system behaviour for different APIs

What users have asked for this feature?

#4401

What is the developer experience going to be?

Opensearch _nodes/stats API will have an additional rest_actions section for each node in the json response

{
    "_nodes": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "cluster_name": "runTask",
    "nodes": {
        "0HKtbbpBQWKHBeKEygDuFQ": {
            "timestamp": 1665351289994,
            "name": "runTask-0",
            "transport_address": "127.0.0.1:9300",
            "host": "127.0.0.1",
            "ip": "127.0.0.1:9300",
            "roles": [
                "cluster_manager",
                "data",
                "ingest",
                "remote_cluster_client"
            ],
            "attributes": {
                "testattr": "test",
                "shard_indexing_pressure_enabled": "true"
            },
            "rest_actions": {
                "all": {
                    "200": 6,
                    "400": 3
                },
                "main_action": {
                    "200": 2
                },
                "nodes_usage_action": {
                    "200": 1
                },
                "get_indices_action": {
                    "200": 2
                },
                "create_index_action": {
                    "200": 1,
                    "400": 3
                }
            }
        }
    }
}

Are there any security considerations?

No

Are there any breaking changes to the API

No, there are no breaking changes. All changes will be backward compatible

What is the user experience going to be?

Today OpenSearch publishes lot of cumulative stats around indexing and search, API usage etc. Users hit the stats API periodically and plot the difference of these stats over time to understand how system is being used and performing over time.
Lets look at two user stories to understand the usefulness of feature

  1. A user sees failures and wants to understand which all APIs are impacted, is it only search or only indexing or all APIs are failing and can also configure alarms if needed.
  2. A user is upgrading their cluster to a different version (using rolling restart) and started seeing errors but only few requests are failing and not all. User sends requests through the load balancer to distribute the traffic across all nodes. Now these API level stats will provide node metrics and gives an idea which nodes are failing the requests and debug faster as opposed to analysing the logs which usually takes a lot of time and effort.
  3. An operator sees API failures but want to know which APIs are frequently failing to involve subject matter experts of specific APIs like _search and _bulk for further debugging. A more advanced scenario is a system monitoring rest action stats to automatically cut a ticket to teams responsible for operating specific parts like indexing in a cluster.

Are there breaking changes to the User Experience?

No

Why should it be built? Any reason not to?

It should be built to allow users have a better view of the failures across APIs in opensearch which can give a good direction for further debugging making the overall process faster. It also allow users to build better monitoring solution using rest action stats.

What will it take to execute?

It will involve making changes to code related to stats api

Any remaining open questions?

NA

@CaptainDredge CaptainDredge added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 23, 2022
@anasalkouz
Copy link
Member

Hi @CaptainDredge, Thanks for spending the time to put this proposal, Is there a reason why we add this as separate issue? I would prefer to keep our discussion into the original issue #4401 for better tracking.

@CaptainDredge
Copy link
Contributor Author

@anasalkouz I'm open to keeping the discussion on the original issue but I assumed the process for contribution is to have separate issues, one of feature ask which explains the problem and other for the proposal which explains the solution. What do you suggest, should we close this one and add all the details in original issue description?

@anasalkouz
Copy link
Member

@CaptainDredge
Sorry for the late response, seems the origin issue also has some proposal. I would move this to the origin issue to get more attractions from the community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

3 participants