Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the Metrics section on the Observability overview page #79093

Closed
sorantis opened this issue Oct 1, 2020 · 15 comments · Fixed by #90879
Closed

Update the Metrics section on the Observability overview page #79093

sorantis opened this issue Oct 1, 2020 · 15 comments · Fixed by #90879
Assignees
Labels
Feature:Observability Landing Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services

Comments

@sorantis
Copy link

sorantis commented Oct 1, 2020

Today the Observability overview page contains the following Metrics section:
Screen Shot 2020-10-01 at 12 07 01
While the information is relevant it didn't not prove to be actionable. Instead based on the feedback we got users would rather see top 5/10 host with the highest CPU or RAM.

The proposal is to reuse the table view used for services and show the following host level information:

Hosts: count(host.name)

Uptime Provider & OS icon Hostname CPU % Load 15 IOWait Disk Used %
system.uptime.duration.ms host.os.platform, cloud.provider host.name host.cpu.pct system.load.15 system.core.iowait.pct system.filesystem.used.pct

The host.name is a hyperlink to node details page for that host.

The resulting table should look similar to this:

79093v3

@sorantis sorantis added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Observability Landing labels Oct 1, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@sorantis sorantis assigned sorantis and hbharding and unassigned hbharding and sorantis Oct 1, 2020
@sorantis
Copy link
Author

sorantis commented Oct 1, 2020

@hbharding FYI

@simianhacker
Copy link
Member

@sorantis For IOWait (system.core.iowait.pct), the average will be across all the cores since Metricbeat reports an event per core (system.core.id) which I think will be fine. I'm concerned about Disk Usage, the total will be the average across ALL devices (system.filesystem. device_name), on a lot of the systems there will be quite a few system devices which report 100% all the time and will skew the numbers. I just want you to be aware or the caveats.

Request with Time Series

POST metric*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-1h",
              "lte": "now"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "nodes": {
      "terms": {
        "field": "host.id",
        "size": 10
      },
      "aggs": {
        "metadata": {
          "top_metrics": {
            "metrics": [
              { "field": "host.os.platform" },
              { "field": "host.name" },
              { "field": "cloud.provider" }
              
            ],
            "sort": { "@timestamp": "desc" },
            "size": 1
          }
        },
        "uptime": {
          "max": {
            "field": "system.uptime.duration.ms"
          }
        },
        "cpu": {
          "avg": {
            "field": "host.cpu.pct"
          }
        },
        "iowait": {
          "avg": {
            "field": "system.core.iowait.pct"
          }
        },
        "load": {
          "avg": {
            "field": "system.load.15"
          }
        },
        "disk_usage": {
          "avg": {
            "field": "system.filesystem.used.pct"
          }
        },
        "timeseries": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "1m",
            "extended_bounds": {
              "min": "now-1h",
              "max": "now"
            }
          },
          "aggs": {
            "cpu": {
              "avg": {
                "field": "host.cpu.pct"
              }
            },
            "iowait": {
              "avg": {
                "field": "system.core.iowait.pct"
              }
            },
            "load": {
              "avg": {
                "field": "system.load.15"
              }
            },
            "disk_usage": {
              "avg": {
                "field": "system.filesystem.used.pct"
              }
            }
          }
        }
      }
    }
  }
}

@sorantis
Copy link
Author

@simianhacker thanks for the update. Looks like it doesn't make much sense to show aggregated Disk IO information. I think we can back to showing Inbound and Outbound traffic (system.network.in.bytes | system.network.out.bytes), like we did for the initial version.

@simianhacker
Copy link
Member

@sorantis I think I will use host.network.in.bytes & host.network.out.bytes since they are now gauges and I think we can use the new rate aggregation with it (just need to check on sorting).

@simianhacker
Copy link
Member

simianhacker commented Feb 18, 2021

FYI... I'm slowly making progress on this:

image

I should have a PR ready to review by tomorrow or Monday

@simianhacker
Copy link
Member

@kaiyan-sheng Do you know all the different values that could be recorded in host.os.platform?

@katefarrar We are going to need logos for the different platforms (from @kaiyan-sheng). EUI has a windows logo and I have logos for all the different providers (aws, gcp, azure). Where there isn't a provider I just used the compute icon from EUI OR do you want to just leave it empty?

@kaiyan-sheng
Copy link
Contributor

@simianhacker For host.os.platform, I pinged @fearful-symmetry and the answer is:

aix
android
darwin
dragonfly
freebsd
illumos
js
linux
netbsd
openbsd
plan9
solaris
windows

@sorantis @simianhacker For using the new host fields host.network.in.bytes & host.network.out.bytes, I wonder if we should hold off on that. The new host fields were added into ECS 3 days ago(elastic/ecs#1248) and it will be released in 1.9.0. But in the RFC process, these host field names got changed. Now metricbeat needs to be adjusted again to the new field names to match ECS. Should this wait till the change is made in Metricbeat?

@fearful-symmetry
Copy link

A brief note: The platform values across beats are mostly taken from GOOS, and you can get them with go tool dist list | cut -f1 -d'/' | sort | uniq The precise values will change with whatever platform a given go release does/does not support. Depending on what this is being used for, you might want to dynamically generate the list using the version of golang beats is compiled on, which is in .go-version in the root beats directory.

@katefarrar
Copy link
Contributor

@kaiyan-sheng Do you know all the different values that could be recorded in host.os.platform?

@katefarrar We are going to need logos for the different platforms (from @kaiyan-sheng). EUI has a windows logo and I have logos for all the different providers (aws, gcp, azure). Where there isn't a provider I just used the compute icon from EUI OR do you want to just leave it empty?

I think using the compute logo works if we don't have a specific provider logo. That way we keep things consistent. Thanks!

@simianhacker
Copy link
Member

@katefarrar Is there an issue on the design side for the logos for the other platforms? It looks likeEuiIcon works with SVGs.

@simianhacker
Copy link
Member

@kaiyan-sheng Do you have the new names?

@kaiyan-sheng
Copy link
Contributor

kaiyan-sheng commented Feb 25, 2021

@simianhacker Yes, I have the new names but these names are not used by Metricbeat yet.
New names are in ECS 1.9.0 and the main changes are:

host.cpu.pct -> host.cpu.usage
host.network.in.bytes -> host.network.ingress.bytes
host.network.out.bytes -> host.network.egress.bytes

I'm working on a PR to change the names to match the new names that got into ECS.

@simianhacker
Copy link
Member

simianhacker commented Mar 17, 2021

I found logos for everything except plan9, openbsd, js

image

@simianhacker
Copy link
Member

FYI... the system.uptime.duration.ms metric only ships every 15 minutes. There is a scenario where uptime will display N/A when the time range is less than 15 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Observability Landing Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants