Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard Spec - Cluster Dashboard #222

Closed
julienlim opened this issue Aug 7, 2017 · 7 comments
Closed

Dashboard Spec - Cluster Dashboard #222

julienlim opened this issue Aug 7, 2017 · 7 comments

Comments

@julienlim
Copy link
Member

julienlim commented Aug 7, 2017

Dashboard Spec - Cluster Dashboard

Display a default dashboard for a single Gluster cluster present in Tendrl that provides at-a-glance information about a single Gluster trusted storage pool that includes health and status information, key performance indicators (e.g. IOPS, throughput, etc.), and alerts that can highlight the Tendrl user's (e.g. Gluster Administrator) attention to potential issues in the cluster, host, volume, and brick.

Problem description

A Gluster Administrator wants to be able to answer the following questions by looking at the cluster dashboard:

  • Is my cluster running well, is it healthy?
  • What’s actually wrong with my cluster, why it it slow?
  • Is my cluster filling up too fast?
  • When will my cluster run out of capacity?
  • If something is down / broken / failed, where and what is the issue, and when did it happen?
  • If there is a split brain issue, show and inform me there's an issue and what do I need to do to fix it?
  • Have the number of clients (indicated via connections) increased (which may possibly be the reason for the performance degradation that the clients / applications are observing?

Use Cases

Uses Cases in the form of user stories:

  • As a Gluster Administrator, I want to view at-a-glance information about my Gluster trusted storage pool that includes health and status information, key performance indicators (e.g. IOPS, throughput, etc.), and alerts that can highlight my attention to potential issues in the cluster, host, volume, and brick.

  • As a Gluster Administrator, I want to compare a metric (e.g. IOPS, CPU, Memory, Network Load) across hosts within the cluster

  • Compare utilization (e.g. IOPS, capacity, etc.) across bricks within a volume

Proposed change

Provide a pre-canned, default cluster dashboard in Grafana (that is initially launchable from the Tendrl UI, and eventually embed it into the Tendrl UI) that shows the following metrics rendered either in text or in a chart/graph depending on the type of metric being displayed below:

The Dashboard is composed of individual Panels (dashboard widgits) arranged on a number of Rows.

Note: The cluster name/ID should be visible at all times, and user should be able to switch to another cluster.

Row 1

Panel (Dashboard Widgit) 1: Health - Cluster Health

Panel 2: Hosts

  • n total - total number (n) of hosts in the cluster
  • n up - count (n) of hosts in the cluster that are up
  • n down - count (n) of hosts in the cluster that are down
  • chart type: Stacked Card
  • Example:
    +--------------------------+
    | |
    | Hosts |
    | |
    | 6 total |
    | 5 up |
    | 1 down |
    +--------------------------+

Panel 3: Volumes

  • n total - total number (n) of volumes in the cluster

  • n Up - count (n) of volumes in the cluster that are started and active; see https://github.com/gluster/gstatus for details about the various status

  • n Up (Partial) - count (n) of volumes in the cluster that are up(partial)

  • n Up (Degraded) - count (n) of volumes in the cluster that are up(degraded)

  • n Down - count (n) of volumes in the cluster that are down

  • the color of panel should be green when all volumes are Up, red when 1 or more volume(s) is down or quorum lost, yellow when 1 or more volume(s) is up( degraded), or up (partial)

  • chart type: Singlestat (see http://docs.grafana.org/features/panels/singlestat/) for further information

  • chart type: Stacked Card

Panel 4: Bricks

  • n total - total number (n) of bricks in the cluster
  • n up - count (n) of bricks in the cluster that are up
  • n down - count (n) of bricks in the cluster that are down
  • chart type: Stacked Card

[FUTURE] Panel 5: Disks

  • n total - total number (n) of disks in the cluster
  • n up - count (n) of disks in the cluster that are up
  • n down - count (n) of disks in the cluster that are down
  • chart type: Stacked Card

Panel 6: Snapshots

  • n total - count (n) of active snapshots in the cluster
  • chart type: Singlestat

Panel 7: Geo-replication Sessions

  • n total - total number (n) of geo-replication sessions for the cluster

  • n Created - count(n) of geo-replication sessions is CREATED or established

  • n Up - count (n) of Geo-replication sessions (All bricks) are ONLINE and UP

  • n Up (Partial) - count (n) of geo-replication sessions for the cluster that are up(partial), i.e. some bricks are Online and Some bricks are offline

  • n Stopped - count (n) of geo-replication sessions is STOPPED

  • n Down - count (n) of geo-replication sessions, wherein all bricks are Offline/Down

  • n Paused - count (n) of geo-replication sessions that are in paused state

  • chart type: Stacked Card

Panel 8: Connections Trend

  • count (n) of client connections to the bricks in the volume over a period of time
  • chart type: Line Chart / Spark

Row 2

Panel 9: Capacity Utilization

  • Disk space used
  • chart type: Gauge

Panel 10: Capacity Available

  • Disk space free
  • chart type: Singlestat

Panel 11: Growth Rate

  • growth rate computed based on beginning and last end point to perform estimation
  • chart type: Singlestat

Panel 12: Time Remaining (Weeks)

  • based on projected growth rate in Panel 10, provide estimated # of weeks remaining
  • chart type: Singlestat

[FUTURE] Panel 13: Services Trend

  • based on Gluster svc events (connected, disconnected, failed) over a period of time
  • similar to what was available with the previous Gluster Console with the Nagios plug-in
  • chart type: Line Chart / Spark

Panel 14: IOPS Trend

  • show the IOPS for the cluster over a period of time
  • chart type: Line Chart / Spark

Panel 15: IO Size

  • show IO Size
  • chart type: Singlestat

Panel 16: Network Throughput Trend

  • show network throughput for the cluster network over a period of time
  • chart type: Line Chart / Spark

Row 3

Panel 17: Top volumes by capacity utilization

  • show the top 5 volumes with the highest disk utilization
  • chart type: Bar Chart / Histogram

Panel 18: Top bricks by capacity utilization

  • show the top 5 bricks with the highest disk utilization
    • Note: User should be able to discern which host the brick is mounted on
  • chart type: Bar Chart / Histogram

Row 4

Panel 19: CPU used by Host

  • show CPU utilization for individual hosts within 4 different utilization ranges/buckets: > 90%, 80=90%, 70-80%, and < 70%
  • chart type: Heat Map

Panel 20: Memory used by Host

  • show memory utilization for individual hosts within 4 different utilization ranges/buckets: > 90%, 80=90%, 70-80%, and < 70%
  • chart type: Heat Map

Panel 21: Ping Latency by Host Trend

  • show Ping latency over a period of time
  • x-axis: time
  • y-axis: ping latency for each host in the cluster
  • chart type: Line Chart / Spark

Note: The dashboard layout for the panels and panels within the rows may need to alter based on implementation and actual visualization especially when certain metrics may need to be aligned together whether vertically or horizontally.

Alternatives

Create similar dashboard using PatternFly (www.patternfly.org) or d3.js components to show similar information within the Tendrl UI.

Data model impact:

TBD

Impacted Modules:

TBD

Tendrl API impact:

TBD

Notifications/Monitoring impact:

TBD

Tendrl/common impact:

TBD

Tendrl/node_agent impact:

TBD

Sds integration impact:

TBD

Security impact:

TBD

Other end user impact:

User will mostly interact with this feature via the Grafana UI, though access via Grafana API and Tendrl API is possible, but would require API calls to provide similar information.

Performance impact:

TBD

Other deployer impact:

  • Plug-ins required by Grafana will need to be packaged and installed with tendrl-ansible.

  • This (default) cluster dashboard will need to be automatically generated whenever a cluster is imported to be managed by Tendrl.

Developer impact:

TBD

Implementation:

TBD

Assignee(s):

Primary assignee: @cloudbehl

Other contributors: @anmolbabu, @anivargi, @julienlim, @japplewhite

Work Items:

TBD

Estimate:

TBD

Dependencies:

TBD

Testing:

Test whether health, status, and metrics displayed for a given cluster is correct and that the information is up-to-date as failures or other cluster changes are observed.

Documentation impact:

Documentation should include information related to what's being displayed and explained for clarity if not immediately obvious from looking at the dashboard. This may include but not be limited to what the metrics refers to, the measurement unit, how to use or apply it to solving troubleshooting problems, e.g. healing / split brain issues, lost of quorum, etc.

References and Related GitHub Links:

@julienlim
Copy link
Member Author

julienlim commented Aug 7, 2017

@sankarshanmukhopadhyay @brainfunked @r0h4n @nthomas-redhat @Tendrl/qe @Tendrl/tendrl_frontend @japplewhite @rghatvis@redhat.com @mcarrano

This dashboard proposal is ready for review. Note: API impact, module impact, etc. has to be filled out by someone else -- maybe @cloudbehl, @anmolbabu, or @anivargi.

Suggested Labels (for folks who have permissions to label the spec):

  • FEATURE:Monitoring
  • INTERFACE:Dashboard
  • INTERFACE:GUI

@nthomas-redhat
Copy link
Contributor

nthomas-redhat commented Aug 10, 2017

Row-1:
Panel (Dashboard Widgit) 1: Health - Cluster Health
Cluster status will have only two values, Healthy or Unhealthy. This is inline with what gstatus is doing and we would like to stick with the same

Panel 3: Volumes
Volume has states up(partial) and up(degraded) as well

Panel 5: Disks
No platform support for disk status as such. This won't be supported now

Panel 7: Geo-replication Sessions
What's the difference between active and up?
What we are planing to support now is: up, down, up(partial)

Row 2
Panel 13: Services Trend
Can we get some clarity around this? Is it part of MVP?

@julienlim
Copy link
Member Author

@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi

I've addressed and updated Panels 1, 3, 5, and 7 per @nthomas-redhat's comments. They should align with https://github.com/gluster/gstatus.

For geo-rep, I was following what we had in the Gluster metrics document we had previously, but have updated it per what the plan for support is now.

For Panel 13 (services trend), I raised this a few times in BLR, and I'm suggesting this to have parity with the old Console. This was the only we didn't address. The use scenario is that there's not easy way for Admins to know if their services/daemons die today or are still ok, and this is a means for monitoring their health. I will defer this to @japplewhite if this is part of the MVP.

@julienlim
Copy link
Member Author

julienlim commented Aug 10, 2017

@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi @mcarrano asrivast@redhat.com

I've put a very rough mockup together to show what the cluster dashboard might look like:

grafana dashboard - cluster

@julienlim
Copy link
Member Author

Noting that geo-rep session status changes planned per Tendrl/gluster-integration#459.

@julienlim
Copy link
Member Author

Updated the Geo-replication Session Panel per georep session status changes.

@shtripat @nthomas-redhat @cloudbehl @Tendrl/tendrl-qe @mcarrano

@r0h4n
Copy link
Contributor

r0h4n commented Jan 30, 2018

Closing this one, please file new issue with relevant context if anything missing

@r0h4n r0h4n closed this as completed Jan 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants