Dashboard Spec - Cluster Dashboard #222

julienlim · 2017-08-07T22:33:49Z

Dashboard Spec - Cluster Dashboard

Display a default dashboard for a single Gluster cluster present in Tendrl that provides at-a-glance information about a single Gluster trusted storage pool that includes health and status information, key performance indicators (e.g. IOPS, throughput, etc.), and alerts that can highlight the Tendrl user's (e.g. Gluster Administrator) attention to potential issues in the cluster, host, volume, and brick.

Problem description

A Gluster Administrator wants to be able to answer the following questions by looking at the cluster dashboard:

Is my cluster running well, is it healthy?
What’s actually wrong with my cluster, why it it slow?
Is my cluster filling up too fast?
When will my cluster run out of capacity?
If something is down / broken / failed, where and what is the issue, and when did it happen?
If there is a split brain issue, show and inform me there's an issue and what do I need to do to fix it?
Have the number of clients (indicated via connections) increased (which may possibly be the reason for the performance degradation that the clients / applications are observing?

Use Cases

Uses Cases in the form of user stories:

As a Gluster Administrator, I want to view at-a-glance information about my Gluster trusted storage pool that includes health and status information, key performance indicators (e.g. IOPS, throughput, etc.), and alerts that can highlight my attention to potential issues in the cluster, host, volume, and brick.
As a Gluster Administrator, I want to compare a metric (e.g. IOPS, CPU, Memory, Network Load) across hosts within the cluster
Compare utilization (e.g. IOPS, capacity, etc.) across bricks within a volume

Proposed change

Provide a pre-canned, default cluster dashboard in Grafana (that is initially launchable from the Tendrl UI, and eventually embed it into the Tendrl UI) that shows the following metrics rendered either in text or in a chart/graph depending on the type of metric being displayed below:

The Dashboard is composed of individual Panels (dashboard widgits) arranged on a number of Rows.

Note: The cluster name/ID should be visible at all times, and user should be able to switch to another cluster.

Row 1

Panel (Dashboard Widgit) 1: Health - Cluster Health

show cluster status, i.e. Healthy, Unhealthy
the color of panel should be green when OK, red when Unhealthy or quorum lost
chart type: Singlestat (see http://docs.grafana.org/features/panels/singlestat/) for further information

Panel 2: Hosts

n total - total number (n) of hosts in the cluster
n up - count (n) of hosts in the cluster that are up
n down - count (n) of hosts in the cluster that are down
chart type: Stacked Card
Example:
+--------------------------+
| |
| Hosts |
| |
| 6 total |
| 5 up |
| 1 down |
+--------------------------+

Panel 3: Volumes

n total - total number (n) of volumes in the cluster
n Up - count (n) of volumes in the cluster that are started and active; see https://github.com/gluster/gstatus for details about the various status
n Up (Partial) - count (n) of volumes in the cluster that are up(partial)
n Up (Degraded) - count (n) of volumes in the cluster that are up(degraded)
n Down - count (n) of volumes in the cluster that are down
the color of panel should be green when all volumes are Up, red when 1 or more volume(s) is down or quorum lost, yellow when 1 or more volume(s) is up( degraded), or up (partial)
chart type: Singlestat (see http://docs.grafana.org/features/panels/singlestat/) for further information
chart type: Stacked Card

Panel 4: Bricks

n total - total number (n) of bricks in the cluster
n up - count (n) of bricks in the cluster that are up
n down - count (n) of bricks in the cluster that are down
chart type: Stacked Card

[FUTURE] Panel 5: Disks

n total - total number (n) of disks in the cluster
n up - count (n) of disks in the cluster that are up
n down - count (n) of disks in the cluster that are down
chart type: Stacked Card

Panel 6: Snapshots

n total - count (n) of active snapshots in the cluster
chart type: Singlestat

Panel 7: Geo-replication Sessions

n total - total number (n) of geo-replication sessions for the cluster
n Created - count(n) of geo-replication sessions is CREATED or established
n Up - count (n) of Geo-replication sessions (All bricks) are ONLINE and UP
n Up (Partial) - count (n) of geo-replication sessions for the cluster that are up(partial), i.e. some bricks are Online and Some bricks are offline
n Stopped - count (n) of geo-replication sessions is STOPPED
n Down - count (n) of geo-replication sessions, wherein all bricks are Offline/Down
n Paused - count (n) of geo-replication sessions that are in paused state
chart type: Stacked Card

Panel 8: Connections Trend

count (n) of client connections to the bricks in the volume over a period of time
chart type: Line Chart / Spark

Row 2

Panel 9: Capacity Utilization

Disk space used
chart type: Gauge

Panel 10: Capacity Available

Disk space free
chart type: Singlestat

Panel 11: Growth Rate

growth rate computed based on beginning and last end point to perform estimation
chart type: Singlestat

Panel 12: Time Remaining (Weeks)

based on projected growth rate in Panel 10, provide estimated # of weeks remaining
chart type: Singlestat

[FUTURE] Panel 13: Services Trend

based on Gluster svc events (connected, disconnected, failed) over a period of time
similar to what was available with the previous Gluster Console with the Nagios plug-in
chart type: Line Chart / Spark

Panel 14: IOPS Trend

show the IOPS for the cluster over a period of time
chart type: Line Chart / Spark

Panel 15: IO Size

show IO Size
chart type: Singlestat

Panel 16: Network Throughput Trend

show network throughput for the cluster network over a period of time
chart type: Line Chart / Spark

Row 3

Panel 17: Top volumes by capacity utilization

show the top 5 volumes with the highest disk utilization
chart type: Bar Chart / Histogram

Panel 18: Top bricks by capacity utilization

show the top 5 bricks with the highest disk utilization
- Note: User should be able to discern which host the brick is mounted on
chart type: Bar Chart / Histogram

Row 4

Panel 19: CPU used by Host

show CPU utilization for individual hosts within 4 different utilization ranges/buckets: > 90%, 80=90%, 70-80%, and < 70%
chart type: Heat Map

Panel 20: Memory used by Host

show memory utilization for individual hosts within 4 different utilization ranges/buckets: > 90%, 80=90%, 70-80%, and < 70%
chart type: Heat Map

Panel 21: Ping Latency by Host Trend

show Ping latency over a period of time
x-axis: time
y-axis: ping latency for each host in the cluster
chart type: Line Chart / Spark

Note: The dashboard layout for the panels and panels within the rows may need to alter based on implementation and actual visualization especially when certain metrics may need to be aligned together whether vertically or horizontally.

Alternatives

Create similar dashboard using PatternFly (www.patternfly.org) or d3.js components to show similar information within the Tendrl UI.

Data model impact:

TBD

Impacted Modules:

TBD

Tendrl API impact:

TBD

Notifications/Monitoring impact:

TBD

Tendrl/common impact:

TBD

Tendrl/node_agent impact:

TBD

Sds integration impact:

TBD

Security impact:

TBD

Other end user impact:

User will mostly interact with this feature via the Grafana UI, though access via Grafana API and Tendrl API is possible, but would require API calls to provide similar information.

Performance impact:

TBD

Other deployer impact:

Plug-ins required by Grafana will need to be packaged and installed with tendrl-ansible.
This (default) cluster dashboard will need to be automatically generated whenever a cluster is imported to be managed by Tendrl.

Developer impact:

TBD

Implementation:

TBD

Assignee(s):

Primary assignee: @cloudbehl

Other contributors: @anmolbabu, @anivargi, @julienlim, @japplewhite

Work Items:

TBD

Estimate:

TBD

Dependencies:

TBD

Testing:

Test whether health, status, and metrics displayed for a given cluster is correct and that the information is up-to-date as failures or other cluster changes are observed.

Documentation impact:

Documentation should include information related to what's being displayed and explained for clarity if not immediately obvious from looking at the dashboard. This may include but not be limited to what the metrics refers to, the measurement unit, how to use or apply it to solving troubleshooting problems, e.g. healing / split brain issues, lost of quorum, etc.

References and Related GitHub Links:

Gluster Metrics (https://github.com/Tendrl/documentation/wiki/Metrics)
Gluster metrics (Gluster metrics. #188)
Initial onboarding experience for user accessing Tendrl UI (Initial onboarding experience for user accessing Tendrl UI #200)
Drill-down navigation in grafana dashboard
(Drill-down navigation in grafana dashboard #189)
Use Grafana for Tendrl monitoring (Use Grafana for Tendrl monitoring #168)
Build package scripts for tendrl-monitoring-integration (Build package scripts for tendrl-monitoring-integration #178)

julienlim · 2017-08-07T23:35:29Z

@sankarshanmukhopadhyay @brainfunked @r0h4n @nthomas-redhat @Tendrl/qe @Tendrl/tendrl_frontend @japplewhite @rghatvis@redhat.com @mcarrano

This dashboard proposal is ready for review. Note: API impact, module impact, etc. has to be filled out by someone else -- maybe @cloudbehl, @anmolbabu, or @anivargi.

Suggested Labels (for folks who have permissions to label the spec):

FEATURE:Monitoring
INTERFACE:Dashboard
INTERFACE:GUI

nthomas-redhat · 2017-08-10T09:45:18Z

Row-1:
Panel (Dashboard Widgit) 1: Health - Cluster Health
Cluster status will have only two values, Healthy or Unhealthy. This is inline with what gstatus is doing and we would like to stick with the same

Panel 3: Volumes
Volume has states up(partial) and up(degraded) as well

Panel 5: Disks
No platform support for disk status as such. This won't be supported now

Panel 7: Geo-replication Sessions
What's the difference between active and up?
What we are planing to support now is: up, down, up(partial)

Row 2
Panel 13: Services Trend
Can we get some clarity around this? Is it part of MVP?

julienlim · 2017-08-10T16:10:02Z

@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi

I've addressed and updated Panels 1, 3, 5, and 7 per @nthomas-redhat's comments. They should align with https://github.com/gluster/gstatus.

For geo-rep, I was following what we had in the Gluster metrics document we had previously, but have updated it per what the plan for support is now.

For Panel 13 (services trend), I raised this a few times in BLR, and I'm suggesting this to have parity with the old Console. This was the only we didn't address. The use scenario is that there's not easy way for Admins to know if their services/daemons die today or are still ok, and this is a means for monitoring their health. I will defer this to @japplewhite if this is part of the MVP.

julienlim · 2017-08-10T20:36:15Z

@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi @mcarrano asrivast@redhat.com

I've put a very rough mockup together to show what the cluster dashboard might look like:

julienlim · 2017-11-03T14:36:15Z

Noting that geo-rep session status changes planned per Tendrl/gluster-integration#459.

julienlim · 2017-11-07T15:52:35Z

Updated the Geo-replication Session Panel per georep session status changes.

@shtripat @nthomas-redhat @cloudbehl @Tendrl/tendrl-qe @mcarrano

r0h4n · 2018-01-30T09:33:18Z

Closing this one, please file new issue with relevant context if anything missing

brainfunked added the PROJECT:UX label Aug 9, 2017

julienlim mentioned this issue Aug 23, 2017

Dashboard requests for enhancements [RFEs] #246

Open

fbalak mentioned this issue Sep 27, 2017

Panels in Gluster-at-glance dashboard contain tables instead of graphs Tendrl/monitoring-integration#151

Closed

julienlim mentioned this issue Oct 27, 2017

Adding the description to the Grafana Panels Tendrl/monitoring-integration#222

Closed

This was referenced Nov 7, 2017

georep session status changes Tendrl/gluster-integration#459

Closed

Update panel styling to improve readability Tendrl/monitoring-integration#186

Closed

r0h4n closed this as completed Jan 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dashboard Spec - Cluster Dashboard #222

Dashboard Spec - Cluster Dashboard #222

julienlim commented Aug 7, 2017 •

edited

Loading

julienlim commented Aug 7, 2017 •

edited

Loading

nthomas-redhat commented Aug 10, 2017 •

edited

Loading

julienlim commented Aug 10, 2017

julienlim commented Aug 10, 2017 •

edited

Loading

julienlim commented Nov 3, 2017

julienlim commented Nov 7, 2017

r0h4n commented Jan 30, 2018

Dashboard Spec - Cluster Dashboard #222

Dashboard Spec - Cluster Dashboard #222

Comments

julienlim commented Aug 7, 2017 • edited Loading

Dashboard Spec - Cluster Dashboard

Problem description

Use Cases

Proposed change

Row 1

Row 2

Row 3

Row 4

Alternatives

Data model impact:

Impacted Modules:

Tendrl API impact:

Notifications/Monitoring impact:

Tendrl/common impact:

Tendrl/node_agent impact:

Sds integration impact:

Security impact:

Other end user impact:

Performance impact:

Other deployer impact:

Developer impact:

Implementation:

Assignee(s):

Work Items:

Estimate:

Dependencies:

Testing:

Documentation impact:

References and Related GitHub Links:

julienlim commented Aug 7, 2017 • edited Loading

nthomas-redhat commented Aug 10, 2017 • edited Loading

julienlim commented Aug 10, 2017

julienlim commented Aug 10, 2017 • edited Loading

julienlim commented Nov 3, 2017

julienlim commented Nov 7, 2017

r0h4n commented Jan 30, 2018

julienlim commented Aug 7, 2017 •

edited

Loading

julienlim commented Aug 7, 2017 •

edited

Loading

nthomas-redhat commented Aug 10, 2017 •

edited

Loading

julienlim commented Aug 10, 2017 •

edited

Loading