-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dashboard Spec - Cluster Dashboard #222
Comments
@sankarshanmukhopadhyay @brainfunked @r0h4n @nthomas-redhat @Tendrl/qe @Tendrl/tendrl_frontend @japplewhite @rghatvis@redhat.com @mcarrano This dashboard proposal is ready for review. Note: API impact, module impact, etc. has to be filled out by someone else -- maybe @cloudbehl, @anmolbabu, or @anivargi. Suggested Labels (for folks who have permissions to label the spec):
|
Row-1: Panel 3: Volumes Panel 5: Disks Panel 7: Geo-replication Sessions Row 2 |
@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi I've addressed and updated Panels 1, 3, 5, and 7 per @nthomas-redhat's comments. They should align with https://github.com/gluster/gstatus. For geo-rep, I was following what we had in the Gluster metrics document we had previously, but have updated it per what the plan for support is now. For Panel 13 (services trend), I raised this a few times in BLR, and I'm suggesting this to have parity with the old Console. This was the only we didn't address. The use scenario is that there's not easy way for Admins to know if their services/daemons die today or are still ok, and this is a means for monitoring their health. I will defer this to @japplewhite if this is part of the MVP. |
@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi @mcarrano asrivast@redhat.com I've put a very rough mockup together to show what the cluster dashboard might look like: |
Noting that geo-rep session status changes planned per Tendrl/gluster-integration#459. |
Updated the Geo-replication Session Panel per georep session status changes. @shtripat @nthomas-redhat @cloudbehl @Tendrl/tendrl-qe @mcarrano |
Closing this one, please file new issue with relevant context if anything missing |
Dashboard Spec - Cluster Dashboard
Display a default dashboard for a single Gluster cluster present in Tendrl that provides at-a-glance information about a single Gluster trusted storage pool that includes health and status information, key performance indicators (e.g. IOPS, throughput, etc.), and alerts that can highlight the Tendrl user's (e.g. Gluster Administrator) attention to potential issues in the cluster, host, volume, and brick.
Problem description
A Gluster Administrator wants to be able to answer the following questions by looking at the cluster dashboard:
Use Cases
Uses Cases in the form of user stories:
As a Gluster Administrator, I want to view at-a-glance information about my Gluster trusted storage pool that includes health and status information, key performance indicators (e.g. IOPS, throughput, etc.), and alerts that can highlight my attention to potential issues in the cluster, host, volume, and brick.
As a Gluster Administrator, I want to compare a metric (e.g. IOPS, CPU, Memory, Network Load) across hosts within the cluster
Compare utilization (e.g. IOPS, capacity, etc.) across bricks within a volume
Proposed change
Provide a pre-canned, default cluster dashboard in Grafana (that is initially launchable from the Tendrl UI, and eventually embed it into the Tendrl UI) that shows the following metrics rendered either in text or in a chart/graph depending on the type of metric being displayed below:
The Dashboard is composed of individual Panels (dashboard widgits) arranged on a number of Rows.
Note: The cluster name/ID should be visible at all times, and user should be able to switch to another cluster.
Row 1
Panel (Dashboard Widgit) 1: Health - Cluster Health
Panel 2: Hosts
+--------------------------+
| |
| Hosts |
| |
| 6 total |
| 5 up |
| 1 down |
+--------------------------+
Panel 3: Volumes
n total - total number (n) of volumes in the cluster
n Up - count (n) of volumes in the cluster that are started and active; see https://github.com/gluster/gstatus for details about the various status
n Up (Partial) - count (n) of volumes in the cluster that are up(partial)
n Up (Degraded) - count (n) of volumes in the cluster that are up(degraded)
n Down - count (n) of volumes in the cluster that are down
the color of panel should be green when all volumes are Up, red when 1 or more volume(s) is down or quorum lost, yellow when 1 or more volume(s) is up( degraded), or up (partial)
chart type: Singlestat (see http://docs.grafana.org/features/panels/singlestat/) for further information
chart type: Stacked Card
Panel 4: Bricks
[FUTURE] Panel 5: Disks
Panel 6: Snapshots
Panel 7: Geo-replication Sessions
n total - total number (n) of geo-replication sessions for the cluster
n Created - count(n) of geo-replication sessions is CREATED or established
n Up - count (n) of Geo-replication sessions (All bricks) are ONLINE and UP
n Up (Partial) - count (n) of geo-replication sessions for the cluster that are up(partial), i.e. some bricks are Online and Some bricks are offline
n Stopped - count (n) of geo-replication sessions is STOPPED
n Down - count (n) of geo-replication sessions, wherein all bricks are Offline/Down
n Paused - count (n) of geo-replication sessions that are in paused state
chart type: Stacked Card
Panel 8: Connections Trend
Row 2
Panel 9: Capacity Utilization
Panel 10: Capacity Available
Panel 11: Growth Rate
Panel 12: Time Remaining (Weeks)
[FUTURE] Panel 13: Services Trend
Panel 14: IOPS Trend
Panel 15: IO Size
Panel 16: Network Throughput Trend
Row 3
Panel 17: Top volumes by capacity utilization
Panel 18: Top bricks by capacity utilization
Row 4
Panel 19: CPU used by Host
Panel 20: Memory used by Host
Panel 21: Ping Latency by Host Trend
Note: The dashboard layout for the panels and panels within the rows may need to alter based on implementation and actual visualization especially when certain metrics may need to be aligned together whether vertically or horizontally.
Alternatives
Create similar dashboard using PatternFly (www.patternfly.org) or d3.js components to show similar information within the Tendrl UI.
Data model impact:
TBD
Impacted Modules:
TBD
Tendrl API impact:
TBD
Notifications/Monitoring impact:
TBD
Tendrl/common impact:
TBD
Tendrl/node_agent impact:
TBD
Sds integration impact:
TBD
Security impact:
TBD
Other end user impact:
User will mostly interact with this feature via the Grafana UI, though access via Grafana API and Tendrl API is possible, but would require API calls to provide similar information.
Performance impact:
TBD
Other deployer impact:
Plug-ins required by Grafana will need to be packaged and installed with tendrl-ansible.
This (default) cluster dashboard will need to be automatically generated whenever a cluster is imported to be managed by Tendrl.
Developer impact:
TBD
Implementation:
TBD
Assignee(s):
Primary assignee: @cloudbehl
Other contributors: @anmolbabu, @anivargi, @julienlim, @japplewhite
Work Items:
TBD
Estimate:
TBD
Dependencies:
TBD
Testing:
Test whether health, status, and metrics displayed for a given cluster is correct and that the information is up-to-date as failures or other cluster changes are observed.
Documentation impact:
Documentation should include information related to what's being displayed and explained for clarity if not immediately obvious from looking at the dashboard. This may include but not be limited to what the metrics refers to, the measurement unit, how to use or apply it to solving troubleshooting problems, e.g. healing / split brain issues, lost of quorum, etc.
References and Related GitHub Links:
(Drill-down navigation in grafana dashboard #189)
The text was updated successfully, but these errors were encountered: