Skip to content

Commit

Permalink
op-guide, media: update overview dashboard
Browse files Browse the repository at this point in the history
  • Loading branch information
lilin90 committed Jun 13, 2018
1 parent f9ed1a5 commit cbee1b7
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 26 deletions.
Binary file added media/overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
87 changes: 61 additions & 26 deletions op-guide/dashboard-overview-info.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,37 +5,72 @@ category: operations

# Key Metrics

If you use Ansible to deploy TiDB cluster, you can deploy the monitoring system at the same time. See [Overview of the Monitoring Framework](monitor-overview.md) for more information.
If you use Ansible to deploy the TiDB cluster, the monitoring system is deployed at the same time. For more information, see [Overview of the Monitoring Framework](monitor-overview.md) .

The Grafana dashboard is divided into four sub dashboards: node_export, PD, TiKV, and TiDB. There are a lot of metics there to help you diagnose. For routine operations, some of the key metrics are displayed on the Overview dashboard so that you can get the overview of the status of the components and the entire cluster. See the following section for their descriptions:
The Grafana dashboard are divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node\_exporter, Disk Performance, and so on. A lot of metrics are there to help you diagnose.

For routine operations, you can get an overview of the component (PD, TiDB, TiKV) status and the entire cluster from the Overview dashboard, where the key metrics are displayed. This document provides a detailed description of these key metrics.

## Key metrics description

To understand the key metrics displayed on the Overview dashboard, check the following table:

Service | Panel Name | Description | Normal Range
---- | ---------------- | ---------------------------------- | --------------
Services Port Status | Services Online | the online nodes number of each service |
Services Port Status | Services Offline | the offline nodes number of each service |
PD | Storage Capacity | the total storage capacity of the TiDB cluster |
PD | Current Storage Size | the occupied storage capacity of the TiDB cluster |
PD | Store Status -- up store | the number of TiKV nodes that are up |
PD | Store Status -- down store | the number of TiKV nodes that are down | `0`. If the number is bigger than `0`, it means some node(s) are not down.
PD | Store Status -- offline store | the number of TiKV nodes that are manually offline|
PD | Store Status -- Tombstone store | the number of TiKV nodes that are Tombstone|
PD | Current storage usage | the storage occupancy rate of the TiKV cluster | If it exceeds 80%, you need to consider adding more TiKV nodes.
PD | 99% completed cmds duration seconds | the 99th percentile duration to complete a pd-server request| less than 5ms
PD | average completed cmds duration seconds | the average duration to complete a pd-server request | less than 50ms
PD | leader balance ratio | the leader ratio difference of the nodes with the biggest leader ratio and the smallest leader ratio | It is less than 5% for a balanced situation. It becomes bigger when a node is restarting.
PD | region balance ratio | the region ratio difference of the nodes with the biggest region ratio and the smallest region ratio | It is less than 5% for a balanced situation. It becomes bigger when adding or removing a node.
TiDB | handle requests duration seconds | the response time to get TSO from PD| less than 100ms
TiDB | tidb server QPS | the QPS of the cluster | application specific
TiDB | connection count | the number of connections from application servers to the database | Application specific. If the number of connections hops, you need to find out the reasons. If it drops to 0, you can check if the network is broken; if it surges, you need to check the application.
TiDB | statement count | the number of different types of statement within a given time | application specific
TiDB | Query Duration 99th percentile | the 99th percentile query time |
TiKV | 99% & 99.99% scheduler command duration | the 99th percentile and 99.99th percentile scheduler command duration| For 99%, it is less than 50ms; for 99.99%, it is less than 100ms.
TiKV | 95% & 99.99% storage async_request duration | the 95th percentile and 99.99th percentile Raft command duration | For 95%, it is less than 50ms; for 99.99%, it is less than 100ms.
TiKV | server report failure message | There might be an issue with the network or the message might not come from this cluster. | If there are large amount of messages which contains `unreachable`, there might be an issue with the network. If the message contains `store not match`, the message does not come from this cluster.
TiKV | Vote |the frequency of the Raft vote | Usually, the value only changes when there is a split. If the value of Vote remains high for a long time, the system might have a severe issue and some nodes are not working.
TiKV | 95% and 99% coprocessor request duration | the 95th percentile and the 99th percentile coprocessor request duration | Application specific. Usually, the value does not remain high.
TiKV | Pending task | the number of pending tasks | Except for PD worker, it is not normal if the value is too high.
TiKV | stall | RocksDB stall time | If the value is bigger than 0, it means that RocksDB is too busy, and you need to pay attention to IO and CPU usage.
TiKV | channel full | The channel is full and the threads are too busy. | If the value is bigger than 0, the threads are too busy.
TiKV | 95% send message duration seconds | the 95th percentile message sending time | less than 50ms
TiKV | leader/region | the number of leader/region per TiKV server| application specific
PD | Number of Regions | the total number of Regions of the current cluster |
PD | Leader Balance Ratio | the leader ratio difference of the nodes with the biggest leader ratio and the smallest leader ratio | It is less than 5% for a balanced situation and becomes bigger when you restart a node.
PD | Region Balance Ratio | the region ratio difference of the nodes with the biggest Region ratio and the smallest Region ratio | It is less than 5% for a balanced situation and becomes bigger when you add or remove a node.
PD | Store Status -- Up Stores | the number of TiKV nodes that are up |
PD | Store Status -- Disconnect Stores | the number of TiKV nodes that encounter abnormal communication within a short time |
PD | Store Status -- LowSpace Stores | the number of TiKV nodes with an available space of less than 80% |
PD | Store Status -- Down Stores | the number of TiKV nodes that are down | The normal value is `0`. If the number is bigger than `0`, it means some node(s) are abnormal.
PD | Store Status -- Offline Stores | the number of TiKV nodes (still providing service) that are being made offline |
PD | Store Status -- Tombstone Stores | the number of TiKV nodes that are successfully offline |
PD | 99% completed_cmds_duration_seconds | the 99th percentile duration to complete a pd-server request | less than 5ms
PD | handle_requests_duration_seconds | the request duration of a PD request |
TiDB | Statement OPS | the total number of executed SQL statements, including `SELECT`, `INSERT`, `UPDATE` and so on |
TiDB | Duration | the execution time of a SQL statement |
TiDB | QPS By Instance | the QPS on each TiDB instance |
TiDB | Failed Query OPM | the number of failed SQL statements, including syntax error and key conflicts and so on |
TiDB | Connection Count | the connection number of each TiDB instance |
TiDB | Heap Memory Usage | the size of heap memory used by each TiDB instance |
TiDB | Transaction OPS | the number of executed transactions per second |
TiDB | Transaction Duration | the execution time of a transaction |
TiDB | KV Cmd OPS | the number of executed KV commands |
TiDB | KV Cmd Duration 99 | the execution time of the KV command |
TiDB | PD TSO OPS | the number of TSO that TiDB obtains from PD |
TiDB | PD TSO Wait Duration | the time consumed when TiDB obtains TSO from PD |
TiDB | TiClient Region Error OPS | the number of Region related errors returned by TiKV |
TiDB | Lock Resolve OPS | the number of transaction related conflicts |
TiDB | Load Schema Duration | the time consumed when TiDB obtains Schema from TiKV |
TiDB | KV Backoff OPS | the number of errors returned by TiKV (such as transaction conflicts )
TiKV | leader | the number of leaders on each TiKV node |
TiKV | region | the number of Regions on each TiKV node |
TiKV | CPU | the CPU usage ratio on each TiKV node |
TiKV | Memory | the memory usage on each TiKV node |
TiKV | store size | the data amount on each TiKV node |
TiKV | cf size | the data amount on different CFs in the cluster |
TiKV | channel full | `No data points` is displayed in normal conditions. If a monitoring value displays, it means the corresponding TiKV node fails to handle the messages |
TiKV | server report failures | `No data points` is displayed in normal conditions. If `Unreachable` is displayed, it means TiKV encounters a communication issue. |
TiKV | scheduler pending commands | the number of commits on queue | Occasional value peaks are normal.
TiKV | coprocessor pending requests | the number of requests on queue | `0` or very small
TiKV | coprocessor executor count | the number of various query operations |
TiKV | coprocessor request duration | the time consumed by TiKV queries |
TiKV | raft store CPU | the CPU usage ratio of the raftstore thread | Currently, it is a single thread. A value of over 80% indicates that the CPU usage ratio is very high.
TiKV | Coprocessor CPU | the CPU usage ratio of the TiKV query thread, related to the application; complex queries consume a great deal of CPU |
System Info | Vcores | the number of CPU cores |
System Info | Memory | the total memory |
System Info | CPU Usage | the CPU usage ratio, 100% at a maximum |
System Info | Load [1m] | the overload within 1 minute |
System Info | Memory Available | the size of the available memory |
System Info | Network Traffic | the statistics of the network traffic |
System Info | TCP Retrans | the statistics about network monitoring and TCP |
System Info | IO Util | the disk usage ratio, 100% at a maximum; generally you need to consider adding a new node when the usage ratio is up to 80% ~ 90% |

## Interface of the Overview dashboard

![Overview Dashboard](../media/overview.png)

0 comments on commit cbee1b7

Please sign in to comment.