Skip to content

Commit

Permalink
Add cpu and traffic into performance overview (pingcap#18698)
Browse files Browse the repository at this point in the history
  • Loading branch information
dbsid authored Sep 10, 2024
1 parent 2a4c91b commit 11c7ad9
Show file tree
Hide file tree
Showing 6 changed files with 138 additions and 60 deletions.
108 changes: 64 additions & 44 deletions dashboard/dashboard-monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,31 +21,31 @@ If the TiDB cluster is deployed using TiUP, you can also view the Performance Ov

The Performance Overview dashboard orchestrates the metrics of TiDB, PD, and TiKV, and presents each of them in the following sections:

- Overview: Database time and SQL execution time summary. By checking different colors in the overview, you can quickly identify the database workload profile and the performance bottleneck.
- **Overview**: Database time and SQL execution time summary. By checking different colors in the overview, you can quickly identify the database workload profile and the performance bottleneck.

- Load profile: Key metrics and resource usage, including database QPS, connection information, the MySQL command types the application interacts with TiDB, database internal TSO and KV request OPS, and resource usage of the TiKV and TiDB.
- **Load profile**: Key metrics and resource usage, including database QPS, connection information, the MySQL command types the application interacts with TiDB, database internal TSO and KV request OPS, and resource usage of the TiKV and TiDB.

- Top-down latency breakdown: Query latency versus connection idle time ratio, query latency breakdown, TSO/KV request latency during execution, breakdown of write latency within TiKV.
- **Top-down latency breakdown**: Query latency versus connection idle time ratio, query latency breakdown, TSO/KV request latency during execution, breakdown of write latency within TiKV.

The following sections illustrate the metrics on the Performance Overview dashboard.

### Database Time by SQL Type

- database time: Total database time per second
- sql_type: Database time consumed by each type of SQL statements per second
- `database time`: Total database time per second
- `sql_type`: Database time consumed by each type of SQL statements per second

### Database Time by SQL Phase

- database time: Total database time per second
- get token/parse/compile/execute: Database time consumed in four SQL processing phases
- `database time`: Total database time per second
- `get token/parse/compile/execute`: Database time consumed in four SQL processing phases

The SQL execution phase is in green and other phases are in red on general. If non-green areas are large, it means much database time is consumed in other phases than the execution phase and further cause analysis is required.

### SQL Execute Time Overview

- execute time: Database time consumed during SQL execution per second
- tso_wait: Concurrent TSO waiting time per second during SQL execution
- kv request type: Time waiting for each KV request type per second during SQL execution. The total KV request wait time might exceed SQL execution time, because KV requests are concurrent.
- `execute time`: Database time consumed during SQL execution per second
- `tso_wait`: Concurrent TSO waiting time per second during SQL execution
- `kv request type`: Time waiting for each KV request type per second during SQL execution. The total KV request wait time might exceed SQL execution time, because KV requests are concurrent.

Green metrics stand for common KV write requests (such as prewrite and commit), blue metrics stand for common read requests, and metrics in other colors stand for unexpected situations which you need to pay attention to. For example, pessimistic lock KV requests are marked red and TSO waiting is marked dark brown.

Expand Down Expand Up @@ -77,50 +77,70 @@ Generally, `tso - request` divided by `tso - cmd` is the average size of TSO req

### Connection Count

- total: Number of connections to all TiDB instances
- active connections: Number of active connections to all TiDB instances
- `total`: Number of connections to all TiDB instances
- `active connections`: Number of active connections to all TiDB instances
- Number of connections to each TiDB instance

### TiDB CPU
### TiDB CPU/Memory

- avg: Average CPU utilization across all TiDB instances
- delta: Maximum CPU utilization of all TiDB instances minus minimum CPU utilization of all TiDB instances
- max: Maximum CPU utilization across all TiDB instances
- `CPU-Avg`: Average CPU utilization across all TiDB instances
- `CPU-Delta`: Maximum CPU utilization of all TiDB instances minus minimum CPU utilization of all TiDB instances
- `CPU-Max`: Maximum CPU utilization across all TiDB instances
- `CPU-Quota`: Number of CPU cores that can be used by TiDB
- `Mem-Max`: Maximum memory utilization across all TiDB instances

### TiKV CPU/IO MBps
### TiKV CPU/Memory

- CPU-Avg: Average CPU utilization of all TiKV instances
- CPU-Delta: Maximum CPU utilization of all TiKV instances minus minimum CPU utilization of all TiKV instances
- CPU-MAX: Maximum CPU utilization among all TiKV instances
- IO-Avg: Average MBps of all TiKV instances
- IO-Delt: Maximum MBps of all TiKV instances minus minimum MBps of all TiKV instances
- IO-MAX: Maximum MBps of all TiKV instances
- `CPU-Avg`: Average CPU utilization across all TiKV instances
- `CPU-Delta`: Maximum CPU utilization of all TiKV instances minus minimum CPU utilization of all TiKV instances
- `CPU-Max`: Maximum CPU utilization across all TiKV instances
- `CPU-Quota`: Number of CPU cores that can be used by TiKV
- `Mem-Max`: Maximum memory utilization across all TiKV instances

### PD CPU/Memory

- `CPU-Max`: Maximum CPU utilization across all PD instances
- `CPU-Quota`: Number of CPU cores that can be used by PD
- `Mem-Max`: Maximum memory utilization across all PD instances

### Read Traffic

- `TiDB -> Client`: The outbound traffic statistics from TiDB to the client
- `Rocksdb -> TiKV`: The data flow that TiKV retrieves from RocksDB during read operations within the storage layer

### Write Traffic

- `Client -> TiDB`: The inbound traffic statistics from the client to TiDB
- `TiDB -> TiKV: general`: The rate at which foreground transactions are written from TiDB to TiKV
- `TiDB -> TiKV: internal`: The rate at which internal transactions are written from TiDB to TiKV
- `TiKV -> Rocksdb`: The flow of write operations from TiKV to RocksDB
- `RocksDB Compaction`: The total read and write I/O flow generated by RocksDB compaction operations

### Duration

- Duration: Execution time
- `Duration`: Execution time

- The duration from receiving a request from the client to TiDB till TiDB executing the request and returning the result to the client. In general, client requests are sent in the form of SQL statements; however, this duration can include the execution time of commands such as `COM_PING`, `COM_SLEEP`, `COM_STMT_FETCH`, and `COM_SEND_LONG_DATA`.
- TiDB supports Multi-Query, which means the client can send multiple SQL statements at one time, such as `select 1; select 1; select 1;`. In this case, the total execution time of this query includes the execution time of all SQL statements.

- avg: Average time to execute all requests
- 99: P99 duration to execute all requests
- avg by type: Average time to execute all requests in all TiDB instances, collected by type: `SELECT`, `INSERT`, and `UPDATE`
- `avg`: Average time to execute all requests
- `99`: P99 duration to execute all requests
- `avg by type`: Average time to execute all requests in all TiDB instances, collected by type: `SELECT`, `INSERT`, and `UPDATE`

### Connection Idle Duration

Connection Idle Duration indicates the duration of a connection being idle.

- avg-in-txn: Average connection idle duration when the connection is within a transaction
- avg-not-in-txn: Average connection idle duration when the connection is not within a transaction
- 99-in-txn: P99 connection idle duration when the connection is within a transaction
- 99-not-in-txn: P99 connection idle duration when the connection is not within a transaction
- `avg-in-txn`: Average connection idle duration when the connection is within a transaction
- `avg-not-in-txn`: Average connection idle duration when the connection is not within a transaction
- `99-in-txn`: P99 connection idle duration when the connection is within a transaction
- `99-not-in-txn`: P99 connection idle duration when the connection is not within a transaction

### Parse Duration, Compile Duration, and Execute Duration

- Parse Duration: Time consumed in parsing SQL statements
- Compile Duration: Time consumed in compiling the parsed SQL AST to execution plans
- Execution Duration: Time consumed in executing execution plans of SQL statements
- `Parse Duration`: Time consumed in parsing SQL statements
- `Compile Duration`: Time consumed in compiling the parsed SQL AST to execution plans
- `Execution Duration`: Time consumed in executing execution plans of SQL statements

All these three metrics include the average duration and the 99th percentile duration in all TiDB instances.

Expand All @@ -134,25 +154,25 @@ Average time consumed in executing gRPC requests in all TiKV instances based on

### PD TSO Wait/RPC Duration

- wait - avg: Average time in waiting for PD to return TSO in all TiDB instances
- rpc - avg: Average time from sending TSO requests to PD to receiving TSO in all TiDB instances
- wait - 99: P99 time in waiting for PD to return TSO in all TiDB instances
- rpc - 99: P99 time from sending TSO requests to PD to receiving TSO in all TiDB instances
- `wait - avg`: Average time in waiting for PD to return TSO in all TiDB instances
- `rpc - avg`: Average time from sending TSO requests to PD to receiving TSO in all TiDB instances
- `wait - 99`: P99 time in waiting for PD to return TSO in all TiDB instances
- `rpc - 99`: P99 time from sending TSO requests to PD to receiving TSO in all TiDB instances

### Storage Async Write Duration, Store Duration, and Apply Duration

- Storage Async Write Duration: Time consumed in asynchronous write
- Store Duration: Time consumed in store loop during asynchronously write
- Apply Duration: Time consumed in apply loop during asynchronously write
- `Storage Async Write Duration`: Time consumed in asynchronous write
- `Store Duration`: Time consumed in store loop during asynchronously write
- `Apply Duration`: Time consumed in apply loop during asynchronously write

All these three metrics include the average duration and P99 duration in all TiKV instances.

Average storage async write duration = Average store duration + Average apply duration

### Append Log Duration, Commit Log Duration, and Apply Log Duration

- Append Log Duration: Time consumed by Raft to append logs
- Commit Log Duration: Time consumed by Raft to commit logs
- Apply Log Duration: Time consumed by Raft to apply logs
- `Append Log Duration`: Time consumed by Raft to append logs
- `Commit Log Duration`: Time consumed by Raft to commit logs
- `Apply Log Duration`: Time consumed by Raft to apply logs

All these three metrics include the average duration and P99 duration in all TiKV instances.
Binary file added media/performance/titan_disable.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/performance/titan_enable.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/performance/tpcc_cpu_memory.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/performance/tpcc_read_write_traffic.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
90 changes: 74 additions & 16 deletions performance-tuning-methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -216,34 +216,92 @@ In this workload, only `ANALYZE` statements are running in the cluster:
- The total number of KV requests per second is 35.5 and the number of Cop requests per second is 9.3.
- Most of the KV processing time is spent on `Cop-internal_stats`, which indicates that the most time-consuming KV request is `Cop` from internal `ANALYZE` operations.

#### TiDB CPU, TiKV CPU, and IO usage
#### CPU and memory usage

In the TiDB CPU and TiKV CPU/IO MBps panels, you can observe the logical CPU usage and IO throughput of TiDB and TiKV, including average, maximum, and delta (maximum CPU usage minus minimum CPU usage), based on which you can determine the overall CPU usage of TiDB and TiKV.
In the CPU/Memory panels of TiDB, TiKV, and PD, you can monitor their respective logical CPU usage and memory consumption, such as average CPU, maximum CPU, delta CPU (maximum CPU usage minus minimum CPU usage), CPU quota, and maximum memory usage. Based on these metrics, you can determine the overall resource usage of TiDB, TiKV, and PD.

- Based on the `delta` value, you can determine if CPU usage in TiDB is unbalanced (usually accompanied by unbalanced application connections) and if there are read/write hot spots among the cluster.
- With an overview of TiDB and TiKV resource usage, you can quickly determine if there are resource bottlenecks in your cluster and whether TiKV or TiDB needs scale-out.
- Based on the `delta` value, you can determine if CPU usage in TiDB or TiKV is unbalanced. For TiDB, a high `delta` usually means unbalanced application connections among the TiDB instances; For TiKV, a high `delta` usually means there are read/write hot spots in the cluster.
- With an overview of TiDB, TiKV, and PD resource usage, you can quickly determine if there are resource bottlenecks in your cluster and whether TiKV, TiDB, or PD needs scale-out or scale-up.

**Example 1: High TiDB resource usage**
**Example 1: High TiKV resource usage**

In this workload, each TiDB and TiKV is configured with 8 CPUs.
In the following TPC-C workload, each TiDB and TiKV is configured with 16 CPUs. PD is configured with 4 CPUs.

![TPC-C](/media/performance/tidb_high_cpu.png)
![TPC-C](/media/performance/tpcc_cpu_memory.png)

- The average, maximum, and delta CPU usage of TiDB are 575%, 643%, and 136%, respectively.
- The average, maximum, and delta CPU usage of TiKV are 146%, 215%, and 118%, respectively. The average, maximum, and delta I/O throughput of TiKV are 9.06 MB/s, 19.7 MB/s, and 17.1 MB/s, respectively.
- The average, maximum, and delta CPU usage of TiDB are 761%, 934%, and 322%, respectively. The maximum memory usage is 6.86 GiB.
- The average, maximum, and delta CPU usage of TiKV are 1343%, 1505%, and 283%, respectively. The maximum memory usage is 27.1 GiB.
- The maximum CPU usage of PD is 59.1%. The maximum memory usage is 221 MiB.

Obviously, TiDB consumes more CPU, which is near the bottleneck threshold of 8 CPUs. It is recommended that you scale out the TiDB.
Obviously, TiKV consumes more CPU, which is expected because TPC-C is a write-heavy scenario. To improve performance, it is recommended to scale out TiKV.

**Example 2: High TiKV resource usage**
#### Data traffic

In the TPC-C workload below, each TiDB and TiKV is configured with 16 CPUs.
The read and write traffic panels offer insights into traffic patterns within your TiDB cluster, allowing you to monitor data flow from clients to the database and between internal components comprehensively.

![TPC-C](/media/performance/tpcc_cpu_io.png)
- Read traffic

- The average, maximum, and delta CPU usage of TiDB are 883%, 962%, and 153%, respectively.
- The average, maximum, and delta CPU usage of TiKV are 1288%, 1360%, and 126%, respectively. The average, maximum, and delta I/O throughput of TiKV are 130 MB/s, 153 MB/s, and 53.7 MB/s, respectively.
- `TiDB -> Client`: the outbound traffic statistics from TiDB to the client
- `Rocksdb -> TiKV`: the data flow that TiKV retrieves from RocksDB during read operations within the storage layer

Obviously, TiKV consumes more CPU, which is expected because TPC-C is a write-heavy scenario. It is recommended that you scale out the TiKV to improve performance.
- Write traffic

- `Client -> TiDB`: the inbound traffic statistics from the client to TiDB
- `TiDB -> TiKV: general`: the rate at which foreground transactions are written from TiDB to TiKV
- `TiDB -> TiKV: internal`: the rate at which internal transactions are written from TiDB to TiKV
- `TiKV -> Rocksdb`: the flow of write operations from TiKV to RocksDB
- `RocksDB Compaction`: the total read and write I/O flow generated by RocksDB compaction operations. If `RocksDB Compaction` is significantly higher than `TiKV -> Rocksdb`, and your average row size is larger than 512 bytes, you can enable Titan to reduce the compaction I/O flow as follows, with min-blob-size set to `"512B"` or `"1KB"` and blob-file-compression set to `"zstd"`.

```toml
[rocksdb.titan]
enabled = true
[rocksdb.defaultcf.titan]
min-blob-size = "1KB"
blob-file-compression = "zstd"
```

**Example 1: Read and write traffic in the TPC-C workload**

The following is an example of read and write traffic in the TPC-C workload.

- Read traffic

- `TiDB -> Client`: 14.2 MB/s
- `Rocksdb -> TiKV`: 469 MB/s. Note that both read operations (`SELECT` statements) and write operations (`INSERT`, `UPDATE`, and `DELETE` statements) require reading data from RocksDB into TiKV before committing a transaction.

- Write traffic

- `Client -> TiDB`: 5.05 MB/s
- `TiDB -> TiKV: general`: 13.1 MB/s
- `TiDB -> TiKV`: internal: 5.07 KB/s
- `TiKV -> Rocksdb`: 109 MB/s
- `RocksDB Compaction`: 567 MB/s

![TPC-C](/media/performance/tpcc_read_write_traffic.png)

**Example 2: Write traffic before and after Titan is enabled**

The following example shows the performance changes before and after Titan is enabled. For an insert workload with 6 KB records, Titan significantly reduces write traffic and compaction I/O, enhancing overall performance and resource utilization of TiKV.

- Write traffic before Titan is enabled

- `Client -> TiDB`: 510 MB/s
- `TiDB -> TiKV: general`: 187 MB/s
- `TiDB -> TiKV: internal`: 3.2 KB/s
- `TiKV -> Rocksdb`: 753 MB/s
- `RocksDB Compaction`: 10.6 GB/s

![Titan Disable](/media/performance/titan_disable.png)

- Write traffic after Titan is enabled

- `Client -> TiDB`: 586 MB/s
- `TiDB -> TiKV: general`: 295 MB/s
- `TiDB -> TiKV: internal`: 3.66 KB/s
- `TiKV -> Rocksdb`: 1.21 GB/s
- `RocksDB Compaction`: 4.68 MB/s

![Titan Enable](/media/performance/titan_enable.png)

### Query latency breakdown and key latency metrics

Expand Down

0 comments on commit 11c7ad9

Please sign in to comment.