Performance Diagnosis Enhancements #34106

you06 · 2022-04-19T11:57:49Z

Enhancement

TiDB users often suffer from performance diagnosis when trying to tune the cluster to make it most adaptive to their workloads. Some missing metrics in the system may block the diagnosis and make the tuning hard. To solve the problem, more details of queries and more metrics are required.

Data to be collected

Common Path

The common path includes the processes before a query is executed, connection handling, parsing, and optimization.

Read packet duration
Parse duration
TSO duration
Stmt retry count and duration(the failed stmt execution duration)
Write response duration
Fetch wait duration
Execution duration = Query Duration - above all

Read

The read requests are mainly processed in executors, and they can be divided into calculation executors and data source executors. TiDB reads data by get, batch-get, scan(iter, only used by internal txn), and coprocessor(unary request only). When TiD reads with some error, like lock error or network error, it’ll automatically retry, such retry time should be recorded, and all requests except the last successful one should be counted as retry duration.

Get

Get KV size
Get request duration
Get request retry duration

BatchGet

BatchGet KVs size
BatchGet regions duration
BatchGet single region duration
BatchGet single region retry duration

Coprocessor

Scan

Scan Size
Scan request duration
Scan request retry duration

UnionScan

Read mem rows duration

Resolve lock For Read

Write

Write requests are cached in TiDB’s memory before transactions are committed. In which pessimistic transactions will fetch pessimistic locks during the execution. And some write operations are based on the read results.

Pessimistic lock

Prewrite

Prewrite mutation size
Async commit get minCommitTS duration
Prewrite mutations duration
Token wait duration per batch
Prewrite single batch duration
Prewrite single batch retry duration

Non Async Commit

Async Commit

Batch KV Client

Batch System

gRPC

gRPC queue length
gRPC queue wait duration
gRPC reconnect count and duration

Region Cache

Region cache miss event
Region error count
Load region(on cache miss) duration

Data to be verified or enhanced

Trasanction panel. The transaction stmt number, transaction size, transaction region size. It's needed to verify if these metrics are correct and necessary.
The lock resolve ops panel. The difficult to understand the types and classify the impact on the cluster performance.

Reference

The tikv part
The internal resources usage and impact insight.
The incorrect txn region metric

you06 · 2022-04-21T15:13:56Z

Generally, get TS can be devided into 2 types: async and sync.

The async way is only used in txn or stmt start.

The sync way is used in the following cases:

Get ts for RC read or forUpdateRead
Get ts as minCommitTS or commitTS

So we may just tag the TSO duration in such 2 type and also show like this.

zyguan · 2022-04-26T07:22:56Z

We may also need to add some metrics for tracking memory usage. Currently, It's hard to tell user why the server gets OOM-killed.

cfzjywxk · 2022-04-27T13:52:28Z

The lock contention may have a significant impact on cluster performance and query latency. Currently, there's no convenient way to do the diagnosis work. Usually, the performance insight is requested to answer:

What's the quantified impact on the cluster performance such as throughput and query latency.
What are the specific causes of the contention or which queries or transaction usages lead to this contention.

As lock-view is not integrated into dashboard yet and it does not support checking historical data. It's still needed to do enhancements to the existing diagnosis means such as the slow log and monitor metric.

you06 · 2022-04-28T02:50:43Z

Since #33963 introduces the request source for metrics, we may attach this information to more metrics, including memory tracking, acquired locks, etc.
The causes of the contention are hard to discover with aggregated metrics.

ref #34106

…36180) ref #34106

ref #34106

cfzjywxk · 2022-08-18T09:21:25Z

@zyguan @you06
The kv request duration currently represents the whole path:

sendReqToRegion -> tidb batch client -> tidb grpc client -> env -> tikv grpc server -> |tikv grpc process -> tikv grpc client ->| env -> tidb grpc server -> tidb batch client.

The tikv internal part is recorded as tikv grpc duration, other parts are still missing and difficult to diagnose, for example, we could see some slow queries with just one point get executor but we could not find correspond slowness from the tikv grpc duration. We may need a way to verify that the slowness comes from the tidb part or the environment:

From the kv client side, we could record request-related duration in tidb batch client.
From the tidb grpc part seems not much we could record unless we hack into the grpc. So as the tikv grpc-rs wrapper.

So maybe we could try to improve the first step, and then we could verify more parts:
sendReqToRegion -> tidb batch client -> tidb grpc client -> env -> tikv grpc server -> |tikv grpc process -> tikv grpc client ->| env -> tidb grpc server -> tidb batch client.

…clear (#37158) ref #34106

ref #34106

you06 added the type/enhancement The issue or PR belongs to an enhancement. label Apr 19, 2022

you06 mentioned this issue Apr 21, 2022

doc: add design doc for performance diagnosis enhancements #34162

Open

12 tasks

cfzjywxk mentioned this issue May 9, 2022

Slow Log Enhancement Tracking Issue #34487

Open

17 tasks

This was referenced May 12, 2022

proto: add request_source in Context pingcap/kvproto#916

Merged

metrics: add loading region cache duration #34679

Merged

ti-chi-bot pushed a commit that referenced this issue Jun 29, 2022

metrics: add loading region cache duration (#34679)

c5ad449

ref #34106

you06 mentioned this issue Jul 13, 2022

log: add get-latest-ts duration of async commit and 1pc into slowlog #36180

Merged

12 tasks

ti-chi-bot pushed a commit that referenced this issue Jul 14, 2022

log: add get-latest-ts duration of async commit and 1pc into slowlog (#…

bbc2e65

…36180) ref #34106

zyguan mentioned this issue Jul 15, 2022

executor,metrics: add a metric for observing execution phases #35906

Merged

12 tasks

ti-chi-bot pushed a commit that referenced this issue Jul 21, 2022

executor,metrics: add a metric for observing execution phases (#35906)

23f25af

ref #34106

zyguan mentioned this issue Jul 27, 2022

executor: support trace with tikv exec details #36638

Merged

12 tasks

ti-chi-bot pushed a commit that referenced this issue Jul 29, 2022

executor: support trace with tikv exec details (#36638)

c3180e0

ref #34106

cfzjywxk mentioned this issue Aug 16, 2022

txn: seperate the prewrite and commit details information to make it clear #37158

Merged

1 task

ti-chi-bot pushed a commit that referenced this issue Aug 19, 2022

txn: seperate the prewrite and commit details information to make it …

8c79898

…clear (#37158) ref #34106

cfzjywxk mentioned this issue Aug 22, 2022

diagnose: enhance the execdetails when the pessimistic retry happens #37282

Open

breezewish mentioned this issue Sep 30, 2022

metrics: Add IOPs, Disk Throughput to the overview dashboard #38270

Merged

12 tasks

ti-chi-bot pushed a commit that referenced this issue Sep 30, 2022

metrics: Add IOPs, Disk Throughput to the overview dashboard (#38270)

d70b022

ref #34106

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Diagnosis Enhancements #34106

Performance Diagnosis Enhancements #34106

you06 commented Apr 19, 2022 •

edited by cfzjywxk

Loading

you06 commented Apr 21, 2022

zyguan commented Apr 26, 2022

cfzjywxk commented Apr 27, 2022

you06 commented Apr 28, 2022

cfzjywxk commented Aug 18, 2022

Performance Diagnosis Enhancements #34106

Performance Diagnosis Enhancements #34106

Comments

you06 commented Apr 19, 2022 • edited by cfzjywxk Loading

Enhancement

Data to be collected

Common Path

Read

Get

BatchGet

Coprocessor

Scan

UnionScan

Resolve lock For Read

Write

Pessimistic lock

Prewrite

Non Async Commit

Async Commit

Batch KV Client

Batch System

gRPC

Region Cache

Data to be verified or enhanced

Reference

you06 commented Apr 21, 2022

zyguan commented Apr 26, 2022

cfzjywxk commented Apr 27, 2022

you06 commented Apr 28, 2022

cfzjywxk commented Aug 18, 2022

you06 commented Apr 19, 2022 •

edited by cfzjywxk

Loading