Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Diagnosis Enhancements #34106

Open
65 tasks
you06 opened this issue Apr 19, 2022 · 5 comments
Open
65 tasks

Performance Diagnosis Enhancements #34106

you06 opened this issue Apr 19, 2022 · 5 comments
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@you06
Copy link
Contributor

you06 commented Apr 19, 2022

Enhancement

TiDB users often suffer from performance diagnosis when trying to tune the cluster to make it most adaptive to their workloads. Some missing metrics in the system may block the diagnosis and make the tuning hard. To solve the problem, more details of queries and more metrics are required.

Data to be collected

Common Path

The common path includes the processes before a query is executed, connection handling, parsing, and optimization.

  • Read packet duration
  • Parse duration
  • TSO duration
  • Stmt retry count and duration(the failed stmt execution duration)
  • Write response duration
  • Fetch wait duration
  • Execution duration = Query Duration - above all

Read

The read requests are mainly processed in executors, and they can be divided into calculation executors and data source executors. TiDB reads data by get, batch-get, scan(iter, only used by internal txn), and coprocessor(unary request only). When TiD reads with some error, like lock error or network error, it’ll automatically retry, such retry time should be recorded, and all requests except the last successful one should be counted as retry duration.

Get

  • Get KV size
  • Get request duration
  • Get request retry duration

BatchGet

  • BatchGet KVs size
  • BatchGet regions duration
  • BatchGet single region duration
  • BatchGet single region retry duration

Coprocessor

  • Copr Response Size
  • Fetch response duration
  • Copr task wait duration
  • Copr task duration
  • Copr task retry duration

Scan

  • Scan Size
  • Scan request duration
  • Scan request retry duration

UnionScan

  • Read mem rows duration

Resolve lock For Read

  • Resolve lock keys count
  • Resolve locks duration
  • Resolve lock wait duration
  • Get txn status duration
  • Get txn status retry duration
  • Resolve pessimistic lock duration

Write

Write requests are cached in TiDB’s memory before transactions are committed. In which pessimistic transactions will fetch pessimistic locks during the execution. And some write operations are based on the read results.

Pessimistic lock

  • Lock keys count
  • Lock keys duration
  • Token wait duration per batch
  • Lock single batch duration
  • Lock single batch retry duration

Prewrite

  • Prewrite mutation size
  • Async commit get minCommitTS duration
  • Prewrite mutations duration
  • Token wait duration per batch
  • Prewrite single batch duration
  • Prewrite single batch retry duration

Non Async Commit

  • Commit keys count
  • Get commitTS duration
  • Commit primary key batch duration
  • Commit mutations duration
  • Token wait duration per batch
  • Commit single batch duration
  • Commit single batch retry duration

Async Commit

  • Commit keys count
  • Commit mutations duration
  • Token wait duration per batch
  • Commit single batch duration
  • Commit single batch retry duration

Batch KV Client

Batch System

  • RPC duration
  • Batch size
  • Batch wait duration
  • Batch fetch duration
  • Batch send latency

gRPC

  • gRPC queue length
  • gRPC queue wait duration
  • gRPC reconnect count and duration

Region Cache

  • Region cache miss event
  • Region error count
  • Load region(on cache miss) duration

Data to be verified or enhanced

  • Trasanction panel. The transaction stmt number, transaction size, transaction region size. It's needed to verify if these metrics are correct and necessary.
  • The lock resolve ops panel. The difficult to understand the types and classify the impact on the cluster performance.

Reference

  • The tikv part
  • The internal resources usage and impact insight.
  • The incorrect txn region metric
@you06 you06 added the type/enhancement The issue or PR belongs to an enhancement. label Apr 19, 2022
@you06
Copy link
Contributor Author

you06 commented Apr 21, 2022

Generally, get TS can be devided into 2 types: async and sync.

The async way is only used in txn or stmt start.

The sync way is used in the following cases:

  • Get ts for RC read or forUpdateRead
  • Get ts as minCommitTS or commitTS

So we may just tag the TSO duration in such 2 type and also show like this.

@zyguan
Copy link
Contributor

zyguan commented Apr 26, 2022

We may also need to add some metrics for tracking memory usage. Currently, It's hard to tell user why the server gets OOM-killed.
2022-04-26_123558
2022-04-26_151830

@cfzjywxk
Copy link
Contributor

The lock contention may have a significant impact on cluster performance and query latency. Currently, there's no convenient way to do the diagnosis work. Usually, the performance insight is requested to answer:

  • What's the quantified impact on the cluster performance such as throughput and query latency.
  • What are the specific causes of the contention or which queries or transaction usages lead to this contention.

As lock-view is not integrated into dashboard yet and it does not support checking historical data. It's still needed to do enhancements to the existing diagnosis means such as the slow log and monitor metric.

@you06
Copy link
Contributor Author

you06 commented Apr 28, 2022

Since #33963 introduces the request source for metrics, we may attach this information to more metrics, including memory tracking, acquired locks, etc.
The causes of the contention are hard to discover with aggregated metrics.

@cfzjywxk
Copy link
Contributor

@zyguan @you06
The kv request duration currently represents the whole path:

sendReqToRegion -> tidb batch client -> tidb grpc client -> env -> tikv grpc server -> |tikv grpc process -> tikv grpc client ->| env -> tidb grpc server -> tidb batch client.

The tikv internal part is recorded as tikv grpc duration, other parts are still missing and difficult to diagnose, for example, we could see some slow queries with just one point get executor but we could not find correspond slowness from the tikv grpc duration. We may need a way to verify that the slowness comes from the tidb part or the environment:

  1. From the kv client side, we could record request-related duration in tidb batch client.
  2. From the tidb grpc part seems not much we could record unless we hack into the grpc. So as the tikv grpc-rs wrapper.

So maybe we could try to improve the first step, and then we could verify more parts:
sendReqToRegion -> tidb batch client -> tidb grpc client -> env -> tikv grpc server -> |tikv grpc process -> tikv grpc client ->| env -> tidb grpc server -> tidb batch client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

3 participants