Enhancement for tidb query diagnose #28937

cfzjywxk · 2021-10-19T03:03:30Z

Feature Request

Is your feature request related to a problem? Please describe:

In the architecture of TiDB, tidb-server manages the user connections and process incoming queries,these queries will be converted into different kv requests and sent to tikv-server using tidb batch client and grpc components. The question is, after turing query into kv requests many execution context and query context will be lost, which make it difficult to diagnose "query-diamension" issues, for example slow queries. For example we could often see slow queries like:

4f12266030e202b41c1f0531d03ba799a458f11a88635a3f424506a12e3ee543,SELECT event_time FROM followers WHERE uid = 59322005 AND target_uid = 75161335 LIMIT 1;,10.160.32.142:10080,blued,32214282,1,1632294495.143163,1.004865,0.000028,0.000029,0.000053,0.000000,0.000091,0.000406,0.000000,0.000000,0.000000,0.000000,0,0,427896207863981621,," id task estRows operator info actRows execution info memory disk
Point_Get_1 root 1 table:followers, index:PRIMARY(uid, target_uid) 0 time:1s, loops:1, Get:{num_rpc:1, total_time:1s} N/A N/A",0,,,,blued,10.10.96.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0,0,0,0,0,0,0,,,0,0,0,0,0

In the slow log above, we could only know the kv get rpc is slow(1s), but what is the root cause and how to solve it there's no useful information. Usually we need to check the grafana metrics for more information but this is the incorrect dianosing way, diagnosing query issues using server dimenson information is quite inefficient and inappropriate.

Describe the feature you'd like:

Try to pass the needed query context and execution context into the tikv-server, and record duration for each stage. So we could clearly know what is the root cause slowing down specific queries and will not need to check the aggregated data in grafana. This is the key point to enhance the diagnosibility for TiDB.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

The text was updated successfully, but these errors were encountered:

breezewish · 2021-10-28T08:27:47Z

cfzjywxk · 2021-10-28T10:27:29Z

Related: #17689 tikv/tikv#8942

TL;DR:

TiKV side process and wait time in Coprocessor requests

TiKV side RocksDB scan details in Coprocessor requests

TiKV side process and wait time in Get requests

TiKV side RocksDB scan details in Get requests

We are still working on delivering the Top SQL feature, which solves some more urgent diagnostics issues.

However there is a very detailed implementation instruction in tikv/tikv#8942 (comment) I would appreciate if someone interested can give a help to bring it to our code base :)

Note: after tikv/tikv#8942 is resolved, we still have more work to do with single-query diagnostics. From my current understanding, a complete solution should at least cover these information:

The executor sum elapsed time and sum processed rows

What we really want: The cost (e.g. CPU) of each executor

The RPC time from TiDB side

The total time, wait time at TiKV side for all kind of requests

What we really want: The real network latency (TiDB side RPC time - TiKV side RPC time - SuperBatch RPC framework time)

The KV process time at TiKV side

The FS block read numbers at RocksDB for all kind of requests

What we really want: The real disk read latency

The acquire lock time at TiDB side

What we really want: Lock time for all kind of locks and where these locks comes from

Advanced (mostly for debugging purpose):

The sub-procedure level function timing (a.k.a. tracing)

The elapsed time for all queuing operations in the SQL execution lifetime (e.g. Raftstore FSM queue)

@breeswish
Thanks for the information.
The TiDB cluster has a separate architecture, a storage node is a stand-alone kv-service. As a result, there will be information lost in the processing process, such as the session context, execution context, and query-related context. Besides the diagnosing issues we're faced with, there are also other issues related to this loss, for example, execution timeout could not be handled properly at the tikv-server side or even the rocksdb side, query memory limit will just take effect in the tidb-server side, etc.
To make TiDB work more like an integrated database other than combinations of several components, we may need to think about how to pass the information mentioned above when we implement a specific execution path as well as design the architeture.

breezewish · 2021-10-28T12:13:57Z

@cfzjywxk Thanks for the recap! Maybe we can separate what you describe into two things that we can consider work with independently:

a) The troubleshooting of a single SQL execution is not well supported. Currently these lack information can only be known from the metrics, which is an aggregation that may not be helpful to a single execution. This has been planned to be improved by @SunRunAway in the next several releases.

b) Components other than TiDB does not process the SQL request in the same way as TiDB.

I guess there is no architectural difficulty for (b), as some behavior has been successfully kept identical. For example, the handling of different SQL modes are well implemented in both TiKV side and TiDB side. They follows the same behavior for whatever SQL mode user sets. This indicates that we can do it right. However I admit that we need to implement these behaviors one by one for now, which is not a good way.

ref #28937

cfzjywxk added type/feature-request Categorizes issue or PR as related to a new feature. sig/diagnosis SIG: Diagnosis labels Oct 19, 2021

cfzjywxk mentioned this issue Nov 19, 2021

Trace client-side and server-side latency to diagnose RPC latency tikv/tikv#11378

Open

cfzjywxk mentioned this issue May 9, 2022

Slow Log Enhancement Tracking Issue #34487

Open

17 tasks

This was referenced Mar 27, 2024

cop: display kv read time details in the execution information #52146

Merged

cop: fix time detail merge tikv/client-go#1258

Merged

ti-chi-bot bot pushed a commit that referenced this issue Apr 3, 2024

cop: display kv read time details in the execution information (#52146)

3648b9d

ref #28937

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement for tidb query diagnose #28937

Enhancement for tidb query diagnose #28937

cfzjywxk commented Oct 19, 2021 •

edited

Loading

breezewish commented Oct 28, 2021 •

edited

Loading

cfzjywxk commented Oct 28, 2021

breezewish commented Oct 28, 2021

Enhancement for tidb query diagnose #28937

Enhancement for tidb query diagnose #28937

Comments

cfzjywxk commented Oct 19, 2021 • edited Loading

Feature Request

breezewish commented Oct 28, 2021 • edited Loading

cfzjywxk commented Oct 28, 2021

breezewish commented Oct 28, 2021

cfzjywxk commented Oct 19, 2021 •

edited

Loading

breezewish commented Oct 28, 2021 •

edited

Loading