Trace client-side and server-side latency to diagnose RPC latency #11378

sticnarf · 2021-11-16T04:00:30Z

Feature Request

Time spent on RPC has always been the missing part of our metrics. There have been many cases when TiKV reports its process duration is short while the end user experiences high latency.

For example, TiDB has slow query logs and it is common for us to fail to find where the latency comes from. Sometimes, we blame the latency to the network and the RPC framework, without evidence.

If we record the client-side and server-side latency for each request, the client can easily know the time spent on the network and the RPC framework. This will help tease out the possible source of latency and guide us where to optimize. And it is vital to diagnose tail latency shown in slow query logs but not discoverable through metrics.

This can be implemented as part of full process tracing, but it seems #8981 has been inactive for months. But for this feature, a simpler implementation and protocol is enough.

zhongzc · 2021-11-18T06:47:05Z

But for this feature, a simpler implementation and protocol is enough.

Agreed. We can achieve this by recording more timestamps in the RPC process.

sticnarf · 2021-11-19T02:23:31Z

I implement it and test a TiDB cluster deployed on a single machine, and find the difference is larger than I expected...

The 99th difference can be up to two-digit milliseconds..

cfzjywxk · 2021-11-19T03:41:12Z

There are some related discussions in pingcap/tidb#28937.

BusyJay · 2021-11-24T06:22:56Z

Similar protocols can also be implemented for raft replications.

The latency should be split as send latency and receive latency. The sum is a RTT, and monitor them respectively may find potential pitfalls.

sticnarf · 2021-11-25T02:18:07Z

The latency should be split as send latency and receive latency. The sum is a RTT, and monitor them respectively may find potential pitfalls.

Single trip latency is more difficult because we will be comparing two instants on different computers. It may depend on the precision of NTP.

BusyJay · 2021-11-25T05:43:21Z

Time is usually synced. We can find the difference if there are gaps.

sticnarf · 2021-11-25T05:57:29Z

Time is usually synced. We can find the difference if there are gaps.

I worry the error may mislead us to wrong conclusions.

For example, if the clock of the receiver is 5ms slower (possible in a multi-region deployment) than the sender, we will find the send latency is shorter than the receive latency even if the actual latency is the same.

BusyJay · 2021-11-25T06:58:35Z

It seems time sync monitor is supported by node exporter, which can be used directly. https://github.com/prometheus/node_exporter/blob/master/docs/TIME.md

sticnarf added type/enhancement The issue or PR belongs to an enhancement. sig/diagnosis SIG: Diagnosis labels Nov 16, 2021

BusyJay mentioned this issue Dec 10, 2021

Resolve the imbalance CPU usage in gRPC #11638

Open

sticnarf mentioned this issue Apr 14, 2022

Performance Diagnosis Enhancements #12362

Open

40 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace client-side and server-side latency to diagnose RPC latency #11378

Trace client-side and server-side latency to diagnose RPC latency #11378

sticnarf commented Nov 16, 2021

zhongzc commented Nov 18, 2021

sticnarf commented Nov 19, 2021 •

edited

Loading

cfzjywxk commented Nov 19, 2021

BusyJay commented Nov 24, 2021

sticnarf commented Nov 25, 2021

BusyJay commented Nov 25, 2021

sticnarf commented Nov 25, 2021

BusyJay commented Nov 25, 2021 •

edited

Loading

Trace client-side and server-side latency to diagnose RPC latency #11378

Trace client-side and server-side latency to diagnose RPC latency #11378

Comments

sticnarf commented Nov 16, 2021

Feature Request

zhongzc commented Nov 18, 2021

sticnarf commented Nov 19, 2021 • edited Loading

cfzjywxk commented Nov 19, 2021

BusyJay commented Nov 24, 2021

sticnarf commented Nov 25, 2021

BusyJay commented Nov 25, 2021

sticnarf commented Nov 25, 2021

BusyJay commented Nov 25, 2021 • edited Loading

sticnarf commented Nov 19, 2021 •

edited

Loading

BusyJay commented Nov 25, 2021 •

edited

Loading