-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trace client-side and server-side latency to diagnose RPC latency #11378
Comments
Agreed. We can achieve this by recording more timestamps in the RPC process. |
There are some related discussions in pingcap/tidb#28937. |
Similar protocols can also be implemented for raft replications. The latency should be split as send latency and receive latency. The sum is a RTT, and monitor them respectively may find potential pitfalls. |
Single trip latency is more difficult because we will be comparing two instants on different computers. It may depend on the precision of NTP. |
Time is usually synced. We can find the difference if there are gaps. |
I worry the error may mislead us to wrong conclusions. For example, if the clock of the receiver is 5ms slower (possible in a multi-region deployment) than the sender, we will find the send latency is shorter than the receive latency even if the actual latency is the same. |
It seems time sync monitor is supported by node exporter, which can be used directly. https://github.com/prometheus/node_exporter/blob/master/docs/TIME.md |
Feature Request
Time spent on RPC has always been the missing part of our metrics. There have been many cases when TiKV reports its process duration is short while the end user experiences high latency.
For example, TiDB has slow query logs and it is common for us to fail to find where the latency comes from. Sometimes, we blame the latency to the network and the RPC framework, without evidence.
If we record the client-side and server-side latency for each request, the client can easily know the time spent on the network and the RPC framework. This will help tease out the possible source of latency and guide us where to optimize. And it is vital to diagnose tail latency shown in slow query logs but not discoverable through metrics.
This can be implemented as part of full process tracing, but it seems #8981 has been inactive for months. But for this feature, a simpler implementation and protocol is enough.
The text was updated successfully, but these errors were encountered: