Optimize the performance of GetTS #1847

nolouch · 2019-10-25T09:09:58Z

Description

Obtaining TSO is a key step in the transaction. Improving the performance of GetTS can improve the performance of the entire cluster. One of the more obvious problems is that when the client is busy, getting TSO has a significant long tail problem. We have a benchmark client bench-tso in PD project, and you can run it with 10000 concurrency. Example output:

count:947492, max:50, min:8, >1ms:0, >2ms:0, >5ms:478043, >10ms:459063, >30ms:10386
count:982164, max:13, min:8, >1ms:0, >2ms:0, >5ms:487404, >10ms:494760, >30ms:0
count:973954, max:13, min:8, >1ms:0, >2ms:0, >5ms:385329, >10ms:588625, >30ms:0
count:972821, max:14, min:8, >1ms:0, >2ms:0, >5ms:377676, >10ms:595145, >30ms:0
count:975413, max:14, min:8, >1ms:0, >2ms:0, >5ms:412231, >10ms:563182, >30ms:0
count:941196, max:21, min:8, >1ms:0, >2ms:0, >5ms:303737, >10ms:637459, >30ms:0
count:961796, max:14, min:8, >1ms:0, >2ms:0, >5ms:369033, >10ms:592763, >30ms:0
count:973593, max:14, min:8, >1ms:0, >2ms:0, >5ms:418633, >10ms:554960, >30ms:0
count:982489, max:13, min:8, >1ms:0, >2ms:0, >5ms:472591, >10ms:509898, >30ms:0
count:931930, max:61, min:8, >1ms:0, >2ms:0, >5ms:448291, >10ms:471571, >30ms:12068
count:979887, max:13, min:8, >1ms:0, >2ms:0, >5ms:419038, >10ms:560849, >30ms:0
count:973117, max:25, min:8, >1ms:0, >2ms:0, >5ms:478479, >10ms:494638, >30ms:0
count:984385, max:13, min:8, >1ms:0, >2ms:0, >5ms:483516, >10ms:500869, >30ms:0
count:939342, max:19, min:8, >1ms:0, >2ms:0, >5ms:280159, >10ms:659183, >30ms:0
count:935988, max:32, min:8, >1ms:0, >2ms:0, >5ms:302658, >10ms:633283, >30ms:47
count:991155, max:14, min:8, >1ms:0, >2ms:0, >5ms:593176, >10ms:397979, >30ms:0
count:984702, max:13, min:8, >1ms:0, >2ms:0, >5ms:422315, >10ms:562387, >30ms:0
count:957891, max:31, min:8, >1ms:0, >2ms:0, >5ms:394925, >10ms:562376, >30ms:590
count:976569, max:13, min:8, >1ms:0, >2ms:0, >5ms:441844, >10ms:534725, >30ms:0
count:974835, max:24, min:8, >1ms:0, >2ms:0, >5ms:459893, >10ms:514942, >30ms:0
count:986326, max:14, min:8, >1ms:0, >2ms:0, >5ms:477202, >10ms:509124, >30ms:0
count:978978, max:13, min:8, >1ms:0, >2ms:0, >5ms:444974, >10ms:534004, >30ms:0
count:975067, max:13, min:8, >1ms:0, >2ms:0, >5ms:404418, >10ms:570649, >30ms:0
count:971453, max:21, min:8, >1ms:0, >2ms:0, >5ms:448108, >10ms:523345, >30ms:0
count:961049, max:23, min:8, >1ms:0, >2ms:0, >5ms:410778, >10ms:550271, >30ms:0
count:983659, max:13, min:8, >1ms:0, >2ms:0, >5ms:423110, >10ms:560549, >30ms:0
count:951750, max:30, min:8, >1ms:0, >2ms:0, >5ms:397103, >10ms:554362, >30ms:285

The requests greater than 30ms occur periodically.
Goals:

Improve the long tail problem in TiDB
Improve the ability to acquire TS ( more TPS )

Difficulty

Hard

Score

20000

Mentor(s)

@disksing

Recommended Skills

Go

The text was updated successfully, but these errors were encountered:

mgkanani · 2020-03-11T14:25:26Z

@nolouch
I tried to reproduce this on my local. I coundn't able to reproduce it.
Below are the steps I followed:

<pd-root-dir-location>$ make
<pd-root-dir-location>$ make pd-tso-bench
<pd-root-dir-location>$ bin/pd-server

on different terminal:

<pd-root-dir-location>$ bin/pd-tso-bench -C=10000

I tried with 2 different go-version I.e. 1.13.3 & 1.12.1

Can you specify the RAM and number of CPUs used for the setup to run? I will try to it in container with limits to reproduce.

nolouch · 2020-03-12T04:21:41Z

Hi @mgkanani, Thank you for your attention.
I run it in my 8vCPU mbp. maybe you can increase the concurrency, try 20000, 30000...
and this problem is likely to be related to Go's goroutine scheduling. You can try to reproduce it locally first.

mgkanani · 2020-03-14T13:56:37Z

@nolouch ,
Thank you for the info. Earlier I tried those option but could not able to reproduce somehow.
Right now, I am able to reproduce with 35000 in my system.

I tried few options. The issue stopped occurring when GC is turned off and invoked only per Second. The reasoning behind so:

In client.go, we are trying to cache the memory to avoid allocation cost.
As per pool.go, If GC() is triggered enough times, it will shrink Pool size.

Let's assume below steps:

10k times tsoReqPool.Put(x) triggered --> free pool-size =10000, victimSize=0
GC() --> free pool cache moved to victim variable --> localSize(free Pool)=0, victimSize=10k
5ktimes Pool.Get() called --> localSize=0, victimSize=5k // 5k allocation avoided
GC() --> localSize=0, victimSize=0
5ktimes Pool.Put(x) called --> localSize=5k, victimSize=0
GC() --> localSize=0, victimSize=5k
GC() --> localSize=0, victimSize=0
5ktimes Pool.Get() called --> localSize=0, victimSize=0 // new 5k allocation required

So it requires 2 GC() invocations to clear unused pool memory.

Changes I made to disable GC() and trigger only per second:

diff --git a/tools/pd-tso-bench/main.go b/tools/pd-tso-bench/main.go
index 0753b3ad..aaa5cfac 100644
--- a/tools/pd-tso-bench/main.go
+++ b/tools/pd-tso-bench/main.go
@@ -19,6 +19,7 @@ import (
        "fmt"
        "os"
        "os/signal"
+        "runtime"
        "sync"
        "syscall"
        "time"
@@ -101,6 +102,7 @@ func showStats(ctx context.Context, durCh chan time.Duration) {
        for {
                select {
                case <-ticker.C:
+                       runtime.GC()
                        println(s.String())
                        total.merge(s)
                        s = newStats()

command to run:

make pd-tso-bench
export GOGC=off; bin/pd-tso-bench -C=35000

PS:
If the golang version is 1.12 then just one GC() will flush entire Pool's free memory.
This problem has been resolved in 1.13 onwards.

nolouch · 2020-03-14T17:16:32Z

Amazing! @mgkanani that's an awesome work and your point is fantastic, PTAL @disksing @tiancaiamao
and I test it in my workstation，it does improve but still has the same problem:

❯ export GOGC=off; bin/pd-tso-bench -C=10000
count:606386, max:7, min:0, >1ms:23556, >2ms:14991, >5ms:7497, >10ms:0, >30ms:0
count:610780, max:7, min:0, >1ms:18153, >2ms:6176, >5ms:969, >10ms:0, >30ms:0
count:617561, max:13, min:0, >1ms:30014, >2ms:17428, >5ms:15079, >10ms:8808, >30ms:0
count:615149, max:7, min:0, >1ms:23845, >2ms:10466, >5ms:5286, >10ms:0, >30ms:0
count:619419, max:5, min:0, >1ms:22575, >2ms:3694, >5ms:172, >10ms:0, >30ms:0
count:615440, max:5, min:0, >1ms:12348, >2ms:4285, >5ms:3, >10ms:0, >30ms:0
count:612421, max:10, min:0, >1ms:18230, >2ms:1427, >5ms:232, >10ms:161, >30ms:0
count:615669, max:15, min:0, >1ms:13467, >2ms:10074, >5ms:3700, >10ms:2501, >30ms:0
count:582374, max:8, min:0, >1ms:19808, >2ms:729, >5ms:962, >10ms:0, >30ms:0
count:580256, max:11, min:0, >1ms:23707, >2ms:6213, >5ms:4486, >10ms:428, >30ms:0
count:605998, max:10, min:0, >1ms:20625, >2ms:236, >5ms:295, >10ms:109, >30ms:0
count:609601, max:16, min:0, >1ms:22940, >2ms:2588, >5ms:8761, >10ms:1803, >30ms:0
count:593323, max:5, min:0, >1ms:27557, >2ms:2178, >5ms:27, >10ms:0, >30ms:0
count:604243, max:5, min:0, >1ms:18285, >2ms:1836, >5ms:4, >10ms:0, >30ms:0
count:599705, max:10, min:0, >1ms:13896, >2ms:134, >5ms:1077, >10ms:170, >30ms:0
count:611364, max:5, min:0, >1ms:19566, >2ms:6184, >5ms:65, >10ms:0, >30ms:0
count:613677, max:5, min:0, >1ms:19995, >2ms:1883, >5ms:35, >10ms:0, >30ms:0
count:609633, max:5, min:0, >1ms:19990, >2ms:1127, >5ms:31, >10ms:0, >30ms:0
count:606090, max:6, min:0, >1ms:23818, >2ms:3037, >5ms:1196, >10ms:0, >30ms:0
count:566737, max:15, min:0, >1ms:31681, >2ms:24356, >5ms:8482, >10ms:7351, >30ms:0
count:601414, max:7, min:0, >1ms:27836, >2ms:14706, >5ms:2242, >10ms:0, >30ms:0
count:593375, max:9, min:0, >1ms:23476, >2ms:15011, >5ms:793, >10ms:0, >30ms:0
count:599079, max:6, min:0, >1ms:26653, >2ms:10718, >5ms:404, >10ms:0, >30ms:0
count:1343571, max:18, min:0, >1ms:4489, >2ms:15553, >5ms:1238196, >10ms:31353, >30ms:0
count:1440656, max:26, min:5, >1ms:0, >2ms:0, >5ms:1413910, >10ms:26746, >30ms:0
count:818916, max:142, min:0, >1ms:16072, >2ms:22754, >5ms:507016, >10ms:233779, >30ms:28853
count:840317, max:45, min:0, >1ms:18455, >2ms:12826, >5ms:463418, >10ms:317002, >30ms:10068
count:998703, max:51, min:5, >1ms:0, >2ms:0, >5ms:735523, >10ms:260104, >30ms:3076
count:1471386, max:30, min:4, >1ms:0, >2ms:194, >5ms:1423303, >10ms:47869, >30ms:20

the numbers in the last few lines of this details expose the problem.

The main problem we want to solve: When tidb is very busy, the following entries occasionally appear in the log:

[2020/03/15 00:30:48.006 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=61.689168ms]
[2020/03/15 00:30:48.010 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=65.400363ms]
[2020/03/15 00:30:48.010 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=63.914003ms]
[2020/03/15 00:32:21.983 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=39.321432ms]
[2020/03/15 00:32:21.983 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=38.332206ms]
[2020/03/15 00:32:21.983 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=38.294493ms]
[2020/03/15 00:32:42.749 +08:00] [INFO] [coprocessor.go:855] ["[TIME_COP_WAIT] resp_time:776.512795ms txnStartTS:415289458539888643 region_id:22 store_addr:172.16.5.38:20160 kv_wait_ms:764"]
[2020/03/15 00:32:46.029 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=88.071116ms]
[2020/03/15 00:32:46.044 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=99.70434ms]
[2020/03/15 00:33:03.409 +08:00] [INFO] [coprocessor.go:855] ["[TIME_COP_PROCESS] resp_time:436.692862ms txnStartTS:415289464044912643 region_id:22 store_addr:172.16.5.38:20160"]
[2020/03/15 00:34:47.085 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=39.095808ms]
[2020/03/15 00:34:47.092 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=41.958004ms]
[2020/03/15 00:34:48.371 +08:00] [INFO] [coprocessor.go:855] ["[TIME_COP_PROCESS] resp_time:397.802191ms txnStartTS:415289491570032646 region_id:38 store_addr:172.16.5.38:20160"]
[2020/03/15 00:35:09.531 +08:00] [INFO] [coprocessor.go:855] ["[TIME_COP_WAIT] resp_time:556.581142ms txnStartTS:415289497075056644 region_id:38 store_addr:172.16.5.38:20160 kv_wait_ms:546"]
[2020/03/15 00:35:30.319 +08:00] [INFO] [coprocessor.go:855] ["[TIME_COP_PROCESS] resp_time:345.113814ms txnStartTS:415289502580080645 region_id:24 store_addr:172.16.5.38:20160"]
[2020/03/15 00:36:09.965 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=36.675042ms]

It's very easy to reproduce:

startup 1 tidb, 1 tikv, 1 pd, remember to run tidb with the config like:
```
max-server-connections = 40960
token-limit = 100000
```

use sysbench to bench the tidb:

 #client 1
 sysbench --report-interval=10 --time=1000 --rand-type=uniform --threads=4000 --mysql-host=127.0.0.1 --mysql-port=4001 --mysql-user=root --mysql-db=sbtest1 --db-driver=mysql --mysql-ignore-errors=8002 oltp_point_select --thread-stack-size=1k --tables=1 --table-size=200000 run

 #client 2
 sysbench --report-interval=10 --time=1000 --rand-type=uniform --threads=4000 --mysql-host=127.0.0.1 --mysql-port=4001 --mysql-user=root --mysql-db=sbtest1 --db-driver=mysql --mysql-ignore-errors=8002 oltp_point_select --thread-stack-size=1k --tables=1 --table-size=200000 run

 ....

Do you have any ideas?

tiancaiamao · 2020-03-16T03:34:12Z

You can't disable GC in the real scenario.
If the sync.Pool is replaced by a customed one, could we alleviate the situation of GetTS latency?
The problem behind the scene is the TSO goroutine is to damn crucial, this goroutine's performance would affect all other goroutines as they depend on it. I have a blog post about it (in chinese) long before.

@mgkanani

mgkanani · 2020-03-22T19:48:42Z

@nolouch
Can you try with these changes?

@tiancaiamao ,
Yes, we should not disable GC in production systems.
I have earlier read about goroutines designed as co-routines. In golang there is no priority among go-routines. If there are enough number of cpu-intensive go-routines then latency related problems surfaces.
There is some developments to ease the problem by preemption. Hopefully, it may reduce the latency problem if there are such go-routines in pd-clients.

In my changes, I have chosen the default Array size based on metrics data.
These changes are temporary, we should not use summary if distribution is known. Use of summary for pd-client will increase memory requirements and calculation of quantiles is also cpu-intensive.
Basically, earlier array was created and doubled on overflow. The updated code avoids the allocation + extra unnecessary copying cost if the size is <256.

During my investigations, I found that the gRPC-server who handles TSO request never reported latency concerns(partially due to batching at client side before requesting to pd-server).
One way to handle latency related issue is through capacity planning. We should come up with the data like:

X cores, Y GB memory, tidb can handle Z requests to pd-cluster

Based on these data, we can scale/shard pd-cluster and tidb cluster whichever could be bottleneck.

nolouch · 2020-03-25T17:38:42Z

thanks @mgkanani.
I meet a problem with commit 54c1468, looks like caused by currentSize .

mgkanani · 2020-06-11T13:31:05Z

The problem behind the scene is the TSO goroutine is to damn crucial, this goroutine's performance would affect all other goroutines as they depend on it. I have a blog post about it (in chinese) long before.

@mgkanani

TL;DR
@tiancaiamao
As part of #2300, I tried various options to fix the issue. I tried runtime.Gosched() with Atomic + Sleep with different variations. Later realized, channels are just not being used for message passing but also waking the blocked go-routine.
I tried pprof and tracing while running tso-benchmark.
Below are my findings and proposal to address them.

As per current implementation:

Long running routines:

tsCancelLoop —> looks for timeout during request is out on network
tsLoop —> responsible for batching requests, contacts pd stream server and returns responses to each request by calling finishTsoLoop
GetTSAsync —> origination of the new request before batching, pushes the newly created struct on chan for batching(tsLoop)
Wait —> selectively waits for either context-done or request-done

Bottlenecks:

Channel used for batching i.e. Client.tsoRequests. tsLoop is consumer and GetTSAsync is producer.
There is one tsLoop go-routine and many GetTSAsync go-routine tries to access the same channel at different time.
- Profile data observations(pprof):
  - GetTSAsync —> channel’s lock call contributed 9.07%, out of it 5.62% runtime.usleep(invoked due to function call from runtime.lock) —> clear indication of contention for lock acquisition
  - tsLoop, receiving from chan —> lock while receiving from channel: 0.64% (0.13% spent behind runtime.usleep while acquiring lock on channel), and in select 0.12%
- Trace data observation(trace tool’s Synchronization blocking profile): ~24.82% GetTSAsync blocked on runtime.chansend1
tsLoop and tsCancelLoop communicate via channel(Buffer size is 1). If tsLoop gets enough cpu time and tsCancelLoop doesn’t then due to channel-size 1, tsLoop will be blocked. However, I couldn’t find in profile/trace data but can potentially occur.

Proposal:

tsLoop being a consumer, should never wait for lock as long as data is available.
We can change design as per below:

Use of lock-free ring-buffer for batching
- The pattern here is MultiProducer-SingleConsumer
- Channels are also handling blocking/resuming a go-routine. We have to come up with our own mechanism to wake a go-routine during finishTsoLoop
Offloading finishTsoLoop function call from critical path, tsCancelLoop can handle it. However, this can be taken care later.
I have figured out a way to block and unblock a go-routine similar to event-driven https://github.com/mgkanani/mgkanani.github.io/tree/master/golang/runtime.
- This API can be used with RingBuffer implementation to block/wake go-routines during various conditions.
- In fact, based on the data-structures, we can control which go-routines should be wakened hence indirectly we can define the priority.

I am thinking of trying above approach and comparing it with channels via benchmarks.

Tradeoffs:

The downside of this approach is, we have to update golang’s source file until they offer some similar solution

tiancaiamao · 2020-06-12T11:18:32Z

A kind remind: the benchmark program may not reflect the real case somehow.

In the benchmark, the batch size can be quite large, producers call GetAsync at a much higher rate than the consumer handle the data and the channel buffer could be a bottlenect.
In TiDB, there is not so many concurrent GetAsync call.

And another factor is the goroutine count on the scheduler.
You may add 100 / 200 / 500 / 1000 other gorouintes which just do nothing, and let them run together with the benchmark program to see their affection. In TiDB there maybe 1000~3000 unrelated goroutines.

@mgkanani

mgkanani · 2020-06-15T18:28:39Z

A kind remind: the benchmark program may not reflect the real case somehow.

In the benchmark, the batch size can be quite large, producers call GetAsync at a much higher rate than the consumer handle the data and the channel buffer could be a bottlenect.
In TiDB, there is not so many concurrent GetAsync call.

And another factor is the goroutine count on the scheduler.
You may add 100 / 200 / 500 / 1000 other gorouintes which just do nothing, and let them run together with the benchmark program to see their affection. In TiDB there maybe 1000~3000 unrelated goroutines.

@mgkanani

I agree with your remarks. With sysbench, I didn't see data which tells tsoRequests channel is bottleneck. With tracing data, I saw many go-routines. To be specific, for 4000 clients, there were ~24000 github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).run and github.com/pingcap/tidb/store/tikv.(*copIteratorTaskSender).run go-routines each apart from
4000 github.com/pingcap/tidb/server.(*Server).onConn. This lead to runnable go-routines to 100s to 1000s always.

Golang follows cooperative and fair scheduling principles. Even if, we able schedule tsLoop when we desired, it will not solve the latency problem. finishTsoLoop unblocks (*client).Wait go-routines. These go-routines don’t always result into immediate scheduling because of high number of runnable go-routines.

It is better to create separate Issue for TiDB-repo and close this. In TiDB, we should look for reducing unnecessary cpu-intensive go-routines.

@tiancaiamao

nolouch · 2020-06-28T14:14:54Z

Hi @mgkanani
As the issue says, the goal is to improve the long tail problem in TiDB. So, This issue still needs to investigate.

nolouch added type/enhancement The issue or PR belongs to an enhancement. difficulty/hard Hard task. PCP-S1 labels Oct 25, 2019

breezewish changed the title ~~PCP: Optimize the performance of GetTS~~ PCP-26: Optimize the performance of GetTS Oct 28, 2019

you06 changed the title ~~PCP-26: Optimize the performance of GetTS~~ Optimize the performance of GetTS Feb 29, 2020

nolouch added challenge-program-2 and removed PCP-S1 labels Mar 2, 2020

mgkanani mentioned this issue Mar 29, 2020

Addressing latency issue for pd-clients #2300

Merged

tiancaiamao mentioned this issue Apr 2, 2020

The goroutine count surge when region miss or TiKV server is busy pingcap/tidb#15995

Closed

rleungx removed the challenge-program label Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the performance of GetTS #1847

Optimize the performance of GetTS #1847

nolouch commented Oct 25, 2019 •

edited

Loading

mgkanani commented Mar 11, 2020

nolouch commented Mar 12, 2020 •

edited

Loading

mgkanani commented Mar 14, 2020 •

edited

Loading

nolouch commented Mar 14, 2020

tiancaiamao commented Mar 16, 2020

mgkanani commented Mar 22, 2020

nolouch commented Mar 25, 2020

mgkanani commented Jun 11, 2020 •

edited

Loading

tiancaiamao commented Jun 12, 2020

mgkanani commented Jun 15, 2020

nolouch commented Jun 28, 2020

Optimize the performance of GetTS #1847

Optimize the performance of GetTS #1847

Comments

nolouch commented Oct 25, 2019 • edited Loading

Description

Difficulty

Score

Mentor(s)

Recommended Skills

mgkanani commented Mar 11, 2020

nolouch commented Mar 12, 2020 • edited Loading

mgkanani commented Mar 14, 2020 • edited Loading

nolouch commented Mar 14, 2020

tiancaiamao commented Mar 16, 2020

mgkanani commented Mar 22, 2020

nolouch commented Mar 25, 2020

mgkanani commented Jun 11, 2020 • edited Loading

As per current implementation:

Bottlenecks:

Proposal:

Tradeoffs:

tiancaiamao commented Jun 12, 2020

mgkanani commented Jun 15, 2020

nolouch commented Jun 28, 2020

nolouch commented Oct 25, 2019 •

edited

Loading

nolouch commented Mar 12, 2020 •

edited

Loading

mgkanani commented Mar 14, 2020 •

edited

Loading

mgkanani commented Jun 11, 2020 •

edited

Loading