ticlient: Add keep alive #7099

breezewish · 2018-07-18T16:53:33Z

What have you changed? (mandatory)

This PR adds keep alive settings for ticlient, using the same configuration as TiKV's (time = 10s, timeout = 3s).

By adding keep alive, we can avoid firewall dropping our inactive connections, which will cause SQL queries to fail.

Since the client meets issue on our release-2.0 branch, this fix is proposed over release-2.0 branch instead of master. It will be cherry picked to master later.

What is the type of the changes? (mandatory)

Bug fix (non-breaking change which fixes an issue)

How has this PR been tested? (mandatory)

To test whether this fix is effective, we first need to reproduce the issue in our own environment.

Since I don't have such firewalls, so I tried to simulate this firewall by using the following scripts:

echo "*filter" > rule
echo ":INPUT ACCEPT [549:776388]" >> rule
echo ":FORWARD ACCEPT [0:0]" >> rule
echo ":OUTPUT ACCEPT [596:577866]" >> rule
netstat -anp | grep tidb | grep ESTABLISHED | grep tcp | grep 20160 | awk '{ print $4 }' | awk -F':' '{ print $2 }' | sort | uniq | awk '{ print "-A INPUT -p tcp --dport " $1 " -j DROP"; print "-A OUTPUT -p tcp --sport " $1 " -j DROP" }' >> rule
echo "COMMIT" >> rule
cat rule | tee /etc/sysconfig/iptables
service iptables restart

This script captures all alive source ports of established connections between current host's TiDB and other TiKVs. These source ports will be added it to iptables' rule (drop packet). After execution, all future packets in these ports (connection) will be dropped, just like the firewall.

For the current TiDB master as well as release-2.0 branch

For TiDB, after dropping start working, existing gRPC connections were still used to send requests (and will never receive response) so all queries from this TiDB took a very long time and its QPS is 0:

Sysbench will fail:

According to netstat, these dead connections were kept for more than 15 minutes since we started another sysbench after dropping them. After that, they were destroyed and new connections were established, so that everything backed to normal again.

For this fixed version (Test 1)

I started a sysbench immediately after these connections are dropped by iptables:

We can see that initially QPS was affected (notice that we deployed multiple TiDBs and only 1 is affected). After about 30 seconds it was recovered. This is far better than the 15-minute-recovery previously. Also sysbench did not fail.

For this fixed version (Test 2)

I started a sysbench 1 minute after these connections are dropped by iptables:

We can see that QPS was not affected totally.

Does this PR affect documentation (docs/docs-cn) update? (mandatory)

No.

Does this PR affect tidb-ansible update? (mandatory)

pingcap/tidb-ansible#469

Does this PR need to be added to the release notes? (mandatory)

No.

Refer to a related PR or issue link (optional)

Benchmark result if necessary (optional)

Add a few positive/negative examples (optional)

zhexuany

LGTM

zz-jason · 2018-07-19T00:40:20Z

tidb-server/main.go

@@ -389,6 +389,8 @@ func setGlobalVars() {
 	if cfg.TiKVClient.GrpcConnectionCount > 0 {
 		tikv.MaxConnectionCount = cfg.TiKVClient.GrpcConnectionCount
 	}
+	tikv.GrpcKeepAliveTime = time.Duration(cfg.TiKVClient.GrpcKeepAliveTime) * time.Second


I think we should check where the configuration is valid. For example, the configured time duration should be greater than zero.

It seems that other configurations are not checked as well, except for the 1 config above (GrpcConnectionCount). I think it would be better to leave to another PR to do this.

OK, could you file an github issue about this?

@zz-jason Yes! I just created one: #7103

ngaut · 2018-07-19T02:02:19Z

Well done.

coocood · 2018-07-19T06:13:02Z

/run-all-tests tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0

coocood · 2018-07-19T06:15:08Z

LGTM

breezewish · 2018-07-19T07:19:37Z

/run-all-tests tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0

zhexuany · 2018-07-19T07:40:53Z

/run-all-tests tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0

breezewish · 2018-07-19T09:05:54Z

/run-integration-common-test tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0

breezewish · 2018-07-19T11:45:03Z

/run-integration-common-test tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0

zhexuany

LGTM

add keep alive settings

4399f48

zhexuany reviewed Jul 18, 2018

View reviewed changes

zz-jason reviewed Jul 19, 2018

View reviewed changes

zz-jason added component/coprocessor type/bugfix This PR fixes a bug. type/2.0 cherry-pick labels Jul 19, 2018

Merge branch 'release-2.0' into wenxuan/keepalive_2.0

dd30842

This was referenced Jul 19, 2018

[cherry-pick] ticlient: Add keep alive #7100

Merged

[release-2.0] Add keep alive settings pingcap/tidb-ansible#469

Merged

Merge branch 'release-2.0' into wenxuan/keepalive_2.0

0b414c3

shenli added status/all tests passed status/LGT2 Indicates that a PR has LGTM 2. labels Jul 19, 2018

zhexuany approved these changes Jul 19, 2018

View reviewed changes

coocood merged commit 5c61f4c into pingcap:release-2.0 Jul 19, 2018

breezewish deleted the wenxuan/keepalive_2.0 branch July 19, 2018 15:13

This was referenced Dec 18, 2019

client: supports to add gRPC dial options tikv/pd#2035

Merged

store/tikv: keepalive with pd #14118

Merged

This was referenced Dec 25, 2019

store/tikv: keepalive with pd (#14118) #14233

Merged

store/tikv: keepalive with pd (#14118) #14234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ticlient: Add keep alive #7099

ticlient: Add keep alive #7099

breezewish commented Jul 18, 2018 •

edited

Loading

zhexuany left a comment

zz-jason Jul 19, 2018

breezewish Jul 19, 2018

zz-jason Jul 19, 2018

breezewish Jul 19, 2018

ngaut commented Jul 19, 2018

coocood commented Jul 19, 2018

coocood commented Jul 19, 2018

breezewish commented Jul 19, 2018

zhexuany commented Jul 19, 2018

breezewish commented Jul 19, 2018

breezewish commented Jul 19, 2018

zhexuany left a comment

ticlient: Add keep alive #7099

ticlient: Add keep alive #7099

Conversation

breezewish commented Jul 18, 2018 • edited Loading

What have you changed? (mandatory)

What is the type of the changes? (mandatory)

How has this PR been tested? (mandatory)

For the current TiDB master as well as release-2.0 branch

For this fixed version (Test 1)

For this fixed version (Test 2)

Does this PR affect documentation (docs/docs-cn) update? (mandatory)

Does this PR affect tidb-ansible update? (mandatory)

Does this PR need to be added to the release notes? (mandatory)

Refer to a related PR or issue link (optional)

Benchmark result if necessary (optional)

Add a few positive/negative examples (optional)

zhexuany left a comment

Choose a reason for hiding this comment

zz-jason Jul 19, 2018

Choose a reason for hiding this comment

breezewish Jul 19, 2018

Choose a reason for hiding this comment

zz-jason Jul 19, 2018

Choose a reason for hiding this comment

breezewish Jul 19, 2018

Choose a reason for hiding this comment

ngaut commented Jul 19, 2018

coocood commented Jul 19, 2018

coocood commented Jul 19, 2018

breezewish commented Jul 19, 2018

zhexuany commented Jul 19, 2018

breezewish commented Jul 19, 2018

breezewish commented Jul 19, 2018

zhexuany left a comment

Choose a reason for hiding this comment

breezewish commented Jul 18, 2018 •

edited

Loading