Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ticlient: Add keep alive #7099

Merged
merged 3 commits into from
Jul 19, 2018
Merged

Conversation

breezewish
Copy link
Member

@breezewish breezewish commented Jul 18, 2018

What have you changed? (mandatory)

This PR adds keep alive settings for ticlient, using the same configuration as TiKV's (time = 10s, timeout = 3s).

By adding keep alive, we can avoid firewall dropping our inactive connections, which will cause SQL queries to fail.

Since the client meets issue on our release-2.0 branch, this fix is proposed over release-2.0 branch instead of master. It will be cherry picked to master later.

What is the type of the changes? (mandatory)

  • Bug fix (non-breaking change which fixes an issue)

How has this PR been tested? (mandatory)

To test whether this fix is effective, we first need to reproduce the issue in our own environment.

Since I don't have such firewalls, so I tried to simulate this firewall by using the following scripts:

echo "*filter" > rule
echo ":INPUT ACCEPT [549:776388]" >> rule
echo ":FORWARD ACCEPT [0:0]" >> rule
echo ":OUTPUT ACCEPT [596:577866]" >> rule
netstat -anp | grep tidb | grep ESTABLISHED | grep tcp | grep 20160 | awk '{ print $4 }' | awk -F':' '{ print $2 }' | sort | uniq | awk '{ print "-A INPUT -p tcp --dport " $1 " -j DROP"; print "-A OUTPUT -p tcp --sport " $1 " -j DROP" }' >> rule
echo "COMMIT" >> rule
cat rule | tee /etc/sysconfig/iptables
service iptables restart

This script captures all alive source ports of established connections between current host's TiDB and other TiKVs. These source ports will be added it to iptables' rule (drop packet). After execution, all future packets in these ports (connection) will be dropped, just like the firewall.

For the current TiDB master as well as release-2.0 branch

For TiDB, after dropping start working, existing gRPC connections were still used to send requests (and will never receive response) so all queries from this TiDB took a very long time and its QPS is 0:

image

image

Sysbench will fail:

image

According to netstat, these dead connections were kept for more than 15 minutes since we started another sysbench after dropping them. After that, they were destroyed and new connections were established, so that everything backed to normal again.

For this fixed version (Test 1)

I started a sysbench immediately after these connections are dropped by iptables:

image

image

We can see that initially QPS was affected (notice that we deployed multiple TiDBs and only 1 is affected). After about 30 seconds it was recovered. This is far better than the 15-minute-recovery previously. Also sysbench did not fail.

For this fixed version (Test 2)

I started a sysbench 1 minute after these connections are dropped by iptables:

image

We can see that QPS was not affected totally.

Does this PR affect documentation (docs/docs-cn) update? (mandatory)

No.

Does this PR affect tidb-ansible update? (mandatory)

pingcap/tidb-ansible#469

Does this PR need to be added to the release notes? (mandatory)

No.

Refer to a related PR or issue link (optional)

Benchmark result if necessary (optional)

Add a few positive/negative examples (optional)

Copy link
Contributor

@zhexuany zhexuany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -389,6 +389,8 @@ func setGlobalVars() {
if cfg.TiKVClient.GrpcConnectionCount > 0 {
tikv.MaxConnectionCount = cfg.TiKVClient.GrpcConnectionCount
}
tikv.GrpcKeepAliveTime = time.Duration(cfg.TiKVClient.GrpcKeepAliveTime) * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check where the configuration is valid. For example, the configured time duration should be greater than zero.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that other configurations are not checked as well, except for the 1 config above (GrpcConnectionCount). I think it would be better to leave to another PR to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, could you file an github issue about this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zz-jason Yes! I just created one: #7103

@ngaut
Copy link
Member

ngaut commented Jul 19, 2018

Well done.

@coocood
Copy link
Member

coocood commented Jul 19, 2018

/run-all-tests tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0

@coocood
Copy link
Member

coocood commented Jul 19, 2018

LGTM

@breezewish
Copy link
Member Author

/run-all-tests tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0

1 similar comment
@zhexuany
Copy link
Contributor

/run-all-tests tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0

@breezewish
Copy link
Member Author

/run-integration-common-test tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0

1 similar comment
@breezewish
Copy link
Member Author

/run-integration-common-test tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0

@shenli shenli added status/all tests passed status/LGT2 Indicates that a PR has LGTM 2. labels Jul 19, 2018
Copy link
Contributor

@zhexuany zhexuany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@coocood coocood merged commit 5c61f4c into pingcap:release-2.0 Jul 19, 2018
@breezewish breezewish deleted the wenxuan/keepalive_2.0 branch July 19, 2018 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/coprocessor status/LGT2 Indicates that a PR has LGTM 2. type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants