-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ticlient: Add keep alive #7099
ticlient: Add keep alive #7099
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -389,6 +389,8 @@ func setGlobalVars() { | |||
if cfg.TiKVClient.GrpcConnectionCount > 0 { | |||
tikv.MaxConnectionCount = cfg.TiKVClient.GrpcConnectionCount | |||
} | |||
tikv.GrpcKeepAliveTime = time.Duration(cfg.TiKVClient.GrpcKeepAliveTime) * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should check where the configuration is valid. For example, the configured time duration should be greater than zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that other configurations are not checked as well, except for the 1 config above (GrpcConnectionCount). I think it would be better to leave to another PR to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, could you file an github issue about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done. |
/run-all-tests tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0 |
LGTM |
/run-all-tests tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0 |
1 similar comment
/run-all-tests tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0 |
/run-integration-common-test tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0 |
1 similar comment
/run-integration-common-test tidb-test=release-2.0 tikv=release-2.0 pd=release-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What have you changed? (mandatory)
This PR adds keep alive settings for ticlient, using the same configuration as TiKV's (time = 10s, timeout = 3s).
By adding keep alive, we can avoid firewall dropping our inactive connections, which will cause SQL queries to fail.
Since the client meets issue on our release-2.0 branch, this fix is proposed over release-2.0 branch instead of master. It will be cherry picked to master later.
What is the type of the changes? (mandatory)
How has this PR been tested? (mandatory)
To test whether this fix is effective, we first need to reproduce the issue in our own environment.
Since I don't have such firewalls, so I tried to simulate this firewall by using the following scripts:
This script captures all alive source ports of established connections between current host's TiDB and other TiKVs. These source ports will be added it to iptables' rule (drop packet). After execution, all future packets in these ports (connection) will be dropped, just like the firewall.
For the current TiDB master as well as release-2.0 branch
For TiDB, after dropping start working, existing gRPC connections were still used to send requests (and will never receive response) so all queries from this TiDB took a very long time and its QPS is 0:
Sysbench will fail:
According to netstat, these dead connections were kept for more than 15 minutes since we started another sysbench after dropping them. After that, they were destroyed and new connections were established, so that everything backed to normal again.
For this fixed version (Test 1)
I started a sysbench immediately after these connections are dropped by iptables:
We can see that initially QPS was affected (notice that we deployed multiple TiDBs and only 1 is affected). After about 30 seconds it was recovered. This is far better than the 15-minute-recovery previously. Also sysbench did not fail.
For this fixed version (Test 2)
I started a sysbench 1 minute after these connections are dropped by iptables:
We can see that QPS was not affected totally.
Does this PR affect documentation (docs/docs-cn) update? (mandatory)
No.
Does this PR affect tidb-ansible update? (mandatory)
pingcap/tidb-ansible#469
Does this PR need to be added to the release notes? (mandatory)
No.
Refer to a related PR or issue link (optional)
Benchmark result if necessary (optional)
Add a few positive/negative examples (optional)