Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Killing connections needs cooperation from the to-be-killed connection #24031

Closed
dveeden opened this issue Apr 14, 2021 · 23 comments · Fixed by #29212 or #32809
Closed

Killing connections needs cooperation from the to-be-killed connection #24031

dveeden opened this issue Apr 14, 2021 · 23 comments · Fixed by #29212 or #32809
Assignees
Labels
affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects 5.4.x versions. severity/minor sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.

Comments

@dveeden
Copy link
Contributor

dveeden commented Apr 14, 2021

Bug Report

1. Minimal reproduce step (Required)

  • Run TiDB playground.
  • Connect two sessions with mysql --host 127.0.0.1 --port 4000 -u root -p
  • From connection A: run SHOW PROCESSLIST and get the connection ID of the other connection (mysql reports its own connection id when starting up)
  • Now from connection A issue: KILL TIDB CONNECTION <id> where ID is the connection ID of the other connection.
  • Now run SHOW PROCESSLIST again.

Note that this is with a single tidb node.

2. What did you expect to see? (Required)

The killed connection gone from the processlist.

3. What did you see instead (Required)

The other connection still being in the processlist until it tries to run a query.

Once a query like do 1 is attempted on connection B the connection is gone from the processlist and connection B shows ERROR 2013 (HY000): Lost connection to MySQL server during query

4. What is your TiDB version? (Required)

mysql> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v5.0.0
Edition: Community
Git Commit Hash: bdac0885cd11bdf571aad9353bfc24e13554b91c
Git Branch: heads/refs/tags/v5.0.0
UTC Build Time: 2021-04-06 16:36:29
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)
@dveeden dveeden added the type/bug The issue is confirmed as a bug. label Apr 14, 2021
@morgo morgo added the sig/sql-infra SIG: SQL Infra label Apr 14, 2021
@morgo
Copy link
Contributor

morgo commented Apr 14, 2021

This is important in the case of max_connections, since idle connections can consume slots and denial of service new connections.

I assume the fix is to have some idle loop to check if the connection is killed.

@cosven
Copy link
Contributor

cosven commented Apr 15, 2021

/severity major

@dveeden
Copy link
Contributor Author

dveeden commented Apr 15, 2021

I tried 4.0.12 and that version behaves the same.

@dveeden
Copy link
Contributor Author

dveeden commented Apr 15, 2021

Besides what @morgo says this also makes the kill command seem less reliable reducing the trust users may have in the product.

@xinwu5
Copy link

xinwu5 commented Apr 19, 2021

Running version 5.7.25-TiDB-v5.0.0, kill query just does not work on alter statement.

|  113 | root        | xx.xx.xx.xx.xx  | test | Query   | 1248 | autocommit | alter table xxxxx

mysql> KILL TIDB CONNECTION 113;
Query OK, 0 rows affected (0.11 sec)

|  113 | root        | xx.xx.xx.xx.xx  | test | Query   | 1494 | autocommit | alter table xxxxxx

@dveeden
Copy link
Contributor Author

dveeden commented Apr 19, 2021

This looks like it is killing the connection after the first interaction after the running query and not directly stop the query as expected.

@morgo
Copy link
Contributor

morgo commented Apr 19, 2021

Running version 5.7.25-TiDB-v5.0.0, kill query just does not work on alter statement.

|  113 | root        | xx.xx.xx.xx.xx  | test | Query   | 1248 | autocommit | alter table xxxxx

mysql> KILL TIDB CONNECTION 113;
Query OK, 0 rows affected (0.11 sec)

|  113 | root        | xx.xx.xx.xx.xx  | test | Query   | 1494 | autocommit | alter table xxxxxx

I think this is a different issue, which is that the query being killed is DDL (which runs on the DDL owner, the state in the processlist here is just polling the progress of DDL). Try killing SELECT SLEEP(60) for example.

For killing DDL, there is ADMIN CANCEL DDL. I am not sure if it is a feature request or a bug, but DDL should be killable with the kill statement.

@xinwu5
Copy link

xinwu5 commented Apr 19, 2021

For killing DDL, there is ADMIN CANCEL DDL. I am not sure if it is a feature request or a bug, but DDL should be killable with the kill statement.

ADMIN CANCEL DDL manages to kill the query. Agree that kill command should be able to kill the DDL.

@morgo
Copy link
Contributor

morgo commented Apr 19, 2021

I've forked the KILL DDL issue to #24144

@morgo morgo self-assigned this Aug 2, 2021
@morgo
Copy link
Contributor

morgo commented Aug 5, 2021

I took a look at this today. The issue is as described; killing a connection sets cc.status to connStatusWaitShutdown, which is read on the next interaction in that session, and then the connection is killed.

My suggestion to fix it is the following:

  1. SHOW PROCESSLIST (and infoschema) is modified to show the State as Killed.
  2. Instead of setting the read timeout to waitTimeout, the code is instead modified to have a hard coded 2s timeout, but loops for up to waitTimeout retrying a read:

tidb/server/conn.go

Lines 933 to 951 in 0c72834

// Usually, client connection status changes between [dispatching] <=> [reading].
// When some event happens, server may notify this client connection by setting
// the status to special values, for example: kill or graceful shutdown.
// The client connection would detect the events when it fails to change status
// by CAS operation, it would then take some actions accordingly.
for {
if !atomic.CompareAndSwapInt32(&cc.status, connStatusDispatching, connStatusReading) ||
// The judge below will not be hit by all means,
// But keep it stayed as a reminder and for the code reference for connStatusWaitShutdown.
atomic.LoadInt32(&cc.status) == connStatusWaitShutdown {
return
}
cc.alloc.Reset()
// close connection when idle time is more than wait_timeout
waitTimeout := cc.getSessionVarsWaitTimeout(ctx)
cc.pkt.setReadTimeout(time.Duration(waitTimeout) * time.Second)
start := time.Now()
data, err := cc.readPacket()

We have a similar problem with the sleep function, which handles interuption by looping:

func doSleep(secs float64, sessVars *variable.SessionVars) (isKilled bool) {
if secs <= 0.0 {
return false
}
dur := time.Duration(secs * float64(time.Second.Nanoseconds()))
ticker := time.NewTicker(10 * time.Millisecond)
defer ticker.Stop()
timer := time.NewTimer(dur)
for {
select {
case <-ticker.C:
if atomic.CompareAndSwapUint32(&sessVars.Killed, 1, 0) {
timer.Stop()
return true
}
case <-timer.C:
return false
}
}
}

@dveeden
Copy link
Contributor Author

dveeden commented Aug 5, 2021

@morgo minor remark: MySQL uses Killed instead of killed. Would be good to keep this identical if possible.

@bb7133
Copy link
Member

bb7133 commented Sep 26, 2021

Hi @morgo

Instead of setting the read timeout to waitTimeout, the code is instead modified to have a hard coded 2s timeout, but loops for up to waitTimeout retrying a read:

Is it better to set the timeout to `NOW'? I saw some codes with a similar purpose as a workaround to interrupt the blocking read from the connection:

https://github.com/google/mtail/pull/497/files#diff-5c07b612b388d4ea0b64ab060e63777999598c659b7deb73ca44a7f2564ad4cdR17

@djshow832
Copy link
Contributor

Modifying waitTimeout sounds similar to cancelling the context directly. So here's another solution:

  1. Every session/connection has a context. Killing the session/connection cancels the context.
  2. Every query has a context, which inherits from the context belonging to the session/connection. Killing the query cancels the context.

@bb7133
Copy link
Member

bb7133 commented Sep 26, 2021

@djshow832 context the straightforward(and 'gopher') way. I've checked the code and looks 'cancels the context' is applicable:

if atomic.LoadInt32(&cc.status) != connStatusShutdown {

@bb7133 bb7133 assigned yiwen92 and unassigned morgo Sep 26, 2021
@tiancaiamao
Copy link
Contributor

Modifying waitTimeout sounds similar to cancelling the context directly. So here's another solution:

  1. Every session/connection has a context. Killing the session/connection cancels the context.
  2. Every query has a context, which inherits from the context belonging to the session/connection. Killing the query cancels the context.

Cancel is problematic, because there are many concurrent executor and they may not handle the cancel welll, result in resource leak or even query block.

@tiancaiamao
Copy link
Contributor

I have a internal document why cancel doesn't work

https://docs.google.com/document/d/1vnc2fsUkE_j0755-247lZL_KtD0FkZLEMe1ULX4p7iI/edit#

And that's why we check the killed flag

@bb7133
Copy link
Member

bb7133 commented Nov 5, 2021

I have a internal document why cancel doesn't work

https://docs.google.com/document/d/1vnc2fsUkE_j0755-247lZL_KtD0FkZLEMe1ULX4p7iI/edit#

And that's why we check the killed flag

I see, thanks for the explaination.

@bb7133
Copy link
Member

bb7133 commented Nov 5, 2021

@tiancaiamao So I think setting waitTimeout is a better way?

@yiwen92
Copy link
Contributor

yiwen92 commented Nov 8, 2021

After an internal discussion, finally we decide to filter out killed connections when doing show processlist. This will keep the same behaviour as v4.0.10 and before.

@github-actions
Copy link

Please check whether the issue should be labeled with 'affects-x.y' or 'fixes-x.y.z', and then remove 'needs-more-info' label.

@bb7133
Copy link
Member

bb7133 commented Nov 25, 2021

@morgo I will make it as 'enhancement' since the current behavior is expected. We can still keep this issue open as a backlog.

@yiwen92
Copy link
Contributor

yiwen92 commented Apr 1, 2022

need cherry pick to all historical versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects 5.4.x versions. severity/minor sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.
Projects
None yet
9 participants