Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: vttablet replica is considered healthy (serving) even when not connect to primary #9788

Closed
vkovacik opened this issue Feb 25, 2022 · 4 comments · Fixed by #9853
Closed

Comments

@vkovacik
Copy link

Overview of the Issue

Short description:

Replica tablet is serving reads even when replication IO thread is disconnected and replica data is stale for more than unhealthy_threshold.

Long description:

We were seeing inconsistency between data returned from replica and data returned from primary tablet. After investigation I found that the replication on the replica is not connected to the primary but vtgate shows that the replica is actively participating in the reads (serving).

Affected shard:

las-0832024891 marketplace -80 replica 10.77.24.84:15000 10.77.24.84:3306 [] <null>
las-4121544931 marketplace -80 primary 10.77.18.26:15000 10.77.18.26:3306 [] 2022-02-25T20:27:16Z

Replication status on replica (las-0832024891):
Note that IO thread is not connected, SQL thread is running and reported replication lag is 0.

vitess@vitess-scratch-dev-db-vttablet-las-0832024891-50c8c7e3:/$ mysql -S /vt/socket/mysql.sock -uroot -e 'show slave status \G' | grep -e "Slave_IO_Running\|Slave_SQL_Running\|SQL_Delay"
             Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes
                    SQL_Delay: 0
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

Serving state:

mysql> show vitess_tablets;
+------+-------------+-------+------------+-------------+----------------+--------------+----------------------+
| Cell | Keyspace    | Shard | TabletType | State       | Alias          | Hostname     | PrimaryTermStartTime |
+------+-------------+-------+------------+-------------+----------------+--------------+----------------------+
| las  | marketplace | -80   | PRIMARY    | SERVING     | las-4121544931 | 10.77.18.26  | 2022-02-25T20:27:16Z |
| las  | marketplace | -80   | REPLICA    | SERVING     | las-0832024891 | 10.77.24.84  |                      |
...

Screenshot from vtgate status:
Screenshot from 2022-02-25 12-47-12

Full replication status output:

vitess@vitess-scratch-dev-db-vttablet-las-0832024891-50c8c7e3:/$ mysql -S /vt/socket/mysql.sock -uroot -e 'show slave status \G'              
*************************** 1. row ***************************
               Slave_IO_State: Connecting to master
                  Master_Host: 10.77.18.26
                  Master_User: vt_repl
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: 
          Read_Master_Log_Pos: 4
               Relay_Log_File: vt-0832024891-relay-bin.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: 
             Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 154
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1045
                Last_IO_Error: error connecting to master 'vt_repl@10.77.18.26:3306' - retry-time: 10  retries: 134
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1402982156
                  Master_UUID: 45a0fae8-8558-11ec-aab4-4a3c894fe127
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 220225 20:49:46
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 34f5be60-8558-11ec-9338-3289113f7527:1-79,
45a0fae8-8558-11ec-aab4-4a3c894fe127:1-9
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 

Reproduction Steps

On replica, simulate replication connectivity issue to the primary. This can be done either by filtering replication traffic from replica to the primary or changing password for the replication user on the primary.

Binary Version

/vt/bin/vtgate --version
Version: 13.0.0 (Git revision bc4a9606e1 branch 'heads/v13.0.0') built on Tue Feb 22 14:23:16 UTC 2022 by vitess@buildkitsandbox using go1.17.6 linux/amd64

Operating System and Environment details

N/A

Log Fragments

No response

@vkovacik vkovacik added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Feb 25, 2022
@mattlord
Copy link
Contributor

mattlord commented Feb 25, 2022

Hi @vkovacik! What version of MySQL are you using here? It feels like a bug in MySQL to show 0 for seconds behind master in this scenario rather than NULL. We now take a NULL value into account with this work: #9308

In this case, assuming this is still an issue in the most recent MySQL versions, we would have to consider the lag to also be "unknown" if the IO_Thread is not running and the Last_IO_Errno is set.

Thank you for the great issue!

@mattlord mattlord self-assigned this Feb 25, 2022
@mattlord mattlord added Needs Triage This issue needs to be correctly labelled and triaged and removed Needs Triage This issue needs to be correctly labelled and triaged labels Feb 25, 2022
@vkovacik
Copy link
Author

vkovacik commented Feb 25, 2022

@mattlord It's recent Percona MySQL 5.7.36-39

vitess@vitess-scratch-dev-db-vttablet-las-0832024891-50c8c7e3:/$ mysql --version
mysql  Ver 14.14 Distrib 5.7.36-39, for debian-linux-gnu (x86_64) using  7.0

Yeah, I think when replication is checked by polling "SHOW SLAVE STATUS", besides checkingSQL_Delay with also need to check that both IO and SQL threads are running.

I'm not sure with vanilla MySQL, but Percona 5.7 reports 0 slave lag when IO thread is not running.

LInks:
stateManager.refreshReplHealthLocked()
poller.Status()

@mattlord
Copy link
Contributor

mattlord commented Feb 25, 2022

@vkovacik ok, thanks! MySQL and Percona are generally the same (MariaDB is not, after 5.5).

Given that, I think the fix may be as simple as this:

diff --git a/go/mysql/flavor.go b/go/mysql/flavor.go
index 72cd460d7d..d948874851 100644
--- a/go/mysql/flavor.go
+++ b/go/mysql/flavor.go
@@ -316,7 +316,8 @@ func parseReplicationStatus(fields map[string]string) ReplicationStatus {
        status := ReplicationStatus{
                SourceHost: fields["Master_Host"],
                // These fields are returned from the underlying DB and cannot be renamed
-               IOThreadRunning:  fields["Slave_IO_Running"] == "Yes" || fields["Slave_IO_Running"] == "Connecting",
+               IOThreadRunning: (fields["Slave_IO_Running"] == "Yes") ||
+                       (fields["Slave_IO_Running"] == "Connecting" && fields["Last_IO_Error"] == ""),
                SQLThreadRunning: fields["Slave_SQL_Running"] == "Yes",
        }
        parseInt, _ := strconv.ParseInt(fields["Master_Port"], 10, 0)

We treat "Connecting" as equivalent to "Running" to avoid flapping on low traffic systems due to slave_net_timeout reconnects. I'll open a PR. Thanks again for the great issue!

@mattlord mattlord removed the Needs Triage This issue needs to be correctly labelled and triaged label Feb 25, 2022
@vkovacik
Copy link
Author

Yeah, I agree (fields["Slave_IO_Running"] == "Connecting" && fields["Last_IO_Error"] == "") should prevent low traffic systems from flapping and still cover this issue when IO thread is really disconnect from the primary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment