vtgate serves queries from replicas when its source is unavailable and replication is unhealthy #9307

mattlord · 2021-12-01T01:22:59Z

Overview of the Issue

You can use vitess for read scale out by explicitly routing queries to shard replicas using <keyspace>/<shard>@replica shard targeting. In order to limit the potential window of inconsistency for these reads and bound how stale the reads can be, you tell the tablet to reflect the tablet as being unhealthy when the replication lag gets beyond -unhealthy_threshold.

The problem is that when the replica cannot talk to its source -- because the source mysqld is down or otherwise unreachable -- then mysqld reports seconds_behind_master as NULL because it's unknown. We then silently convert that to 0 and thus the replica tablet passes its healthcheck with flying colors and is never marked as unhealthy even though it has not been able to talk to the source. This can lead to replicas serving VERY stale reads that can in turn cause a cascade of downstream issues.

Reproduction Steps

Steps to reproduce this issue:

$ make docker_local && ./docker/local/run.sh

# create 1 record to make it easier to see when we're serving queries
mysql commerce -e "insert into customer values (1, 'you@planetscale.com')"

# kill the current replica tablet
kill $(ps auxww | grep vttablet | grep 101 | awk '{print $2}'); sleep 2;

# restart it with a 10 second replication lag limit for being considered healthy
vttablet -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global -log_dir /vt/vtdataroot/tmp -log_queries_to_file /vt/vtdataroot/tmp/vttablet_0000000101_querylog.txt -tablet-path zone1-0000000101 -init_keyspace commerce -init_shard 0 -init_tablet_type replica -health_check_interval 5s -enable_semi_sync -enable_replication_reporter -backup_storage_implementation file -file_backup_storage_root /vt/vtdataroot/backups -restore_from_backup -port 15101 -grpc_port 16101 -service_map grpc-queryservice,grpc-tabletmanager,grpc-updatestream -pid_file /vt/vtdataroot/vt_0000000101/vttablet.pid -vtctld_addr http://2a35755b0ef2:15000/ -unhealthy_threshold=10s &

sleep 5

# kill its replication source mysqld_safe and mysqld processes so that replication is now broken
kill -9 $(ps auxww | grep mysqld_safe | grep 100 | awk '{print $2}')
kill -9 $(ps auxww | grep mysqld | grep 100 | awk '{print $2}')

# mysqld reports seconds_behind_master as 'NULL' which we convert to a uint32 using strconv
# but we ignore the error and the non-integer value gets converted to 0
# so mysqld tells us it doesn't know how far behind it is because it cannot talk to its source, but the
# tablet tells vtgate that everything is healthy and we're fully caught up
for i in {1..5}; do
  echo -n "mysqld says:"
  command mysql -u root --socket=/vt/vtdataroot/vt_0000000101/mysql.sock -e "show slave status\G" | grep Seconds | cut -d: -f2
  echo -n "vttablet says: "
  curl -s localhost:15101/debug/status_details | jq -r '.[1].Value'
  echo
  sleep 5
done

# You'll see that the vttablet and thus vtgate report everything as being fine and will still serve queries:
mysql commerce/0@replica -e "select * from customer"
curl -s localhost:15101/debug/status_details
mysql -e "show vitess_tablets"

Binary version

$ vtgate --version
Version: 13.0.0-SNAPSHOT (Git revision 6d8de8e8c1 branch 'HEAD') built on Wed Dec  1 00:49:19 UTC 2021 by vitess@6363583678d2 using go1.17 linux/amd64

The text was updated successfully, but these errors were encountered:

mattlord self-assigned this Dec 1, 2021

mattlord added Type: Bug Component: Query Serving labels Dec 1, 2021

mattlord mentioned this issue Dec 1, 2021

Estimate replica lag when seconds behind from mysqld is unknown #9308

Merged

3 tasks

deepthi closed this as completed in #9308 Dec 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vtgate serves queries from replicas when its source is unavailable and replication is unhealthy #9307

vtgate serves queries from replicas when its source is unavailable and replication is unhealthy #9307

mattlord commented Dec 1, 2021 •

edited

Loading

vtgate serves queries from replicas when its source is unavailable and replication is unhealthy #9307

vtgate serves queries from replicas when its source is unavailable and replication is unhealthy #9307

Comments

mattlord commented Dec 1, 2021 • edited Loading

Overview of the Issue

Reproduction Steps

Binary version

mattlord commented Dec 1, 2021 •

edited

Loading