You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a situation where we start MySQLd under vttablet control (e.g. with -restore_from_backup), as we do in the Vitess Operator, we found the following issue.
If you startup vttablet, while running a /healthz check in a tight loop, e.g.:
You will observe something like (after the HTTP port becomes live, of course):
14:56:21.803939901 ok
14:56:22.070264141 ok
14:56:22.337504430 ok
14:56:22.603866197 ok
14:56:22.869842826 500 internal server error: vttablet is not serving
14:56:23.133949851 ok
If you correlate this with the vttablet logs, you will see that first 4 ok checks are from before the tablet transitioned to SERVING, and the state was "NotConnected" (i.e. it was waiting for MySQL to come up).
Normally this would not matter. However, if:
MySQL does not come up at all (corruption)
MySQL takes a long time to come up (extended recovery, upgrade activity)
the /healthz being ok can cause the Vitess Operator to declare the vttablet pod as running, which can have consequences for the operator's rollout of tablet changes, resulting in the tearing down of the next vttablet pod in a shard before the previous one became truly ready.
In a situation where we start MySQLd under vttablet control (e.g. with
-restore_from_backup
), as we do in the Vitess Operator, we found the following issue.If you startup vttablet, while running a
/healthz
check in a tight loop, e.g.:You will observe something like (after the HTTP port becomes live, of course):
If you correlate this with the vttablet logs, you will see that first 4
ok
checks are from before the tablet transitioned toSERVING
, and the state was "NotConnected" (i.e. it was waiting for MySQL to come up).Normally this would not matter. However, if:
the
/healthz
beingok
can cause the Vitess Operator to declare the vttablet pod as running, which can have consequences for the operator's rollout of tablet changes, resulting in the tearing down of the next vttablet pod in a shard before the previous one became truly ready.The problem seems to be in the code here:
https://github.com/vitessio/vitess/blob/main/go/vt/vttablet/tabletserver/tabletserver.go#L1542
where we will only report
/healthz
as unhealthy ifwantState == StateServing
; but not ifwantState
isStateNotConnected
The text was updated successfully, but these errors were encountered: