Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vttablet /healthz reports 200 / "ok" when not connected to MySQL yet #8237

Closed
aquarapid opened this issue Jun 2, 2021 · 0 comments · Fixed by #8238
Closed

vttablet /healthz reports 200 / "ok" when not connected to MySQL yet #8237

aquarapid opened this issue Jun 2, 2021 · 0 comments · Fixed by #8238
Assignees

Comments

@aquarapid
Copy link
Contributor

In a situation where we start MySQLd under vttablet control (e.g. with -restore_from_backup), as we do in the Vitess Operator, we found the following issue.

If you startup vttablet, while running a /healthz check in a tight loop, e.g.:

while /bin/true ; do echo -n `date +"%T.%N"` ; echo -n " " ; curl -m 0.25 -s http://localhost:15100/healthz ; sleep 0.25 ; done

You will observe something like (after the HTTP port becomes live, of course):

14:56:21.803939901 ok
14:56:22.070264141 ok
14:56:22.337504430 ok
14:56:22.603866197 ok
14:56:22.869842826 500 internal server error: vttablet is not serving
14:56:23.133949851 ok

If you correlate this with the vttablet logs, you will see that first 4 ok checks are from before the tablet transitioned to SERVING, and the state was "NotConnected" (i.e. it was waiting for MySQL to come up).

Normally this would not matter. However, if:

  • MySQL does not come up at all (corruption)
  • MySQL takes a long time to come up (extended recovery, upgrade activity)
    the /healthz being ok can cause the Vitess Operator to declare the vttablet pod as running, which can have consequences for the operator's rollout of tablet changes, resulting in the tearing down of the next vttablet pod in a shard before the previous one became truly ready.

The problem seems to be in the code here:

https://github.com/vitessio/vitess/blob/main/go/vt/vttablet/tabletserver/tabletserver.go#L1542

where we will only report /healthz as unhealthy if wantState == StateServing; but not if wantState is StateNotConnected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant