-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove tablet healthcheck cache record on error #9106
Remove tablet healthcheck cache record on error #9106
Conversation
6d61534
to
7c310fa
Compare
7c310fa
to
8f0b646
Compare
Hey Matt! I have reproduced your test case with and without your patch. After following the instructions using your patch, the primary tablet of
After executing |
Hmm, are you sure the patch was applied? I can try to repeat it some more times too... |
I'm going to work on an e2e test for this that fails on main but passes on this branch to be sure that we've fixed it here and ensure we don't have similar issues crop up later. |
I have retried the same procedure a few times in a row and the outputs were always correct. |
ab85ea3
to
91c800a
Compare
91c800a
to
eceb452
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty safe to me. Nice tests. I'll wait for @deepthi to do the final ack.
ff40e0a
to
da592d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks safe to me as well
995d713
to
e0b66aa
Compare
cfc5a54
to
4a7a3cb
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
Specifically to confirm zombie tablets do not occur Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
bd68e0c
to
c84819d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 on the tests. They give us confidence in this change 😃
64c52cc
to
f7ed5a0
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
f7ed5a0
to
240f3c8
Compare
The new changes look good. Thank you for fixing WaitForStatus. |
Signed-off-by: Priya Bibra <pbibra@slack-corp.com>
Description
If a tablet's health check runs concurrently with its deletion then you can end up with a timing issue where the goroutine getting the status from
vtgate
->vttablet
via an RPC call has a connection related error (e.g. context cancelled or EOF) but a copy of the tablet's healthcheck cache record is made and that copy's updated state is stored in the healthcheck cache after the tablet has gone away. Thus you can end up with a zombie tablet record in the healthcheck cache.The shorter your
-health_check_interval
tablet flag is (default is 30s), the more likely you will encounter this. I was able to repeat it quite easily and regularly in thedocker_local
container in part because the interval is set to 5s for those tablets and theDeleteTablet
calls made in the test case (./307_delete_shard_0.sh) are somewhat slow given the execution context.I was no longer able to repeat the bug using my test case with this patch.
Related Issue(s)
#8465
Checklist