Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove tablet healthcheck cache record on error #9106

Merged
merged 4 commits into from
Nov 1, 2021

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Oct 28, 2021

Description

If a tablet's health check runs concurrently with its deletion then you can end up with a timing issue where the goroutine getting the status from vtgate->vttablet via an RPC call has a connection related error (e.g. context cancelled or EOF) but a copy of the tablet's healthcheck cache record is made and that copy's updated state is stored in the healthcheck cache after the tablet has gone away. Thus you can end up with a zombie tablet record in the healthcheck cache.

The shorter your -health_check_interval tablet flag is (default is 30s), the more likely you will encounter this. I was able to repeat it quite easily and regularly in the docker_local container in part because the interval is set to 5s for those tablets and the DeleteTablet calls made in the test case (./307_delete_shard_0.sh) are somewhat slow given the execution context.

I was no longer able to repeat the bug using my test case with this patch.

Note: w/o this patch, the only way to remove these zombie tablet healthcheck records is to restart the vtgate

Related Issue(s)

#8465

Checklist

@mattlord mattlord requested a review from deepthi October 28, 2021 01:10
@mattlord mattlord marked this pull request as ready for review October 28, 2021 01:15
@mattlord mattlord force-pushed the ZombieTabletHeathCheckEntries branch from 6d61534 to 7c310fa Compare October 28, 2021 02:20
@mattlord mattlord changed the title Remove tablet healthcheck record when context cancelled Remove tablet healthcheck cache record when context cancelled Oct 28, 2021
@mattlord mattlord force-pushed the ZombieTabletHeathCheckEntries branch from 7c310fa to 8f0b646 Compare October 28, 2021 06:11
@frouioui
Copy link
Member

Hey Matt!

I have reproduced your test case with and without your patch. After following the instructions using your patch, the primary tablet of customer/0 is still hanging:

mysql> show vitess_tablets;
+-------+----------+-------+------------+-------------+------------------+--------------+----------------------+
| Cell  | Keyspace | Shard | TabletType | State       | Alias            | Hostname     | PrimaryTermStartTime |
+-------+----------+-------+------------+-------------+------------------+--------------+----------------------+
| zone1 | commerce | 0     | PRIMARY    | SERVING     | zone1-0000000100 | e4979245177d | 2021-10-28T05:59:59Z |
| zone1 | commerce | 0     | REPLICA    | SERVING     | zone1-0000000101 | e4979245177d |                      |
| zone1 | commerce | 0     | RDONLY     | SERVING     | zone1-0000000102 | e4979245177d |                      |
| zone1 | customer | -80   | PRIMARY    | SERVING     | zone1-0000000300 | e4979245177d | 2021-10-28T06:01:52Z |
| zone1 | customer | -80   | REPLICA    | SERVING     | zone1-0000000301 | e4979245177d |                      |
| zone1 | customer | -80   | RDONLY     | SERVING     | zone1-0000000302 | e4979245177d |                      |
| zone1 | customer | 0     | PRIMARY    | NOT_SERVING | zone1-0000000200 | e4979245177d | 2021-10-28T06:01:06Z |
| zone1 | customer | 80-   | PRIMARY    | SERVING     | zone1-0000000400 | e4979245177d | 2021-10-28T06:01:52Z |
| zone1 | customer | 80-   | REPLICA    | SERVING     | zone1-0000000401 | e4979245177d |                      |
| zone1 | customer | 80-   | RDONLY     | SERVING     | zone1-0000000402 | e4979245177d |                      |
+-------+----------+-------+------------+-------------+------------------+--------------+----------------------+
10 rows in set (0.00 sec)

After executing 306_down_shard_0.sh and 307_delete_shard_0.sh, that shard should not be around anymore, though it is still listed in show vitess_tablets. Am I missing something?

@mattlord
Copy link
Contributor Author

I have reproduced your test case with and without your patch. After following the instructions using your patch, the primary tablet of customer/0 is still hanging:
...
After executing 306_down_shard_0.sh and 307_delete_shard_0.sh, that shard should not be around anymore, though it is still listed in show vitess_tablets. Am I missing something?

Hmm, are you sure the patch was applied? I can try to repeat it some more times too...

@mattlord
Copy link
Contributor Author

mattlord commented Oct 28, 2021

I'm going to work on an e2e test for this that fails on main but passes on this branch to be sure that we've fixed it here and ensure we don't have similar issues crop up later.

@frouioui
Copy link
Member

Hmm, are you sure the patch was applied? I can try to repeat it some more times too...

make docker_local was made on this pull request's head:

Version: 12.0.0-SNAPSHOT (Git revision 7c310fa6a0 branch 'ZombieTabletHeathCheckEntries') built on Thu Oct 28 05:57:39 UTC 2021 by vitess@buildkitsandbox using go1.17 linux/amd64

I have retried the same procedure a few times in a row and the outputs were always correct.

Copy link
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty safe to me. Nice tests. I'll wait for @deepthi to do the final ack.

@mattlord
Copy link
Contributor Author

mattlord commented Oct 28, 2021

Looks pretty safe to me. Nice tests. I'll wait for @deepthi to do the final ack.

Thanks! Once I get the new e2e test passing on this branch and failing here (against main): #9109 then I'll ping @deepthi for a review.

@mattlord mattlord force-pushed the ZombieTabletHeathCheckEntries branch 6 times, most recently from ff40e0a to da592d4 Compare October 29, 2021 05:08
Copy link
Member

@frouioui frouioui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks safe to me as well

@mattlord mattlord force-pushed the ZombieTabletHeathCheckEntries branch from 995d713 to e0b66aa Compare October 29, 2021 08:29
@mattlord mattlord changed the title Remove tablet healthcheck cache record when context cancelled Remove tablet healthcheck cache record on error Oct 29, 2021
@mattlord mattlord force-pushed the ZombieTabletHeathCheckEntries branch from cfc5a54 to 4a7a3cb Compare October 29, 2021 15:55
Signed-off-by: Matt Lord <mattalord@gmail.com>
Specifically to confirm zombie tablets do not occur

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the ZombieTabletHeathCheckEntries branch 2 times, most recently from bd68e0c to c84819d Compare October 29, 2021 20:39
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 on the tests. They give us confidence in this change 😃

@mattlord mattlord force-pushed the ZombieTabletHeathCheckEntries branch 6 times, most recently from 64c52cc to f7ed5a0 Compare October 30, 2021 03:08
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the ZombieTabletHeathCheckEntries branch from f7ed5a0 to 240f3c8 Compare October 30, 2021 03:33
@deepthi
Copy link
Member

deepthi commented Nov 1, 2021

The new changes look good. Thank you for fixing WaitForStatus.

@deepthi deepthi merged commit 4f068c5 into vitessio:main Nov 1, 2021
@deepthi deepthi deleted the ZombieTabletHeathCheckEntries branch November 1, 2021 16:58
pbibra added a commit to slackhq/vitess that referenced this pull request Feb 13, 2023
Signed-off-by: Priya Bibra <pbibra@slack-corp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants