Remove tablet healthcheck cache record on error #9106

mattlord · 2021-10-28T00:49:08Z

Description

If a tablet's health check runs concurrently with its deletion then you can end up with a timing issue where the goroutine getting the status from vtgate->vttablet via an RPC call has a connection related error (e.g. context cancelled or EOF) but a copy of the tablet's healthcheck cache record is made and that copy's updated state is stored in the healthcheck cache after the tablet has gone away. Thus you can end up with a zombie tablet record in the healthcheck cache.

The shorter your -health_check_interval tablet flag is (default is 30s), the more likely you will encounter this. I was able to repeat it quite easily and regularly in the docker_local container in part because the interval is set to 5s for those tablets and the DeleteTablet calls made in the test case (./307_delete_shard_0.sh) are somewhat slow given the execution context.

I was no longer able to repeat the bug using my test case with this patch.

Note: w/o this patch, the only way to remove these zombie tablet healthcheck records is to restart the vtgate

Related Issue(s)

#8465

Checklist

Should this PR be backported? Thoughts?
Tests were added
- This demonstrates the test failing (as expected) on a branch that only has the test: Add e2e test to check for tablet healthcheck cache correctness #9109
Documentation was added or is not required

frouioui · 2021-10-28T06:11:40Z

Hey Matt!

I have reproduced your test case with and without your patch. After following the instructions using your patch, the primary tablet of customer/0 is still hanging:

mysql> show vitess_tablets;
+-------+----------+-------+------------+-------------+------------------+--------------+----------------------+
| Cell  | Keyspace | Shard | TabletType | State       | Alias            | Hostname     | PrimaryTermStartTime |
+-------+----------+-------+------------+-------------+------------------+--------------+----------------------+
| zone1 | commerce | 0     | PRIMARY    | SERVING     | zone1-0000000100 | e4979245177d | 2021-10-28T05:59:59Z |
| zone1 | commerce | 0     | REPLICA    | SERVING     | zone1-0000000101 | e4979245177d |                      |
| zone1 | commerce | 0     | RDONLY     | SERVING     | zone1-0000000102 | e4979245177d |                      |
| zone1 | customer | -80   | PRIMARY    | SERVING     | zone1-0000000300 | e4979245177d | 2021-10-28T06:01:52Z |
| zone1 | customer | -80   | REPLICA    | SERVING     | zone1-0000000301 | e4979245177d |                      |
| zone1 | customer | -80   | RDONLY     | SERVING     | zone1-0000000302 | e4979245177d |                      |
| zone1 | customer | 0     | PRIMARY    | NOT_SERVING | zone1-0000000200 | e4979245177d | 2021-10-28T06:01:06Z |
| zone1 | customer | 80-   | PRIMARY    | SERVING     | zone1-0000000400 | e4979245177d | 2021-10-28T06:01:52Z |
| zone1 | customer | 80-   | REPLICA    | SERVING     | zone1-0000000401 | e4979245177d |                      |
| zone1 | customer | 80-   | RDONLY     | SERVING     | zone1-0000000402 | e4979245177d |                      |
+-------+----------+-------+------------+-------------+------------------+--------------+----------------------+
10 rows in set (0.00 sec)

After executing 306_down_shard_0.sh and 307_delete_shard_0.sh, that shard should not be around anymore, though it is still listed in show vitess_tablets. Am I missing something?

mattlord · 2021-10-28T06:23:34Z

I have reproduced your test case with and without your patch. After following the instructions using your patch, the primary tablet of customer/0 is still hanging:
...
After executing 306_down_shard_0.sh and 307_delete_shard_0.sh, that shard should not be around anymore, though it is still listed in show vitess_tablets. Am I missing something?

Hmm, are you sure the patch was applied? I can try to repeat it some more times too...

mattlord · 2021-10-28T06:37:39Z

I'm going to work on an e2e test for this that fails on main but passes on this branch to be sure that we've fixed it here and ensure we don't have similar issues crop up later.

frouioui · 2021-10-28T06:42:38Z

Hmm, are you sure the patch was applied? I can try to repeat it some more times too...

make docker_local was made on this pull request's head:

Version: 12.0.0-SNAPSHOT (Git revision 7c310fa6a0 branch 'ZombieTabletHeathCheckEntries') built on Thu Oct 28 05:57:39 UTC 2021 by vitess@buildkitsandbox using go1.17 linux/amd64

I have retried the same procedure a few times in a row and the outputs were always correct.

sougou

Looks pretty safe to me. Nice tests. I'll wait for @deepthi to do the final ack.

mattlord · 2021-10-28T18:32:40Z

Looks pretty safe to me. Nice tests. I'll wait for @deepthi to do the final ack.

Thanks! Once I get the new e2e test passing on this branch and failing here (against main): #9109 then I'll ping @deepthi for a review.

frouioui

this looks safe to me as well

Signed-off-by: Matt Lord <mattalord@gmail.com>

Specifically to confirm zombie tablets do not occur Signed-off-by: Matt Lord <mattalord@gmail.com>

Signed-off-by: Matt Lord <mattalord@gmail.com>

deepthi

💯 on the tests. They give us confidence in this change 😃

Signed-off-by: Matt Lord <mattalord@gmail.com>

deepthi · 2021-11-01T16:58:22Z

The new changes look good. Thank you for fixing WaitForStatus.

Signed-off-by: Priya Bibra <pbibra@slack-corp.com>

mattlord added Component: Query Serving Type: Bug release notes labels Oct 28, 2021

mattlord requested a review from deepthi October 28, 2021 01:10

mattlord marked this pull request as ready for review October 28, 2021 01:15

mattlord requested a review from frouioui October 28, 2021 01:45

mattlord force-pushed the ZombieTabletHeathCheckEntries branch from 6d61534 to 7c310fa Compare October 28, 2021 02:20

mattlord changed the title ~~Remove tablet healthcheck record when context cancelled~~ Remove tablet healthcheck cache record when context cancelled Oct 28, 2021

mattlord force-pushed the ZombieTabletHeathCheckEntries branch from 7c310fa to 8f0b646 Compare October 28, 2021 06:11

mattlord requested review from harshit-gangal and systay as code owners October 28, 2021 16:26

mattlord force-pushed the ZombieTabletHeathCheckEntries branch 3 times, most recently from ab85ea3 to 91c800a Compare October 28, 2021 17:07

mattlord removed request for systay and harshit-gangal October 28, 2021 17:12

mattlord mentioned this pull request Oct 28, 2021

Add e2e test to check for tablet healthcheck cache correctness #9109

Closed

mattlord force-pushed the ZombieTabletHeathCheckEntries branch from 91c800a to eceb452 Compare October 28, 2021 17:30

sougou reviewed Oct 28, 2021

View reviewed changes

mattlord force-pushed the ZombieTabletHeathCheckEntries branch 6 times, most recently from ff40e0a to da592d4 Compare October 29, 2021 05:08

frouioui reviewed Oct 29, 2021

View reviewed changes

mattlord force-pushed the ZombieTabletHeathCheckEntries branch from 995d713 to e0b66aa Compare October 29, 2021 08:29

mattlord changed the title ~~Remove tablet healthcheck cache record when context cancelled~~ Remove tablet healthcheck cache record on error Oct 29, 2021

mattlord force-pushed the ZombieTabletHeathCheckEntries branch from cfc5a54 to 4a7a3cb Compare October 29, 2021 15:55

mattlord requested review from frouioui and sougou October 29, 2021 17:40

mattlord added 3 commits October 29, 2021 15:46

Remove tablet healthcheck record when context cancelled

d509ae0

Signed-off-by: Matt Lord <mattalord@gmail.com>

Add e2e test to check for tablet healthcheck cache correctness

a926bb5

Specifically to confirm zombie tablets do not occur Signed-off-by: Matt Lord <mattalord@gmail.com>

Clear tablet healthcheck cache record on EOF RPC error too

b1ccca3

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the ZombieTabletHeathCheckEntries branch 2 times, most recently from bd68e0c to c84819d Compare October 29, 2021 20:39

deepthi approved these changes Oct 29, 2021

View reviewed changes

mattlord force-pushed the ZombieTabletHeathCheckEntries branch 6 times, most recently from 64c52cc to f7ed5a0 Compare October 30, 2021 03:08

Improvements to tablet healthcheck tests

240f3c8

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the ZombieTabletHeathCheckEntries branch from f7ed5a0 to 240f3c8 Compare October 30, 2021 03:33

sougou approved these changes Nov 1, 2021

View reviewed changes

deepthi merged commit 4f068c5 into vitessio:main Nov 1, 2021

deepthi deleted the ZombieTabletHeathCheckEntries branch November 1, 2021 16:58

mattlord mentioned this pull request Nov 1, 2021

vtgate HealthCheck Tablet Cache incorrect information #8465

Closed

This was referenced Nov 15, 2021

VTGate Healthcheck Cache Inconsistencies #9238

Closed

Prevent Race Conditions Between Tablet Deletes and Updates #9237

Merged

mattlord mentioned this pull request Mar 22, 2022

Properly generate vtgate_tablet_healthcheck_cache workflow #9950

Merged

3 tasks

pbibra added a commit to slackhq/vitess that referenced this pull request Feb 13, 2023

apply patch from vitessio#9106

8c0bdd0

Signed-off-by: Priya Bibra <pbibra@slack-corp.com>

pbibra mentioned this pull request Feb 13, 2023

backport upstream patch 12178 and 9106 slackhq/vitess#48

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove tablet healthcheck cache record on error #9106

Remove tablet healthcheck cache record on error #9106

mattlord commented Oct 28, 2021 •

edited

Loading

frouioui commented Oct 28, 2021

mattlord commented Oct 28, 2021

mattlord commented Oct 28, 2021 •

edited

Loading

frouioui commented Oct 28, 2021

sougou left a comment

mattlord commented Oct 28, 2021 •

edited

Loading

frouioui left a comment

deepthi left a comment

deepthi commented Nov 1, 2021

Remove tablet healthcheck cache record on error #9106

Remove tablet healthcheck cache record on error #9106

Conversation

mattlord commented Oct 28, 2021 • edited Loading

Description

Related Issue(s)

Checklist

frouioui commented Oct 28, 2021

mattlord commented Oct 28, 2021

mattlord commented Oct 28, 2021 • edited Loading

frouioui commented Oct 28, 2021

sougou left a comment

Choose a reason for hiding this comment

mattlord commented Oct 28, 2021 • edited Loading

frouioui left a comment

Choose a reason for hiding this comment

deepthi left a comment

Choose a reason for hiding this comment

deepthi commented Nov 1, 2021

mattlord commented Oct 28, 2021 •

edited

Loading

mattlord commented Oct 28, 2021 •

edited

Loading

mattlord commented Oct 28, 2021 •

edited

Loading