Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent Race Conditions Between Tablet Deletes and Updates #9237

Merged
merged 5 commits into from
Nov 21, 2021

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Nov 15, 2021

Description

There was a race condition between deleting a tablet's healthcheck record from the authoritative map (hc.healthByAlias) and updating the same tablet's health data record (hc.healthData).

This could cause us to effectively re-add an updated copy of the tablet healthcheck record after it's been deleted due to the open stream with the tablet encounting an error -- often a context being cancelled or a gRPC goaway error seen due to the tablet shutting down. This then leads to "zombie" tablet records in the SHOW VITESS_TABLETS output as it is based on what is in the hc.healthData map and it had records for tablets that were no longer in hc.healthByAlias (so the authoritative map was no longer authoritative):

status := e.scatterConn.GetHealthCheckCacheStatus()

...
func (hc *HealthCheckImpl) cacheStatusMap() map[string]*TabletsCacheStatus {
tcsMap := make(map[string]*TabletsCacheStatus)
hc.mu.Lock()
defer hc.mu.Unlock()
for _, ths := range hc.healthData {
for _, th := range ths {
key := fmt.Sprintf("%v.%v.%v.%v", th.Tablet.Alias.Cell, th.Target.Keyspace, th.Target.Shard, th.Target.TabletType.String())
var tcs *TabletsCacheStatus
var ok bool
if tcs, ok = tcsMap[key]; !ok {
tcs = &TabletsCacheStatus{
Cell: th.Tablet.Alias.Cell,
Target: th.Target,
}
tcsMap[key] = tcs
}
tcs.TabletsStats = append(tcs.TabletsStats, th)
}
}
return tcsMap
}

Changes

  • KEY: The healthcheck will ONLY delete from its own lower level cache directly -- via hc.deleteTablet() -- if there's mismatching endpoints for the tablet indicating that the healthcheck record is no longer valid (this was the case before Remove tablet healthcheck cache record on error #9106)
    • This is the only safe exception to the rule that the healthcheck cache be updated (add,replace,delete) via topo server -> topology watcher -> healthcheck here, as in this case the record is simply not valid and will get corrected in the noted path via the topo server tablet records in the next refresh
  • KEY: We avoid the race condition betweenhc.deleteTablet() and hc.updateHealth() by confirming that the tablet record still exists in the authoritative hc.healthByAlias map before updating the health of the tablet in the sibling hc.healthData map because the mutex does not enforce order of operations. If it does NOT exist there then the update is a no-op (the old tablet health data record copy made here or here will go out of scope and be GC'd).
  • The vtgate/tablet_healthcheck_cache/correctness end2end test was updated to verify the correct behavior (unfortunately the test originally added in Remove tablet healthcheck cache record on error #9106 was simply verifying bad behavior)
  • We create the record in the sibling hc.healthData map -- if it doesn't exist -- when updating the health record for a tablet that we have confirmed exists in the authoritative hc.healthByAlias map

With these changes, I could no longer repeat the zombie tablet record using the steps shown here, and we see the expected log messages indicating that we've avoided the race condition:

/vt/vtdataroot/tmp/vtgate.INFO:I1115 23:31:18.311915    2021 healthcheck.go:424] Tablet zone1-0000000201 has been deleted, skipping health update
/vt/vtdataroot/tmp/vtgate.INFO:I1115 23:31:18.311976    2021 healthcheck.go:424] Tablet zone1-0000000200 has been deleted, skipping health update
/vt/vtdataroot/tmp/vtgate.INFO:I1115 23:31:18.311994    2021 healthcheck.go:424] Tablet zone1-0000000202 has been deleted, skipping health update

Related Issue(s)

Fixes: #9238
Properly Fixes: #8465

Checklist

  • Should this PR be backported? NO
  • Tests were added/updated
  • Documentation is not required

There was a race condition between deleting a tablet's healthcheck
record from the authoritative map (hc.healthByAlias) and updating
the same tablet's health data (hc.healthData) record. This could
cause us to effectively re-add a an updated copy of the tablet
healthcheck record after it's been deleted. This then leads to
"zombie" tablet records in the SHOW VITESS_TABLETS output as it
is based on what is in the hc.healthData map:
https://github.com/vitessio/vitess/blob/693c5dbdeacdd7a705b46ebce6776a5256c8cfef/go/vt/discovery/healthcheck.go#L537-L557

And purge all potential healthcheck records for a tablet alias by
type on delete.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord changed the title Purge all potential healthcheck records for a tablet alias on delete Prevent Race Conditions Between Tablet Healthcheck Deletes and Updates Nov 15, 2021
@mattlord mattlord changed the title Prevent Race Conditions Between Tablet Healthcheck Deletes and Updates Prevent Race Condition Between Tablet Deletes and Updates Nov 15, 2021
go/vt/discovery/healthcheck.go Outdated Show resolved Hide resolved
go/vt/discovery/tablet_health_check.go Outdated Show resolved Hide resolved
go/vt/discovery/tablet_health_check.go Show resolved Hide resolved
@mattlord mattlord force-pushed the TabletHealthcheckCorrectness branch 2 times, most recently from f5ce43a to 2ff2eeb Compare November 16, 2021 00:38
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord changed the title Prevent Race Condition Between Tablet Deletes and Updates Prevent Race Conditions Between Tablet Deletes and Updates Nov 19, 2021
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a tiny typo in comments, otherwise LGTM.
Nice work!

This ensures we are not assuming any default value and the tests
will run in less time.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@deepthi deepthi merged commit 645bc9d into vitessio:main Nov 21, 2021
@deepthi deepthi deleted the TabletHealthcheckCorrectness branch November 21, 2021 14:17
pbibra added a commit to slackhq/vitess that referenced this pull request Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VTGate Healthcheck Cache Inconsistencies vtgate HealthCheck Tablet Cache incorrect information
3 participants