Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topology_watcher: Allow tablets to reuse old tablet addresses. #5244

Merged
merged 1 commit into from
Sep 27, 2019

Conversation

enisoc
Copy link
Member

@enisoc enisoc commented Sep 27, 2019

Fixes #5229.

The unit test failed before I added the check in RemoveTablet that we're deleting the tablet we think we are.

I also removed the goroutines from RemoveTablet and ReplaceTablet because it's ludicrous that there was no guarantee whatsoever of what state you'll be in at the end of loadTablets(). Let me know if there was some reason we had tried to make those asynchronous.

Signed-off-by: Anthony Yeh enisoc@planetscale.com

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
hc.AddTablet(new, name)
}()
hc.deleteConn(old)
hc.AddTablet(new, name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice yeah this should hopefully eliminate a huge source of threading issues

@tirsen
Copy link
Collaborator

tirsen commented Sep 27, 2019

I'll patch this in and will test it in our environment. It was very easy to reproduce the issue there.

@tirsen
Copy link
Collaborator

tirsen commented Sep 27, 2019

I can confirm this solves the issue in our environment but I'll let someone else more familiar with the code do a deeper review.

@enisoc
Copy link
Member Author

enisoc commented Sep 27, 2019

@demmer explained that the reason RemoveTablet and ReplaceTablet were in goroutines was likely because they had seen that sometimes the attempt to close the connection inside deleteConn() would hang forever, which deadlocked the healthchecker because it held the mutex forever.

However, it seems that now we don't close connections synchronously anyway. We just cancel the Context and move on. So I believe it should be safe to run RemoveTablet synchronously now.

Copy link
Member

@derekperkins derekperkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be watertight. But it's a big improvement over what was there before.

// If it's the same tablet, something is wrong.
if topoproto.TabletAliasEqual(th.latestTabletStats.Tablet.Alias, tablet.Alias) {
hc.mu.Unlock()
log.Warningf("refusing to add duplicate tablet %v for %v: %+v", name, tablet.Alias.Cell, tablet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should eventually become an acceptable state, which will allow us to "update if needed" kind of calls. But let the warning remain for now. It will allow us to find out if this really happens in the wild.

The other future improvement on this will be to add a timestamp that tracks when a tablet acquired the address to declare the true winner. Similar to how we're solving the mastership story.

@sougou sougou merged commit e8b05e6 into vitessio:master Sep 27, 2019
@enisoc enisoc deleted the topo-watcher-race branch September 27, 2019 17:30
systay pushed a commit that referenced this pull request Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vtgate intermittently loses a tablet
4 participants