Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add locking between replicaGCQueue and multiraft.state.createGroup. #2868

Merged
merged 1 commit into from
Oct 21, 2015

Conversation

bdarnell
Copy link
Contributor

This partially addresses the race seen in #2815. A similar race still
occurs but much less frequently.

@bdarnell
Copy link
Contributor Author

I'm not sure how to test this. I've been testing it locally by increasing the iteration count in TestRaftRemoveRace, although checking in that change would noticeably increase total test runtime. I don't see a good way to trigger the race in a more direct/controlled way.

if _, err := rng.rm.GetReplica(desc.RangeID); err == nil {
log.Infof("replica recreated during deletion; aborting deletion")
}

// TODO(bdarnell): add some sort of locking to prevent the range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tamird
Copy link
Contributor

tamird commented Oct 20, 2015

What is the similar race? can you document it?

@bdarnell
Copy link
Contributor Author

I'm still working on identifying the similar race. All I know so far is that it produces the same error message as #2815 and it takes over a minute for the test to reproduce it.

@bdarnell
Copy link
Contributor Author

One "similar race" is that I forgot to return nil after the "aborting deletion" log line. But even with that fixed I'm seeing other rare failures. I suspect that what may be happening is that the node is sometimes falling far enough behind that it is learning about multiple iterations of the add/remove loop at once.

This partially addresses the race seen in cockroachdb#2815. A similar race still
occurs but much less frequently.
@tamird
Copy link
Contributor

tamird commented Oct 21, 2015

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants