-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: only validate new regions when adding/dropping #113956
sql: only validate new regions when adding/dropping #113956
Conversation
currentZone := zonepb.NewZoneConfig() | ||
if currentZoneConfigWithRaw, err := params.p.Descriptors().GetZoneConfig( | ||
params.ctx, params.p.Txn(), n.desc.ID, | ||
); err != nil { | ||
return err | ||
} else if currentZoneConfigWithRaw != nil { | ||
currentZone = currentZoneConfigWithRaw.ZoneConfigProto() | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this snippet of code appearing multiple times. Do we want to make it a function (to avoid some code repetitions)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are talking about only 5 lines of code, used in 4 places, my preference would be to keep the API simple. IMO the added maintenance overhead of making the API handle this usage, which is very similar to an existing API function, is not worth it. This article elaborates on some reasons why it can be better to add code duplication rather than introducing the wrong (or too many) API abstractions: https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction?duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preemptive commenting/reviewing :) All this look good to me; thanks for doing this. Do you plan to add two more tests for the node-restart and restore case?
yeah, these tests are on the way. I created this draft PR to see if any existing tests were affected. |
0f51b22
to
d5336be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only closely reviewed the c2c test. Thank you for doing this!
|
||
// As a sanity check, drop a region on the source cluster. | ||
c.SrcTenantSQL.ExecSucceedsSoon(c.T, `ALTER DATABASE many DROP REGION "venus"`) | ||
// THIS TIMES OUT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the last two lines here should be removed. But also, is it unexpected that a user could run this twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't these being executed on different clusters?
will remove the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doh. you're right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have the memory and reading ability of a goldfish.
}() | ||
|
||
// Check how MR primitives have replicated to non-mr stand by cluster | ||
t.Run("mr db only with primary region", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity: were you surprised that this worked before this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that checks out with my understanding. the validation logic previously was only validating the final state of the zone config, and this test was ending up with a zone config with no regions to validate.
d5336be
to
ec2e42f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. I left a comment and another question is "do you think we should also have a test case for the backup/restore case?" Did I miss it somehow?
// This test tests that we can add and drop regions even if the locality flags | ||
// of a node no longer match the regions that already were added to the | ||
// database. | ||
func runMismatchedLocalityTest(ctx context.Context, t test.Test, c cluster.Cluster) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I don't have any objection to how you implemented this function. I'm just curious what you think of a test implementation like the following:
create 3 nodes with locality, "east", "central", "west", respectively
add primary region "east", add region "central", add region "west"
restart all three nodes with a completely different set of localities, "mars", "jupiter", "venus"
drop region "central", drop region "west", drop (primary) region "east"
add primary region "mars", add region "jupiter", add region "venus"
Do you think a test like this is a bit more "stressful"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that test seems fine with me. i can change it to that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"do you think we should also have a test case for the backup/restore case?"
I didn't think this was important to include since we have the mismatched locality test, which is a more direct way to test what we wanted
// This test tests that we can add and drop regions even if the locality flags | ||
// of a node no longer match the regions that already were added to the | ||
// database. | ||
func runMismatchedLocalityTest(ctx context.Context, t test.Test, c cluster.Cluster) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that test seems fine with me. i can change it to that
ec2e42f
to
21eec89
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
21eec89
to
5a877aa
Compare
When we added validation logic to make sure every region corresponded to a known node locality, we were a little too aggressive. The validation made it possible to end up in a state where any ALTER..REGION operation could hang. This could happen in a few situations; for example: - node is restarted with a different locality flag. - MR cluster is restored into a non-MR cluster. - c2c streaming is used with a MR source and non-MR destination. In all these cases, the problem was that the zone configuration could reference a region that no longer has any nodes with the corresponding locality. The validation was too aggressive, since it would validate those regions which already existed in the zone configuration. Now, only the newly added region is validated. Release note (bug fix): Fixed a bug that could cause ALTER DATABASE ... ADD/DROP REGION to hang if node localities were changed after regions were added.
5a877aa
to
4283d2e
Compare
tftr! bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 4283d2e to blathers/backport-release-22.2-113956: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. error creating merge commit from 4283d2e to blathers/backport-release-23.1-113956: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
When we added validation logic to make sure every region corresponded to a known node locality, we were a little too aggressive. The validation made it possible to end up in a state where any ALTER..REGION operation could hang. This could happen in a few situations; for example:
In all these cases, the problem was that the zone configuration could reference a region that no longer has any nodes with the corresponding locality. The validation was too aggressive, since it would validate those regions which already existed in the zone configuration.
Now, only the newly added region is validated.
fixes #113324
fixes #113871
Release note (bug fix): Fixed a bug that could cause ALTER DATABASE ... ADD/DROP REGION to hang if node localities were changed after regions were added.