Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: only validate new regions when adding/dropping #113956

Merged
merged 1 commit into from
Nov 8, 2023

Conversation

rafiss
Copy link
Collaborator

@rafiss rafiss commented Nov 7, 2023

When we added validation logic to make sure every region corresponded to a known node locality, we were a little too aggressive. The validation made it possible to end up in a state where any ALTER..REGION operation could hang. This could happen in a few situations; for example:

  • node is restarted with a different locality flag.
  • MR cluster is restored into a non-MR cluster.
  • c2c streaming is used with a MR source and non-MR destination.

In all these cases, the problem was that the zone configuration could reference a region that no longer has any nodes with the corresponding locality. The validation was too aggressive, since it would validate those regions which already existed in the zone configuration.

Now, only the newly added region is validated.

fixes #113324
fixes #113871

Release note (bug fix): Fixed a bug that could cause ALTER DATABASE ... ADD/DROP REGION to hang if node localities were changed after regions were added.

@rafiss rafiss added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Nov 7, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Comment on lines +2329 to +2337
currentZone := zonepb.NewZoneConfig()
if currentZoneConfigWithRaw, err := params.p.Descriptors().GetZoneConfig(
params.ctx, params.p.Txn(), n.desc.ID,
); err != nil {
return err
} else if currentZoneConfigWithRaw != nil {
currentZone = currentZoneConfigWithRaw.ZoneConfigProto()
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this snippet of code appearing multiple times. Do we want to make it a function (to avoid some code repetitions)?

Copy link
Collaborator Author

@rafiss rafiss Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are talking about only 5 lines of code, used in 4 places, my preference would be to keep the API simple. IMO the added maintenance overhead of making the API handle this usage, which is very similar to an existing API function, is not worth it. This article elaborates on some reasons why it can be better to add code duplication rather than introducing the wrong (or too many) API abstractions: https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction?duplication

Copy link
Contributor

@Xiang-Gu Xiang-Gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preemptive commenting/reviewing :) All this look good to me; thanks for doing this. Do you plan to add two more tests for the node-restart and restore case?

@rafiss
Copy link
Collaborator Author

rafiss commented Nov 7, 2023

Do you plan to add two more tests for the node-restart and restore case?

yeah, these tests are on the way. I created this draft PR to see if any existing tests were affected.

@rafiss rafiss force-pushed the fix-add-drop-region-validation branch 4 times, most recently from 0f51b22 to d5336be Compare November 7, 2023 22:52
@rafiss rafiss marked this pull request as ready for review November 8, 2023 05:34
@rafiss rafiss requested review from a team as code owners November 8, 2023 05:34
@rafiss rafiss requested review from DarrylWong, renatolabs, dt and Xiang-Gu and removed request for a team November 8, 2023 05:34
@dt dt requested review from msbutler and removed request for dt November 8, 2023 13:20
Copy link
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only closely reviewed the c2c test. Thank you for doing this!


// As a sanity check, drop a region on the source cluster.
c.SrcTenantSQL.ExecSucceedsSoon(c.T, `ALTER DATABASE many DROP REGION "venus"`)
// THIS TIMES OUT.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the last two lines here should be removed. But also, is it unexpected that a user could run this twice?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't these being executed on different clusters?

will remove the comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doh. you're right.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have the memory and reading ability of a goldfish.

}()

// Check how MR primitives have replicated to non-mr stand by cluster
t.Run("mr db only with primary region", func(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity: were you surprised that this worked before this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that checks out with my understanding. the validation logic previously was only validating the final state of the zone config, and this test was ending up with a zone config with no regions to validate.

@rafiss rafiss force-pushed the fix-add-drop-region-validation branch from d5336be to ec2e42f Compare November 8, 2023 17:25
Copy link
Contributor

@Xiang-Gu Xiang-Gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. I left a comment and another question is "do you think we should also have a test case for the backup/restore case?" Did I miss it somehow?

// This test tests that we can add and drop regions even if the locality flags
// of a node no longer match the regions that already were added to the
// database.
func runMismatchedLocalityTest(ctx context.Context, t test.Test, c cluster.Cluster) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I don't have any objection to how you implemented this function. I'm just curious what you think of a test implementation like the following:

create 3 nodes with locality, "east", "central", "west", respectively

add primary region "east", add region "central", add region "west"

restart all three nodes with a completely different set of localities, "mars", "jupiter", "venus"

drop region "central", drop region "west", drop (primary) region "east"

add primary region "mars", add region "jupiter", add region "venus"

Do you think a test like this is a bit more "stressful"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that test seems fine with me. i can change it to that

Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"do you think we should also have a test case for the backup/restore case?"

I didn't think this was important to include since we have the mismatched locality test, which is a more direct way to test what we wanted

// This test tests that we can add and drop regions even if the locality flags
// of a node no longer match the regions that already were added to the
// database.
func runMismatchedLocalityTest(ctx context.Context, t test.Test, c cluster.Cluster) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that test seems fine with me. i can change it to that

@rafiss rafiss force-pushed the fix-add-drop-region-validation branch from ec2e42f to 21eec89 Compare November 8, 2023 18:26
@rafiss rafiss requested a review from Xiang-Gu November 8, 2023 18:29
Copy link
Contributor

@Xiang-Gu Xiang-Gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rafiss rafiss force-pushed the fix-add-drop-region-validation branch from 21eec89 to 5a877aa Compare November 8, 2023 19:28
When we added validation logic to make sure every region corresponded to
a known node locality, we were a little too aggressive. The validation
made it possible to end up in a state where any ALTER..REGION operation
could hang. This could happen in a few situations; for example:

- node is restarted with a different locality flag.
- MR cluster is restored into a non-MR cluster.
- c2c streaming is used with a MR source and non-MR destination.

In all these cases, the problem was that the zone configuration could
reference a region that no longer has any nodes with the corresponding
locality. The validation was too aggressive, since it would validate
those regions which already existed in the zone configuration.

Now, only the newly added region is validated.

Release note (bug fix): Fixed a bug that could cause ALTER DATABASE ...
ADD/DROP REGION to hang if node localities were changed after regions
were added.
@rafiss rafiss force-pushed the fix-add-drop-region-validation branch from 5a877aa to 4283d2e Compare November 8, 2023 20:34
@rafiss
Copy link
Collaborator Author

rafiss commented Nov 8, 2023

tftr!

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 8, 2023

Build succeeded:

@craig craig bot merged commit 8be1f7f into cockroachdb:master Nov 8, 2023
7 of 8 checks passed
Copy link

blathers-crl bot commented Nov 8, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 4283d2e to blathers/backport-release-22.2-113956: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


error creating merge commit from 4283d2e to blathers/backport-release-23.1-113956: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@rafiss rafiss deleted the fix-add-drop-region-validation branch November 13, 2023 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
4 participants