Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADD/DROP REGION stuck if node restarts (or is restored) with different --locality flag #113324

Closed
Xiang-Gu opened this issue Oct 30, 2023 · 3 comments · Fixed by #113956
Closed
Assignees
Labels
A-multiregion Related to multi-region C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@Xiang-Gu
Copy link
Contributor

Xiang-Gu commented Oct 30, 2023

If we have a MR cluster and one of the database, db, has oldRegion set as its primary region, what would happen if we stop the node(s) and restart them with a different --locality flag?

We should allow such behavior and when it comes to the zone configuration of db, it still has its old configurations, including constraints=[region=oldRegion]. We end up with an unfulfillable allocation constraint and the allocator falls back to putting replicas wherever it can. However, if we were to add one of the new regions to db, such a ADD REGION will get stuck because there are validation logic that complains about this unfulfillable constraint, something like

non-cancelable: constraint \"+region=aws-us-east-2\" matches no existing nodes within the cluster - did you enter it correctly?

This is a regression compared to v21.x or whenever this worked.

One solution is to relax the validation to only regions proposed to be added (not but existing one as it's possible and fine they don't exist in such a scenario).

Also, we run into exactly the same scenario (and problems) when we restore into a cluster with different regions than the ones when the backup is taken.

Discovered from this escalation, in which we included one workaround.
See slack thread https://cockroachlabs.slack.com/archives/C2C5FKPPB/p1698680009075359 for more details.

Jira issue: CRDB-32870

Epic CRDB-33032

@Xiang-Gu Xiang-Gu added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Oct 30, 2023
@Xiang-Gu Xiang-Gu changed the title ADD REGION stuck if node restarts (or is restored) with different --locality flag ADD REGION stuck if node restarts (or is restored) with different --locality flag Oct 30, 2023
@rafiss rafiss added the A-multiregion Related to multi-region label Oct 31, 2023
@rafiss
Copy link
Collaborator

rafiss commented Oct 31, 2023

@Xiang-Gu can you provide step-by-step instructions on how to reproduce this? Then we can determine exactly when the regression occurred, and assign this to a different team if we need to.

@rafiss rafiss changed the title ADD REGION stuck if node restarts (or is restored) with different --locality flag ADD/DROP REGION stuck if node restarts (or is restored) with different --locality flag Nov 2, 2023
@Xiang-Gu
Copy link
Contributor Author

Xiang-Gu commented Nov 2, 2023

Definitely!

Terminal 1:
$ cockroach start-single-node --locality=region=us-east-1,zone=us-east-1b --insecure

Terminal 2:
$ cockroach sql --url='the-connection-string-from-terminal-1'

SET CLUSTER SETTING cluster.organization = 'Cockroach Labs - Production Testing';
SET CLUSTER SETTING enterprise.license = 'find-your-enterprise-license';
ALTER DATABASE defaultdb PRIMARY REGION 'us-east-1';
\q

Terminal 1:
ctrl + c to kill the cockroach process
$ cockroach start-single-node --locality=region=us-west-1,zone=us-west-1b --insecure   // restart it with a different cluster region

Terminal 2:
$ cockroach sql --url='the-connection-string-from-terminal-1'

ALTER DATABASE defaultdb ADD REGION 'us-west-1`;     -- stuck and if you inspect the job, it's failing with the above error (in the PR description) and retries indefinitely

@rafiss
Copy link
Collaborator

rafiss commented Nov 2, 2023

Sharing a similar example from @smcvey

Have three nodes, with localities set to region=west, region=central and region=east respectively.
Make a database: CREATE DATABASE db PRIMARY REGION "central";
Add the other regions: ALTER DATABASE db ADD REGION "west"; ALTER DATABASE db ADD REGION "east";
In my test, I also created two tables, one regional by row, one the default regional by table, but this step may not be necessary.
Remove the --locality flags from each node and restart them all.
Attempt to drop a region: ALTER DATABASE db DROP REGION "west"; <--- This hangs forever

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multiregion Related to multi-region C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants