Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data may be stored on only a single server shortly after startup #10657

Open
aphyr opened this issue May 30, 2019 · 5 comments
Open

Data may be stored on only a single server shortly after startup #10657

aphyr opened this issue May 30, 2019 · 5 comments
Assignees

Comments

@aphyr
Copy link

aphyr commented May 30, 2019

TiDB 3.0.0-rc.2, by design, starts up with a region with only a single replica, regardless of the configured target number of replicas. PD then gradually adds additional replicas until target-replicas is reached. In addition, any regions which are split from this initial region also start with the same number of replicas as the target region, until PD can expand them.

This is not a problem in itself, but it does lead to an awkward possibility: in the early stages of a TiDB cluster, data may be acknowledged, but stored only on a single node, when the user expected that data to be replicated to multiple nodes. A single-node failure during that period could destroy acknowledged writes, or render the cluster partly, or totally, unusable. In our tests with Jepsen (which use small regions and are extra-sensitive to this phenomenon), a single-node network partition as late as 500 seconds into the test can result in a total outage, because some regions are only replicated to 1 or 2, rather than 3, nodes.

The configuration parameter for replica count is called max-replicas, which is sort of an odd name, because regions could have exactly max-replicas (the happy case), fewer than max-replicas (e.g. PD hasn't gotten around to resizing that region yet), or more than max-replicas (e.g. during handoff when a node is declared dead). It might be best to call this target-replicas?

I'd also like to suggest that when a cluster has a configured replica count, TiDB should disallow transactions on regions which don't have at least that many replicas in their Raft group. That'd prevent the possibility of a single-node failure destroying committed data, which is something I'm pretty sure users don't expect to be possible!

@aphyr aphyr changed the title TiDB data may be stored on only a single server Data may be stored on only a single server shortly after startup May 30, 2019
@gregwebs gregwebs added type/bug The issue is confirmed as a bug. and removed type/suggestion labels May 31, 2019
@aphyr
Copy link
Author

aphyr commented May 31, 2019

I've been experimenting with different settings to see how long it takes for all regions to be fully replicated. With the settings we were using (region-max-keys = 5 & region-split-keys = 2), it could take over an hour for the cluster to converge. While increasing the region size reduces the number of regions, and generally speeds up convergence, it still looks like TiDB can take 80+ minutes to fully replicate regions with region-max-keys = 500 and region-split-keys = 400.

In this case, it looks like most regions got to 3 or 2 replicas early in the test, but for some reason, new regions keep getting created, and old ones being destroyed, while no progress is being made on the original regions. I've set up jepsen 0268340 to log region IDs and replica counts: in an hour and 20 minutes, we go from

2019-05-31 12:19:50,727{GMT}	INFO	[jepsen node n1] tidb.db: :region-replicas {4 1,
 16 3,
 18 3,
 20 3,
 22 3,
 24 3,
 28 3,
 30 3,
 32 3,
 34 3,
 36 3,
 38 2,
 40 3,
 42 2,
 44 2,
 46 1,
 48 1,
 208 3}

to

INFO [2019-05-31 13:40:03,747] jepsen node n2 - tidb.db :region-replicas {4 2,
 16 3,
 18 3,
 20 3,
 22 3,
 24 3,
 28 3,
 30 3,
 32 3,
 34 3,
 36 3,
 38 2,
 40 3,
 42 2,
 44 2,
 46 2,
 48 2,
 4055 3}

That final region, id=4055, gets replaced by a new, higher region every few seconds. I'm not sure why this is the case--we're not actually making any writes, or even connecting clients to this cluster at this point. It looks pretty well stuck. Full logs are here: 20190531T121838.000-0400.zip

With Jepsen 0268340, try something like this to reproduce. It may take a few runs--it doesn't seem to get stuck every time.

lein run test -w append --time-limit 30 --concurrency 2n --version 3.0.0-rc.2 --nemesis partition-pd-leader --nemesis-interval 10 --nemesis-schedule fixed --test-count 30

@siddontang
Copy link
Member

PTAL @nolouch

@nolouch
Copy link
Member

nolouch commented Jun 3, 2019

@aphyr How do you check bootstrap step, it takes 80+ minutes but I found the log that just cost one minutes, like region 46:

[2019/05/31 09:19:49.294 -07:00] [INFO] [operator_controller.go:215] ["add operator"] [region-id=46] [operator="\"make-up-replica (kind:region,replica, region:46(17,1), createAt:2019-05-31 09:19:49.29461602 -0700 PDT m=+62.228962485, cu
[2019/05/31 09:19:49.294 -07:00] [INFO] [operator_controller.go:317] ["send schedule command"] [region-id=46] [step="add learner peer 214 on store 9"] [source=create]
[2019/05/31 09:19:49.308 -07:00] [INFO] [cluster_info.go:546] ["region ConfVer changed"] [region-id=46] [detail="Add peer:{id:214 store_id:9 is_learner:true }"] [old-confver=1] [new-confver=2]
[2019/05/31 09:19:49.308 -07:00] [INFO] [operator_controller.go:317] ["send schedule command"] [region-id=46] [step="add learner peer 214 on store 9"] [source=heartbeat]
[2019/05/31 09:19:53.328 -07:00] [INFO] [operator_controller.go:317] ["send schedule command"] [region-id=46] [step="add learner peer 214 on store 9"] [source=heartbeat]
[2019/05/31 09:19:53.396 -07:00] [INFO] [operator_controller.go:317] ["send schedule command"] [region-id=46] [step="promote learner peer 214 on store 9 to voter"] [source=heartbeat]
[2019/05/31 09:19:53.413 -07:00] [INFO] [cluster_info.go:546] ["region ConfVer changed"] [region-id=46] [detail="Remove peer:{id:214 store_id:9 is_learner:true },Add peer:{id:214 store_id:9 }"] [old-confver=2] [new-confver=3]
[2019/05/31 09:19:53.413 -07:00] [INFO] [operator_controller.go:93] ["operator finish"] [region-id=46] [operator="\"make-up-replica (kind:region,replica, region:46(17,1), createAt:2019-05-31 09:19:49.29461602 -0700 PDT m=+62.228962485,

you can grep make-up-replica and found the last log time is 2019/05/31 09:20:01,
and the PD start time is:

[2019/05/31 09:18:47.077 -07:00] [INFO] [util.go:59] ["Welcome to Placement Driver (PD)"]

and a better way to bootstrap the cluster is to wait for the first region make up itself replicas to the target-replicas then start the TiDB, because TiDB will actively split region by creating some system tables even without any writes.

@aphyr
Copy link
Author

aphyr commented Jun 3, 2019

How do you check bootstrap step

We check for bootstrapping by performing an HTTP GET of PD's /regions API after starting PD, TiKV, and TiDB, and checking to ensure that all regions have at least 3 members which are not flagged with is_learner. See db.clj for details.

I found the log that just cost one minutes, like region 46

Region 46 starting quickly is great, but I'm concerned that the highest region never seem to stabilize--do you know what might be going on there?

and a better way to bootstrap the cluster is to wait for the first region make up itself replicas to the target-replicas then start the TiDB, because TiDB will actively split region by creating some system tables even without any writes.

OK! I'll rewrite the setup code to block before starting TiDB. Do you think TiDB running could be preventing the final region from converging?

@nolouch
Copy link
Member

nolouch commented Jun 5, 2019

the final region is always the last range from [last_key, ""], so if new table create, or bigger primary record insert than last one in the last created table, the entry encodes must bigger than last_key, and the range will split to 2 range[last_key, new_encode_key], [new_encode_key, ""], and the id of the final region will be renewed.

@jebter jebter added the sig/transaction SIG:Transaction label Nov 11, 2020
@cfzjywxk cfzjywxk added status/discussion and removed severity/major sig/transaction SIG:Transaction type/bug The issue is confirmed as a bug. labels Dec 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants