Data may be stored on only a single server shortly after startup #10657

aphyr · 2019-05-30T20:56:06Z

TiDB 3.0.0-rc.2, by design, starts up with a region with only a single replica, regardless of the configured target number of replicas. PD then gradually adds additional replicas until target-replicas is reached. In addition, any regions which are split from this initial region also start with the same number of replicas as the target region, until PD can expand them.

This is not a problem in itself, but it does lead to an awkward possibility: in the early stages of a TiDB cluster, data may be acknowledged, but stored only on a single node, when the user expected that data to be replicated to multiple nodes. A single-node failure during that period could destroy acknowledged writes, or render the cluster partly, or totally, unusable. In our tests with Jepsen (which use small regions and are extra-sensitive to this phenomenon), a single-node network partition as late as 500 seconds into the test can result in a total outage, because some regions are only replicated to 1 or 2, rather than 3, nodes.

The configuration parameter for replica count is called max-replicas, which is sort of an odd name, because regions could have exactly max-replicas (the happy case), fewer than max-replicas (e.g. PD hasn't gotten around to resizing that region yet), or more than max-replicas (e.g. during handoff when a node is declared dead). It might be best to call this target-replicas?

I'd also like to suggest that when a cluster has a configured replica count, TiDB should disallow transactions on regions which don't have at least that many replicas in their Raft group. That'd prevent the possibility of a single-node failure destroying committed data, which is something I'm pretty sure users don't expect to be possible!

The text was updated successfully, but these errors were encountered:

aphyr · 2019-05-31T17:43:56Z

I've been experimenting with different settings to see how long it takes for all regions to be fully replicated. With the settings we were using (region-max-keys = 5 & region-split-keys = 2), it could take over an hour for the cluster to converge. While increasing the region size reduces the number of regions, and generally speeds up convergence, it still looks like TiDB can take 80+ minutes to fully replicate regions with region-max-keys = 500 and region-split-keys = 400.

In this case, it looks like most regions got to 3 or 2 replicas early in the test, but for some reason, new regions keep getting created, and old ones being destroyed, while no progress is being made on the original regions. I've set up jepsen 0268340 to log region IDs and replica counts: in an hour and 20 minutes, we go from

2019-05-31 12:19:50,727{GMT}	INFO	[jepsen node n1] tidb.db: :region-replicas {4 1,
 16 3,
 18 3,
 20 3,
 22 3,
 24 3,
 28 3,
 30 3,
 32 3,
 34 3,
 36 3,
 38 2,
 40 3,
 42 2,
 44 2,
 46 1,
 48 1,
 208 3}

to

INFO [2019-05-31 13:40:03,747] jepsen node n2 - tidb.db :region-replicas {4 2,
 16 3,
 18 3,
 20 3,
 22 3,
 24 3,
 28 3,
 30 3,
 32 3,
 34 3,
 36 3,
 38 2,
 40 3,
 42 2,
 44 2,
 46 2,
 48 2,
 4055 3}

That final region, id=4055, gets replaced by a new, higher region every few seconds. I'm not sure why this is the case--we're not actually making any writes, or even connecting clients to this cluster at this point. It looks pretty well stuck. Full logs are here: 20190531T121838.000-0400.zip

With Jepsen 0268340, try something like this to reproduce. It may take a few runs--it doesn't seem to get stuck every time.

lein run test -w append --time-limit 30 --concurrency 2n --version 3.0.0-rc.2 --nemesis partition-pd-leader --nemesis-interval 10 --nemesis-schedule fixed --test-count 30

siddontang · 2019-05-31T20:43:21Z

PTAL @nolouch

nolouch · 2019-06-03T08:05:37Z

@aphyr How do you check bootstrap step, it takes 80+ minutes but I found the log that just cost one minutes, like region 46:

[2019/05/31 09:19:49.294 -07:00] [INFO] [operator_controller.go:215] ["add operator"] [region-id=46] [operator="\"make-up-replica (kind:region,replica, region:46(17,1), createAt:2019-05-31 09:19:49.29461602 -0700 PDT m=+62.228962485, cu
[2019/05/31 09:19:49.294 -07:00] [INFO] [operator_controller.go:317] ["send schedule command"] [region-id=46] [step="add learner peer 214 on store 9"] [source=create]
[2019/05/31 09:19:49.308 -07:00] [INFO] [cluster_info.go:546] ["region ConfVer changed"] [region-id=46] [detail="Add peer:{id:214 store_id:9 is_learner:true }"] [old-confver=1] [new-confver=2]
[2019/05/31 09:19:49.308 -07:00] [INFO] [operator_controller.go:317] ["send schedule command"] [region-id=46] [step="add learner peer 214 on store 9"] [source=heartbeat]
[2019/05/31 09:19:53.328 -07:00] [INFO] [operator_controller.go:317] ["send schedule command"] [region-id=46] [step="add learner peer 214 on store 9"] [source=heartbeat]
[2019/05/31 09:19:53.396 -07:00] [INFO] [operator_controller.go:317] ["send schedule command"] [region-id=46] [step="promote learner peer 214 on store 9 to voter"] [source=heartbeat]
[2019/05/31 09:19:53.413 -07:00] [INFO] [cluster_info.go:546] ["region ConfVer changed"] [region-id=46] [detail="Remove peer:{id:214 store_id:9 is_learner:true },Add peer:{id:214 store_id:9 }"] [old-confver=2] [new-confver=3]
[2019/05/31 09:19:53.413 -07:00] [INFO] [operator_controller.go:93] ["operator finish"] [region-id=46] [operator="\"make-up-replica (kind:region,replica, region:46(17,1), createAt:2019-05-31 09:19:49.29461602 -0700 PDT m=+62.228962485,

you can grep make-up-replica and found the last log time is 2019/05/31 09:20:01,
and the PD start time is:

[2019/05/31 09:18:47.077 -07:00] [INFO] [util.go:59] ["Welcome to Placement Driver (PD)"]

and a better way to bootstrap the cluster is to wait for the first region make up itself replicas to the target-replicas then start the TiDB, because TiDB will actively split region by creating some system tables even without any writes.

aphyr · 2019-06-03T15:27:53Z

How do you check bootstrap step

We check for bootstrapping by performing an HTTP GET of PD's /regions API after starting PD, TiKV, and TiDB, and checking to ensure that all regions have at least 3 members which are not flagged with is_learner. See db.clj for details.

I found the log that just cost one minutes, like region 46

Region 46 starting quickly is great, but I'm concerned that the highest region never seem to stabilize--do you know what might be going on there?

and a better way to bootstrap the cluster is to wait for the first region make up itself replicas to the target-replicas then start the TiDB, because TiDB will actively split region by creating some system tables even without any writes.

OK! I'll rewrite the setup code to block before starting TiDB. Do you think TiDB running could be preventing the final region from converging?

nolouch · 2019-06-05T09:11:10Z

the final region is always the last range from [last_key, ""], so if new table create, or bigger primary record insert than last one in the last created table, the entry encodes must bigger than last_key, and the range will split to 2 range[last_key, new_encode_key], [new_encode_key, ""], and the id of the final region will be renewed.

aphyr changed the title ~~TiDB data may be stored on only a single server~~ Data may be stored on only a single server shortly after startup May 30, 2019

XuHuaiyu added type/suggestion component/tikv labels May 31, 2019

BusyJay mentioned this issue May 31, 2019

First region may not have enough replicas before accepting writes tikv/tikv#4815

Open

gregwebs added type/bug The issue is confirmed as a bug. and removed type/suggestion labels May 31, 2019

siddontang assigned nolouch May 31, 2019

fzhedu added the severity/major label Aug 26, 2020

jebter added the sig/transaction SIG:Transaction label Nov 11, 2020

ichn-hu mentioned this issue Dec 4, 2020

Welcome to contribute #20804

Closed

cfzjywxk added status/discussion and removed severity/major sig/transaction SIG:Transaction type/bug The issue is confirmed as a bug. labels Dec 28, 2020

tisonkun removed the status/discussion label Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data may be stored on only a single server shortly after startup #10657

Data may be stored on only a single server shortly after startup #10657

aphyr commented May 30, 2019 •

edited

Loading

aphyr commented May 31, 2019

siddontang commented May 31, 2019

nolouch commented Jun 3, 2019 •

edited

Loading

aphyr commented Jun 3, 2019

nolouch commented Jun 5, 2019

Data may be stored on only a single server shortly after startup #10657

Data may be stored on only a single server shortly after startup #10657

Comments

aphyr commented May 30, 2019 • edited Loading

aphyr commented May 31, 2019

siddontang commented May 31, 2019

nolouch commented Jun 3, 2019 • edited Loading

aphyr commented Jun 3, 2019

nolouch commented Jun 5, 2019

aphyr commented May 30, 2019 •

edited

Loading

nolouch commented Jun 3, 2019 •

edited

Loading