BR suffers a 15x performance regression when single TiKV node down #42973

YuJuncen · 2023-04-12T06:52:19Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Create a cluster with 4 or more TiKV nodes.
Shut down one TiKV node.
Execute the backup.

2. What did you expect to see? (Required)

The backup should be slightly slower than backing up a healthy cluster.

3. What did you see instead (Required)

The backup speed is about 15x slower than backing up a healthy cluster. (4 mins vs 1 hour)

4. What is your TiDB version? (Required)

Near master, but this problem is not strong relative to the version.

YuJuncen · 2023-04-12T10:05:34Z

The reason is:

When we are dialing to a (down) host, it takes a long time to fail.

BR SHOULD skip those (unreachable) stores, and BR did implemented this:

tidb/br/pkg/backup/push.go

Lines 80 to 82 in 0548d61

    
           if s.GetState() != metapb.StoreState_Up { 
        
           	logutil.CL(lctx).Warn("skip store", zap.Stringer("State", s.GetState())) 
        
           	continue

But in our case, it seems the returned store status is still Up even pd-ctl returns a Down status.
Consequently, for backing up any range, we must wait about 30s for the store failed... (30s is the dial timeout) This also cannot concurrently wait for this because the mutual execution of StoreManager.
So... If there were a store down, we may take at least 1 min for each object to be backed up. In my TPCC workload, there are 116 objects (ranges), then it costs about 1 hour to fully backup the cluster.

YuJuncen · 2023-04-12T10:14:03Z

BTW, the current implementation will directly jump to the fine-grained backup when there is any store unreachable:

tidb/br/pkg/backup/push.go

Lines 85 to 90 in 0548d61

    
           if err != nil { 
        
           	// BR should be able to backup even some of stores disconnected. 
        
           	// The regions managed by this store can be retried at fine-grained backup then. 
        
           	logutil.CL(lctx).Warn("fail to connect store, skipping", zap.Error(err)) 
        
           	return nil 
        
           }

This may slow down the backup speed in this scenario.

YuJuncen · 2023-04-12T10:15:11Z

There isn't trivial fix to the problem. I think we must figure out why PD will return wrong state of store firstly (Note that StoreState_Up is the default value for StoreState).

tonyxuqqi · 2023-04-17T06:47:30Z

StoreState_Up

@nolouch , please take a look. The question is what's the best way to tell if the store is healthy or not. Apparently StoreState_Up seems not always accurate as it may need to wait for 30 minutes to be StoreState_Down

bufferflies · 2023-04-17T07:00:32Z

br using client to get store information by grpc, but the pd-ctl using http interface to get it. In the http handling, PD will check the store down status:
https://github.com/tikv/pd/blob/c40e319f50822678cda71ae62ee2fd70a9cac010/server/api/store.go#L144-L151
In grpc handling, it doesn't contain this code:
https://github.com/tikv/pd/blob/c40e319f50822678cda71ae62ee2fd70a9cac010/server/api/store.go#L144-L151

close #42973

YuJuncen added the type/bug The issue is confirmed as a bug. label Apr 12, 2023

jebter added component/br This issue is related to BR of TiDB. severity/critical labels Apr 17, 2023

ti-chi-bot added may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 labels Apr 17, 2023

YuJuncen mentioned this issue Apr 17, 2023

backup: check the store state by last heartbeat #43099

Merged

1 task

ti-chi-bot closed this as completed in #43099 Apr 19, 2023

ti-chi-bot pushed a commit that referenced this issue Apr 19, 2023

backup: check the store state by last heartbeat (#43099)

f22ae5f

close #42973

This was referenced Apr 19, 2023

backup: check the store state by last heartbeat (#43099) #43213

Closed

backup: check the store state by last heartbeat (#43099) #43215

Open

backup: check the store state by last heartbeat (#43099) #43216

Merged

ti-chi-bot mentioned this issue Apr 19, 2023

backup: check the store state by last heartbeat (#43099) #43217

Merged

1 task

YuJuncen added affects-7.1 and removed affects-7.1 labels May 8, 2023

ti-chi-bot bot pushed a commit that referenced this issue May 23, 2023

backup: check the store state by last heartbeat (#43099) (#43217)

e2bee76

close #42973

ti-chi-bot bot pushed a commit that referenced this issue Jun 30, 2023

backup: check the store state by last heartbeat (#43099) (#43216)

75c751b

close #42973

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BR suffers a 15x performance regression when single TiKV node down #42973

BR suffers a 15x performance regression when single TiKV node down #42973

YuJuncen commented Apr 12, 2023

YuJuncen commented Apr 12, 2023 •

edited

Loading

YuJuncen commented Apr 12, 2023

YuJuncen commented Apr 12, 2023

tonyxuqqi commented Apr 17, 2023 •

edited

Loading

bufferflies commented Apr 17, 2023

BR suffers a 15x performance regression when single TiKV node down #42973

BR suffers a 15x performance regression when single TiKV node down #42973

Comments

YuJuncen commented Apr 12, 2023

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

YuJuncen commented Apr 12, 2023 • edited Loading

YuJuncen commented Apr 12, 2023

YuJuncen commented Apr 12, 2023

tonyxuqqi commented Apr 17, 2023 • edited Loading

bufferflies commented Apr 17, 2023

YuJuncen commented Apr 12, 2023 •

edited

Loading

tonyxuqqi commented Apr 17, 2023 •

edited

Loading