support deterministic failover schedule for placement rules #37251

morgo · 2022-08-21T21:27:17Z

Enhancement

My deployment scenario involves two "primary" regions in AWS:

us-east-1
us-west-2

I have been experimenting with placement rules with a third region: us-east-2. This region should only be used for quorum, as there are no application servers hosted in it. So I define a placement policy as follows:

CREATE PLACEMENT POLICY `defaultpolicy` PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-west-2,us-east-2";

Because the pd-server supports a weight concept, when us-east-1 fails I can have the pd-leader be deterministic and us-west-2 will become the leader. However, there is no deterministic behavior of where the leader of regions for defaultpolicy will go. They will likely balance across us-west-2 and us-east-2, which is not the desired behavior.

Ideally I want the priority for the leader to be in-order of the region-list. This means that us-west-2 will become the new leader for all regions. Perhaps this could be conveyed with syntax like:

CREATE PLACEMENT POLICY `defaultpolicy` PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-west-2,us-east-2" SCHEDULE="DETERMINISTIC";

In fact, if this worked deterministically for leader-scheduling and follower-scheduling, and extension of this is I could create the following:

CREATE PLACEMENT POLICY `defaultpolicy` PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-west-2,us-east-2,us-west-1" SCHEDULE="DETERMINISTIC";

Since the default followers is 2, it would mean that us-west-1 won't get regions scheduled unless one of the other regions fails, which suits me perfectly. It will also mean that commit latency is only initially bad when failover to us-west-2 first occurs. Over time as regions as migrated to us-west-1, the performance should be ~restored as quorum can be achieved on the west coast.

This is a really common deployment pattern in the continental USA, so I'm hoping it can be implemented :-)

The text was updated successfully, but these errors were encountered:

morgo · 2022-09-01T15:53:10Z

An alternative to this proposal, is to use the leader-weight property that pd can set on stores. But it currently doesn't work as expected:

Assume I have a placement group of PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-west-2,us-east-2"
I set the leader weight to zero on all stores in us-east-2.
When us-east-1 fails, leaders are randomly scattered across us-west-2 and us-east-2
The leader balance schedule does not apply until the cluster is healthy again, preventing failover.

The reason is because in (3) the new leader is chosen by an election by the tikv raft group, which has no knowledge (or concern) for leader-weight. But what I would like to suggest, is that if a heartbeat is sent from a leader in a zero leader-weight store, a forced leader transfer occurs.

I took a look at a quick hack to do this, but it didn't work :-) I'm hoping someone who knows pd better can help here.

nolouch · 2022-09-02T01:58:10Z

Hi @morgo, if you set the leader weight to zero, the score calculation would like count/weight(1e-6). the balance leader will transfer the leader from us-east-2 to us-west-2 in step 3. but it may cannot transfer out to all leaders, because the balance-leader scheduler's goal is to balance the score.

An alternative method, use label-property. rather than leader score, it will always transfer leader out from the reject leader stores to other stores, the operators :

pd-ctl schedule add label-scheduler // active the label-scheduler
pd-ctl config set label-property reject-leader region us-west-2 // all tikv leaders will transfer to other regions(us-west-2) when failover

you can check the implementation in: https://github.com/tikv/pd/blob/master/server/schedulers/label.go#L117-L124

morgo · 2022-09-02T02:34:07Z

This is great! Thank you @nolouch

nolouch · 2022-09-02T04:35:00Z

I test this scenario with the rule and this scheduler. and found that label-scheduler does not work well as we expect. the log:

[2022/09/02 11:45:59.171 +08:00] [DEBUG] [label.go:139] ["fail to create transfer label reject leader operator"] [error="cannot create operator: target leader is not allowed"]

it shows try to create an operator but failed. it is caused by the placement rule is explicitly specifies that this store should place followers, which is reasonable in the error.

After I change the policy, it works.

for failover, the placement policy should change from :

CREATE PLACEMENT POLICY primary_east PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-east-2,us-west-2";

to

CREATE PLACEMENT POLICY primary_east_2 LEADER_CONSTRAINTS="[+region=us-east-1]" FOLLOWER_CONSTRAINTS="{+region=us-east-2: 1,+region=us-wetst-2: 1}"

the difference between them can check the raw rule in PD

the raw rule in PD will change from :

 {
    "group_id": "TiDB_DDL_71",
    "id": "table_rule_71_0",
    "index": 40,
    "start_key": "7480000000000000ff4700000000000000f8",
    "end_key": "7480000000000000ff4800000000000000f8",
    "role": "voter",
    "count": 1,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-east-1"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ],
    "create_timestamp": 1662089499
  },
  {
    "group_id": "TiDB_DDL_71",
    "id": "table_rule_71_1",
    "index": 40,
    "start_key": "7480000000000000ff4700000000000000f8",
    "end_key": "7480000000000000ff4800000000000000f8",
    "role": "follower",
    "count": 2,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-east-2",
          "us-west-1"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ],
  },

to

  {
    "group_id": "TiDB_DDL_71",
    "id": "table_rule_71_0",
    "index": 40,
    "start_key": "7480000000000000ff4700000000000000f8",
    "end_key": "7480000000000000ff4800000000000000f8",
    "role": "leader",
    "count": 1,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-east-1"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ],
    "version": 1,
    "create_timestamp": 1662089499
  },
  {
    "group_id": "TiDB_DDL_71",
    "id": "table_rule_71_1",
    "index": 40,
    "start_key": "7480000000000000ff4700000000000000f8",
    "end_key": "7480000000000000ff4800000000000000f8",
    "role": "voter",
    "count": 1,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-east-2"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ],
    "version": 1,
    "create_timestamp": 1662089499
  },
  {
    "group_id": "TiDB_DDL_71",
    "id": "table_rule_71_2",
    "index": 40,
    "start_key": "7480000000000000ff4700000000000000f8",
    "end_key": "7480000000000000ff4800000000000000f8",
    "role": "voter",
    "count": 1,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-west-1"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ],
    "create_timestamp": 1662092753
  },

nolouch · 2022-09-02T05:24:01Z

Hi, @morgo. an easier way is just to use one policy (works in 3 regions) like:

CREATE PLACEMENT POLICY  primary_east_backup_west LEADER_CONSTRAINTS="[+region=us-east-1,-region=us-east-2]" FOLLOWER_CONSTRAINTS="{+region=us-east-2: 1,+region=us-west-2: 1}"

If want to apply to the cluster level, I think we can use the raw placement rule, example:

show details

[
 {
    "group_id": "cluster_rule",
    "id": "cluster_rule_0_primary",
    "index": 500,
    "start_key": "",
    "end_key": "",
    "role": "leader",
    "count": 1,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-east-1"
        ]
      },
      {
        "key": "region",
        "op": "notIn",
        "values": [
          "us-east-2"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ]
  },
  {
    "group_id": "cluster_rule",
    "id": "cluster_rule_1_us_west_2",
    "index": 500,
    "start_key": "",
    "end_key": "",
    "role": "voter",
    "count": 1,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-west-2"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ]
  },
  {
    "group_id": "cluster_rule",
    "id": "cluster_rule_2_us_east_2",
    "index": 500,
    "start_key": "",
    "end_key": "",
    "role": "follower",
    "count": 1,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-east-2"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ]
  }
]

morgo · 2022-09-02T13:37:10Z

If want to apply to the cluster level, I think we can use the raw placement rule, example:

I'd prefer to keep it in SQL rules, so its easier for other users on my team to change them if needed. It's okay though, the only other schema I need to change is mysql. This is important actually because SHOW VARIABLES reads from the mysql.tidb table for the gc variables. Since various client libraries run show variables like 'x' on a new connection, if this table isn't in the primary region, it is going to cause performance problems.

It can be done with:

mysql -e "ALTER DATABASE mysql PLACEMENT POLICY=defaultpolicy;"
for TABLE in `mysql mysql -BNe "SHOW TABLES"`; do
  mysql mysql -e "ALTER TABLE $TABLE PLACEMENT POLICY=defaultpolicy;"
done;

nolouch · 2022-09-02T13:47:47Z

Well, do you think the system level is needed in the placement rule in SQL? Is it more friendly for your scenarios?
Actually, if we support the system level, there is only one rule in PD. but in the current way, It will create many rules in PD. it's about 3 raw rules per table, I worry about the burden of too many rules.

morgo · 2022-09-02T14:06:04Z

This is essentially this feature request: #29677

There are some strange behaviors that need to be determined, but yes: I think the feature system level has merit.

ref #37251

nolouch · 2022-09-09T03:32:42Z

Hi, @morgo. an easier way is just to use one policy (works in 3 regions) like:
CREATE PLACEMENT POLICY  primary_east_backup_west LEADER_CONSTRAINTS="[+region=us-east-1,-region=us-east-2]" FOLLOWER_CONSTRAINTS="{+region=us-east-2: 1,+region=us-west-2: 1}"
If want to apply to the cluster level, I think we can use the raw placement rule, example:

show details

@morgo I confirmed that this placement policy cannot achieve the purpose of automatic switching, the problem needs the placement policy to distinguish voter and follower, setting follower raw rule for east2 let it cannot be leader.
So if you want to use SQL, you still need to use label scheduler as #37251 (comment) commented.

nolouch · 2022-09-09T03:37:11Z

BTW, Do you need to set the placement policy for the metadata, I think metadata likes read schema information, which may cause performance problems.

nolouch · 2022-09-14T02:32:12Z

BTW, Do you need to set the placement policy for the metadata, I think metadata likes read schema information, which may cause performance problems.

I really suggest use cluster-level setting with raw placement rule for this scenario now. in my test met fewer problems. use pd-ctl to setting for keyrange from "" to "", rules like:

// setting rule group  https://docs.pingcap.com/tidb/dev/configure-placement-rules#use-pd-ctl-to-configure-rule-groups
>> pd-ctl config placement-rules rule-group set cluster_rule 2 true
>> cat cluster_rule_group.json
{
    "group_id": "cluster_rule",
    "group_index": 2,
    "group_override": true,
    "rules": [
        {
            "group_id": "cluster_rule",
            "id": "cluster_rule_0_primary_leader",
            "index": 1,
            "start_key": "",
            "end_key": "",
            "role": "leader",
            "count": 1,
            "label_constraints": [
                {
                    "key": "region",
                    "op": "in",
                    "values": [
                        "us-east-1"
                    ]
                }
            ]
        },
        {
            "group_id": "cluster_rule",
            "id": "cluster_rule_1_primary_voter",
            "index": 1,
            "start_key": "",
            "end_key": "",
            "role": "voter",
            "count": 1,
            "label_constraints": [
                {
                    "key": "region",
                    "op": "in",
                    "values": [
                        "us-east-1"
                    ]
                }
            ]
        },
        {
            "group_id": "cluster_rule",
            "id": "cluster_rule_3_us_east_2",
            "index": 1,
            "start_key": "",
            "end_key": "",
            "role": "voter",
            "count": 2,
            "label_constraints": [
                {
                    "key": "region",
                    "op": "in",
                    "values": [
                        "us-east-2"
                    ]
                }
            ]
        },
        {
            "group_id": "cluster_rule",
            "id": "cluster_rule_2_us_west_2",
            "index": 1,
            "start_key": "",
            "end_key": "",
            "role": "follower",
            "count": 1,
            "label_constraints": [
                {
                    "key": "region",
                    "op": "in",
                    "values": [
                        "us-west-2"
                    ]
                }
            ]
        }
    ]
}

// apply the rule for the group https://docs.pingcap.com/tidb/dev/configure-placement-rules#use-pd-ctl-to-batch-update-groups-and-rules-in-groups
>>  pd-ctl config placement-rules rule-bundle set cluster_rule --in="cluster_rule_group.json"

tonyxuqqi · 2022-09-14T03:48:46Z

I confirmed that this placement policy cannot achieve the purpose of automatic switching, the problem needs the placement policy to distinguish voter and follower, setting follower raw rule for east2 let it cannot be leader.

@nolouch Is this follower rule enforced in PD side or in TikV raft protocol?

morgo · 2022-09-14T20:27:54Z

@nolouch Happy to try with one policy. I'm getting an error with what you pasted above though :(

$ ./bin/pd-ctl config placement-rules rule-bundle set cluster_rule --in="cluster_rule_group.json"
json: cannot unmarshal array into Go value of type struct { GroupID string "json:\"group_id\"" }

Using pd-ctl from v6.2.0.

nolouch · 2022-09-14T23:45:44Z

@morgo sorry, I updated the comment in #37251 (comment). you can try again.

kolbe · 2022-09-14T23:59:32Z

To be used in a kubernetes environment (until pingcap/tidb-operator#4678 is implemented), "region" should be changed to "topology.kubernetes.io/region".

SunRunAway · 2022-09-16T06:29:12Z

@morgo
Since you never read from us-east-2, is it better to set us-east-2 as witness if posible?

kolbe · 2022-09-16T15:44:37Z

@SunRunAway what is "witness"? This is not mentioned anywhere in our documentation.

SunRunAway · 2022-09-19T02:17:45Z

@kolbe I'm discussing a developing feature, see tikv/tikv#12876

morgo added the type/enhancement The issue or PR belongs to an enhancement. label Aug 21, 2022

This was referenced Sep 2, 2022

Define the placement of data by SQL statements #18030

Closed

placement: refine sugar syntax rules generation #37565

Merged

ti-chi-bot pushed a commit that referenced this issue Sep 7, 2022

placement: refine sugar syntax rules generation (#37565)

8670731

ref #37251

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support deterministic failover schedule for placement rules #37251

support deterministic failover schedule for placement rules #37251

morgo commented Aug 21, 2022

morgo commented Sep 1, 2022

nolouch commented Sep 2, 2022 •

edited

Loading

morgo commented Sep 2, 2022

nolouch commented Sep 2, 2022 •

edited

Loading

nolouch commented Sep 2, 2022 •

edited

Loading

morgo commented Sep 2, 2022 •

edited

Loading

nolouch commented Sep 2, 2022

morgo commented Sep 2, 2022

nolouch commented Sep 9, 2022

nolouch commented Sep 9, 2022

nolouch commented Sep 14, 2022 •

edited

Loading

tonyxuqqi commented Sep 14, 2022

morgo commented Sep 14, 2022

nolouch commented Sep 14, 2022

kolbe commented Sep 14, 2022

SunRunAway commented Sep 16, 2022

kolbe commented Sep 16, 2022

SunRunAway commented Sep 19, 2022

support deterministic failover schedule for placement rules #37251

support deterministic failover schedule for placement rules #37251

Comments

morgo commented Aug 21, 2022

Enhancement

morgo commented Sep 1, 2022

nolouch commented Sep 2, 2022 • edited Loading

morgo commented Sep 2, 2022

nolouch commented Sep 2, 2022 • edited Loading

nolouch commented Sep 2, 2022 • edited Loading

morgo commented Sep 2, 2022 • edited Loading

nolouch commented Sep 2, 2022

morgo commented Sep 2, 2022

nolouch commented Sep 9, 2022

nolouch commented Sep 9, 2022

nolouch commented Sep 14, 2022 • edited Loading

tonyxuqqi commented Sep 14, 2022

morgo commented Sep 14, 2022

nolouch commented Sep 14, 2022

kolbe commented Sep 14, 2022

SunRunAway commented Sep 16, 2022

kolbe commented Sep 16, 2022

SunRunAway commented Sep 19, 2022

nolouch commented Sep 2, 2022 •

edited

Loading

nolouch commented Sep 2, 2022 •

edited

Loading

nolouch commented Sep 2, 2022 •

edited

Loading

morgo commented Sep 2, 2022 •

edited

Loading

nolouch commented Sep 14, 2022 •

edited

Loading