Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document and improve HA algorithm #670

Merged
merged 4 commits into from
Jul 29, 2019
Merged

document and improve HA algorithm #670

merged 4 commits into from
Jul 29, 2019

Conversation

cofyc
Copy link
Contributor

@cofyc cofyc commented Jul 18, 2019

What problem does this PR solve?

fixes #652

What is changed and how does it work?

Check List

Tests

  • Unit test

Code changes

  • Has Go code change

Does this PR introduce a user-facing change?:

NONE

@cofyc
Copy link
Contributor Author

cofyc commented Jul 19, 2019

/run-e2e-tests

@xiaojingchen
Copy link
Contributor

LGTM

weekface
weekface previously approved these changes Jul 19, 2019
if podsCount+1 >= int(replicas+1)/2 {
// When replicas equals 3, pods on each node should not be greater than 1.
maxPodsPerNode := 1
if replicas > 3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 is hard-coded, but it should be gathered from configuration (in case the user sets it to 5)

Copy link
Contributor Author

@cofyc cofyc Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 3 is the number of copies of data regions in TiKV. Is it possible to has more than 3 copies when the replicas of TiKV instances increases, e.g. replicates one data region into more than 3 instances? cc @tennix @weekface

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 is the tikv replicas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pr just fix the corner case when replicas is 4.

@cofyc
Copy link
Contributor Author

cofyc commented Jul 23, 2019

it has some conflicts, I will resolve it later

@cofyc
Copy link
Contributor Author

cofyc commented Jul 23, 2019

@xiaojingchen @weekface conflicts are resolved, PTAL again. Thanks!
@gregwebs I'm going to fix the bug first.

@cofyc
Copy link
Contributor Author

cofyc commented Jul 23, 2019

/run-e2e-tests

maxPodsPerNode := 1
if replicas > 3 {
// When replicas is greater than 3, we allow more than one pods on one node.
maxPodsPerNode = int((replicas + 1) / 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if replicas =5 , the maxPodsPerNode would be 3. Is it right?

Copy link
Contributor Author

@cofyc cofyc Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems not right... 2, 3 will be possible when there are only two nodes.

int((replicas + 1) / 2) should be ceil(replicas / 3)

replicas maxPodsPerNode three nodes less than three nodes
< 3 not limited, because HA is not possible possible, no HA is required possible, no HA is required
= 3 1 1, 1, 1 not possible, requires HA in three nodes
4 2 1, 1, 2 not possible, requires HA in three nodes
5 2 1, 2, 2 not possible, requires HA in three nodes
6 2 2, 2, 2 not possible, requires HA in three nodes
7 3 2, 2, 3 not possible, requires HA in three nodes
8 3 2, 3, 3 not possible, requires HA in three nodes
8 3 3, 3, 3 not possible, requires HA in three nodes

what do you think? @weekface

Copy link
Contributor

@weekface weekface Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the table above, the maxPodsPerNode is 2 when replicas is 4, this is not HA for pd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the maxPodsPerNode is right when replicas is 4, because in three nodes, 1, 1, 2 is ok, but when there are two instances in two nodes (1, 1), maxPodsPerNode must be 1 at this moment to force the scheduler to find a node which has no TiKV pod.

I guess we need to update the HA algorithm to take more information into account.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxPodsPerNode is always 1 when scheduled TiKV pods less than 3?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current algorithm is ok for pd, but not tikv. We need a new HA algorithm for tikv.

@cofyc cofyc changed the title fix the corner case when replicas is 4 WIP: fix the corner case when replicas is 4 Jul 23, 2019
@cofyc cofyc changed the title WIP: fix the corner case when replicas is 4 document and improve HA algorithm Jul 23, 2019
@cofyc
Copy link
Contributor Author

cofyc commented Jul 23, 2019

@weekface @xiaojingchen PTAL again.

@cofyc cofyc force-pushed the fix652 branch 2 times, most recently from cd80f80 to 131f76d Compare July 23, 2019 12:28
if component != label.PDLabelVal && component != label.TiKVLabelVal {
glog.V(4).Infof("component %s is ignored in HA predicate", component)
return nodes, nil
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a safe guard and improves readability

name: "three nodes, three pods scheduled on these three nodes, replicas is 4, can't scheduled",
podFn: newHAPDPod,
name: "three nodes, three pods scheduled on these three nodes, replicas is 4, return all the three nodes",
podFn: newHATiKVPod,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case is for TiKV, we need a TiKV pod

@weekface
Copy link
Contributor

/run-e2e-tests

* ---------------------------
* 1 1
* 2 1
* 3 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please align

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I mixed tab and spaces, fixed now.

@cofyc
Copy link
Contributor Author

cofyc commented Jul 29, 2019

/run-e2e-tests

1 similar comment
@cofyc
Copy link
Contributor Author

cofyc commented Jul 29, 2019

/run-e2e-tests

@cofyc
Copy link
Contributor Author

cofyc commented Jul 29, 2019

@weekface should this be merged before 1.0 GA?

@weekface
Copy link
Contributor

Yes, merging it to master.

But I will not cherry-pick it to the release-1.0. The release-1.0 is frozen.

@sre-bot
Copy link
Contributor

sre-bot commented Jul 30, 2019

cherry pick to release-1.0 failed

@sre-bot
Copy link
Contributor

sre-bot commented Jul 30, 2019

cherry pick to release-1.1 failed

@weekface
Copy link
Contributor

weekface commented Aug 1, 2019

/run-cherry-picker

2 similar comments
@weekface
Copy link
Contributor

weekface commented Aug 1, 2019

/run-cherry-picker

@cofyc
Copy link
Contributor Author

cofyc commented Aug 5, 2019

/run-cherry-picker

cofyc added a commit to cofyc/tidb-operator that referenced this pull request Aug 5, 2019
* fix the corner case when replicas is 4

* improve HA algorithm for TiKV/PD
cofyc added a commit that referenced this pull request Aug 6, 2019
* fix the corner case when replicas is 4

* improve HA algorithm for TiKV/PD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't schedule 4 TiKV instances on 3 K8s nodes
5 participants