Fix #9696 - apiserver outage when replacing or scaling control plane nodes #9701

holmesb · 2023-01-24T12:21:11Z

Fixed by bouncing apiserver static pods sequentially instead of all at once when there are etcd node changes. Retain the faster, old method for use by non-HA apiserver.

No longer change running apiserver static pods (and restart them) if only the order of etcd servers have changed.

/kind bug

What this PR does / why we need it:
Fixes loss of apiserver when are etcd nodes are scaled, or CP1 is replaced.

Which issue(s) this PR fixes:
Fixes #9696

Does this PR introduce a user-facing change?:
NONE

linux-foundation-easycla · 2023-01-24T12:21:14Z

The committers listed above are authorized under a signed CLA.

✅ login: holmesb (646bc92)

k8s-ci-robot · 2023-01-24T12:21:20Z

Hi @holmesb. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-01-24T12:29:40Z

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

0a9b600 Fix apiserver outage when replacing or scaling control plane nodes #9696 by bouncing apiserver static pods sequentially instead of all at once when there are etcd node changes. Retain the faster, old method for use by non-HA apiserver.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

holmesb · 2023-01-25T14:33:23Z

Fixed author\CLA.

holmesb · 2023-01-25T16:20:26Z

CI has a "command-instead-of-module" Ansible lint error because I'm using sed & curl instead of lineinfile & uri. But we must query the local endpoint (127.0.0.1/healthz) before moving onto the next CP node (apiserver static pod). Can't use throttle at the block level, otherwise I'd couple a lineinfile with a uri task. Open to suggestions. Maybe I should split this into its own play, then I could use serial. Or can we just ignore this lint error?

floryut · 2023-01-25T16:40:34Z

CI has a "command-instead-of-module" Ansible lint error because I'm using sed & curl instead of lineinfile & uri. But we must query the local endpoint (127.0.0.1/healthz) before moving onto the next CP node (apiserver static pod). Can't use throttle at the block level, otherwise I'd couple a shell with a uri task. Open to suggestions. Maybe I should split this into its own play, then I could use serial. Or can we just ignore this lint error?

ignoring the error is fine for this case I'd say, as long as there is a valid reason it's fine for me

holmesb · 2023-02-03T17:33:58Z

Any news with this one @cyclinder @jayonlau? Be good to avoid downtime everytime control plane nodes are changed. Are no breaking changes.

holmesb · 2023-02-17T16:57:23Z

Any news with this pls @floryut ? K8s API server going offline every CP change isn't exactly "Production Ready". I'd like to avoid having to add this as a patch to Kubespray every release\build.

floryut · 2023-02-20T09:10:34Z

Any news with this pls @floryut ? K8s API server going offline every CP change isn't exactly "Production Ready". I'd like to avoid having to add this as a patch to Kubespray every release\build.

@holmesb Sorry I was pretty much waiting for you to fix the CI not passing 😆
Can you take a look ?

holmesb · 2023-02-20T20:38:35Z

The ansible lint message? I thought you said "ignoring the error is fine for this case I'd say". We're already using sed & curl elsewhere in this repo, so I can't envisage any issues.

floryut · 2023-02-20T21:31:12Z

The ansible lint message? I thought you said "ignoring the error is fine for this case I'd say". We're already using sed & curl elsewhere in this repo, so I can't envisage any issues.

Yes, ignoring the error, meaning adding a noqa in the code on this specific line 😄

linux-foundation-easycla · 2023-02-20T22:06:32Z

The committers listed above are authorized under a signed CLA.

✅ login: holmesb (6d1a5c1)

k8s-ci-robot · 2023-02-20T22:15:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: holmesb
Once this PR has been reviewed and has the lgtm label, please assign luckysb for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

linux-foundation-easycla · 2023-02-21T16:38:05Z

The committers listed above are authorized under a signed CLA.

✅ login: holmesb (bad2f15)

holmesb · 2023-02-22T09:38:28Z

Passing CI now @floryut

k8s-triage-robot · 2023-05-23T10:29:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jcpunk · 2023-05-31T13:28:26Z

Can you rebase off head?

… of all at once when there are etcd node changes. Retain the faster, old method for use by non-HA apiserver. No longer change running apiserver static pod if only the order of etcd servers have changed.

k8s-ci-robot · 2023-06-05T09:05:42Z

Keywords which can automatically close issues and at(@) mentions are not allowed in the title of a Pull Request.

You can edit the title by writing /retitle in a comment.

When GitHub merges a Pull Request, the title is included in the merge commit. To avoid invalid keywords in the merge commit, please edit the title of the PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

holmesb · 2023-06-07T08:54:41Z

Can you rebase off head?

Done @jcpunk

k8s-triage-robot · 2023-07-07T09:49:46Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

VannTen · 2023-11-26T10:48:25Z

IMO, we should just revert #8253 instead. It's documented here that scale.yml it not intended for control-plane components, and you should use cluster/upgrade-cluster.
Reverting that PR would mean the api-server conf change is handled during the normal playbook upgrade path one at a time, and would avoid the mentioned down-time.

VannTen · 2023-12-21T15:30:29Z

I think this has been fixed by the linked PR above, so we'll close that.
Feel free to reopen if still needed.
Thanks for the work regardless o/
/close

k8s-ci-robot · 2023-12-21T15:30:35Z

@VannTen: Closed this PR.

In response to this:

I think this has been fixed by the linked PR above, so we'll close that.
Feel free to reopen if still needed.
Thanks for the work regardless o/
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-12-21T15:30:38Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 24, 2023

k8s-ci-robot requested review from cyclinder and jayonlau January 24, 2023 12:21

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 24, 2023

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 24, 2023

holmesb force-pushed the bh/api_server_outage_fix branch 2 times, most recently from 3b858f1 to d56fa4c Compare January 25, 2023 14:30

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jan 25, 2023

holmesb force-pushed the bh/api_server_outage_fix branch 2 times, most recently from 07f3f96 to 5800a7c Compare February 20, 2023 22:06

holmesb force-pushed the bh/api_server_outage_fix branch from 5800a7c to 6d1a5c1 Compare February 20, 2023 22:15

k8s-ci-robot removed the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Feb 20, 2023

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 20, 2023

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 20, 2023

holmesb force-pushed the bh/api_server_outage_fix branch 2 times, most recently from 6470788 to 0f65697 Compare February 21, 2023 11:36

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 21, 2023

holmesb force-pushed the bh/api_server_outage_fix branch from edbfc11 to bad2f15 Compare February 21, 2023 16:39

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 21, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2023

Fix issue 9696 by bouncing apiserver static pods sequentially instead…

646bc92

… of all at once when there are etcd node changes. Retain the faster, old method for use by non-HA apiserver. No longer change running apiserver static pod if only the order of etcd servers have changed.

holmesb force-pushed the bh/api_server_outage_fix branch from bad2f15 to 646bc92 Compare June 5, 2023 09:05

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 7, 2023

VannTen mentioned this pull request Nov 26, 2023

Revert "Update etcd-servers for apiserver (#8253)" #10652

Merged

k8s-ci-robot closed this Dec 21, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #9696 - apiserver outage when replacing or scaling control plane nodes #9701

Fix #9696 - apiserver outage when replacing or scaling control plane nodes #9701

holmesb commented Jan 24, 2023 •

edited

Loading

linux-foundation-easycla bot commented Jan 24, 2023 •

edited

Loading

k8s-ci-robot commented Jan 24, 2023

k8s-ci-robot commented Jan 24, 2023

holmesb commented Jan 25, 2023

holmesb commented Jan 25, 2023 •

edited

Loading

floryut commented Jan 25, 2023

holmesb commented Feb 3, 2023

holmesb commented Feb 17, 2023

floryut commented Feb 20, 2023

holmesb commented Feb 20, 2023

floryut commented Feb 20, 2023 •

edited

Loading

linux-foundation-easycla bot commented Feb 20, 2023 •

edited

Loading

k8s-ci-robot commented Feb 20, 2023

linux-foundation-easycla bot commented Feb 21, 2023 •

edited

Loading

holmesb commented Feb 22, 2023

k8s-triage-robot commented May 23, 2023

jcpunk commented May 31, 2023

k8s-ci-robot commented Jun 5, 2023

holmesb commented Jun 7, 2023

k8s-triage-robot commented Jul 7, 2023

VannTen commented Nov 26, 2023

VannTen commented Dec 21, 2023

k8s-ci-robot commented Dec 21, 2023

k8s-ci-robot commented Dec 21, 2023

Fix #9696 - apiserver outage when replacing or scaling control plane nodes #9701

Fix #9696 - apiserver outage when replacing or scaling control plane nodes #9701

Conversation

holmesb commented Jan 24, 2023 • edited Loading

linux-foundation-easycla bot commented Jan 24, 2023 • edited Loading

k8s-ci-robot commented Jan 24, 2023

k8s-ci-robot commented Jan 24, 2023

holmesb commented Jan 25, 2023

holmesb commented Jan 25, 2023 • edited Loading

floryut commented Jan 25, 2023

holmesb commented Feb 3, 2023

holmesb commented Feb 17, 2023

floryut commented Feb 20, 2023

holmesb commented Feb 20, 2023

floryut commented Feb 20, 2023 • edited Loading

linux-foundation-easycla bot commented Feb 20, 2023 • edited Loading

k8s-ci-robot commented Feb 20, 2023

linux-foundation-easycla bot commented Feb 21, 2023 • edited Loading

holmesb commented Feb 22, 2023

k8s-triage-robot commented May 23, 2023

jcpunk commented May 31, 2023

k8s-ci-robot commented Jun 5, 2023

holmesb commented Jun 7, 2023

k8s-triage-robot commented Jul 7, 2023

VannTen commented Nov 26, 2023

VannTen commented Dec 21, 2023

k8s-ci-robot commented Dec 21, 2023

k8s-ci-robot commented Dec 21, 2023

holmesb commented Jan 24, 2023 •

edited

Loading

linux-foundation-easycla bot commented Jan 24, 2023 •

edited

Loading

holmesb commented Jan 25, 2023 •

edited

Loading

floryut commented Feb 20, 2023 •

edited

Loading

linux-foundation-easycla bot commented Feb 20, 2023 •

edited

Loading

linux-foundation-easycla bot commented Feb 21, 2023 •

edited

Loading