Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling-update heuristic needs improvement #489

Closed
nicolasbelanger opened this issue Sep 22, 2016 · 10 comments
Closed

rolling-update heuristic needs improvement #489

nicolasbelanger opened this issue Sep 22, 2016 · 10 comments
Assignees
Milestone

Comments

@nicolasbelanger
Copy link

nicolasbelanger commented Sep 22, 2016

This is a follow-up on the issue #284

The rolling-update for masters in different zones needs to wait for at least one master to be fully ready. I tested an upgrade from v1.3.7 to v1.4.0-beta.10.

NAME            STATUS      NEEDUPDATE  READY   MIN MAX
master-us-west-2a   NeedsUpdate 1       0   1   1
master-us-west-2b   NeedsUpdate 1       0   1   1
master-us-west-2c   NeedsUpdate 1       0   1   1
nodes           NeedsUpdate 6       0   6   6
I0922 10:17:21.875868    1617 rollingupdate_cluster.go:195] Stopping instance "i-00a61d4642492fe95" in AWS ASG "nodes.qa.k8s"
I0922 10:17:21.875884    1617 rollingupdate_cluster.go:195] Stopping instance "i-014733a8fabec86a7" in AWS ASG "master-us-west-2a.masters.qa.k8s"
I0922 10:18:22.092917    1617 rollingupdate_cluster.go:195] Stopping instance "i-00c054ee7fbf2acb5" in AWS ASG "nodes.qa.k8s"
I0922 10:18:22.329016    1617 rollingupdate_cluster.go:195] Stopping instance "i-061357ef39a05806c" in AWS ASG "master-us-west-2b.masters.qa.k8s"
I0922 10:19:22.323679    1617 rollingupdate_cluster.go:195] Stopping instance "i-043cf9fbcd22f847c" in AWS ASG "nodes.qa.k8s"
I0922 10:19:22.774816    1617 rollingupdate_cluster.go:195] Stopping instance "i-0cf1e9b9bd562be81" in AWS ASG "master-us-west-2c.masters.qa.k8s"
I0922 10:20:27.805410    1617 rollingupdate_cluster.go:195] Stopping instance "i-084a220e29795bf42" in AWS ASG "nodes.qa.k8s"
I0922 10:21:33.376359    1617 rollingupdate_cluster.go:195] Stopping instance "i-09ef93661287935ac" in AWS ASG "nodes.qa.k8s"
I0922 10:22:38.845650    1617 rollingupdate_cluster.go:195] Stopping instance "i-0fb6a7a8ba7ace9f0" in AWS ASG "nodes.qa.k8s"

Unfortunately, by the time the master-us-west-2c.masters.qa.k8s is taken down, master-us-west-2a.masters.qa.k8s is not yet fully started. Then no pods could be scheduled, and the service goes down.

Let me know if you need more details.

@justinsb justinsb added this to the 1.3.0 milestone Sep 24, 2016
@chrislovecnm
Copy link
Contributor

Upgrades are on my list. Will ping you if I need more details.

@chrislovecnm chrislovecnm self-assigned this Oct 15, 2016
@chrislovecnm chrislovecnm modified the milestones: backlog, 1.3.0 Oct 15, 2016
@dwradcliffe
Copy link
Contributor

Even if you have just one master, it starts killing workers before the new master is ready.

@chrislovecnm
Copy link
Contributor

@dwradcliffe working on it this week ;) we gonna make it a bunch better :P

@dwradcliffe
Copy link
Contributor

Sweet! Happy to test it when you're ready.

@chrislovecnm
Copy link
Contributor

I am wondering if launch a job in the cluster to upgrade itself has value currently. Probably phase two.

@RichardBronosky
Copy link
Contributor

@chrislovecnm what did you decide on this? What is the alternative to launching a job in the cluster? Having the workstation poll for progress and take steps in sequence? That would be discarding everything we have learned about message/job queues.

@chrislovecnm
Copy link
Contributor

This is phase 1 #1134

@chrislovecnm
Copy link
Contributor

chrislovecnm commented Dec 14, 2016

So work in progress still needs more tlc, but much much much better now:

  • cordons node, and attempts to drain
  • we upgrade the node even if we fail on the drain
  • we validate the entire cluster with every new node, with a time out.

Same pattern upgrade masters, and then the nodes.

Now it does not quite scale to 100s of nodes with this pattern, without a long run time, and we have ideas for that. But much better.

discard everything we learned about job queues?

More context please.

@chrislovecnm
Copy link
Contributor

@nicolasbelanger can we close this now?

@nicolasbelanger
Copy link
Author

@chrislovecnm yep, tx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants