rolling-update heuristic needs improvement #489

nicolasbelanger · 2016-09-22T15:31:53Z

This is a follow-up on the issue #284

The rolling-update for masters in different zones needs to wait for at least one master to be fully ready. I tested an upgrade from v1.3.7 to v1.4.0-beta.10.

NAME            STATUS      NEEDUPDATE  READY   MIN MAX
master-us-west-2a   NeedsUpdate 1       0   1   1
master-us-west-2b   NeedsUpdate 1       0   1   1
master-us-west-2c   NeedsUpdate 1       0   1   1
nodes           NeedsUpdate 6       0   6   6
I0922 10:17:21.875868    1617 rollingupdate_cluster.go:195] Stopping instance "i-00a61d4642492fe95" in AWS ASG "nodes.qa.k8s"
I0922 10:17:21.875884    1617 rollingupdate_cluster.go:195] Stopping instance "i-014733a8fabec86a7" in AWS ASG "master-us-west-2a.masters.qa.k8s"
I0922 10:18:22.092917    1617 rollingupdate_cluster.go:195] Stopping instance "i-00c054ee7fbf2acb5" in AWS ASG "nodes.qa.k8s"
I0922 10:18:22.329016    1617 rollingupdate_cluster.go:195] Stopping instance "i-061357ef39a05806c" in AWS ASG "master-us-west-2b.masters.qa.k8s"
I0922 10:19:22.323679    1617 rollingupdate_cluster.go:195] Stopping instance "i-043cf9fbcd22f847c" in AWS ASG "nodes.qa.k8s"
I0922 10:19:22.774816    1617 rollingupdate_cluster.go:195] Stopping instance "i-0cf1e9b9bd562be81" in AWS ASG "master-us-west-2c.masters.qa.k8s"
I0922 10:20:27.805410    1617 rollingupdate_cluster.go:195] Stopping instance "i-084a220e29795bf42" in AWS ASG "nodes.qa.k8s"
I0922 10:21:33.376359    1617 rollingupdate_cluster.go:195] Stopping instance "i-09ef93661287935ac" in AWS ASG "nodes.qa.k8s"
I0922 10:22:38.845650    1617 rollingupdate_cluster.go:195] Stopping instance "i-0fb6a7a8ba7ace9f0" in AWS ASG "nodes.qa.k8s"

Unfortunately, by the time the master-us-west-2c.masters.qa.k8s is taken down, master-us-west-2a.masters.qa.k8s is not yet fully started. Then no pods could be scheduled, and the service goes down.

Let me know if you need more details.

The text was updated successfully, but these errors were encountered:

chrislovecnm · 2016-10-15T04:14:13Z

Upgrades are on my list. Will ping you if I need more details.

dwradcliffe · 2016-10-24T20:35:33Z

Even if you have just one master, it starts killing workers before the new master is ready.

chrislovecnm · 2016-10-25T00:37:15Z

@dwradcliffe working on it this week ;) we gonna make it a bunch better :P

dwradcliffe · 2016-10-25T00:42:30Z

Sweet! Happy to test it when you're ready.

chrislovecnm · 2016-10-26T03:22:58Z

I am wondering if launch a job in the cluster to upgrade itself has value currently. Probably phase two.

RichardBronosky · 2016-12-14T05:17:36Z

@chrislovecnm what did you decide on this? What is the alternative to launching a job in the cluster? Having the workstation poll for progress and take steps in sequence? That would be discarding everything we have learned about message/job queues.

chrislovecnm · 2016-12-14T06:41:42Z

This is phase 1 #1134

chrislovecnm · 2016-12-14T06:46:17Z

So work in progress still needs more tlc, but much much much better now:

cordons node, and attempts to drain
we upgrade the node even if we fail on the drain
we validate the entire cluster with every new node, with a time out.

Same pattern upgrade masters, and then the nodes.

Now it does not quite scale to 100s of nodes with this pattern, without a long run time, and we have ideas for that. But much better.

discard everything we learned about job queues?

More context please.

chrislovecnm · 2017-07-05T03:55:25Z

@nicolasbelanger can we close this now?

nicolasbelanger · 2017-07-05T12:31:24Z

@chrislovecnm yep, tx.

justinsb added this to the 1.3.0 milestone Sep 24, 2016

chrislovecnm self-assigned this Oct 15, 2016

chrislovecnm modified the milestones: backlog, 1.3.0 Oct 15, 2016

chrislovecnm mentioned this issue Oct 20, 2016

Set roadmap for 1.5 and 1.6 releases #663

Closed

nicolasbelanger closed this as completed Jul 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rolling-update heuristic needs improvement #489

rolling-update heuristic needs improvement #489

nicolasbelanger commented Sep 22, 2016 •

edited

Loading

chrislovecnm commented Oct 15, 2016

dwradcliffe commented Oct 24, 2016

chrislovecnm commented Oct 25, 2016

dwradcliffe commented Oct 25, 2016

chrislovecnm commented Oct 26, 2016

RichardBronosky commented Dec 14, 2016

chrislovecnm commented Dec 14, 2016

chrislovecnm commented Dec 14, 2016 •

edited

Loading

chrislovecnm commented Jul 5, 2017

nicolasbelanger commented Jul 5, 2017

rolling-update heuristic needs improvement #489

rolling-update heuristic needs improvement #489

Comments

nicolasbelanger commented Sep 22, 2016 • edited Loading

chrislovecnm commented Oct 15, 2016

dwradcliffe commented Oct 24, 2016

chrislovecnm commented Oct 25, 2016

dwradcliffe commented Oct 25, 2016

chrislovecnm commented Oct 26, 2016

RichardBronosky commented Dec 14, 2016

chrislovecnm commented Dec 14, 2016

chrislovecnm commented Dec 14, 2016 • edited Loading

chrislovecnm commented Jul 5, 2017

nicolasbelanger commented Jul 5, 2017

nicolasbelanger commented Sep 22, 2016 •

edited

Loading

chrislovecnm commented Dec 14, 2016 •

edited

Loading