Downtime while performing upgrade #35

gaieges · 2017-02-22T03:28:38Z

Problem

Attempts to deploy via the following method results in a period of time where the load balancer throws 500 errors, as a result of not being able to connect to hosts that were shut off (after new containers have already been created)

My deploy approach:

rancher-compose up --force-upgrade --pull -d

With the following rancher-compose.yml settings:

version: '2'
services:
  myapp:
    scale: 2
    upgrade_strategy:
      start_first: true

Expected behavior

During the entire time that the nodes are doing their rolling deploy, I should get no 500 errors.

Solution

Not sure where the problem is here, but I believe it's that rancher-traefik doesnt watch for events in real time, and regenerate the config on that event.

I say this, as I've noticed that after a few minutes, and the 'reload' command is issued and you see the change in the logs, the service starts working fine again.

Willing to make a PR but want to ensure that I have the design of this right.

The text was updated successfully, but these errors were encountered:

kelchm · 2017-02-24T15:06:27Z

Very interested in this as well. Let me know if there is anything I can do to help with testing.

gaieges · 2017-02-24T17:36:49Z

Chatted with @rawmind0 separately about this, he mentioned that this would essentially be handled in a first class way by the traefik guys: traefik/traefik#1173

I'm likely to take that approach instead of putting effort in here.

kelchm · 2017-02-27T18:56:54Z

Thanks for the link @gaieges, the Rancher provider for traefik looks very promising.

rawmind0 · 2017-02-28T11:55:23Z

@gaieges thanks to share the info in the thread.

I was talking with traefik team and may be this week traefik v1.2.0-rc2 would be released, including rancher backend. Asap i'll test it and i'll update the catalog package accordingly.

Best regards....

miguelpeixe · 2017-05-01T18:39:46Z

Any update on this matter? I'm still getting downtime while performing start before stopping upgrades. This is also an issue with rancher's haproxy load balancer.

I was hoping that traefik could be the alternative.

gaieges · 2017-05-01T18:51:54Z

I don't think @rawmind0 is going to put any more effort into this project (but I'll let him speak to that) in favor of the traefik solution they've recently implemented.

I've been trying to use the native traefik support for rancher, and while it works, it's not entirely there yet so I haven't flipped over to using that in production yet. Either way - probably best to look towards using the native traefik approach vs this.

kelchm · 2017-05-01T18:55:15Z

@miguelpeixe, I've made some improvements to the rancher provider in Traefik to improve the situation. While this does not give you true zero downtime deployments out of the box, it's a big improvement.

I think the best way to do blue green deployments with Rancher and Traefik is to actually spin up a new stack rather than upgrading the existing stack.

This allows Traefik to evaluate the health of the stack as a whole and only cut over traffic to the new stack once the entire stack is healthy. I'm working on a complete writeup of how I've been using this, but feel free to reach out to me if you have any questions.

miguelpeixe · 2017-05-01T19:17:50Z

thanks for the updates and tips @gaieges @kelchm, I'll work on setting up native traefik and temporarily use new stack for upgrades.

rayout · 2017-10-06T10:47:21Z

Any news for this?

rawmind0 · 2017-11-07T08:28:04Z

Hi all,

From alpine-traefik release 1.4.0-3, traefik built in rancher integration is supported, metadata and api. Also, community-catalog is already updated. Now 3 rancher integration are available, metadata, api ( traefik built in) or external (rancher-traefik).

Take into account that labels are different with traefik built in integration, https://docs.traefik.io/configuration/backends/rancher/#labels-overriding-default-behaviour
Metadata with longpoll is the prefered integration, it’s working so good. :)

Also, I made a PR that is merged and will be included in next traefik release with a refactor of rancher integration. traefik/traefik#2291

Best regards...

rawmind0 closed this as completed Nov 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downtime while performing upgrade #35

Downtime while performing upgrade #35

gaieges commented Feb 22, 2017 •

edited

Loading

kelchm commented Feb 24, 2017

gaieges commented Feb 24, 2017

kelchm commented Feb 27, 2017

rawmind0 commented Feb 28, 2017

miguelpeixe commented May 1, 2017

gaieges commented May 1, 2017

kelchm commented May 1, 2017 •

edited

Loading

miguelpeixe commented May 1, 2017

rayout commented Oct 6, 2017

rawmind0 commented Nov 7, 2017 •

edited

Loading

Downtime while performing upgrade #35

Downtime while performing upgrade #35

Comments

gaieges commented Feb 22, 2017 • edited Loading

Problem

Expected behavior

Solution

kelchm commented Feb 24, 2017

gaieges commented Feb 24, 2017

kelchm commented Feb 27, 2017

rawmind0 commented Feb 28, 2017

miguelpeixe commented May 1, 2017

gaieges commented May 1, 2017

kelchm commented May 1, 2017 • edited Loading

miguelpeixe commented May 1, 2017

rayout commented Oct 6, 2017

rawmind0 commented Nov 7, 2017 • edited Loading

gaieges commented Feb 22, 2017 •

edited

Loading

kelchm commented May 1, 2017 •

edited

Loading

rawmind0 commented Nov 7, 2017 •

edited

Loading