Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

Document required manual calico 2.6.3 -> calico 3.1.1 upgrade when upgrading from < 0.17.0-provisioned clusters #3208

Merged
merged 4 commits into from
Jul 25, 2018

Conversation

oivindoh
Copy link
Contributor

@oivindoh oivindoh commented Jun 7, 2018

What this PR does / why we need it:
Upgrading from clusters deployed by acs-engine < 0.17.0 and calico enabled had calico 2.6.3. When upgrading such a cluster with 0.17.0 and later, calico addon manifest is 3.1.x, and a migration is not supported for releases prior to 2.6.5, so we need to perform some manual steps to get up and running on 3.1.x. See Issue #3191

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #
fixes #3191

Special notes for your reviewer:
I'm not entirely certain where this belongs (whether in examples/networkpolicy or examples/k8s-upgrade.

If applicable:

  • documentation
  • unit tests
  • tested backward compatibility (ie. deploy with previous version, upgrade with this branch)

Release note:

…er upgrading a cluster created with acs-engine prior to 0.17.0
@acs-bot
Copy link

acs-bot commented Jun 7, 2018

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: oivindoh
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: cecilerobertmichon

Assign the PR to them by writing /assign @cecilerobertmichon in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@acs-bot acs-bot added the size/S label Jun 7, 2018
@msftclas
Copy link

msftclas commented Jun 7, 2018

CLA assistant check
All CLA requirements met.

@codecov
Copy link

codecov bot commented Jun 7, 2018

Codecov Report

Merging #3208 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #3208   +/-   ##
=======================================
  Coverage   52.31%   52.31%           
=======================================
  Files         103      103           
  Lines       15458    15458           
=======================================
  Hits         8087     8087           
  Misses       6643     6643           
  Partials      728      728


acs-engine releases starting with 0.17.0 now produce an addon manifest for calico in `/etc/kubernetes/addons/calico-daemonset.yaml` contaning calico 3.1.x, and an `updateStrategy` of `RollingUpdate`.

To get up and running with the new version of calico after upgrading a cluster with acs-engine `0.17.0` and up, follow these steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @oivindoh! Looks great. Could you please add something like "As per the Calico v0.3.1 release notes, Calico v0.3.x includes breaking changes:

Some highlights include:
You must upgrade to Calico v2.6.5 before you can upgrade to v3.0.1 (see https://docs.projectcalico.org/v3.0/getting-started/kubernetes/upgrade/)
Calico deployments that access the etcd datastore directly must complete
a one-time migration.
You must convert any customized Calico manifests via calicoctl convert
before you can use them with v3.0.1.

Here some instructions to get up and running with the new version of Calico […]" here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So people can have context on the breaking changes (and know it's a calico breaking change, not acs-engine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I added some text to be more explicit about this being due to calico breaking changes and not acs-engine 👍

…ico, point to documentation for calico k8s upgrade and calico convert.
@acs-bot acs-bot added size/M and removed size/S labels Jun 8, 2018
@oivindoh
Copy link
Contributor Author

oivindoh commented Jun 8, 2018

@dtzar could you sanity-check this?

@dtzar
Copy link
Contributor

dtzar commented Jun 8, 2018

@oivindoh Thanks for the docs. Although I'm sure the steps you list work, this does not follow the upgrade guidance provided by Calico. I know there are significant changes to the RBAC config as well as the manifest - see the changes in my PR which did the upgrade to get an idea.

Soo... I wouldn't approve this guidance as-is. It should map directly to what is on Calico's webpage. You could truncate what you have and just punt to the Calico upgrade webpage listing the fact we use the Kuberntes Datastore, policy only configuration - or be more specific following the flow/guidance they have which will be specific to acs-engine deployments.

@oivindoh
Copy link
Contributor Author

@dtzar I guess I'm not entirely clear on where I diverge from the linked upgrade guidance - applying the 3.x manifest with node/cni changed to 2.6.10 and 2.0.6 effectively performs step 1, and handles upgrade to 2.6.5+ in the process. Applying that manifest again with node/cni 3.1.1 performs the rest of the steps required, keeping cluster-cidr as appropriate.

What I wasn't clear on after reading Calico docs (and what I wanted to document here) was the how of upgrading to 2.6.5+, given that I now had a cluster with a 3.1.1 manifest to be managed as an addon and calico 2.6.3 actually running in the cluster.

@acs-bot acs-bot added size/S and removed size/M labels Jun 11, 2018
Copy link
Contributor

@dtzar dtzar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if this is more clear now or if you still have questions


To get up and running with the new version of calico after upgrading a cluster with acs-engine `0.17.0` and up, follow these steps:
1. To update to `2.6.5+` in preparation of an upgrade to 3.1.x as specified, edit `/etc/kubernetes/addons/calico-daemonset.yaml` on a master node, replacing `calico/node:v3.1.1` with `calico/node:v2.6.10` and `calico/cni:v3.1.1` with `calico/cni:v2.0.6`. Run `kubectl apply -f /etc/kubernetes/addons/calico-daemonset.yaml`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The place where you diverge from the upgrade guidance is here because you are only replacing the image versions, not the entire template/manifest. The calico-daemonset.yaml is a merge of the Calico RBAC + manifest files and needs more changes than just bumping the image versions.


`YYYY-MM-DD HH:MM:SS.FFF [INFO][n] health.go 150: Overall health summary=&health.HealthReport{Live:true, Ready:true}`

3) Edit `/etc/kubernetes/addons/calico-daemonset.yaml` on the master node again, replacing `calico/node:v2.6.10` with `calico/node:v3.1.1` and `calico/cni:v2.0.6` with `calico/cni:v3.1.1`. Run `kubectl apply -f /etc/kubernetes/addons/calico-daemonset.yaml`.
2. To complete the upgrade to 3.1.x, edit `/etc/kubernetes/addons/calico-daemonset.yaml` on the master node again, replacing `calico/node:v2.6.10` with `calico/node:v3.1.1` and `calico/cni:v2.0.6` with `calico/cni:v3.1.1`. Run `kubectl apply -f /etc/kubernetes/addons/calico-daemonset.yaml`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here.

@dtzar
Copy link
Contributor

dtzar commented Jun 11, 2018

@oivindoh I read through the document again end-to-end and I think I realize what you're suggesting now. The missing explanation is that I think you're saying to push through with the actual acs-engine upgrade (which will break Calico) and then go into each master node and do the following steps, correct?

If so, then yes the only thing I'm unsure of is if there are any potential problems with the temporary upgrade step you have of 2.6.10 using a 3.x RBAC/Policy template. Feels like a safer route would be to upgrade to 2.6.10 on the existing cluster (so the manifest is much closer), then upgrade acs-engine, then the new 3.x template might just work as-is then.

@oivindoh
Copy link
Contributor Author

oivindoh commented Jun 12, 2018

@dtzar Exactly - however upgrading the cluster shouldn’t break calico because it will happily continue to run using its old 2.6.3 definitions until the moment you apply the add on manifest manually, since it’s not set to anything but ensureexists, effectively keeping addon manager off the table.

I should update the document and remove the propagate to each master comment, since the masters will already have the updated manifest anyway (we have opted for 3.1.3 instead of the supplied 3.1.1 for richer network policy support, so needed to propagate).

I didn’t do any form of extended validation after applying 2.6.10 other the quick manual tests to see or services were still available and no alerts were triggering before updating to 3.x, so I can’t really vouch for zero negative effects during that step, but I could not detect anything FWIW.

3.1.1 and 3.1.3 have definitely been chugging along happily after the upgrade process.

@dtzar
Copy link
Contributor

dtzar commented Jun 12, 2018

@oivindoh - I wouldn't recommend using the new yaml manifest with 2.6.10 images since it does have significant manifest/rbac changes. Would you be willing to test out this flow and update the document?

  1. Update the daemonset yaml (pre-acs-engine-upgrade) to 2.6.10 with the steps as you described.
  2. Upgrade / deploy to newer 3.x calico using acs-engine (it should just work at this point...)

Also - I just issued #3257 which upgrades to the latest 3.1.3 so you can update your document accordingly and not have to do another step.

@jackfrancis
Copy link
Member

I think given that acs-engine itself does not provide cluster lifecycle configuration management (such that this would be taken care of more elegantly), and that the versions of Calico will continue to grow, let's just accept this doc as-is. It might not be 100% precise, but it has a good chance of helping someone out in the future who is running an acs-engine-built Calico 2 cluster and who is unable to tear down and recreate a new cluster.

@jackfrancis jackfrancis merged commit 0c87623 into Azure:master Jul 25, 2018
PaulCharlton added a commit to ElementAnalytics/acs-engine that referenced this pull request Jul 26, 2018
* 'master' of https://github.com/Azure/acs-engine: (59 commits)
  Docs: Update user guide list to include Windows, update description of clusters (Azure#3473)
  update to Azure CNI v1.0.10 (Azure#3551)
  Adding 'make dev' equivalent for Windows (Azure#3471)
  print out ubuntu ver in e2e (Azure#3555)
  fix an issue where networkPlugin was not defined correctly when using calico or cilium (Azure#3271)
  Bump ginkgo to a tagged release (Azure#3554)
  Reenable AzureFile tests for Windows on K8s 1.11.1, resolves Azure#3439 (Azure#3496)
  removing rbac error checking from merge fn (Azure#3530)
  Change dns healthcheck to look at external domain (Azure#3282)
  DOCUMENTATION: Fix Documented Default Value for clusterSubnet (Azure#3474)
  Document required manual calico 2.6.3 -> calico 3.1.1 upgrade when upgrading from < 0.17.0-provisioned clusters (Azure#3208)
  revert --image-pull-policy=IfNotPresent for win (Azure#3553)
  --image-pull-policy=IfNotPresent for kubectl run commands (Azure#3552)
  Kubernetes: --max-pods=30 should be Azure CNI-only (Azure#3543)
  disable Azure CNI network monitor addon default (Azure#3550)
  only do az vm list for k8s (Azure#3540)
  Retire Swarm E2E for PR test coverage (Azure#3539)
  retire Azure CDN for container image repository proxying (Azure#3535)
  removed datadisk to allow scale after upgrade (Azure#3482)
  Pump k8s-azure-kms version (Azure#3531)
  ...
PaulCharlton added a commit to ElementAnalytics/acs-engine that referenced this pull request Jul 26, 2018
* master: (59 commits)
  Docs: Update user guide list to include Windows, update description of clusters (Azure#3473)
  update to Azure CNI v1.0.10 (Azure#3551)
  Adding 'make dev' equivalent for Windows (Azure#3471)
  print out ubuntu ver in e2e (Azure#3555)
  fix an issue where networkPlugin was not defined correctly when using calico or cilium (Azure#3271)
  Bump ginkgo to a tagged release (Azure#3554)
  Reenable AzureFile tests for Windows on K8s 1.11.1, resolves Azure#3439 (Azure#3496)
  removing rbac error checking from merge fn (Azure#3530)
  Change dns healthcheck to look at external domain (Azure#3282)
  DOCUMENTATION: Fix Documented Default Value for clusterSubnet (Azure#3474)
  Document required manual calico 2.6.3 -> calico 3.1.1 upgrade when upgrading from < 0.17.0-provisioned clusters (Azure#3208)
  revert --image-pull-policy=IfNotPresent for win (Azure#3553)
  --image-pull-policy=IfNotPresent for kubectl run commands (Azure#3552)
  Kubernetes: --max-pods=30 should be Azure CNI-only (Azure#3543)
  disable Azure CNI network monitor addon default (Azure#3550)
  only do az vm list for k8s (Azure#3540)
  Retire Swarm E2E for PR test coverage (Azure#3539)
  retire Azure CDN for container image repository proxying (Azure#3535)
  removed datadisk to allow scale after upgrade (Azure#3482)
  Pump k8s-azure-kms version (Azure#3531)
  ...
juan-lee pushed a commit that referenced this pull request Aug 1, 2018
juan-lee pushed a commit that referenced this pull request Aug 1, 2018
jackfrancis pushed a commit to jackfrancis/acs-engine that referenced this pull request Aug 3, 2018
jackfrancis pushed a commit to jackfrancis/acs-engine that referenced this pull request Aug 3, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calico (2.6.3 -> 3.1.1) upgrade fails between acs-engine 0.16.0 and 0.18.1
6 participants