Make sure upgrades from v0.1.0 work well #484

iaguis · 2020-05-26T14:08:41Z

Before releasing v0.2.0 we should make sure updates from v0.1.0 work automatically or, if user intervention is needed, document the steps required.

iaguis · 2020-05-26T16:59:44Z

I've done some initial research on this, here are my findings.

Control plane (and kubelet) updates

This worked fine without any changes.

Components

Updates for some components didn't work, there are two reasons.

Helm resource conflicts

For example:

Applying component 'metrics-server'...
FATA[0001] updating chart failed: rendered manifests contain a new resource that already exists. Unable to continue with update: existing resource conflict: kind: RoleBinding, namespace: kube-system, name: metrics-server-auth-reader  args="[metrics-server]" command="lokoctl component apply"

AFAIU this happens because some of the resources we create changed apiVersion and Helm is not able to deal with that.

A way to handle this was added in helm/helm#7649 which was included in Helm v3.2.0.

Helm doesn't support replacing resources with immutable fields

Some updates change resources that have immutable fields, so trying to do that results in a failure:

Applying component 'cert-manager'...
FATA[0006] updating chart failed: cannot patch "cert-manager-cainjector" with kind Deployment: Deployment.apps "cert-manager-cainjector" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"cainjector", "app.kubernetes.io/instance":"cert-manager", "app.kubernetes.io/name":"cainjector"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable && cannot patch "cert-manager" with kind Deployment: Deployment.apps "cert-manager" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"controller", "app.kubernetes.io/instance":"cert-manager", "app.kubernetes.io/name":"cert-manager"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable && cannot patch "cert-manager-webhook" with kind Deployment: Deployment.apps "cert-manager-webhook" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"webhook", "app.kubernetes.io/instance":"cert-manager", "app.kubernetes.io/name":"webhook"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable  args="[cert-manager]" command="lokoctl component apply"

Helm doesn't support a way to handle these, but there's a PR open that does just that scheduled for v3.3.0.

The way it works is that it deletes and recreates the object if there's a conflict.

Fixing Helm issues

I have a branch where I updated the Helm version we use to v3.2.1.

We already have a fork of Helm, so I rebased our Helm patch to v3.2.1 and added the "recreate" PR that adds support for replacing resources with immutable fields: https://github.com/kinvolk/helm/tree/iaguis/helm-3.2.1-upgrade-rollback

Here's my Lokomotive branch with all this: https://github.com/kinvolk/lokomotive/tree/iaguis/upgrade-helm

Testing it all

To test this, I created a Packet cluster with Lokomotive v0.1.0 with as many components as I could (I based it on our current CI configuration).

Then I ran lokoctl cluster apply with my branch.

Here are the things I needed to do and what I've found.

Annotate and label some objects

To make Helm adopt objects that changed apiVersion we need to add some annotations and labels to them.

I found 4 components that needed this.

Dex

kubectl -n dex label ingress dex app.kubernetes.io/managed-by=Helm 
kubectl -n dex annotate ingress dex meta.helm.sh/release-name=dex
kubectl -n dex annotate ingress dex meta.helm.sh/release-namespace=dex

Gangway

kubectl -n gangway label ingress gangway app.kubernetes.io/managed-by=Helm
kubectl -n gangway annotate ingress gangway meta.helm.sh/release-name=gangway
kubectl -n gangway annotate ingress gangway meta.helm.sh/release-namespace=gangway

Metrics Server

kubectl -n kube-system label rolebinding metrics-server-auth-reader app.kubernetes.io/managed-by=Helm
kubectl -n kube-system annotate rolebinding metrics-server-auth-reader meta.helm.sh/release-namespace=kube-system
kubectl -n kube-system annotate rolebinding metrics-server-auth-reader meta.helm.sh/release-name=metrics-server

httpbin

kubectl -n httpbin label ingress httpbin app.kubernetes.io/managed-by=Helm
kubectl -n httpbin annotate ingress httpbin meta.helm.sh/release-namespace=httpbin
kubectl -n httpbin annotate ingress httpbin meta.helm.sh/release-name=httpbin

Transient failure

cert-manager

The first time you try to update cert-manager you get this:

FATA[0094] updating chart failed: failed to recreate "letsencrypt-production" with kind ClusterIssuer: Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: dial tcp 10.3.34.225:443: connect: connection refused && failed to recreate "letsencrypt-staging" with kind ClusterIssuer: Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: dial tcp 10.3.34.225:443: connect: connection refused  args="[]" command="lokoctl component apply"

It works after the first failure.

Maybe there's a way to retry here so users don't have to run apply twice?

Changes required to cluster configuration

External DNS

owner_id is now required so I had to add that.

We should call this out in the release notes.

Changes recommended in cluster configuration

Dex + Gangway

We should add to the release notes we recommend adding the "oidc" field in the cluster section for Dex+Gangway users after #182 is merged.

Components support monitoring now

Some components gained support for the service_monitor flag in #200. We should call it out in the release notes.

Conclusion

After upgrading Helm and adding the "recreate" PR I mentioned, I could successfully update a Lokomotive v0.1.0 cluster to current master with some extra steps and one hiccup that required me to run lokoctl cluster apply a second time.

I think next steps should be:

Test other platforms
Open a PR with https://github.com/kinvolk/lokomotive/tree/iaguis/upgrade-helm. At least we want to update to 3.2.1, I'd like to hear more opinions on including the "recreate" PR
Once everything works, start working on e2e testing: Add scenarios to test upgrades for components #399 so we test upgrades regularly

invidian · 2020-05-27T10:42:34Z

Nice work @iaguis. IMO upgrading Helm to latest version and pulling this extra patch for recreate seems reasonable.

invidian · 2020-06-08T10:28:34Z

cert-manager
The first time you try to update cert-manager you get this:

FATA[0094] updating chart failed: failed to recreate "letsencrypt-production" with kind ClusterIssuer: Internal error occurred: failed calling webhook "webhook.cert->manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: dial tcp 10.3.34.225:443: connect: connection refused && failed to recreate >"letsencrypt-staging" with kind ClusterIssuer: Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert->manager.svc:443/mutate?timeout=30s: dial tcp 10.3.34.225:443: connect: connection refused  args="[]" command="lokoctl component apply"

It works after the first failure.
Maybe there's a way to retry here so users don't have to run apply twice?

Hm, maybe the default strategy used for the webhook deployment makes it unavailable for some time, hence it times out? Or maybe it does not wait until it becomes ready or something? I'll test.

invidian · 2020-06-08T12:16:21Z

Hm, maybe the default strategy used for the webhook deployment makes it unavailable for some time, hence it times out? Or maybe it does not wait until it becomes ready or something? I'll test.

Okay, this happens, because we re-create the deployment object, which causes all pods to shut down. And we recreate it because of:

FATA[0002] updating chart failed: cannot patch "cert-manager-cainjector" with kind Deployment: Deployment.apps "cert-manager-cainjector" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"cainjector", "app.kubernetes.io/instance":"cert-manager", "app.kubernetes.io/name":"cainjector"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable && cannot patch "cert-manager" with kind Deployment: Deployment.apps "cert-manager" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"controller", "app.kubernetes.io/instance":"cert-manager", "app.kubernetes.io/name":"cert-manager"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable && cannot patch "cert-manager-webhook" with kind Deployment: Deployment.apps "cert-manager-webhook" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"webhook", "app.kubernetes.io/instance":"cert-manager", "app.kubernetes.io/name":"webhook"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable  args="[cert-manager]" command="lokoctl component apply"

One way I could think of about retrying updates would be to have a controller-which would continuously ensure, that configuration is up to date, but this is beyond our capabilities right now.

invidian · 2020-06-11T11:15:13Z

Issue while upgrading AWS cluster:

Error: Missing required argument

  on ../lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers.tf line 1, in module "workers":
   1: module "workers" {

The argument "pool_name" is required, but no definition was found.


Error: Missing required argument

  on ../lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers.tf line 1, in module "workers":
   1: module "workers" {

The argument "cluster_name" is required, but no definition was found.


Error: Missing required argument

  on ../lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers.tf line 1, in module "workers":
   1: module "workers" {

The argument "lb_arn" is required, but no definition was found.


Error: Unsupported argument

  on ../lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers.tf line 3, in module "workers":
   3:   name   = var.cluster_name

An argument named "name" is not expected here.

Solution: either remove lokomotive-assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers.tf file from assets:

rm lokomotive-assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers.tf

Or backup lokomotive-assets/terraform/terraform.tfstate, remove lokomotive-assets, restore state and then run the upgrade.

invidian · 2020-06-11T13:15:10Z

This was needed to upgrade Calico:

kubectl apply -f ~/data/workspaces/lokoctl/lokomotive-assets/cluster-assets/charts/kube-system/calico/crds/kubecontrollersconfigurations.yaml

invidian · 2020-06-12T08:17:33Z

I also hit hashicorp/terraform-provider-aws#8305, I had to run terraform destroy -target to destroy the LBs and let them be re-created. I'm not sure how much impact that has on operating cluster, which receives traffic :/

johananl · 2020-06-16T12:05:36Z

I'm not sure how much impact that has on operating cluster, which receives traffic :/

Destroying an AWS LB is highly disruptive. LBs typically take some time to get destroyed (mainly due to automatic connection draining) and then more time to get re-created. Ideally we shouldn't touch any AWS LBs when upgrading clusters. I also don't see a logical reason to do so (but I'm aware there are currently technical problems around that).

In general, LBs usually remain untouched when maintaining compute nodes. An EC2 machine can get gracefully detached from an LB, updated/replaced/whatever and then re-attached.

invidian · 2020-06-16T15:53:25Z

More findings from AWS:

the workers autoscaling groups are re-created because of this commit: 43fda3c. We could try adding create_before_destroy to make sure some workers stays running, but I'm not sure how feasible that is.
updating prometheus-operator failed to me with the following error:

FATA[0671] Applying component configuration failed: updating chart failed: failed to recreate "prometheus-operator-alertmanager.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitorin
g.coreos.com": Post https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: dial tcp 10.3.114.181:443: connect: connection refused && failed to recreate "prometheus-operator-etcd" with kind Pr
ometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: dial tcp 10.3.114.181:4
43: connect: connection refused && failed to recreate "prometheus-operator-general.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://prometheus-operat
or-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: dial tcp 10.3.114.181:443: connect: connection refused && failed to recreate "prometheus-operator-k8s.rules" with kind PrometheusRule: Internal error occurred: f
ailed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: dial tcp 10.3.114.181:443: connect: connection refused && faile
d to recreate "prometheus-operator-kube-apiserver.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://prometheus-operator-operator.monitoring.svc:443/ad
mission-prometheusrules/validate?timeout=30s: dial tcp 10.3.114.181:443: connect: connection refused && failed to recreate "prometheus-operator-kube-prometheus-node-recording.rules" with kind PrometheusRule: Internal error occurred: failed
calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: dial tcp 10.3.114.181:443: connect: connection refused && failed to r
ecreate "prometheus-operator-kube-scheduler.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://prometheus-operator-operator.monitoring.svc:443/admissio
n-prometheusrules/validate?timeout=30s: dial tcp 10.3.114.181:443: connect: connection refused && failed to recreate "prometheus-operator-kubernetes-absent" with kind PrometheusRule: Internal error occurred: failed calling webhook "promethe
usrulemutate.monitoring.coreos.com": Post https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: dial tcp 10.3.114.181:443: connect: connection refused && failed to recreate "prometheus-opera
tor-kubernetes-apps" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/validate?tim
eout=30s: dial tcp 10.3.114.181:443: connect: connection refused && failed to recreate "prometheus-operator-kubernetes-resources" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.cor
eos.com": Post https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: dial tcp 10.3.114.181:443: connect: connection refused && failed to recreate "prometheus-operator-kubernetes-storage" wit
h kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: dial tcp 10.3.
114.181:443: connect: connection refused && failed to recreate "prometheus-operator-kubernetes-system-apiserver" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post ht
tps://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: dial tcp 10.3.114.181:443: connect: connection refused && failed to recreate "prometheus-operator-kubernetes-system-controller-manager" wi
th kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: dial tcp 10.3
.114.181:443: connect: connection refused && failed to recreate "prometheus-operator-kubernetes-system-kubelet" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post htt
ps://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: dial tcp 10.3.114.181:443: connect: connection refused && failed to recreate "prometheus-operator-kubernetes-system-scheduler" with kind Pr
ometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: dial tcp 10.3.114.181:4
43: connect: connection refused && failed to recreate "prometheus-operator-kubernetes-system" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://prometheus-op
erator-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: dial tcp 10.3.114.181:443: connect: connection refused && failed to recreate "prometheus-operator-node-exporter.rules" with kind PrometheusRule: Internal err
or occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=30s: dial tcp 10.3.114.181:443: connect: connection re
fused  args="[]" command="lokoctl cluster apply"

it seems changes introduced in 0a895bb causes the issue with aws_lb_target_group source. With the following patch, it is possible to create new ones, while keeping old ones around, perhaps to be removed in the next release. This way, we don't try to modify existing target group, but we replace it with the new one. I'm not sure what consequences this has on the incoming traffic.

diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
index 16b42576..9693170a 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
@@ -5,7 +5,7 @@ resource "aws_lb_listener" "ingress-http" {

   default_action {
     type             = "forward"
-    target_group_arn = aws_lb_target_group.workers-http.arn
+    target_group_arn = aws_lb_target_group.workers_http.arn
   }
 }

@@ -16,7 +16,7 @@ resource "aws_lb_listener" "ingress-https" {

   default_action {
     type             = "forward"
-    target_group_arn = aws_lb_target_group.workers-https.arn
+    target_group_arn = aws_lb_target_group.workers_https.arn
   }
 }

@@ -73,3 +73,55 @@ resource "aws_lb_target_group" "workers-https" {
     PoolName    = var.pool_name
   }
 }
+
+resource "aws_lb_target_group" "workers_http" {
+  vpc_id      = var.vpc_id
+  target_type = "instance"
+
+  protocol = "TCP"
+  port     = 30080
+
+  # HTTP health check for ingress
+  health_check {
+    protocol = "TCP"
+    port     = 30080
+
+    # NLBs required to use same healthy and unhealthy thresholds
+    healthy_threshold   = 3
+    unhealthy_threshold = 3
+
+    # Interval between health checks required to be 10 or 30
+    interval = 10
+  }
+
+  tags = {
+    ClusterName = var.cluster_name
+    PoolName    = var.pool_name
+  }
+}
+
+resource "aws_lb_target_group" "workers_https" {
+  vpc_id      = var.vpc_id
+  target_type = "instance"
+
+  protocol = "TCP"
+  port     = 30443
+
+  # HTTP health check for ingress
+  health_check {
+    protocol = "TCP"
+    port     = 30443
+
+    # NLBs required to use same healthy and unhealthy thresholds
+    healthy_threshold   = 3
+    unhealthy_threshold = 3
+
+    # Interval between health checks required to be 10 or 30
+    interval = 10
+  }
+
+  tags = {
+    ClusterName = var.cluster_name
+    PoolName    = var.pool_name
+  }
+}
diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf
index 69e6a589..5bbc7149 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf
@@ -1,9 +1,9 @@
 output "target_group_http" {
   description = "ARN of a target group of workers for HTTP traffic"
-  value       = aws_lb_target_group.workers-http.arn
+  value       = aws_lb_target_group.workers_http.arn
 }

 output "target_group_https" {
   description = "ARN of a target group of workers for HTTPS traffic"
-  value       = aws_lb_target_group.workers-https.arn
+  value       = aws_lb_target_group.workers_https.arn
 }
diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
index 26668621..c7e341be 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
@@ -19,6 +19,8 @@ resource "aws_autoscaling_group" "workers" {
   target_group_arns = flatten([
     aws_lb_target_group.workers-http.id,
     aws_lb_target_group.workers-https.id,
+    aws_lb_target_group.workers_http.id,
+    aws_lb_target_group.workers_https.id,
     var.target_groups,
   ])

after above patch is applied to the cluster, following cleanup is possible:

diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
index 9693170a..3070d967 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
@@ -20,60 +20,6 @@ resource "aws_lb_listener" "ingress-https" {
   }
 }

-resource "aws_lb_target_group" "workers-http" {
-  vpc_id      = var.vpc_id
-  target_type = "instance"
-
-  protocol = "TCP"
-  port     = 80
-
-  # HTTP health check for ingress
-  health_check {
-    protocol = "HTTP"
-    port     = 10254
-    path     = "/healthz"
-
-    # NLBs required to use same healthy and unhealthy thresholds
-    healthy_threshold   = 3
-    unhealthy_threshold = 3
-
-    # Interval between health checks required to be 10 or 30
-    interval = 10
-  }
-
-  tags = {
-    ClusterName = var.cluster_name
-    PoolName    = var.pool_name
-  }
-}
-
-resource "aws_lb_target_group" "workers-https" {
-  vpc_id      = var.vpc_id
-  target_type = "instance"
-
-  protocol = "TCP"
-  port     = 443
-
-  # HTTP health check for ingress
-  health_check {
-    protocol = "HTTP"
-    port     = 10254
-    path     = "/healthz"
-
-    # NLBs required to use same healthy and unhealthy thresholds
-    healthy_threshold   = 3
-    unhealthy_threshold = 3
-
-    # Interval between health checks required to be 10 or 30
-    interval = 10
-  }
-
-  tags = {
-    ClusterName = var.cluster_name
-    PoolName    = var.pool_name
-  }
-}
-
 resource "aws_lb_target_group" "workers_http" {
   vpc_id      = var.vpc_id
   target_type = "instance"
diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
index c7e341be..594492c7 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
@@ -17,8 +17,6 @@ resource "aws_autoscaling_group" "workers" {

   # target groups to which instances should be added
   target_group_arns = flatten([
-    aws_lb_target_group.workers-http.id,
-    aws_lb_target_group.workers-https.id,
     aws_lb_target_group.workers_http.id,
     aws_lb_target_group.workers_https.id,
     var.target_groups,

invidian · 2020-06-16T16:25:35Z

With the following patch, the upgrade process is much smoother and autoscaling group is not recreated:

diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
index 10ad5c8a..0c5103b2 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
@@ -5,7 +5,7 @@ resource "aws_lb_listener" "ingress-http" {

   default_action {
     type             = "forward"
-    target_group_arn = aws_lb_target_group.workers-http.arn
+    target_group_arn = aws_lb_target_group.workers_http.arn
   }
 }

@@ -16,11 +16,11 @@ resource "aws_lb_listener" "ingress-https" {

   default_action {
     type             = "forward"
-    target_group_arn = aws_lb_target_group.workers-https.arn
+    target_group_arn = aws_lb_target_group.workers_https.arn
   }
 }

-resource "aws_lb_target_group" "workers-http" {
+resource "aws_lb_target_group" "workers_http" {
   vpc_id      = var.vpc_id
   target_type = "instance"

@@ -45,7 +45,7 @@ resource "aws_lb_target_group" "workers-http" {
   }
 }

-resource "aws_lb_target_group" "workers-https" {
+resource "aws_lb_target_group" "workers_https" {
   vpc_id      = var.vpc_id
   target_type = "instance"

diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf
index 69e6a589..5bbc7149 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf
@@ -1,9 +1,9 @@
 output "target_group_http" {
   description = "ARN of a target group of workers for HTTP traffic"
-  value       = aws_lb_target_group.workers-http.arn
+  value       = aws_lb_target_group.workers_http.arn
 }

 output "target_group_https" {
   description = "ARN of a target group of workers for HTTPS traffic"
-  value       = aws_lb_target_group.workers-https.arn
+  value       = aws_lb_target_group.workers_https.arn
 }
diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
index 26668621..908820dd 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
@@ -1,6 +1,7 @@
 # Workers AutoScaling Group
 resource "aws_autoscaling_group" "workers" {
-  name = "${var.cluster_name}-${var.pool_name}-workers"
+  #name = "${var.cluster_name}-${var.pool_name}-workers"
+  name = "${var.pool_name}-worker"

   # count
   desired_capacity          = var.worker_count
@@ -17,8 +18,8 @@ resource "aws_autoscaling_group" "workers" {

   # target groups to which instances should be added
   target_group_arns = flatten([
-    aws_lb_target_group.workers-http.id,
-    aws_lb_target_group.workers-https.id,
+    aws_lb_target_group.workers_http.id,
+    aws_lb_target_group.workers_https.id,
     var.target_groups,
   ])

invidian · 2020-06-16T17:06:12Z

After the patch posted above, the worker group can be gracefully replaced (new one is created before previous one). How about we use this patch for the next release, and then rename the worker group in another release and in documentation, we say that user should update to v0.2.0 before updating to v0.2.1?

iaguis · 2020-06-17T17:15:39Z

I tested the patch and it works better.

How about we use this patch for the next release, and then rename the worker group in another release and in documentation, we say that user should update to v0.2.0 before updating to v0.2.1?

Then the worker group will be recreated when updating to v0.2.1, or am I missing something? In any case I'm fine with using this patch for now.

invidian · 2020-06-17T18:21:30Z

Then the worker group will be recreated when updating to v0.2.1, or am I missing something? In any case I'm fine with using this patch for now.

Yes, the worker group would have to be re-created in the next release.

iaguis · 2020-06-18T11:33:35Z

After #638 and following the steps mentioned on #609 updates work fine. Closing this now.

iaguis added area/components Items related to components area/kubernetes Core Kubernetes stuff area/ux User Experience labels May 26, 2020

iaguis added this to the v0.2.0 milestone May 26, 2020

iaguis self-assigned this May 26, 2020

iaguis mentioned this issue Jun 3, 2020

Fix some component upgrades that required user intervention #545

Merged

invidian mentioned this issue Jun 4, 2020

Drop dependency on libcalico-go and improve TestHostEndpoints test #566

Merged

iaguis closed this as completed Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure upgrades from v0.1.0 work well #484

Make sure upgrades from v0.1.0 work well #484

iaguis commented May 26, 2020

iaguis commented May 26, 2020

invidian commented May 27, 2020

invidian commented Jun 8, 2020

invidian commented Jun 8, 2020

invidian commented Jun 11, 2020

invidian commented Jun 11, 2020

invidian commented Jun 12, 2020

johananl commented Jun 16, 2020 •

edited

Loading

invidian commented Jun 16, 2020

invidian commented Jun 16, 2020

invidian commented Jun 16, 2020

iaguis commented Jun 17, 2020

invidian commented Jun 17, 2020

iaguis commented Jun 18, 2020

Make sure upgrades from v0.1.0 work well #484

Make sure upgrades from v0.1.0 work well #484

Comments

iaguis commented May 26, 2020

iaguis commented May 26, 2020

Control plane (and kubelet) updates

Components

Helm resource conflicts

Helm doesn't support replacing resources with immutable fields

Fixing Helm issues

Testing it all

Annotate and label some objects

Dex

Gangway

Metrics Server

httpbin

Transient failure

cert-manager

Changes required to cluster configuration

External DNS

Changes recommended in cluster configuration

Dex + Gangway

Components support monitoring now

Conclusion

invidian commented May 27, 2020

invidian commented Jun 8, 2020

invidian commented Jun 8, 2020

invidian commented Jun 11, 2020

invidian commented Jun 11, 2020

invidian commented Jun 12, 2020

johananl commented Jun 16, 2020 • edited Loading

invidian commented Jun 16, 2020

invidian commented Jun 16, 2020

invidian commented Jun 16, 2020

iaguis commented Jun 17, 2020

invidian commented Jun 17, 2020

iaguis commented Jun 18, 2020

johananl commented Jun 16, 2020 •

edited

Loading