-
Notifications
You must be signed in to change notification settings - Fork 49
Make sure upgrades from v0.1.0 work well #484
Comments
I've done some initial research on this, here are my findings. Control plane (and kubelet) updatesThis worked fine without any changes. ComponentsUpdates for some components didn't work, there are two reasons. Helm resource conflictsFor example:
AFAIU this happens because some of the resources we create changed apiVersion and Helm is not able to deal with that. A way to handle this was added in helm/helm#7649 which was included in Helm v3.2.0. Helm doesn't support replacing resources with immutable fieldsSome updates change resources that have immutable fields, so trying to do that results in a failure:
Helm doesn't support a way to handle these, but there's a PR open that does just that scheduled for v3.3.0. The way it works is that it deletes and recreates the object if there's a conflict. Fixing Helm issuesI have a branch where I updated the Helm version we use to v3.2.1. We already have a fork of Helm, so I rebased our Helm patch to v3.2.1 and added the "recreate" PR that adds support for replacing resources with immutable fields: https://github.com/kinvolk/helm/tree/iaguis/helm-3.2.1-upgrade-rollback Here's my Lokomotive branch with all this: https://github.com/kinvolk/lokomotive/tree/iaguis/upgrade-helm Testing it allTo test this, I created a Packet cluster with Lokomotive v0.1.0 with as many components as I could (I based it on our current CI configuration). Then I ran Here are the things I needed to do and what I've found. Annotate and label some objectsTo make Helm adopt objects that changed apiVersion we need to add some annotations and labels to them. I found 4 components that needed this. Dex
Gangway
Metrics Server
httpbin
Transient failurecert-managerThe first time you try to update cert-manager you get this:
It works after the first failure. Maybe there's a way to retry here so users don't have to run apply twice? Changes required to cluster configurationExternal DNS
We should call this out in the release notes. Changes recommended in cluster configurationDex + GangwayWe should add to the release notes we recommend adding the "oidc" field in the cluster section for Dex+Gangway users after #182 is merged. Components support monitoring nowSome components gained support for the ConclusionAfter upgrading Helm and adding the "recreate" PR I mentioned, I could successfully update a Lokomotive v0.1.0 cluster to current master with some extra steps and one hiccup that required me to run I think next steps should be:
|
Nice work @iaguis. IMO upgrading Helm to latest version and pulling this extra patch for recreate seems reasonable. |
Hm, maybe the default strategy used for the webhook deployment makes it unavailable for some time, hence it times out? Or maybe it does not wait until it becomes ready or something? I'll test. |
Okay, this happens, because we re-create the deployment object, which causes all pods to shut down. And we recreate it because of:
One way I could think of about retrying updates would be to have a controller-which would continuously ensure, that configuration is up to date, but this is beyond our capabilities right now. |
Issue while upgrading AWS cluster: Error: Missing required argument
on ../lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers.tf line 1, in module "workers":
1: module "workers" {
The argument "pool_name" is required, but no definition was found.
Error: Missing required argument
on ../lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers.tf line 1, in module "workers":
1: module "workers" {
The argument "cluster_name" is required, but no definition was found.
Error: Missing required argument
on ../lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers.tf line 1, in module "workers":
1: module "workers" {
The argument "lb_arn" is required, but no definition was found.
Error: Unsupported argument
on ../lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers.tf line 3, in module "workers":
3: name = var.cluster_name
An argument named "name" is not expected here.
Solution: either remove rm lokomotive-assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers.tf Or backup |
This was needed to upgrade Calico:
|
I also hit hashicorp/terraform-provider-aws#8305, I had to run |
Destroying an AWS LB is highly disruptive. LBs typically take some time to get destroyed (mainly due to automatic connection draining) and then more time to get re-created. Ideally we shouldn't touch any AWS LBs when upgrading clusters. I also don't see a logical reason to do so (but I'm aware there are currently technical problems around that). In general, LBs usually remain untouched when maintaining compute nodes. An EC2 machine can get gracefully detached from an LB, updated/replaced/whatever and then re-attached. |
More findings from AWS:
diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
index 16b42576..9693170a 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
@@ -5,7 +5,7 @@ resource "aws_lb_listener" "ingress-http" {
default_action {
type = "forward"
- target_group_arn = aws_lb_target_group.workers-http.arn
+ target_group_arn = aws_lb_target_group.workers_http.arn
}
}
@@ -16,7 +16,7 @@ resource "aws_lb_listener" "ingress-https" {
default_action {
type = "forward"
- target_group_arn = aws_lb_target_group.workers-https.arn
+ target_group_arn = aws_lb_target_group.workers_https.arn
}
}
@@ -73,3 +73,55 @@ resource "aws_lb_target_group" "workers-https" {
PoolName = var.pool_name
}
}
+
+resource "aws_lb_target_group" "workers_http" {
+ vpc_id = var.vpc_id
+ target_type = "instance"
+
+ protocol = "TCP"
+ port = 30080
+
+ # HTTP health check for ingress
+ health_check {
+ protocol = "TCP"
+ port = 30080
+
+ # NLBs required to use same healthy and unhealthy thresholds
+ healthy_threshold = 3
+ unhealthy_threshold = 3
+
+ # Interval between health checks required to be 10 or 30
+ interval = 10
+ }
+
+ tags = {
+ ClusterName = var.cluster_name
+ PoolName = var.pool_name
+ }
+}
+
+resource "aws_lb_target_group" "workers_https" {
+ vpc_id = var.vpc_id
+ target_type = "instance"
+
+ protocol = "TCP"
+ port = 30443
+
+ # HTTP health check for ingress
+ health_check {
+ protocol = "TCP"
+ port = 30443
+
+ # NLBs required to use same healthy and unhealthy thresholds
+ healthy_threshold = 3
+ unhealthy_threshold = 3
+
+ # Interval between health checks required to be 10 or 30
+ interval = 10
+ }
+
+ tags = {
+ ClusterName = var.cluster_name
+ PoolName = var.pool_name
+ }
+}
diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf
index 69e6a589..5bbc7149 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf
@@ -1,9 +1,9 @@
output "target_group_http" {
description = "ARN of a target group of workers for HTTP traffic"
- value = aws_lb_target_group.workers-http.arn
+ value = aws_lb_target_group.workers_http.arn
}
output "target_group_https" {
description = "ARN of a target group of workers for HTTPS traffic"
- value = aws_lb_target_group.workers-https.arn
+ value = aws_lb_target_group.workers_https.arn
}
diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
index 26668621..c7e341be 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
@@ -19,6 +19,8 @@ resource "aws_autoscaling_group" "workers" {
target_group_arns = flatten([
aws_lb_target_group.workers-http.id,
aws_lb_target_group.workers-https.id,
+ aws_lb_target_group.workers_http.id,
+ aws_lb_target_group.workers_https.id,
var.target_groups,
])
diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
index 9693170a..3070d967 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
@@ -20,60 +20,6 @@ resource "aws_lb_listener" "ingress-https" {
}
}
-resource "aws_lb_target_group" "workers-http" {
- vpc_id = var.vpc_id
- target_type = "instance"
-
- protocol = "TCP"
- port = 80
-
- # HTTP health check for ingress
- health_check {
- protocol = "HTTP"
- port = 10254
- path = "/healthz"
-
- # NLBs required to use same healthy and unhealthy thresholds
- healthy_threshold = 3
- unhealthy_threshold = 3
-
- # Interval between health checks required to be 10 or 30
- interval = 10
- }
-
- tags = {
- ClusterName = var.cluster_name
- PoolName = var.pool_name
- }
-}
-
-resource "aws_lb_target_group" "workers-https" {
- vpc_id = var.vpc_id
- target_type = "instance"
-
- protocol = "TCP"
- port = 443
-
- # HTTP health check for ingress
- health_check {
- protocol = "HTTP"
- port = 10254
- path = "/healthz"
-
- # NLBs required to use same healthy and unhealthy thresholds
- healthy_threshold = 3
- unhealthy_threshold = 3
-
- # Interval between health checks required to be 10 or 30
- interval = 10
- }
-
- tags = {
- ClusterName = var.cluster_name
- PoolName = var.pool_name
- }
-}
-
resource "aws_lb_target_group" "workers_http" {
vpc_id = var.vpc_id
target_type = "instance"
diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
index c7e341be..594492c7 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
@@ -17,8 +17,6 @@ resource "aws_autoscaling_group" "workers" {
# target groups to which instances should be added
target_group_arns = flatten([
- aws_lb_target_group.workers-http.id,
- aws_lb_target_group.workers-https.id,
aws_lb_target_group.workers_http.id,
aws_lb_target_group.workers_https.id,
var.target_groups, |
With the following patch, the upgrade process is much smoother and autoscaling group is not recreated: diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
index 10ad5c8a..0c5103b2 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/ingress.tf
@@ -5,7 +5,7 @@ resource "aws_lb_listener" "ingress-http" {
default_action {
type = "forward"
- target_group_arn = aws_lb_target_group.workers-http.arn
+ target_group_arn = aws_lb_target_group.workers_http.arn
}
}
@@ -16,11 +16,11 @@ resource "aws_lb_listener" "ingress-https" {
default_action {
type = "forward"
- target_group_arn = aws_lb_target_group.workers-https.arn
+ target_group_arn = aws_lb_target_group.workers_https.arn
}
}
-resource "aws_lb_target_group" "workers-http" {
+resource "aws_lb_target_group" "workers_http" {
vpc_id = var.vpc_id
target_type = "instance"
@@ -45,7 +45,7 @@ resource "aws_lb_target_group" "workers-http" {
}
}
-resource "aws_lb_target_group" "workers-https" {
+resource "aws_lb_target_group" "workers_https" {
vpc_id = var.vpc_id
target_type = "instance"
diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf
index 69e6a589..5bbc7149 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/outputs.tf
@@ -1,9 +1,9 @@
output "target_group_http" {
description = "ARN of a target group of workers for HTTP traffic"
- value = aws_lb_target_group.workers-http.arn
+ value = aws_lb_target_group.workers_http.arn
}
output "target_group_https" {
description = "ARN of a target group of workers for HTTPS traffic"
- value = aws_lb_target_group.workers-https.arn
+ value = aws_lb_target_group.workers_https.arn
}
diff --git a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
index 26668621..908820dd 100644
--- a/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
+++ b/assets/lokomotive-kubernetes/aws/flatcar-linux/kubernetes/workers/workers.tf
@@ -1,6 +1,7 @@
# Workers AutoScaling Group
resource "aws_autoscaling_group" "workers" {
- name = "${var.cluster_name}-${var.pool_name}-workers"
+ #name = "${var.cluster_name}-${var.pool_name}-workers"
+ name = "${var.pool_name}-worker"
# count
desired_capacity = var.worker_count
@@ -17,8 +18,8 @@ resource "aws_autoscaling_group" "workers" {
# target groups to which instances should be added
target_group_arns = flatten([
- aws_lb_target_group.workers-http.id,
- aws_lb_target_group.workers-https.id,
+ aws_lb_target_group.workers_http.id,
+ aws_lb_target_group.workers_https.id,
var.target_groups,
])
|
After the patch posted above, the worker group can be gracefully replaced (new one is created before previous one). How about we use this patch for the next release, and then rename the worker group in another release and in documentation, we say that user should update to |
I tested the patch and it works better.
Then the worker group will be recreated when updating to v0.2.1, or am I missing something? In any case I'm fine with using this patch for now. |
Yes, the worker group would have to be re-created in the next release. |
Before releasing v0.2.0 we should make sure updates from v0.1.0 work automatically or, if user intervention is needed, document the steps required.
The text was updated successfully, but these errors were encountered: