Terraform module for a GKE Kubernetes Cluster in GCP
If you want to utilize this feature make sure to declare a helm
provider in your terraform configuration as follows.
provider "helm" {
version = "2.1.2" # see https://github.com/terraform-providers/terraform-provider-helm/releases
kubernetes {
host = module.gke_cluster.cluster_endpoint
token = data.google_client_config.google_client.access_token
cluster_ca_certificate = module.gke_cluster.cluster_ca_certificate
}
}
Pay attention to the gke_cluster
module output variables used here.
Drop the use of attributes such as node_count_initial_per_zone
and/or node_count_current_per_zone
(if any) from the list of objects in var.node_pools
.
While performing this upgrade, if you are using the namespace
variable, you may run into one or more of the following errors:
- namespaces is forbidden
- User "system:serviceaccount:devops:default" cannot create resource "namespaces" in API group ""
- User "system:serviceaccount:devops:default" cannot get resource "namespaces" in API group ""
- Get "http://localhost/api/v1/namespaces/<namespace_name>": dial tcp 127.0.0.1:80: connect: connection refused
In order to fix this, you need to declare a kubernetes
provider in your terraform configuration like the following.
provider "kubernetes" {
version = "1.13.3" # see https://github.com/terraform-providers/terraform-provider-kubernetes/releases
load_config_file = false
host = module.gke_cluster.cluster_endpoint
token = data.google_client_config.google_client.access_token
cluster_ca_certificate = module.gke_cluster.cluster_ca_certificate
}
data "google_client_config" "google_client" {}
Pay attention to the gke_cluster
module output variables used here.
This upgrade performs 2 changes:
- Move the declaration of kubernetes secrets into the declaration of kubernetes namesapces
- see the Pull Request description at #7
- Ability to create multiple ingress IPs for istio
- read below
Detailed steps provided below:
- Upgrade
gke_cluster
module version to2.7.1
- Run
terraform plan
- DO NOT APPLY this plan- the plan may show that some
istio
resource(s) (if used any) will be destroyed - we want to avoid any kind of destruction and/or recreation
- P.S. to resolve any changes proposed for
kubernetes_secret
resource(s), please refer to this Pull Request description instead
- the plan may show that some
- Set the
istio_ip_names
variable with at least one item as["ip"]
- this is so that the istio IP resource name is backward-compaitble
- Run
terraform plan
- DO NOT APPLY this plan- now, the plan may show that a
static_istio_ip
resource (if used any) will be destroyed and recreated under new named index - we want to avoid any kind of destruction and/or recreation
- P.S. to resolve any changes proposed for
kubernetes_secret
resource(s), please refer to this Pull Request description instead
- now, the plan may show that a
- Move the terraform states
- notice that the plan says your existing static_istio_ip resource (let's say
istioIpX
) will be destroyed and new static_istio_ip resource (let's sayistioIpY
) will be created - pay attention to the array indexes:
- the
*X
resources (the ones to be destroyed) start with array index[0]
- although it may not show[0]
in the displayed plan - the
*Y
resources (the ones to be created) will show array index with new named index
- the
- Use
terraform state mv
to manually move the state ofistioIpX
toistioIpY
- refer to https://www.terraform.io/docs/commands/state/mv.html to learn more about how to move Terraform state positions
- once a resource is moved, it will say
Successfully moved 1 object(s).
- The purpose of this channge is detailed in this wiki.
- notice that the plan says your existing static_istio_ip resource (let's say
- Run
terraform plan
again- the plan should now show that no changes required
- this confirms that you have successfully moved all your resources' states to their new position as required by
v2.7.1
.
- DONE
This upgrade will move the terraform states of arrays of ingress IPs and k8s namespaces from numbered indexes to named indexes. The purpose of this channge is detailed in this wiki.
- Upgrade
gke_cluster
module version to2.5.1
- Run
terraform plan
- DO NOT APPLY this plan- the plan will show that several resources will be destroyed and recreated under new named indexes
- we want to avoid any kind of destruction and/or recreation
- Move the terraform states
- notice that the plan says your existing static_ingress_ip resource(s) (let's say
ingressIpX
) will be destroyed and new static_ingress_ip resource(s) (let's sayingressIpY
) will be created - also notice that the plan says your existing kubernetes_namespace resource(s) (let's say
namespaceX
) will be destroyed and new kubernetes_namespace resource(s) (let's saynamespaceY
) will be created - P.S. if you happen to have multiple static_ingress_ip resource(s) and kubernetes_namespace resource(s), then the plan will show these destructions and recreations multiple times. You will need to move the states for EACH of the respective resources one-by-one.
- pay attention to the array indexes:
- the
*X
resources (the ones to be destroyed) start with array index[0]
- although it may not show[0]
in the displayed plan - the
*Y
resources (the ones to be created) will show array indexes with new named indexes
- the
- Use
terraform state mv
to manually move the states of each ofingressIpX
toingressIpY
, and to move the states of each ofnamespaceX
tonamespaceY
- refer to https://www.terraform.io/docs/commands/state/mv.html to learn more about how to move Terraform state positions
- once a resource is moved, it will say
Successfully moved 1 object(s).
- repeat until all relevant states are moved to their desired positions
- notice that the plan says your existing static_ingress_ip resource(s) (let's say
- Run
terraform plan
again- the plan should now show that no changes required
- this confirms that you have successfully moved all your resources' states to their new position as required by
v2.5.1
.
- DONE
This upgrade process will:
- drop the use of auxiliary node pools (if any)
- create a new node pool under terraform's array structure
- migrate eixsting deployments/workloads from old node pool to new node pool
- delete old standalone node pool as it's no longer required
Detailed steps provided below:
- While on
v2.2.2
, remove the variablescreate_auxiliary_node_pool
andauxiliary_node_pool_config
.- run
terraform plan
&terraform apply
- this will remove any
auxiliary_node_pool
that may have been there
- run
- Upgrade gke_cluster module to
v2.3.1
and set variablenode_pools
with its required params.- value of
node_pool_name
for the new node pool must be different from the name of the old node pool - run
terraform plan
&terraform apply
- this will create a new node pool as per the specs provided in
node_pools
.
- value of
- Migrate existing deployments/workloads from old node pool to new node pool.
- check status of nodes
kubectl get nodes
- confirm that all nodes from all node pools are shown
- confirm that all nodes have status
Ready
- check status of pods
kubectl get pods -o=wide
- confirm that all pods have status
Running
- confirm that all pods are running on nodes from the old node pool
- cordon the old node pool
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=<OLD_NODE_POOL_NAME> -o=name); do kubectl cordon "$node"; done
- replace <OLD_NODE_POOL_NAME> with the correct value- check status of nodes
kubectl get nodes
- confirm that all nodes from the old node pools have status
Ready,SchedulingDisabled
- confirm that all nodes from the new node pools have status
Ready
- check status of pods
kubectl get pods -o=wide
- confirm that all pods still have status
Running
- confirm that all pods are still running on nodes from the old node pool
- initiate rolling restart of all deployments
kubectl rollout restart deployment <DEPLOYMENT_1_NAME> <DEPLOYMENT_2_NAME> <DEPLOYMENT_3_NAME>
- replace <DEPLOYMENT_*_NAME> with correct names of existing deployments- check status of pods
kubectl get pods -o=wide
- confirm that some pods have status
Running
while some new pods have statusContainerCreating
- confirm that the new pods with status
ContainerCreating
are running on nodes from the new node pool - repeat status checks until all pods have status
Running
and all pods are running on nodes from the new node pool only
- drain the old node pool
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=<OLD_NODE_POOL_NAME> -o=name); do kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 "$node"; done
- replace <OLD_NODE_POOL_NAME> with the correct value- confirm that the response says
evicting pod
orevicted
for all remaining pods in the old node pool - this step may take some time
- Migration complete
- check status of nodes
- Upgrade gke_cluster module to
v2.4.2
and remove use of any obsolete variables.- remove standalone variables such as
machine_type
,disk_size_gb
,node_count_initial_per_zone
,node_count_min_per_zone
,node_count_max_per_zone
,node_count_current_per_zone
from the module which are no longer used for standalone node pool. - run
terraform plan
&terraform apply
- this will remove the old node pool completely
- remove standalone variables such as
- DONE
This upgrade assigns network tags to the node pool nodes. The upgrade process will:
- Create an auxiliary node pool.
- Move all workloads from the existing node pool to the auxiliary node pool
- Assign network tags to the existing node pool (which causes destruction and recreation of that node pool)
- Move all workloads back from the auxiliary node pool into the new node pool (which now has network tags)
- Then delete auxiliary node pool.
- While at
v1.2.9
, setcreate_auxiliary_node_pool
toTrue
- this will create a new additional node pool according to the values ofvar.auxiliary_node_pool_config
before proceeding with the breaking change.- Run
terraform apply
- Run
- Migrate all workloads from existing node pool to the newly created auxiliary node pool
- Follow these instructions
- Upgrade
gke_cluster
module tov1.3.0
- this will destroy and recreate the GKE node pool whiile the auxiliary node pool from step 1 will continue to serve requests of GKE cluster- Run
terraform apply
- Run
- Migrate all workloads back from the auxiliary node pool to the newly created node pool
- Follow these instructions
- While at
v1.3.0
, setcreate_auxiliary_node_pool
toFalse
- this will destroy the auxiliary node pool that was created in step 1 as it is no longer needed now- Run
terraform apply
- Run
- Done