Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Magnum cluster upgrade still references old template #317

Closed
jessica-hofmeister opened this issue Mar 6, 2024 · 4 comments · Fixed by #318 or #349
Closed

Magnum cluster upgrade still references old template #317

jessica-hofmeister opened this issue Mar 6, 2024 · 4 comments · Fixed by #318 or #349

Comments

@jessica-hofmeister
Copy link

Create 2 cluster templates. One with ubuntu-2204-kube-v1.27.3 image and one with ubuntu-2204-kube-v1.27.4 image. Create a cluster with the 1.27.3 template. After it comes up healthy, upgrade it to the 1.27.4 image. Cluster reaches upgrade_complete status, but still references old template.

Relevant images:

openstack image list        
+--------------------------------------+-------------------------------------+--------+
| ID                                   | Name                                | Status |
+--------------------------------------+-------------------------------------+--------+
| 3097dcc8-8c26-473c-a04b-55fe88a65e13 | ubuntu-2204-kube-v1.27.3            | active |
| 4baeb5c2-42cb-457e-81cf-885b100c19a2 | ubuntu-2204-kube-v1.27.4            | active |
+--------------------------------------+-------------------------------------+--------+ 

Create a cluster template for 1.27.3:

openstack coe cluster template create \  
  --image 3097dcc8-8c26-473c-a04b-55fe88a65e13 \  
  --coe kubernetes \
  --flavor m1.medium \  
  --master-flavor m1.medium \
  --external-network public \
  --master-lb-enabled \
  --floating-ip-disabled \
  --network-driver calico \
  --docker-storage-driver overlay2 \
  --label kube_tag=v1.27.3 \
  --label boot_volume_size=40 \
  --label boot_volume_type=rbd1 \
  --label master_lb_floating_ip_enabled=false \
  --label audit_log_enabled=true 
  --label os_distro=ubuntu \
  test-v1.27.3 

Create a cluster template for 1.27.4:

openstack coe cluster template create \
  --image 4baeb5c2-42cb-457e-81cf-885b100c19a2 \
  --coe kubernetes \
  --flavor m1.medium \
  --master-flavor m1.medium \
  --external-network public \
  --master-lb-enabled \
  --floating-ip-disabled \
  --network-driver calico \
  --docker-storage-driver overlay2 \
  --label kube_tag=v1.27.4 \
  --label boot_volume_size=40 \
  --label boot_volume_type=rbd1 \
  --label master_lb_floating_ip_enabled=false \
  --label audit_log_enabled=true \
  --label os_distro=ubuntu \
  test-v1.27.4 

Create a cluster using the test-v1.27.3 template:

openstack coe cluster create \
 --cluster-template test-v1.27.3 \
 --master-count 1 \
 --node-count 1 \
 --fixed-network dev-k8s \
 --keypair svc-account \
 --floating-ip-disabled \
 test-cluster 

Wait for the cluster to come up healthy:

kubectl get nodes
NAME                                          STATUS   ROLES                  AGE     VERSION
kube-db673-control-plane-ndpfj-fqf56          Ready    control-plane,master   4m47s   v1.27.3
kube-db673-default-worker-infra-g28k4-5psxz   Ready    worker                 3m43s   v1.27.3 

List the openstack templates:

openstack coe cluster template list                 
+--------------------------------------+----------------+------+
| uuid                                 | name           | tags |
+--------------------------------------+----------------+------+
| 68813151-763b-4fb5-b2e0-c254f1ad4b42 | test-v1.27.3 | None |
| 063f2a66-a994-4d81-aa00-0442359a333e | test-v1.27.4 | None |
+--------------------------------------+----------------+------+ 

See what template the cluster has currently:

openstack coe cluster show test-cluster -f value -c cluster_template_id            
68813151-763b-4fb5-b2e0-c254f1ad4b42 

Upgrade the cluster to the test-v1.27.4 template:

openstack coe cluster upgrade test-cluster test-v1.27.4
Request to upgrade cluster test-cluster has been accepted. 

Wait for the cluster to finish upgrading:

kubectl get nodes
NAME                                          STATUS   ROLES                  AGE   VERSION
kube-db673-control-plane-7bkxd-55gnv          Ready    control-plane,master   28m   v1.27.4
kube-db673-default-worker-infra-zq68k-jtl8z   Ready    worker                 24m   v1.27.4 

The actual instances show that they are using the test-v1.27.4 image
See what template the cluster has currently:

openstack coe cluster show test-cluster -f value -c cluster_template_id
68813151-763b-4fb5-b2e0-c254f1ad4b42 

Notice that it still thinks the current template is the test-v1.27.3 even though the upgrade to the test-v1.27.4 completed.

And now...the test that shows it breaks:
Attempt to scale up the cluster to 2 worker nodes:

openstack coe cluster resize test-cluster 2
Request to resize cluster test-cluster has been accepted. 

The cluster resize actually fails. The details after trying to resize are:
Note the difference between coe_version and kube_tag

openstack coe cluster show test-cluster -f yaml                        
status: UPDATE_FAILED
health_status: HEALTHY
cluster_template_id: 68813151-763b-4fb5-b2e0-c254f1ad4b42
node_addresses: []
uuid: 9f5066cf-3985-4a75-b350-f731370e3d7b
stack_id: kube-db673
status_reason: 'admission webhook "validation.cluster.cluster.x-k8s.io" denied the
  request: Cluster.cluster.x-k8s.io "kube-db673" is invalid: spec.topology.version:
  Invalid value: "v1.27.3": version cannot be decreased from "1.27.4" to "1.27.3"'
created_at: '2024-03-04T17:21:28+00:00'
updated_at: '2024-03-04T18:29:17+00:00'
coe_version: v1.27.4
labels:
  audit_log_enabled: 'true'
  boot_volume_size: '40'
  boot_volume_type: rbd1
  kube_tag: v1.27.3
  master_lb_floating_ip_enabled: 'false'
  os_distro: ubuntu
labels_overridden: {}
labels_skipped: {}
labels_added: {}
fixed_network: dev-k8s
fixed_subnet: null
floating_ip_enabled: false
faults:
  default-worker: 'admission webhook "validation.cluster.cluster.x-k8s.io" denied
    the request: Cluster.cluster.x-k8s.io "kube-db673" is invalid: spec.topology.version:
    Invalid value: "v1.27.3": version cannot be decreased from "1.27.4" to "1.27.3"'
keypair: svc-account
api_address: https://172.22.4.228:6443
master_addresses: []
master_lb_enabled: true
create_timeout: 60
node_count: 1
discovery_url: null
docker_volume_size: null
master_count: 1
container_version: null
name: test-cluster
master_flavor_id: m1.medium
flavor_id: m1.medium
health_status_reason:
  kube-db673-default-worker-v58w9-vphx2-59nqs.Ready: 'True'
  kube-db673-ppj9w-fztwp.Ready: 'True'
project_id: 402f35ab1fa340d5834c55e6a2d4c32f 
@jessica-hofmeister
Copy link
Author

One other interesting thing is that doing a server show on one of the instances in the cluster shows us image_id: none, while the UI shows the 1.27.4 image

| id                                  | 48a735e5-f3ea-4e4c-8e00-fc33525582f3                                                                                                                                                                   |
| image                               | N/A (booted from volume)                                                                                                                                                                               |
| imageRef                            | None                                                                                                                                                                                                   |
| image_id                            | None                                                                                                                                                                                                   |
| instance_name                       | None                                              

image

@mnaser
Copy link
Member

mnaser commented Mar 7, 2024

@mnaser
Copy link
Member

mnaser commented Mar 7, 2024

Oh, this might be far more complicated, I think during an upgrade the update_cluster_status might be running somewhere else and override the save that happens here:

https://github.com/openstack/magnum/blob/35374b4380db673f9b61cb18da0f9382dcc00fce/magnum/conductor/handlers/cluster_conductor.py#L368-L369

We actually need to setup some lock or actually in the cluster update sync code to pull the cluster-template from the magnum resource..

@jessica-hofmeister
Copy link
Author

after scaling down the magnum conductor to 1, we retested and the results are exactly the same: each node upgrades to the new kubernetes version, but the cluster itself still references the old template id

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants