AWX task and web pods incorrectly scale up during restore #1567

flxbwr · 2023-09-25T07:45:46Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX Operator is open source software provided for free and that I might not receive a timely response.

Feature Summary

Hello!
Recently, i performed the task of automatically restoring AWX (checking the consistency of our backups).
I found that the standard restore tool does not work.
AWX Operator version: 2.5.2
Kubernetes/Platform version: 1.25.5
Investigating the reasons for this, I found two problems:

Incorrect variables {{ ansible_operator_meta.name }} in task "- name: Scale down Deployment for migration" awx-operator/roles/restore/tasks/postgres.yml. Now I can already see that this problem has been solved in 2.5.3.
I applied this solution in my installation, but it did not solve the problem and I continued looking for a solution.
At the time of awx-operator/roles/restore/tasks/postgres.yml execution, the awx CRD with the replicas value taken from the backup is already deploy in my cluster. Based on it, the awx-operator creates a deployment "awx-task" and "awx-web" with the replicas value taken from the awx CRD. Before starting pg_restore, you scale deployment "awx-task" and "awx-web" replicas = 0 so that it does not interfere with recovery:

name: Scale down Deployment for migration
k8s_scale:
api_version: apps/v1
kind: Deployment
name: "{{ item }}"
namespace: "{{ ansible_operator_meta.namespace }}"
replicas: 0
wait: yes
loop:
- "{{ deployment_name }}-task"
- "{{ deployment_name }}-web"
  when: this_deployment['resources'] | length

Here lies the second problem. This step is not enough.
In my case, pg_restore lasts more than 10 min. During this time, the awx-operator again takes the replicas value from the CRD awx -> edits the deployment "awx-tasks" and "awx-web" replicas to original value -> running pods "awx-task" and "awx-web" -> this pods connect to a database that has not yet been restored. For this reason, the correct recovery of the database does not occur.

I solved my problem as follows: until the database is fully restored, I set the CRD awx replicas value = 0 so that the pod does not try to connect to the database. As soon as the DB recovery process ends, I set the CRD value of awx replicas to its original state.
This solution helped me to successfully restore AWX on schedule.

It seems to me that the approach described in the second paragraph would be useful to apply in awx-operator.
I am ready to prepare Pull requests myself, if you think this approach is correct.

fosterseth · 2023-09-27T15:30:03Z

@flxbwr good find!

@TheRealHaoLiu had some ideas for approaching this problem

TheRealHaoLiu · 2023-10-02T20:09:49Z

I would love to implement a "pause" annotation for the awx resource
and have the awx-operator stop reconciling the awx resource when its paused

and when restore finish we can unpause by removing the annotation and that should trigger the operator to re-reconcile awx resource and restore all the changes we have made to the deployment

rooftopcellist · 2023-12-01T21:33:07Z

@flxbwr Very good find indeed! I saw this in a couple other issues as well and was able to reproduce after populating my AWX db with ~ 2Gi of data, then trying a backup and restore. The size of the db can cause the pg_restore to last long enough that the installer role interferes, as you said.

We just merged a fix which implements a pause if there is a Restore object with deployment_name that matches the AWX instance that is currently being reconciled.

Always check and wait for a restore pg_restore to finish #1652

Thank you for your thorough explanation and work-around on this issue! I think we can close it now. Please open a new issue if you still see the issue with this change.

flxbwr · 2023-12-11T14:16:48Z

@rooftopcellist Thank you for solving the issue!

github-actions bot added needs_triage community labels Sep 25, 2023

fosterseth assigned TheRealHaoLiu Sep 27, 2023

fosterseth removed the needs_triage label Oct 4, 2023

TheRealHaoLiu changed the title ~~Restore role~~ AWX task and web pods incorrectly scale up during restore Oct 6, 2023

rooftopcellist closed this as completed Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWX task and web pods incorrectly scale up during restore #1567

AWX task and web pods incorrectly scale up during restore #1567

flxbwr commented Sep 25, 2023

fosterseth commented Sep 27, 2023

TheRealHaoLiu commented Oct 2, 2023

rooftopcellist commented Dec 1, 2023

flxbwr commented Dec 11, 2023

AWX task and web pods incorrectly scale up during restore #1567

AWX task and web pods incorrectly scale up during restore #1567

Comments

flxbwr commented Sep 25, 2023

Please confirm the following

Feature Summary

fosterseth commented Sep 27, 2023

TheRealHaoLiu commented Oct 2, 2023

rooftopcellist commented Dec 1, 2023

flxbwr commented Dec 11, 2023