You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I understand that AWX Operator is open source software provided for free and that I might not receive a timely response.
Feature Summary
Hello!
Recently, i performed the task of automatically restoring AWX (checking the consistency of our backups).
I found that the standard restore tool does not work.
AWX Operator version: 2.5.2
Kubernetes/Platform version: 1.25.5
Investigating the reasons for this, I found two problems:
Incorrect variables {{ ansible_operator_meta.name }} in task "- name: Scale down Deployment for migration" awx-operator/roles/restore/tasks/postgres.yml. Now I can already see that this problem has been solved in 2.5.3.
I applied this solution in my installation, but it did not solve the problem and I continued looking for a solution.
At the time of awx-operator/roles/restore/tasks/postgres.yml execution, the awx CRD with the replicas value taken from the backup is already deploy in my cluster. Based on it, the awx-operator creates a deployment "awx-task" and "awx-web" with the replicas value taken from the awx CRD. Before starting pg_restore, you scale deployment "awx-task" and "awx-web" replicas = 0 so that it does not interfere with recovery:
Here lies the second problem. This step is not enough.
In my case, pg_restore lasts more than 10 min. During this time, the awx-operator again takes the replicas value from the CRD awx -> edits the deployment "awx-tasks" and "awx-web" replicas to original value -> running pods "awx-task" and "awx-web" -> this pods connect to a database that has not yet been restored. For this reason, the correct recovery of the database does not occur.
I solved my problem as follows: until the database is fully restored, I set the CRD awx replicas value = 0 so that the pod does not try to connect to the database. As soon as the DB recovery process ends, I set the CRD value of awx replicas to its original state.
This solution helped me to successfully restore AWX on schedule.
It seems to me that the approach described in the second paragraph would be useful to apply in awx-operator.
I am ready to prepare Pull requests myself, if you think this approach is correct.
The text was updated successfully, but these errors were encountered:
I would love to implement a "pause" annotation for the awx resource
and have the awx-operator stop reconciling the awx resource when its paused
and when restore finish we can unpause by removing the annotation and that should trigger the operator to re-reconcile awx resource and restore all the changes we have made to the deployment
@flxbwr Very good find indeed! I saw this in a couple other issues as well and was able to reproduce after populating my AWX db with ~ 2Gi of data, then trying a backup and restore. The size of the db can cause the pg_restore to last long enough that the installer role interferes, as you said.
We just merged a fix which implements a pause if there is a Restore object with deployment_name that matches the AWX instance that is currently being reconciled.
Thank you for your thorough explanation and work-around on this issue! I think we can close it now. Please open a new issue if you still see the issue with this change.
Please confirm the following
Feature Summary
Hello!
Recently, i performed the task of automatically restoring AWX (checking the consistency of our backups).
I found that the standard restore tool does not work.
AWX Operator version: 2.5.2
Kubernetes/Platform version: 1.25.5
Investigating the reasons for this, I found two problems:
I applied this solution in my installation, but it did not solve the problem and I continued looking for a solution.
k8s_scale:
api_version: apps/v1
kind: Deployment
name: "{{ item }}"
namespace: "{{ ansible_operator_meta.namespace }}"
replicas: 0
wait: yes
loop:
when: this_deployment['resources'] | length
Here lies the second problem. This step is not enough.
In my case, pg_restore lasts more than 10 min. During this time, the awx-operator again takes the replicas value from the CRD awx -> edits the deployment "awx-tasks" and "awx-web" replicas to original value -> running pods "awx-task" and "awx-web" -> this pods connect to a database that has not yet been restored. For this reason, the correct recovery of the database does not occur.
I solved my problem as follows: until the database is fully restored, I set the CRD awx replicas value = 0 so that the pod does not try to connect to the database. As soon as the DB recovery process ends, I set the CRD value of awx replicas to its original state.
This solution helped me to successfully restore AWX on schedule.
It seems to me that the approach described in the second paragraph would be useful to apply in awx-operator.
I am ready to prepare Pull requests myself, if you think this approach is correct.
The text was updated successfully, but these errors were encountered: