Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX task and web pods incorrectly scale up during restore #1567

Closed
3 tasks done
flxbwr opened this issue Sep 25, 2023 · 4 comments
Closed
3 tasks done

AWX task and web pods incorrectly scale up during restore #1567

flxbwr opened this issue Sep 25, 2023 · 4 comments
Assignees

Comments

@flxbwr
Copy link

flxbwr commented Sep 25, 2023

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX Operator is open source software provided for free and that I might not receive a timely response.

Feature Summary

Hello!
Recently, i performed the task of automatically restoring AWX (checking the consistency of our backups).
I found that the standard restore tool does not work.
AWX Operator version: 2.5.2
Kubernetes/Platform version: 1.25.5
Investigating the reasons for this, I found two problems:

  1. Incorrect variables {{ ansible_operator_meta.name }} in task "- name: Scale down Deployment for migration" awx-operator/roles/restore/tasks/postgres.yml. Now I can already see that this problem has been solved in 2.5.3.
    I applied this solution in my installation, but it did not solve the problem and I continued looking for a solution.
  2. At the time of awx-operator/roles/restore/tasks/postgres.yml execution, the awx CRD with the replicas value taken from the backup is already deploy in my cluster. Based on it, the awx-operator creates a deployment "awx-task" and "awx-web" with the replicas value taken from the awx CRD. Before starting pg_restore, you scale deployment "awx-task" and "awx-web" replicas = 0 so that it does not interfere with recovery:
  • name: Scale down Deployment for migration
    k8s_scale:
    api_version: apps/v1
    kind: Deployment
    name: "{{ item }}"
    namespace: "{{ ansible_operator_meta.namespace }}"
    replicas: 0
    wait: yes
    loop:
    • "{{ deployment_name }}-task"
    • "{{ deployment_name }}-web"
      when: this_deployment['resources'] | length

Here lies the second problem. This step is not enough.
In my case, pg_restore lasts more than 10 min. During this time, the awx-operator again takes the replicas value from the CRD awx -> edits the deployment "awx-tasks" and "awx-web" replicas to original value -> running pods "awx-task" and "awx-web" -> this pods connect to a database that has not yet been restored. For this reason, the correct recovery of the database does not occur.

I solved my problem as follows: until the database is fully restored, I set the CRD awx replicas value = 0 so that the pod does not try to connect to the database. As soon as the DB recovery process ends, I set the CRD value of awx replicas to its original state.
This solution helped me to successfully restore AWX on schedule.

It seems to me that the approach described in the second paragraph would be useful to apply in awx-operator.
I am ready to prepare Pull requests myself, if you think this approach is correct.

@fosterseth
Copy link
Member

@flxbwr good find!

@TheRealHaoLiu had some ideas for approaching this problem

@TheRealHaoLiu
Copy link
Member

I would love to implement a "pause" annotation for the awx resource
and have the awx-operator stop reconciling the awx resource when its paused

and when restore finish we can unpause by removing the annotation and that should trigger the operator to re-reconcile awx resource and restore all the changes we have made to the deployment

@TheRealHaoLiu TheRealHaoLiu changed the title Restore role AWX task and web pods incorrectly scale up during restore Oct 6, 2023
@rooftopcellist
Copy link
Member

@flxbwr Very good find indeed! I saw this in a couple other issues as well and was able to reproduce after populating my AWX db with ~ 2Gi of data, then trying a backup and restore. The size of the db can cause the pg_restore to last long enough that the installer role interferes, as you said.

We just merged a fix which implements a pause if there is a Restore object with deployment_name that matches the AWX instance that is currently being reconciled.

Thank you for your thorough explanation and work-around on this issue! I think we can close it now. Please open a new issue if you still see the issue with this change.

@flxbwr
Copy link
Author

flxbwr commented Dec 11, 2023

@rooftopcellist Thank you for solving the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants