-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWX jobs can't tolerate the K8s master nodes restart or termination #13350
Comments
Presumably this is due to how we use the k8s logging api to obtain the job results. We did ship a recent patch that will attempt to reconnect when the log stream has been terminated unexpectedly, but we haven't tested under these conditions. @AlanCoding @fosterseth From failed_jobs_api_description.txt - I can see that this is another example where we are shoving the entirety of the stdout into result_traceback, which has to be this code: awx/awx/main/tasks/receptor.py Lines 428 to 431 in b7f2825
This also makes me wonder if we might be overwriting another error that happened here: awx/awx/main/tasks/callback.py Lines 207 to 210 in 893dba7
Does anyone watching this issue feel comfortable patching and building a custom AWX image? We might be able to provide some guidance on what to try. I probably won't have time to look into this myself before sometime early next year. |
I'd like to get some clarification on something here. Are nodes where AWX itself is running getting killed? Or just nodes where the Kubernetes API server is running? |
Only master nodes, where the K8s API server is running. |
@shanemcd, I didn't do it before with the custom AWX images. Still, since I already checked all kinds of scenarios with different AWX and K8s versions, I think I could join you in troubleshooting, patching images, and checking the results. We are very interested in resolving this issue as soon as possible. |
@elibogomolnyi How long does the k8s api stay unavailable for? In Receptor 1.3.0 we shipped this patch that attempts to recover when the log stream is unexpectedly terminated. Can you please verify that your control plane ee has this version of Receptor?
Apologies, but you are asking for too much here. As I said before - I do not have time to look into this too deeply right now. I'm only working 2 more days before stepping away from work until sometime in early January. If you are unable to troubleshoot and resolve this problem yourself, perhaps a short-term solution would be to deploy into a distro of Kubernetes that does not have the auto-update behavior. |
After the master node is terminated, it stays unavailable for 5 minutes. It is also worth mentioning that when I terminate the master node, all AWX jobs get terminated almost immediately (20 seconds after the termination signal that I send to the node), so it doesn't seem like any retry mechanism is working in this case.
The receptor version is 1.3.0+g8f8481c
I fully understand it, and maybe I expressed myself wrongly. You said that you might be able to provide some guidance on what to try. If you think that this guidance can help resolve this issue for somebody, who is not a contributor to this project and doesn't have enough experience with this project, please let me know, what we can try, and we will elaborate with our team. And by the way, happy holidays!
Since the aim of our project is a migration to EKS, we can't deploy into a distro of Kubernetes that does not have the auto-update behavior. But we might wait for the production migration till this issue is resolved. If we could make it happen faster, we would be glad to contribute. |
PR to address the result_traceback bug here #12961 |
Could you explain how it is related? Do you think it might fix the issue caused by the master node restart? |
@elibogomolnyi sorry, the PR I linked is for the result_traceback bug that shane pointed out |
Hi @shanemcd, please tell me if there is anything else we can do to help to fix this bug. |
Hi AWX community and team, I hope to hear from you soon. Thanks, |
Following the conversation with @TheRealHaoLiu about this issue, we made some additional tests: We also checked that during the master node termination, we can still access the K8s API. We were continuously triggering the "kubectl get nodes" command, which was not interrupted. So the Kubernetes API kept working. I am attaching the instructions for deploying the kOps cluster and AWX for full error reproduction: We get the following error when we one of the master node gets terminated: |
i put up a very rough PR to test out if catching GOAWAY error and retry will help work around this problem |
@elibogomolnyi thanks for helping us identify the specific error we encountering here's a test image with my code change quay.io/haoliu/awx-ee:goaway can u replace the |
@TheRealHaoLiu, now it works like a charm with the kOps cluster! The job keeps running. It will take some time for us to check this issue with the EKS cluster since it requires cooperation from the AWS support side. But as far as I understand, it should also fix the EKS issue. I appreciate your help; it is a very important fix. When can it be merged? |
Hi @TheRealHaoLiu, I've checked how AWX works with EKS with the AWS support team, and everything works like a charm with your fix. Thanks to the community for promoting this PR so fast. If this PR is already merged to devel of receptor, does it mean the new AWX version will already contain this change? |
It is also worth mentioning that with the customized image, AWX can tolerate the EKS master node termination but can't tolerate the EKS control plane upgrade. It is not a problem for us since the EKS upgrade requires maintenance and downtime, but it is good to know about it. |
interesting, i tested this for OCP upgrade and it held up pretty well... have u try to use the graceful termination feature for awx and PodDisruptionBudget in kube? I'm working something to show how to make AWX tolerate kube upgrade with no downtime |
what do you observe during EKS controlplane upgrade? is the API server still reachable? |
Hi @TheRealHaoLiu, I didn't try to use the graceful termination and PodDisruptionBudget, but I can do it when we continue our performance tests.
I didn't check it myself, but EKS API might be unreachable during this process, according to AWS EKS documentation. If the API is not accessible for some time during the upgrade, does it mean that AWX can't reconnect to it? https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html |
Wont add coverage for this issue |
Please confirm the following
Bug Summary
When one of the master (control plane node in case of EKS) nodes gets terminated or restarted, all the AWX jobs related to this node (we don't know how they are linked to this master node, but we can definitely see that they are) also get terminated. We see the "Error" status for these jobs in UI. We checked this behavior with the following configurations:
The problem becomes more severe for the EKS clusters since AWS sometimes brings down the master nodes to make the package upgrades, and we can't control it. As a result, whenever it happens, the jobs that are somehow connected to the restarted or terminated master node become killed with "Error" without any discoverable reason.
AWX version
21.10.1
Select the relevant components
Installation method
Kubernetes
Modifications
no
Steps to reproduce
Expected results
The AWX jobs keep running.
Actual results
The AWX jobs are terminated; in the UI, we can see "Error" without any logs.
Additional information
failed_jobs_api_description.txt
The text was updated successfully, but these errors were encountered: