Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Playbook failures with logs stopping after 5 hours #622

Closed
jessesimpson36 opened this issue Oct 28, 2021 · 5 comments
Closed

Playbook failures with logs stopping after 5 hours #622

jessesimpson36 opened this issue Oct 28, 2021 · 5 comments

Comments

@jessesimpson36
Copy link

jessesimpson36 commented Oct 28, 2021

ISSUE TYPE
  • Bug Report
SUMMARY

After 5 hours of a playbook running, AWX will stop providing me logs, will report the job as a failure, and provide me no error message.

I've been searching for the past few days for various timeouts in ssh, tcp, or just in the kubernetes configurations for the AWX operator, but I can't seem to find anything. Wireshark does not find any TCP FIN's, or TCP RST signals until after the playbook has been reported by AWX as a failure. I've tried setting the following client ssh options

 -o ControlMaster=yes -o ControlPersist=80000s -o ConnectTimeout=3600s -o ConnectionAttempts=5 -o ServerAliveInterval=300s -o ServerAliveCountMax=5 

But that doesn't seem to help either. And Kubernetes marks the pod as killed after 5 hours.

ENVIRONMENT
  • AWX version: 19.4.0 (awx image version)
  • Operator version: 0.14.0
  • Kubernetes version: microk8s v1.22.2
  • AWX install method: microk8s
STEPS TO REPRODUCE

I believe one can reproduce this by having a really long playbook. maybe one with a single step that lasts 5 hours like a sleep 18000 though I have not tried this.

EXPECTED RESULTS

Playbook continues executing, or reports an error.

ACTUAL RESULTS

Playbook stops mid-log and marks the playbook as failed along with the kubernetes pod being killed with

Reason:          Killed
ADDITIONAL INFORMATION

My setup is using an external postgresql instance. The playbook itself is mostly doing backup related tasks like executing pg_dumpall, which takes a long time as the database is quite large.

AWX-OPERATOR LOGS

None right now. If I can reproduce with a long sleep statement, then I'll share that, since I don't want to expose anything from my company.
update: job_5544.txt

Are there timeouts I'm not aware of? or a place I can get more insight into what "automation-job" pods are actually doing? I've tried ssh-ing into the pod and executing the last command that AWX reports, and that seems to work just fine.

@jessesimpson36 jessesimpson36 changed the title Playbook failures with no logs after 5 hours Playbook failures with logs stopping after 5 hours Oct 28, 2021
@jessesimpson36
Copy link
Author

2021-10-28-165834_grim

I have 4 hosts in this particular playbook, and it just stops after 2 hosts completed. No errors, and playbook marks the job as failed.

@jessesimpson36
Copy link
Author

jessesimpson36 commented Oct 29, 2021

job_5544.txt

Heres output of a job that ran a 5 hour long sleep command in an ansible playbook.

ansible playbook:


---
- hosts: custom_directories
  become: yes
  tasks:
    - name: Really long ssh command
      shell:
        executable: /bin/bash
        cmd: "sleep 18000"
    - name: Another shorter ssh command
      shell:
        executable: /bin/bash
        cmd: "sleep 1800"
    

fails at 5 hours

@TheRealHaoLiu
Copy link
Member

we are working on it...

In current implementation of Kubernetes (specifically kubelet) there are hard coded time limit for connections

https://github.com/kubernetes/kubernetes/blob/70e73a6cdaab5dae41357e3a25d6ac60aff62438/pkg/kubelet/server/server.go#L166-L168

At a high level approach to solving this problem is to use Kubernetes --timestamps=true when streaming logs and use the last timestamp we received and --since-time to resume from before the last log message we received until we detect that job is actually Completed (or failed, or deleted)

Ran into some “gotchas” that during the implementation of the fix:

Pod logs: long lines are corrupted when using timestamps=true
Prevent us from being able to successfully parse the log message

Fixed in kubernetes/kubernetes#113481
Fix will be available in following kubernetes release
1.23.14, 1.24.8, 1.25.4

we also needed to fix something in ansible runner so that we dont get duplicated timestamps on the last message containing the zipdata ansible/ansible-runner#1161

here's our receptor PR still WIP, just finish the green path test now we need to make sure if we are deployed on old kube we preserve previous behavior and not barf

ansible/receptor#683

@shanemcd
Copy link
Member

Closing as a dupe of the other issues mentioned above.

@TheRealHaoLiu
Copy link
Member

fixed in ansible/receptor#683

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants