Playbook failures with logs stopping after 5 hours #622

jessesimpson36 · 2021-10-28T20:55:03Z

ISSUE TYPE

Bug Report

SUMMARY

After 5 hours of a playbook running, AWX will stop providing me logs, will report the job as a failure, and provide me no error message.

I've been searching for the past few days for various timeouts in ssh, tcp, or just in the kubernetes configurations for the AWX operator, but I can't seem to find anything. Wireshark does not find any TCP FIN's, or TCP RST signals until after the playbook has been reported by AWX as a failure. I've tried setting the following client ssh options

 -o ControlMaster=yes -o ControlPersist=80000s -o ConnectTimeout=3600s -o ConnectionAttempts=5 -o ServerAliveInterval=300s -o ServerAliveCountMax=5

But that doesn't seem to help either. And Kubernetes marks the pod as killed after 5 hours.

ENVIRONMENT

AWX version: 19.4.0 (awx image version)
Operator version: 0.14.0
Kubernetes version: microk8s v1.22.2
AWX install method: microk8s

STEPS TO REPRODUCE

I believe one can reproduce this by having a really long playbook. maybe one with a single step that lasts 5 hours like a sleep 18000 though I have not tried this.

EXPECTED RESULTS

Playbook continues executing, or reports an error.

ACTUAL RESULTS

Playbook stops mid-log and marks the playbook as failed along with the kubernetes pod being killed with

Reason:          Killed

ADDITIONAL INFORMATION

My setup is using an external postgresql instance. The playbook itself is mostly doing backup related tasks like executing pg_dumpall, which takes a long time as the database is quite large.

AWX-OPERATOR LOGS

None right now. If I can reproduce with a long sleep statement, then I'll share that, since I don't want to expose anything from my company.
update: job_5544.txt

Are there timeouts I'm not aware of? or a place I can get more insight into what "automation-job" pods are actually doing? I've tried ssh-ing into the pod and executing the last command that AWX reports, and that seems to work just fine.

The text was updated successfully, but these errors were encountered:

jessesimpson36 · 2021-10-28T21:00:12Z

I have 4 hosts in this particular playbook, and it just stops after 2 hosts completed. No errors, and playbook marks the job as failed.

jessesimpson36 · 2021-10-29T11:56:57Z

job_5544.txt

Heres output of a job that ran a 5 hour long sleep command in an ansible playbook.

ansible playbook:


---
- hosts: custom_directories
  become: yes
  tasks:
    - name: Really long ssh command
      shell:
        executable: /bin/bash
        cmd: "sleep 18000"
    - name: Another shorter ssh command
      shell:
        executable: /bin/bash
        cmd: "sleep 1800"

fails at 5 hours

TheRealHaoLiu · 2022-11-16T22:04:35Z

we are working on it...

In current implementation of Kubernetes (specifically kubelet) there are hard coded time limit for connections

https://github.com/kubernetes/kubernetes/blob/70e73a6cdaab5dae41357e3a25d6ac60aff62438/pkg/kubelet/server/server.go#L166-L168

At a high level approach to solving this problem is to use Kubernetes --timestamps=true when streaming logs and use the last timestamp we received and --since-time to resume from before the last log message we received until we detect that job is actually Completed (or failed, or deleted)

Ran into some “gotchas” that during the implementation of the fix:

Pod logs: long lines are corrupted when using timestamps=true
Prevent us from being able to successfully parse the log message

Fixed in kubernetes/kubernetes#113481
Fix will be available in following kubernetes release
1.23.14, 1.24.8, 1.25.4

we also needed to fix something in ansible runner so that we dont get duplicated timestamps on the last message containing the zipdata ansible/ansible-runner#1161

here's our receptor PR still WIP, just finish the green path test now we need to make sure if we are deployed on old kube we preserve previous behavior and not barf

ansible/receptor#683

shanemcd · 2022-11-17T13:40:47Z

Closing as a dupe of the other issues mentioned above.

TheRealHaoLiu · 2022-12-14T16:32:21Z

fixed in ansible/receptor#683

jessesimpson36 changed the title ~~Playbook failures with no logs after 5 hours~~ Playbook failures with logs stopping after 5 hours Oct 28, 2021

FedorRub mentioned this issue Jan 24, 2022

Playbooks running longer than 4 hours are terminated unexpectedly ansible/awx#11594

Closed

3 tasks

rooftopcellist mentioned this issue Nov 16, 2022

Playbooks terminated unexpectedly after 4 hours ansible/awx#11805

Closed

6 tasks

shanemcd closed this as completed Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Playbook failures with logs stopping after 5 hours #622

Playbook failures with logs stopping after 5 hours #622

jessesimpson36 commented Oct 28, 2021 •

edited

Loading

jessesimpson36 commented Oct 28, 2021

jessesimpson36 commented Oct 29, 2021 •

edited

Loading

TheRealHaoLiu commented Nov 16, 2022

shanemcd commented Nov 17, 2022

TheRealHaoLiu commented Dec 14, 2022

Playbook failures with logs stopping after 5 hours #622

Playbook failures with logs stopping after 5 hours #622

Comments

jessesimpson36 commented Oct 28, 2021 • edited Loading

ISSUE TYPE

SUMMARY

ENVIRONMENT

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

ADDITIONAL INFORMATION

AWX-OPERATOR LOGS

jessesimpson36 commented Oct 28, 2021

jessesimpson36 commented Oct 29, 2021 • edited Loading

TheRealHaoLiu commented Nov 16, 2022

shanemcd commented Nov 17, 2022

TheRealHaoLiu commented Dec 14, 2022

jessesimpson36 commented Oct 28, 2021 •

edited

Loading

jessesimpson36 commented Oct 29, 2021 •

edited

Loading